TNS
VOXPOP
Where are you using WebAssembly?
Wasm promises to let developers build once and run anywhere. Are you using it yet?
At work, for production apps
0%
At work, but not for production apps
0%
I don’t use WebAssembly but expect to when the technology matures
0%
I have no plans to use WebAssembly
0%
No plans and I get mad whenever I see the buzzword
0%
DevOps / Software Development

How to Use BOLT, Binary Optimization and Layout Tool

A look at how this post-link optimizer developed to speed up large applications can be a simpler way to achieve better code layout.
Dec 23rd, 2022 9:00am by
Featued image for: How to Use BOLT, Binary Optimization and Layout Tool
Feature image via Unsplash

Data center applications are generally very large and complex, which makes code layout an important optimization to improve their performance. Such a technique for code layout is called feedback-driven optimizations (FDO) or profile-guided optimizations (PGO). However, due to their large sizes, applying FDO to these applications leads to scalability issues due to significant memory and computation usage and cost, which makes this technique practically impossible.

To overcome this scalability issue, the use of sample-based profiling techniques have been introduced by different systems, such as Ispike, AutoFDO and HFSort. Some of them are applied to multiple points in the compilation chain, such as AutoFDO to compilation time, LIPO and HFSort to link time and Ispike to post-link time. Among them, the post-link optimizers have been relatively unpopular compared to the compile-time ones, since the profile data is injected in the late phase of the compilation chain.

However, BOLT demonstrates that the post-link optimizations are still useful because injecting the profile data later enables more accurate use of the information for better code layout, and mapping the profile data, which is collected at the binary level, back to the binary level (instead of the  compiler’s intermediate representation) is much simpler, resulting in efficient low-level optimizations such as code layout.

It’s not to be confused with the open source tool from Puppet to run ad-hoc commands and scripts across infrastructure, which is also called Bolt.

Frequently Asked Questions about BOLT

Q. What does BOLT stand for?
A.
Binary Optimization and Layout Tool

Q. What does BOLT do?
A. BOLT has the following rewriting pipeline for a given executable binary:

  • Function discovery
  • Read debug information
  • Read profile data
  • Disassembly
  • CFG construction (using LLVM’s Tablegen-generated disassembler)
  • Optimization pipeline
  • Emit and link functions
  • Rewrite binary file

Q. Can any of the optimization techniques be moved to earlier phases of compilation?
A.
It depends on the situation.

  • Sample-based or instrumentation-based
    • Code efficiency vs. runtime overhead
  • Whether re-compilation is allowed
    • Object files/executable binary in link/post-link phase vs. compiler IR in compile phase

Q. Why does BOLT run on the binary level but not on the source code level or compiler IR level?
A. First, profiling data typically collects binary-level events, and there are challenges in mapping such events to higher-level code representation. Figure 1 shows such a challenge.

Figure 1. An example of a challenge in mapping binary-level events back to higher-level code representations

Second, user programs (object code) could be improved almost instantly with minimal effort.

Q. Why is BOLT implemented as a separate tool?
A. There are two reasons:

  • There are multiple open source linkers and selecting one of them to use for any particular application depends on a number of circumstances that may also change over time.
  • To facilitate the tool’s adaptation.

Q. What kind of optimizations does BOLT perform?
A. BOLT optimization pipeline uses:

  1. strip-rep-ret: Strip ‘epzfrom repz retq instructions used for legacy AMD processors.
  2. icf: Identical code folding: additional benefits from function without -ffunction-sections flag and functions with jump tables
  3. icp: Indirect call promotion: leverages call frequency information to mutate a function call into a more performance version
  4. peepholes: Simple peephole optimizations
  5. simplify-to-loads: Fetch constant data in .rodata whose address is known statically and mutates a load into a move instruction
  6. icf: Identical code folding (second run)
  7. plt: Remove indirection from PLT calls
  8. reorder-bbs: Reorder basic blocks and split hot/cold blocks into separate sections (layout optimization)
  9. peepholes: Simple peephole optimization (second run)
  10. uce: Eliminate unreachable basic blocks
  11. fixup-branches: Fix basic block terminator instructions to match the CFG and the current layout (redone by reorder-bbs)
  12. reorder-functions: Apply HFSort to reorder functions (layout optimization)
  13. sctc: Simplify conditional tail calls
  14. frame-opts: Remove unnecessary caller-saved register spilling
  15. shrink-wrapping: Move callee-saved register spills closer to where they are needed, if profiling data shows it is better to do so

Q. Can BOLT be used for dynamically loading libraries?
A. Yes, it just requires an additional profiling step with dynamically loading libraries.

Q. Which profiling data does BOLT use?
A. BOLT uses Linux perf utility to collect training input, including:

  • CPU cycles (in user mode only)
  • Sampled taken branches (and type of branches)

Please refer to the details of perf events here.

Q. What applications were tested to benchmark BOLT?
A. Larger applications (more than 100MB). It is better to aggressively reduce I-cache occupation, since the cache is one of the most constrained resources in the data center space. The followings are tested by Facebook using BOLT:

  • HHVM: PHP/Hack virtual machine that powers the web servers
  • TAO: a highly distributed, in memory, data-cache service
  • Proxygen: a cluster load balancer
  • Multifeed: a selection of what is shown in the Facebook News Feed
  • Clang: a compiler frontend for programming languages
  • GCC: an optimizing compiler by GNU Project

Current Status of BOLT

The original research paper was published by CGO 2019 by Facebook engineers. The source code has been released and maintained at GitHub since 2015. The BOLT project was added to the mainstream of the LLVM project version 14 in March.

BOLT operates on x86-64 and AAarch64 ELF binaries. The binaries should have an unstripped symbol table; to get maximum performance gains, they should be linked with relocations (--emit-relocs or –q linker flag).

BOLT is currently incompatible with the -freorder-blocks-and-partition compiler option. GCC 8 and later versions enable this option by default; you must explicitly disable it by adding a -fno-reorder-blocks-and-partition flag.

The latest code commits were done four months ago, and they are non-functional changes.

How to Build and Test BOLT

This section describes how to build BOLT and test with simple executables.

Building BOLT

Step 1. Get source code.


Step 2. Build BOLT.


Note that you might need to modify the PATH variable in your environment to include ./llvm-bolt/build/bin.

Test with Simple Executable

Step 1. Write t.cc.


Step 2. Write a Makefile.


 Step 3. Build an executable from t.cc.


Step 4. Get profile data p.data from executable t by running perf utility.


Step 5. Convert perf data, p.data, to BOLT format, p.fdata, by executing perf2bolt.


Note that you might need to grant users permission to execute perf.


Step 6. Generate optimized binary t.bolt from t.


Step 7. Compare the file size and the execution time for t and t.bolt.

Simple Trial with Maple JavaScript

In their research paper, the Facebook teams use two categories of binaries to evaluate BOLT. The first is the actual workloads running on Facebook’s data centers. They are (1) HHVM, the PHP/Hack virtual machine, (2) TAO, a distributed, in-memory, data-caching service, (3) Proxygen, a cluster load balancer built on top of the same open source library and (4) Multifeed, a service for Facebook News Feed. The second category of binaries are (1) Clang and (2) GCC compilers.

First, we tried to use the engine for our targeting binary to optimize. Javascript engine is an in-house JavaScript runtime engine developed by Futurewei Technologies. Two workloads were used for Maple JavaScript: First one is prime.js which finds prime numbers less than 1 million, and the second is 3d-cube.js, which performs matrix computations for rotating a 3D cube.

Step 1: The Cmake build script must be changed to keep relocations in the executable file.


Step 2: Build the binary for the Maple JavaScript engine.

Step 3: Modify the run script to get profile data.


 Step 4: Write the benchmark JavaScript application, for example, prime.js.


Step 5: Get profile data by running prime.js with the Maple JavaScript engine.


 Step 6: Convert perf data output to BOLT format.