Modal Title
DevOps / Software Development

How to Use BOLT, Binary Optimization and Layout Tool

A look at how this post-link optimizer developed to speed up large applications can be a simpler way to achieve better code layout.
Dec 23rd, 2022 9:00am by
Featued image for: How to Use BOLT, Binary Optimization and Layout Tool
Feature image via Unsplash

Data center applications are generally very large and complex, which makes code layout an important optimization to improve their performance. Such a technique for code layout is called feedback-driven optimizations (FDO) or profile-guided optimizations (PGO). However, due to their large sizes, applying FDO to these applications leads to scalability issues due to significant memory and computation usage and cost, which makes this technique practically impossible.

To overcome this scalability issue, the use of sample-based profiling techniques have been introduced by different systems, such as Ispike, AutoFDO and HFSort. Some of them are applied to multiple points in the compilation chain, such as AutoFDO to compilation time, LIPO and HFSort to link time and Ispike to post-link time. Among them, the post-link optimizers have been relatively unpopular compared to the compile-time ones, since the profile data is injected in the late phase of the compilation chain.

However, BOLT demonstrates that the post-link optimizations are still useful because injecting the profile data later enables more accurate use of the information for better code layout, and mapping the profile data, which is collected at the binary level, back to the binary level (instead of the  compiler’s intermediate representation) is much simpler, resulting in efficient low-level optimizations such as code layout.

It’s not to be confused with the open source tool from Puppet to run ad-hoc commands and scripts across infrastructure, which is also called Bolt.

Frequently Asked Questions about BOLT

Q. What does BOLT stand for?
A.
Binary Optimization and Layout Tool

Q. What does BOLT do?
A. BOLT has the following rewriting pipeline for a given executable binary:

  • Function discovery
  • Read debug information
  • Read profile data
  • Disassembly
  • CFG construction (using LLVM’s Tablegen-generated disassembler)
  • Optimization pipeline
  • Emit and link functions
  • Rewrite binary file

Q. Can any of the optimization techniques be moved to earlier phases of compilation?
A.
It depends on the situation.

  • Sample-based or instrumentation-based
    • Code efficiency vs. runtime overhead
  • Whether re-compilation is allowed
    • Object files/executable binary in link/post-link phase vs. compiler IR in compile phase

Q. Why does BOLT run on the binary level but not on the source code level or compiler IR level?
A. First, profiling data typically collects binary-level events, and there are challenges in mapping such events to higher-level code representation. Figure 1 shows such a challenge.

Figure 1. An example of a challenge in mapping binary-level events back to higher-level code representations

Second, user programs (object code) could be improved almost instantly with minimal effort.

Q. Why is BOLT implemented as a separate tool?
A. There are two reasons:

  • There are multiple open source linkers and selecting one of them to use for any particular application depends on a number of circumstances that may also change over time.
  • To facilitate the tool’s adaptation.

Q. What kind of optimizations does BOLT perform?
A. BOLT optimization pipeline uses:

  1. strip-rep-ret: Strip ‘epzfrom repz retq instructions used for legacy AMD processors.
  2. icf: Identical code folding: additional benefits from function without -ffunction-sections flag and functions with jump tables
  3. icp: Indirect call promotion: leverages call frequency information to mutate a function call into a more performance version
  4. peepholes: Simple peephole optimizations
  5. simplify-to-loads: Fetch constant data in .rodata whose address is known statically and mutates a load into a move instruction
  6. icf: Identical code folding (second run)
  7. plt: Remove indirection from PLT calls
  8. reorder-bbs: Reorder basic blocks and split hot/cold blocks into separate sections (layout optimization)
  9. peepholes: Simple peephole optimization (second run)
  10. uce: Eliminate unreachable basic blocks
  11. fixup-branches: Fix basic block terminator instructions to match the CFG and the current layout (redone by reorder-bbs)
  12. reorder-functions: Apply HFSort to reorder functions (layout optimization)
  13. sctc: Simplify conditional tail calls
  14. frame-opts: Remove unnecessary caller-saved register spilling
  15. shrink-wrapping: Move callee-saved register spills closer to where they are needed, if profiling data shows it is better to do so

Q. Can BOLT be used for dynamically loading libraries?
A. Yes, it just requires an additional profiling step with dynamically loading libraries.

Q. Which profiling data does BOLT use?
A. BOLT uses Linux perf utility to collect training input, including:

  • CPU cycles (in user mode only)
  • Sampled taken branches (and type of branches)

Please refer to the details of perf events here.

Q. What applications were tested to benchmark BOLT?
A. Larger applications (more than 100MB). It is better to aggressively reduce I-cache occupation, since the cache is one of the most constrained resources in the data center space. The followings are tested by Facebook using BOLT:

  • HHVM: PHP/Hack virtual machine that powers the web servers
  • TAO: a highly distributed, in memory, data-cache service
  • Proxygen: a cluster load balancer
  • Multifeed: a selection of what is shown in the Facebook News Feed
  • Clang: a compiler frontend for programming languages
  • GCC: an optimizing compiler by GNU Project

Current Status of BOLT

The original research paper was published by CGO 2019 by Facebook engineers. The source code has been released and maintained at GitHub since 2015. The BOLT project was added to the mainstream of the LLVM project version 14 in March.

BOLT operates on x86-64 and AAarch64 ELF binaries. The binaries should have an unstripped symbol table; to get maximum performance gains, they should be linked with relocations (--emit-relocs or –q linker flag).

BOLT is currently incompatible with the -freorder-blocks-and-partition compiler option. GCC 8 and later versions enable this option by default; you must explicitly disable it by adding a -fno-reorder-blocks-and-partition flag.

The latest code commits were done four months ago, and they are non-functional changes.

How to Build and Test BOLT

This section describes how to build BOLT and test with simple executables.

Building BOLT

Step 1. Get source code.


Step 2. Build BOLT.


Note that you might need to modify the PATH variable in your environment to include ./llvm-bolt/build/bin.

Test with Simple Executable

Step 1. Write t.cc.


Step 2. Write a Makefile.


 Step 3. Build an executable from t.cc.


Step 4. Get profile data p.data from executable t by running perf utility.


Step 5. Convert perf data, p.data, to BOLT format, p.fdata, by executing perf2bolt.


Note that you might need to grant users permission to execute perf.


Step 6. Generate optimized binary t.bolt from t.


Step 7. Compare the file size and the execution time for t and t.bolt.

Simple Trial with Maple JavaScript

In their research paper, the Facebook teams use two categories of binaries to evaluate BOLT. The first is the actual workloads running on Facebook’s data centers. They are (1) HHVM, the PHP/Hack virtual machine, (2) TAO, a distributed, in-memory, data-caching service, (3) Proxygen, a cluster load balancer built on top of the same open source library and (4) Multifeed, a service for Facebook News Feed. The second category of binaries are (1) Clang and (2) GCC compilers.

First, we tried to use the engine for our targeting binary to optimize. Javascript engine is an in-house JavaScript runtime engine developed by Futurewei Technologies. Two workloads were used for Maple JavaScript: First one is prime.js which finds prime numbers less than 1 million, and the second is 3d-cube.js, which performs matrix computations for rotating a 3D cube.

Step 1: The Cmake build script must be changed to keep relocations in the executable file.


Step 2: Build the binary for the Maple JavaScript engine.

Step 3: Modify the run script to get profile data.


 Step 4: Write the benchmark JavaScript application, for example, prime.js.


Step 5: Get profile data by running prime.js with the Maple JavaScript engine.


 Step 6: Convert perf data output to BOLT format.


 Step 7: Rename the Maple JavaScript runtime library libmplre-dyn.so.


Step 8: Execute prime.js with the Maple JavaScript engine by using the original run script.


Step 9: Compare the file size and the execution time.


However, the benefits of binary optimization using BOLT for the Maple JavaScript engine was not obviously seen. We believe the main reason is that the workloads that are used with Maple JavaScript were not as complicated as the ones that were used by the original authors of the paper. The workloads merely have conditional branches, so BOLT might not have any good opportunities to optimize the binary of Maple JavaScript. Also, the duration of the execution times is very short compared to the workloads that the authors used.

BOLT Optimization for Clang

So we decided to use the same benchmark workload used in the paper on our setup, which was Clang compiler. The detailed steps to reproduce the result presented in the paper was documented in the BOLT’s Github repository. Most of the steps were identical, but the later version 14 of Clang was selected instead of Clang 7. Here is the summary of the setup.

  • Tested app: Clang 14 (14.x branch of GitHub source code)
  • Tested environment: Ubuntu 18.04.4 LTS, 40-core CPU, 800GB memory
  • Different optimizations
    • PGO+LTO: baseline setup without BOLT (Profile Guided Optimization + Link-Time Optimization provided by LLVM/Clang)
    • PGO+LTO+BOLT: BOLT optimizations enabled (suggested by BOLT GitHub project)
      • Algorithm for reordering of functions: hfsort+
      • Algorithm for reordering of basic blocks: cache+ (layout optimizing I-cache behavior)
      • Level of function splitting: Three (all functions)
      • Fold functions with identical code
    • BOLT-reorder functions: BOLT optimizations excluding reordering of functions
    • BOLT-reorder blocks: BOLT optimizations excluding reordering of basic blocks
    • BOLT-hot/cold split: BOLT optimizations excluding hot/cold splitting
    • BOLT-ICF: BOLT optimizations excluding identical code folding

The main purpose of this test is to identify how much performance benefits come from what optimization options of BOLT. PGO+LTO, which enables the basic optimization based on PGO and LTO that are supported by LLVM, was selected as a baseline of the performance comparison.

PGO+LTO+BOLT indicates all BOLT optimizations were enabled on top of PGO and LTO. No reorder functionsenable all the BOLT optimization (described in their documentation) except no reordering of functions. Similarly, No reorder blocks, No hot/cold split and No ICF enables all the BOLT optimization except reordering of basic blocks, hot/cold splitting, and identical code folding, respectively.

Table 1 shows the execution times of different optimization configurations.

Table 1. Execution time of Clang with different optimization configurations

From the table showing the execution time, the following single optimization among all the BOLT optimization options mostly affect the execution time: (1) reorder blocks, (2) hot/cold function split, (3) reorder functions, (4) identical code folding, in order.

Table 2 shows different contributions of different optimization options on L1-icache-misses.

Figure 2. Contribution of different BOLT optimizations

As seen, each single BOLT optimization option that mostly affects L1-icache-misses is (1) reorder blocks, (2) hot/cold split, reorder functions (tie), and (3) identical code folding, in order.

Table 3 shows more results on different optimization options from other system parameters’ point of view.


Table 3. Contribution of different BOLT optimizations

From Table 3, two additional system parameters are mostly affected by different BOLT optimization options, cpu-cycles and L1-icache-load-misses. CPU-cycles is mostly affected by (1) reorder blocks, (2) hot/cold split, (2) reorder functions (tie) and (3) identical code folding, in order, and L1-icache-load-misses by (1) reorder blocks, (2) hot/cold split, (3) reorder functions and (4) identical code folding, in order.

Group Created with Sketch.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.