How to Use BOLT, Binary Optimization and Layout Tool

Data center applications are generally very large and complex, which makes code layout an important optimization to improve their performance. Such a technique for code layout is called feedback-driven optimizations (FDO) or profile-guided optimizations (PGO). However, due to their large sizes, applying FDO to these applications leads to scalability issues due to significant memory and computation usage and cost, which makes this technique practically impossible.
To overcome this scalability issue, the use of sample-based profiling techniques have been introduced by different systems, such as Ispike, AutoFDO and HFSort. Some of them are applied to multiple points in the compilation chain, such as AutoFDO to compilation time, LIPO and HFSort to link time and Ispike to post-link time. Among them, the post-link optimizers have been relatively unpopular compared to the compile-time ones, since the profile data is injected in the late phase of the compilation chain.
However, BOLT demonstrates that the post-link optimizations are still useful because injecting the profile data later enables more accurate use of the information for better code layout, and mapping the profile data, which is collected at the binary level, back to the binary level (instead of the compiler’s intermediate representation) is much simpler, resulting in efficient low-level optimizations such as code layout.
It’s not to be confused with the open source tool from Puppet to run ad-hoc commands and scripts across infrastructure, which is also called Bolt.
Frequently Asked Questions about BOLT
Q. What does BOLT stand for?
A. Binary Optimization and Layout Tool
Q. What does BOLT do?
A. BOLT has the following rewriting pipeline for a given executable binary:
- Function discovery
- Read debug information
- Read profile data
- Disassembly
- CFG construction (using LLVM’s Tablegen-generated disassembler)
- Optimization pipeline
- Emit and link functions
- Rewrite binary file
Q. Can any of the optimization techniques be moved to earlier phases of compilation?
A. It depends on the situation.
- Sample-based or instrumentation-based
- Code efficiency vs. runtime overhead
- Whether re-compilation is allowed
- Object files/executable binary in link/post-link phase vs. compiler IR in compile phase
Q. Why does BOLT run on the binary level but not on the source code level or compiler IR level?
A. First, profiling data typically collects binary-level events, and there are challenges in mapping such events to higher-level code representation. Figure 1 shows such a challenge.

Figure 1. An example of a challenge in mapping binary-level events back to higher-level code representations
Second, user programs (object code) could be improved almost instantly with minimal effort.
Q. Why is BOLT implemented as a separate tool?
A. There are two reasons:
- There are multiple open source linkers and selecting one of them to use for any particular application depends on a number of circumstances that may also change over time.
- To facilitate the tool’s adaptation.
Q. What kind of optimizations does BOLT perform?
A. BOLT optimization pipeline uses:
strip-rep-ret
: Strip ‘epz
fromrepz retq
instructions used for legacy AMD processors.icf
: Identical code folding: additional benefits from function without-ffunction-sections
flag and functions with jump tablesicp
: Indirect call promotion: leverages call frequency information to mutate a function call into a more performance versionpeepholes
: Simple peephole optimizationssimplify-to-loads
: Fetch constant data in.rodata
whose address is known statically and mutates a load into a move instructionicf
: Identical code folding (second run)plt
: Remove indirection fromPLT
callsreorder-bbs
: Reorder basic blocks and split hot/cold blocks into separate sections (layout optimization)peepholes
: Simple peephole optimization (second run)uce
: Eliminate unreachable basic blocksfixup-branches
: Fix basic block terminator instructions to match the CFG and the current layout (redone byreorder-bbs
)reorder-functions
: Apply HFSort to reorder functions (layout optimization)sctc
: Simplify conditional tail calls- f
rame-opts
: Remove unnecessary caller-saved register spilling shrink-wrapping
: Move callee-saved register spills closer to where they are needed, if profiling data shows it is better to do so
Q. Can BOLT be used for dynamically loading libraries?
A. Yes, it just requires an additional profiling step with dynamically loading libraries.
Q. Which profiling data does BOLT use?
A. BOLT uses Linux perf utility to collect training input, including:
- CPU cycles (in user mode only)
- Sampled taken branches (and type of branches)
Please refer to the details of perf events here.
Q. What applications were tested to benchmark BOLT?
A. Larger applications (more than 100MB). It is better to aggressively reduce I-cache occupation, since the cache is one of the most constrained resources in the data center space. The followings are tested by Facebook using BOLT:
- HHVM: PHP/Hack virtual machine that powers the web servers
- TAO: a highly distributed, in memory, data-cache service
- Proxygen: a cluster load balancer
- Multifeed: a selection of what is shown in the Facebook News Feed
- Clang: a compiler frontend for programming languages
- GCC: an optimizing compiler by GNU Project
Current Status of BOLT
The original research paper was published by CGO 2019 by Facebook engineers. The source code has been released and maintained at GitHub since 2015. The BOLT project was added to the mainstream of the LLVM project version 14 in March.
BOLT operates on x86-64 and AAarch64 ELF binaries. The binaries should have an unstripped symbol table; to get maximum performance gains, they should be linked with relocations (--emit-relocs
or –q
linker flag).
BOLT is currently incompatible with the -freorder-blocks-and-partition
compiler option. GCC 8 and later versions enable this option by default; you must explicitly disable it by adding a -fno-reorder-blocks-and-partition
flag.
The latest code commits were done four months ago, and they are non-functional changes.
How to Build and Test BOLT
This section describes how to build BOLT and test with simple executables.
Building BOLT
Step 1. Get source code.
1 |
git clone https://github.com/facebookincubator/BOLT llvm-bolt |
Step 2. Build BOLT.
1 2 3 4 5 |
cd llvm-bolt mkdir build cd build cmake -G Ninja ../llvm-bolt/llvm -DLLVM_TARGETS_TO_BUILD="X86;AArch64" -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON -DLLVM_ENABLE_PROJECTS="clang;lld;bolt" ninja |
Note that you might need to modify the PATH variable in your environment to include ./llvm-bolt/build/bin
.
Test with Simple Executable
Step 1. Write t.cc.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
// t.cc #include <iostream> #include <vector> using namespace std; int x[5] = { 0xba, 0xbb, 0xbc, 0xbd, 0xbe }; bool p(int n) { for (int i = 2; i*i <= n; i++) { if (n % i == 0) return false; } return true; } int f(int i) { return x[i]; } int main() { int sum = 0; for (int k = 2; k < 1000000; k++) { if (p(k)) { sum++; } } cout << sum << endl; } |
Step 2. Write a Makefile.
1 2 3 4 5 6 7 |
# Makefile t: t.cc clang++ -Wl,--emit-relocs -o t t.cc clean: rm t |
Step 3. Build an executable from t.cc.
1 |
Make |
Step 4. Get profile data p.data from executable t by running perf utility.
1 2 3 4 |
$ perf record -e cycles:u -j any,u -o p.data -- ./t 78498 [ perf record: Woken up 3 times to write data ] [ perf record: Captured and wrote 0.526 MB p.data (1280 samples) ] |
Step 5. Convert perf data, p.data, to BOLT format, p.fdata, by executing perf2bolt.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
$ perf2bolt -p p.data -o p.fdata ./t PERF2BOLT: Starting data aggregation job for p.data PERF2BOLT: spawning perf job to read branch events PERF2BOLT: spawning perf job to read mem events PERF2BOLT: spawning perf job to read process events PERF2BOLT: spawning perf job to read task events BOLT-INFO: Target architecture: x86_64 BOLT-INFO: BOLT version: 88c70afe9d388ad430cc150cc158641701397f70 BOLT-INFO: first alloc address is 0x400000 BOLT-INFO: creating new program header table at address 0x800000, offset 0x400000 BOLT-INFO: enabling relocation mode BOLT-INFO: enabling strict relocation mode for aggregation purposes BOLT-INFO: pre-processing profile using perf data aggregator BOLT-WARNING: build-id will not be checked because we could not read one from input binary PERF2BOLT: waiting for perf mmap events collection to finish... PERF2BOLT: parsing perf-script mmap events output PERF2BOLT: waiting for perf task events collection to finish... PERF2BOLT: parsing perf-script task events output PERF2BOLT: input binary is associated with 1 PID(s) PERF2BOLT: waiting for perf events collection to finish... PERF2BOLT: parse branch events... PERF2BOLT: read 1280 samples and 20335 LBR entries PERF2BOLT: 0 samples (0.0%) were ignored PERF2BOLT: traces mismatching disassembled function contents: 0 (0.0%) PERF2BOLT: out of range traces involving unknown regions: 253 (1.3%) BOLT-WARNING: Ignored 0 functions due to cold fragments. PERF2BOLT: processing branch events... PERF2BOLT: wrote 17 objects and 0 memory objects to p.fdata |
Note that you might need to grant users permission to execute perf.
1 2 |
$ sudo sysctl kernel.perf_event_paranoid=-1 kernel.perf_event_paranoid = -1 |
Step 6. Generate optimized binary t.bolt from t.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
$ llvm-bolt ./t -o ./t.bolt -data=p.fdata -reorder-blocks=cache+ -reorder-functions=hfsort -split-functions=2 -split-all-cold -split-eh -dyno-stats BOLT-INFO: Target architecture: x86_64 BOLT-INFO: BOLT version: 88c70afe9d388ad430cc150cc158641701397f70 BOLT-INFO: first alloc address is 0x400000 BOLT-INFO: creating new program header table at address 0x800000, offset 0x400000 BOLT-INFO: enabling relocation mode BOLT-INFO: enabling lite mode BOLT-INFO: pre-processing profile using branch profile reader BOLT-WARNING: Ignored 0 functions due to cold fragments. BOLT-INFO: 2 out of 16 functions in the binary (12.5%) have non-empty execution profile BOLT-INFO: 10 instructions were shortened BOLT-INFO: basic block reordering modified layout of 2 (9.09%) functions BOLT-INFO: UCE removed 0 blocks and 0 bytes of code. BOLT-INFO: splitting separates 76 hot bytes from 51 cold bytes (59.84% of split functions is hot). BOLT-INFO: 0 Functions were reordered by LoopInversionPass BOLT-INFO: program-wide dynostats after all optimizations before SCTC and FOP: 13531 : executed forward branches 6165 : taken forward branches 0 : executed backward branches 0 : taken backward branches 13644 : executed unconditional branches 141 : all function calls 0 : indirect calls 0 : PLT calls 96484 : executed instructions 41335 : executed load instructions 7716 : executed store instructions 0 : taken jump table branches 0 : taken unknown indirect branches 27175 : total branches 19809 : taken branches 7366 : non-taken conditional branches 6165 : taken conditional branches 13531 : all conditional branches 7258 : executed forward branches (-46.4%) 16 : taken forward branches (-99.7%) 6273 : executed backward branches (+627200.0%) 6246 : taken backward branches (+624500.0%) 174 : executed unconditional branches (-98.7%) 141 : all function calls (=) 0 : indirect calls (=) 0 : PLT calls (=) 82987 : executed instructions (-14.0%) 41335 : executed load instructions (=) 7716 : executed store instructions (=) 0 : taken jump table branches (=) 0 : taken unknown indirect branches (=) 13705 : total branches (-49.6%) 6436 : taken branches (-67.5%) 7269 : non-taken conditional branches (-1.3%) 6262 : taken conditional branches (+1.6%) 13531 : all conditional branches (=) BOLT-INFO: SCTC: patched 0 tail calls (0 forward) tail calls (0 backward) from a total of 0 while removing 0 double jumps and removing 0 basic blocks totalling 0 bytes of code. CTCs total execution count is 0 and the number of times CTCs are taken is 0. BOLT-INFO: padding code to 0xc00000 to accommodate hot text BOLT-INFO: setting _end to 0x600df0 BOLT-INFO: setting __hot_start to 0xa00000 BOLT-INFO: setting __hot_end to 0xa00092 |
Step 7. Compare the file size and the execution time for t and t.bolt.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
$ ls -l t t.bolt -rwxrwxr-x 1 wjeon wjeon 10400 Feb 10 17:10 t -rwxrwxrwx 1 wjeon wjeon 8394880 Feb 10 17:18 t.bolt $ time ./t 78498 real 0m0.309s user 0m0.309s sys 0m0.000s wjeon@fw0014107 ~/maplejs/maple $ time ./t.bolt 78498 real 0m0.259s user 0m0.259s sys 0m0.000s |
Simple Trial with Maple JavaScript
In their research paper, the Facebook teams use two categories of binaries to evaluate BOLT. The first is the actual workloads running on Facebook’s data centers. They are (1) HHVM, the PHP/Hack virtual machine, (2) TAO, a distributed, in-memory, data-caching service, (3) Proxygen, a cluster load balancer built on top of the same open source library and (4) Multifeed, a service for Facebook News Feed. The second category of binaries are (1) Clang and (2) GCC compilers.
First, we tried to use the engine for our targeting binary to optimize. Javascript engine is an in-house JavaScript runtime engine developed by Futurewei Technologies. Two workloads were used for Maple JavaScript: First one is prime.js which finds prime numbers less than 1 million, and the second is 3d-cube.js, which performs matrix computations for rotating a 3D cube.
Step 1: The Cmake build script must be changed to keep relocations in the executable file.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
diff --git a/maple_engine/src/CMakeLists.txt b/maple_engine/src/CMakeLists.txt index 8eec9d1..323f1a2 100644 --- a/maple_engine/src/CMakeLists.txt +++ b/maple_engine/src/CMakeLists.txt @@ -74,6 +74,8 @@ find_library( PBjaddr2_LIB java_addr2line "${CMAKE_CURRENT_SOURCE_DIR}/../lib/*" find_library( PBmplre_LIB mplre "${CMAKE_CURRENT_SOURCE_DIR}/../lib/*" ) find_library( PBunwind_LIB unwind "${CMAKE_CURRENT_SOURCE_DIR}/../lib/*" ) +target_link_options(mplre-dyn PRIVATE -Wl,--emit-relocs) + target_link_libraries( mplre-dyn "${CMAKE_CURRENT_SOURCE_DIR}/../../../mapleall/out/ark-clang-release/lib/64/libHWSecureC.a" "${CMAKE_CURRENT_SOURCE_DIR}/../../../mapleall/jscre/build/libjscre.a" icuio icui18n icuuc icudata) #target_link_libraries( mplsh ${PBmpl_LIB} ) #target_link_libraries( mplsh ${PBcorea_LIB} ) |
Step 2: Build the binary for the Maple JavaScript engine.
Step 3: Modify the run script to get profile data.
1 2 3 4 5 6 7 8 9 10 11 |
diff --git a/maple_build/tools/run-js-app.sh b/maple_build/tools/run-js-app.sh index 0af9c8d..a4c0cae 100755 --- a/maple_build/tools/run-js-app.sh +++ b/maple_build/tools/run-js-app.sh @@ -46,4 +46,5 @@ $MPLCG -O2 --quiet --no-pie --verbose-asm --fpic $file.mmpl /usr/bin/x86_64-linux-gnu-g++-5 -g3 -pie -O2 -x assembler-with-cpp -c $file.s -o $file.o /usr/bin/x86_64-linux-gnu-g++-5 -g3 -pie -O2 -fPIC -shared -o $file.so $file.o -rdynamic export LD_LIBRARY_PATH=$MAPLE_RUNTIME_ROOT/lib/x86_64 -$DBCMD $MPLSH -cp $file.so +#$DBCMD $MPLSH -cp $file.so +perf record -e cycles:u -j any,u -o perf.data -- $DBCMD $MPLSH -cp $file.so |
Step 4: Write the benchmark JavaScript application, for example, prime.js.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
if (typeof console == "object") print = console.log; if (typeof console === 'undefined') console = {log:print}; function p(n) { for (let i = 2;i * i <= n;i++) { if (n % i == 0) { return false; } } return true; } var sum = 0; for (var k = 2;k < 1000000;k++) { if (p(k)) { sum++; } } print(sum); |
Step 5: Get profile data by running prime.js with the Maple JavaScript engine.
1 2 3 4 |
$ run-js-app.sh prime.js 78498 [ perf record: Woken up 37 times to write data ] [ perf record: Captured and wrote 9.468 MB perf.data (22989 samples) ] |
Step 6: Convert perf data output to BOLT format.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 |
$ llvm-bolt libmplre-dyn.so -o libmplre-dyn.bolt.so -data=perf.fdata -reorder-blocks=cache+ -reorder-functions=hfsort -split-functions=2 - split-all-cold -split-eh -dyno-stats BOLT-INFO: shared object or position-independent executable detected BOLT-INFO: Target architecture: x86_64 BOLT-INFO: BOLT version: 88c70afe9d388ad430cc150cc158641701397f70 BOLT-INFO: first alloc address is 0x0 BOLT-INFO: creating new program header table at address 0x400000, offset 0x400000 BOLT-WARNING: debug info will be stripped from the binary. Use -update-debug-sections to keep it. BOLT-INFO: enabling relocation mode BOLT-WARNING: disabling -split-eh for shared object BOLT-INFO: enabling lite mode BOLT-INFO: pre-processing profile using branch profile reader BOLT-WARNING: Ignored 0 functions due to cold fragments. BOLT-INFO: forcing -jump-tables=move as PIC jump table was detected in function _ZN5maple21InvokeInterpretMeth odERNS_12DynMFunctionE BOLT-INFO: 14 out of 1205 functions in the binary (1.2%) have non-empty execution profile BOLT-INFO: 1 function with profile could not be optimized BOLT-INFO: profile for 1 objects was ignored BOLT-INFO: the input contains 25 (dynamic count : 1) opportunities for macro-fusion optimization. Will fix ins tances on a hot path. BOLT-INFO: 1241 instructions were shortened BOLT-INFO: removed 5 empty blocks BOLT-INFO: basic block reordering modified layout of 10 (0.47%) functions BOLT-INFO: UCE removed 0 blocks and 0 bytes of code. BOLT-INFO: splitting separates 3334 hot bytes from 3048 cold bytes (52.24% of split functions is hot). BOLT-INFO: 0 Functions were reordered by LoopInversionPass BOLT-INFO: program-wide dynostats after all optimizations before SCTC and FOP: 21327 : executed forward branches 10516 : taken forward branches 652 : executed backward branches 459 : taken backward branches 648 : executed unconditional branches 2085 : all function calls 988 : indirect calls 988 : PLT calls 327015 : executed instructions 81409 : executed load instructions 56643 : executed store instructions 8029 : taken jump table branches 0 : taken unknown indirect branches 22627 : total branches 11623 : taken branches 11004 : non-taken conditional branches 10975 : taken conditional branches 21979 : all conditional branches 21205 : executed forward branches (-0.6%) 255 : taken forward branches (-97.6%) 774 : executed backward branches (+18.7%) 513 : taken backward branches (+11.8%) 329 : executed unconditional branches (-49.2%) 2085 : all function calls (=) 988 : indirect calls (=) 988 : PLT calls (=) 326401 : executed instructions (-0.2%) 81409 : executed load instructions (=) 56643 : executed store instructions (=) 8029 : taken jump table branches (=) 0 : taken unknown indirect branches (=) 22308 : total branches (-1.4%) 1097 : taken branches (-90.6%) 21211 : non-taken conditional branches (+92.8%) 768 : taken conditional branches (-93.0%) 21979 : all conditional branches (=) BOLT-INFO: SCTC: patched 0 tail calls (0 forward) tail calls (0 backward) from a total of 0 while removing 0 d ouble jumps and removing 0 basic blocks totalling 0 bytes of code. CTCs total execution count is 0 and the num ber of times CTCs are taken is 0. BOLT-INFO: padding code to 0x800000 to accommodate hot text BOLT-INFO: setting _end to 0x80e43c BOLT-INFO: setting _end to 0x80e43c BOLT-INFO: setting __hot_start to 0x600000 BOLT-INFO: setting __hot_end to 0x6205e7 |
Step 7: Rename the Maple JavaScript runtime library libmplre-dyn.so.
1 2 |
$ mv libmplre-dyn.so libmplre-dyn.so~ $ mv libmplre-dyn.bolt.so libmplre-dyn.so |
Step 8: Execute prime.js with the Maple JavaScript engine by using the original run script.
1 |
run-js-app.sh prime.js |
Step 9: Compare the file size and the execution time.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
$ ls -l libmplre-dyn* -rwxrwxrwx 1 wjeon wjeon 8676416 Feb 10 18:22 libmplre-dyn.so -rwxrwxr-x 1 wjeon wjeon 19387232 Feb 10 11:37 libmplre-dyn.bolt.so // original $ time run-js-app.sh prime.js 78498 real 0m5.743s user 0m5.714s sys 0m0.046s // with BOLT $ time run-js-app.sh prime.js 78498 real 0m5.738s user 0m5.710s sys 0m0.045s // original $ time run-js-app.sh 3d-cube.js real 0m51.210s user 0m51.183s sys 0m0.040s // with BOLT $ time run-js-app.sh 3d-cube.js real 0m51.425s user 0m51.368s sys 0m0.073s |
However, the benefits of binary optimization using BOLT for the Maple JavaScript engine was not obviously seen. We believe the main reason is that the workloads that are used with Maple JavaScript were not as complicated as the ones that were used by the original authors of the paper. The workloads merely have conditional branches, so BOLT might not have any good opportunities to optimize the binary of Maple JavaScript. Also, the duration of the execution times is very short compared to the workloads that the authors used.
BOLT Optimization for Clang
So we decided to use the same benchmark workload used in the paper on our setup, which was Clang compiler. The detailed steps to reproduce the result presented in the paper was documented in the BOLT’s Github repository. Most of the steps were identical, but the later version 14 of Clang was selected instead of Clang 7. Here is the summary of the setup.
- Tested app: Clang 14 (14.x branch of GitHub source code)
- Tested environment: Ubuntu 18.04.4 LTS, 40-core CPU, 800GB memory
- Different optimizations
- PGO+LTO: baseline setup without BOLT (Profile Guided Optimization + Link-Time Optimization provided by LLVM/Clang)
- PGO+LTO+BOLT: BOLT optimizations enabled (suggested by BOLT GitHub project)
- Algorithm for reordering of functions:
hfsort+
- Algorithm for reordering of basic blocks:
cache+
(layout optimizing I-cache behavior) - Level of function splitting: Three (all functions)
- Fold functions with identical code
- Algorithm for reordering of functions:
- BOLT-reorder functions: BOLT optimizations excluding reordering of functions
- BOLT-reorder blocks: BOLT optimizations excluding reordering of basic blocks
- BOLT-hot/cold split: BOLT optimizations excluding hot/cold splitting
- BOLT-ICF: BOLT optimizations excluding identical code folding
The main purpose of this test is to identify how much performance benefits come from what optimization options of BOLT. PGO+LTO, which enables the basic optimization based on PGO and LTO that are supported by LLVM, was selected as a baseline of the performance comparison.
PGO+LTO+BOLT indicates all BOLT optimizations were enabled on top of PGO and LTO. No reorder functions
enable all the BOLT optimization (described in their documentation) except no reordering of functions. Similarly, No reorder blocks
, No hot/cold split
and No ICF
enables all the BOLT optimization except reordering of basic blocks, hot/cold splitting, and identical code folding, respectively.
Table 1 shows the execution times of different optimization configurations.

Table 1. Execution time of Clang with different optimization configurations
From the table showing the execution time, the following single optimization among all the BOLT optimization options mostly affect the execution time: (1) reorder blocks, (2) hot/cold function split, (3) reorder functions, (4) identical code folding, in order.
Table 2 shows different contributions of different optimization options on L1-icache-misses.

Figure 2. Contribution of different BOLT optimizations
As seen, each single BOLT optimization option that mostly affects L1-icache-misses is (1) reorder blocks, (2) hot/cold split, reorder functions (tie), and (3) identical code folding, in order.
Table 3 shows more results on different optimization options from other system parameters’ point of view.

Table 3. Contribution of different BOLT optimizations
From Table 3, two additional system parameters are mostly affected by different BOLT optimization options, cpu-cycles
and L1-icache-load-misses
. CPU-cycles is mostly affected by (1) reorder blocks, (2) hot/cold split, (2) reorder functions (tie) and (3) identical code folding, in order, and L1-icache-load-misses by (1) reorder blocks, (2) hot/cold split, (3) reorder functions and (4) identical code folding, in order.