Intel’s Generational On-Chip Change APX Will Make All the Apps Faster

Intel has made an under-the-radar change [pdf download] in its chips that have long-term implications in helping software run faster on servers and PCs.
The chip maker has doubled the number of registers on its x86 chip architecture, which will give developers an instant boost in application performance on all computers.
Coders simply need to recompile code [pdf download] to take advantage of the new features. Intel executives said programs do not need to be rewritten.
Intel previously had 16 registers [pdf download], but the new Advanced Performance Extensions (APX) doubles that to 32. The technology will result in faster and more power-efficient load and restore times for programs.
“APX is around general purpose, integer computing, it gives you more registers. Recompile, and you get better performance,” said Ronak Singhal, a senior fellow at Intel.
The compiler uses registers to store local variables, and it previously ran out after 16. After that, compilers had to go to memory to manage the variables, which affected performance.
“Now the compiler can go all the way to 32 local variables — we’re running out of these registers all the time and the compiler has to manage those in memory, which costs some of the runtime. We are giving the compiler room to optimize, basically, because there is more space to do things faster,” said Arjan van de Ven, also a senior fellow at Intel.
Modest Gains
Intel’s goal with APX is to provide an incremental benefit on every workload, but do not expect a radical 10x-type boost, Singhal said.
“It’s not hard for people to use. It does not require a restructuring of your application, coming up with new algorithms for application, none of that,” Singhal said.
Intel did not comment on when the APX instructions would be in chips. Intel’s chip lineup includes server chips Emerald Rapids (due later this year), Granite Rapids and Sierra Forest (next year), and PC chips Meteor Lake (this year) and Arrow Lake (next year). Intel typically boosts performance and power efficiency with new generations of chips.
“Who will be the first people to use this — think of the companies that control their own code, and it’s easy for them to recompile and get that benefit. And they are also the savviest users,” Singhal said.
Intel has worked a lot to make the transition simpler, but it will take time to get APX-related tools to get to developers. The company has posted documentation on APX to begin engaging the open source community.
“GCC — we have patches coming out soon. Same for the LLVM compiler. As you can imagine, we will be doing similar things with Microsoft and their compilers, all of them,” van de Ven said.
Application developers have the choice to boost performance for an entire application, or parts of the source code that are performance sensitive.
Binary Capability
“The nice thing with APX is we keep compatibility with existing binaries completely. You can mix and match. You can make one piece new and faster, and the part of the application that is not performance sensitive you do not have to do that. You can just keep it,” van de Ven said.
If a developer cares about performance, it might be useful to ship two copies of an application or parts of your application that are very performance sensitive, van de Ven said.
“You can imagine that the Linux distribution might be able to ship a second build of the same source code that is … optimized for the processor and it’s always completely compatible,” van de Ven said.
APX can also benefit discerning programmers who write code direct to GPUs, CPUs, and other hardware.
“When you’re a compiler writer or you do things like that, yes, this gives you more freedom, which tends to result in better code that is generated,” van de Ven said.
Singhal said more registers were needed as programming is much more complex today than it was 20 to 30 years ago. Computing is forking in multiple ways with applications and programming frameworks.
Applications have also evolved to run in parallel across CPUs and accelerators, and Intel had to bring parallelism on chips, and APX creates a foundation to boost performance on other parallel environments.
Intel has added new instruction sets such as AMX for on-chip AI acceleration, and TDX for on-chip data security. APX will also provide a minor boost to those instructions.
APX also undoes many risky performance-improvement features that Intel has implemented in previous chips.
The company uses a feature called “speculative executive” to anticipate processor behavior. By predicting behavior, the chip was able to reduce delays and run some applications much faster.
But speculative execution has its own issues and was at the center of the Meltdown vulnerability detected on Intel chips in 2018.
The APX instructions have provided an opportunity to remove branch prediction, which typically assigns a task for execution based on “true” and “false” values.
“We can remove that and turn it into a conditional move. If that condition is this, then move this or that? No branch needed,” Singhal said.
Intel always had capabilities for conditionals, but “this makes it much richer and much easier for compilers to take advantage,” Singhal said.
Intel also introduced the new AVX10 instructions, which impacts coders that write applications for high-performance computing.
The AVX10 instructions are a successor to the AVX-512 instructions, which are used for scientific computing, machine learning, security, and other applications.
The biggest improvement in the new AVX10 is around usability, not performance. Intel has had multiple generations of AVX, but the listing of features got convoluted, which made it difficult for programmers to match up the right set of features and CPUs.
“If we have a hard time figuring this out, outside of Intel, nobody has a chance,” van de Ven said.
AVX10 reorganizes the versioning into a linear form, such as 10.1 or 10.2, with each enumeration listing the new features. Customers do not need to check various versions to identify CPUs and match up the AVX features.
The AVX10 feature set will first appear in the upcoming Xeon server chip code-named Granite Rapids, which is due earlier next year. Granite Rapids is based on an entirely new processor design, and it will be made on Intel 3 manufacturing process, in which the chip maker will use Extreme Ultraviolet (EUV) technologies to etch finer features on chips.
Intel has been aggressively pushing its OneAPI computing tools for developers to develop a common codebase that can be easily exported across hardware and accelerators. Intel may first include the APX and AVX10 tools in OneAPI, and make it available through Intel Dev Cloud.