MIT-Created Compiler Speeds up Python Code
Python is a popular, beginner-friendly language. It’s also an interpreted language, which makes it easy to use but slower than a compiled language such as C or C++. At the large scale that becomes a problem, as Ariya Shajii, an MIT CSAIL Ph.D. graduate, and his colleague Ibrahim Numanagić noticed when working with genomics, which involved large data sequences.
They realized the previous efforts to create faster versions of Python were predicated on a top-down approach that started with the traditional implementation and then attempted to make it faster by doing a just-in-time compilation, which compiles the code as the program runs, Shajii said.
“The clear advantage of that is you can get a lot of backwards compatibility, but you’re really limited in the types of things you can do,” Shajii told The New Stack. “For example, Python has this thing called a global interpreter lock, which basically prevents you from doing parallel or multithreaded applications. And that’s a big problem if you really want high performance.”
Instead, Shajii and Numanagić took a bottom-up approach, implementing everything from the ground up, independent of the standard Python implementation, he said. That led them to an unusual approach: compiling Python with a tool they created, with an MIT team, called Codon.
“It gives you a lot more flexibility to do interesting things and generate optimized code, and things like that,” Shajii said. “That’s why we’re able to get such a better performance than some of these other compilation approaches, which maybe get 2 to 4 times, for example, but with Codon it’s usually like 10 to 100 times.”
The MIT team tested Codon on approximately ten commonly used genomics applications, all written in Python and compiled using Codon. The team achieved five to ten times speed-ups over the original hand-optimized implementations.
Codon’s Origin Story
Originally, Shajii and Numanagić planned to build a domain-specific language for genomics, since that was their background. What they found, however, is that people didn’t want to learn a new and specialized language — they like Python.
“That’s why we just made everything as Pythonic as possible,” he said. “Then over time, we just closed the gaps farther and farther to the point where we had sort of a general Python, sort of Python replacement pretty much.”
The team then refactored their tool into the Codon compiler by converting all its genomic-specific library, data structures, and methods of dealing with sequences into an extension. This approach allows Codon to support other domain-specific languages, which are programming languages with higher abstraction for a specific class of problems, all wrapped in a comfortable Python-like environment.
“The whole system is extensible with plugins, so you can write a plugin that has new libraries, new compiler optimizations; you can even add new keywords to the language if you want it to, or new syntax,” Shajii said. “But from the user standpoint, they’re still writing very high-level Pythonic code.”
One of the first puzzles the team had to solve was how to feed the compiler Python code. The compiler’s first step is to perform “type checking,” a process where the program figures out the different data types — string, integers, floating-point numbers, etc. — of each variable or function. Some might be strings, some might be integers. In regular Python, that information is dealt with as the program runs, which is one of the reasons Python is slow. Codon does this type-checking before running the program. Doing so allows the compiler to convert the code to native machine code, thus avoiding the overhead of dealing with data types at runtime.
They then focused on optimizations in the compiler.
“If you’re working with the genomics plugin, for example, that will do its own set of optimizations that are specific to that computing domain, which involves working with genomic sequences and other biological data, for example. The result? An executable file that runs at the speed of C or C++, or even faster once domain-specific optimizations are applied,” MIT stated.
Shajii and the team published a paper detailing how Codon works.
Compiling Python Caveats
There are a few caveats with compiling Python, however. Codon does not support dynamically changing data types at runtime, for instance.
“We said, okay, we’re targeting scientific applications, and it’s rare to do stuff like that, so let’s just like shift our focus to statically analyzable things,” Shajii explained. “So some of those dynamic features we don’t support.”
Some of these omitted features are on Codon’s roadmap to support and some aren’t. For instance, standard library modules aren’t supported yet, but the MIT team is working on it.
“It’s a huge, huge library, but we’ve tried to implement the main ones that we typically see used […] in the kinds of applications that we’re targeting,” he said.
There are also data type differences. For example, integers in Codon are 64 bit and in Python they’re “arbitrarily long,” he said.
Also, while Codon is designed to help projects scale up, don’t expect a seamless output yet.
“Larger code bases, you’ll probably end up having some [of the] incompatibilities that I mentioned. So, you know, oftentimes we give you error messages: that you need to go and change this, or [we] don’t support this yet,” he said.
There are other ways to use Codon in larger Python applications, he said, noting that there is a decorator that allows developers to allow one particular function — say a bottleneck — to compile while everything else stays in Python.
“That’s to address this problem of an all-or-nothing approach,” he said. “Often, if you have some Python application, what people would typically do is they would write the really performance-critical pieces of that in C; or Cython, for example, is another tool that’s used for that. So we’re releasing something pretty soon that lets you do that same thing in Codon, so you never have to leave the Python environment, which, again, is sort of the underlying theme of all this.”
Codon’s Coming Soon: WebAssembly and More
Codon was released in December and is in version 0.15. It’s available for free usage in academic or personal applications.
The team wants to incorporate several dynamic features and expand its Python library coverage. There’s one planned feature, however, that may appeal to frontend and web developers: They’ve planned to support compiling to WebAssembly.
“We use LLVM as a backend. LLVM is a very common sort of compiler infrastructure/framework that a lot of compilers use, and LLVM has support for WebAssembly,” he said. “So one of the things that we plan to add support for is WebAssembly for Codon, so [that] you can take a Python program and compile it to WebAssembly.”