Compliance / Machine Learning / Security / Software Development / Contributed

Automate Quality, Security Checks for Python Library Dependencies

7 Jun 2022 10:00am, by and
Fridolín Pokorný
Fridolín is a passionate Pythonista. He works at Red Hat on scalable platforms and machine learning applications in Red Hat’s office of the CTO. Nature lover.

The power behind open source allows communities to create and share sophisticated software libraries and packages worldwide. With this great power comes the great responsibility of making sure the packages are kept in a healthy state, which often requires domain knowledge and expertise.

In this article, we will look at efforts to accumulate such knowledge centrally and apply it during program builds. The approach is best represented by the open source Python cloud resolver called Thoth, a project on which co-author Fridolín is a developer and committer. Thoth’s resolution engine takes advantages of several resources in the Python ecosystem and guides consumers of open source Python software in building applications with secure, high-performing and compatible dependencies.

A classic xkcd comic, #2347, brings home just how crucial each application dependency is. The history of open source software, unfortunately, provides plenty of examples of the dangers caused by insufficient testing and vetting, including a bug in JavaScript’s npm and the recent Log4j issue. To protect open source developers and users, computer companies are talking about collecting knowledge about open source software and the communities that produce this software.

In this article, we’ll step through typical Python builds and see how tools can automate the process of checking for robust versions of dependencies.

Python Packaging

Andy Oram
Andy Oram writes and edits works about many aspects of computing, ranging in size from blog postings to full-length books. For many years, Andy worked as an editor at O’Reilly & Associates. His topics there covered a wide range of computer technologies.

Most Python developers are familiar with the Python Package Index (PyPI). PyPI is the main place to get open source Python packages. The tool used most often for installing Python dependencies, pip, resolves application dependencies and installs them in the desired environment. Let’s take a look at this resolution process.

The first task in resolution is to make a list of the packages imported by the application and to determine which dependencies they have. Packages are discovered in a cascading fashion because each package may require other packages that are called transitive dependencies.

Next, the resolver must determine which version of each package to install. Many Python developers are content with letting pip install the most recent stable version of each package, which is the default behavior. But there are many reasons to override this choice of the latest version. The developer might specify a particular version, or a range of acceptable versions, because something about the latest version is incompatible with the application.

Pip is also intelligent enough to determine that some packages need particular versions of dependencies in order to be stable. Sometimes pip even backtracks after checking a dependency and switches versions of a package that depends on it. Although this sophisticated resolution process can help a developer by automatically figuring out which dependencies work together, it can sometimes cause undesired downloads.

Other open source tools, such as PipenvPoetry, and pip-tools, can also manage a lock file, which stores a list of the versions of all dependencies needed in an application. By creating a lock file, a developer can ensure a reproducible installation, which is valuable for preventing unpleasant surprises when you’re rebuilding an application frequently in a DevOps environment.

All the tools we’ve mentioned perform the resolution process on clients’ machines, where application dependencies are subsequently installed. As of now, Python packaging does not expose dependency information on PyPI, which means that the tools have to actually download artifacts and obtain the dependency information during the installation process.

Hardware and other requirements can also affect resolution. Project Thoth began largely to meet the specialized requirements of machine learning programs, which can call for GPUs. When resolving such popular libraries as PyTorch and TensorFlow, the resolver should consider the available CUDA, cuDNN or CPU instruction set.

If the target environment is not isolated, developers should also consider the security of dependencies. As of the writing of this article, open source TensorFlow has 299 reported vulnerabilities in releases hosted on PyPI (see Figure 1). When installing TensorFlow as a direct dependency of an application, if the quality of any dependency (either direct or transitive) is weak, an application can be vulnerable or can misbehave. In the worst case, a dependency can go missing entirely from publicly available sources.

Figure 1: CVE database on the popular TensorFlow library.

Aggregating Information about Dependencies

To avoid unnecessary downloads, dependency information could be extracted from individual packages and exposed directly on PyPI. Recent efforts to expose static wheel metadata on API endpoints of Warehouse support this feature.

Another possible solution is to monitor new package releases and extract dependency information out of wheels or even source distributions. If the extracted information is stored in a queryable form for desired environments, a resolver can use preaggregated dependency data to resolve application requirements for target environments. This resolver does not need to run on client machines if clients send information about the runtime environment used.

The time saved during the resolution process can be invested to find a resolved set of software packages that meet desired quality criteria. In this case, the resolution process can come up with dependencies that are not the latest possible versions based on the dependency graph (as in the case of pip), but versions that are not prone to known vulnerabilities, that perform well in the target environment, and that consist of well-tested, stable library combinations.

advisory-db, maintained by the Python Packaging Authority, provides information about vulnerabilities in open source software. The information is available as YAML files and can be easily consumed by machines. An example of a tool exploiting this database is pip-audit, which audits packages present in the environment.

Using Information about Dependencies in the Resolution Process

The advisory-db can also be a valuable source of security information that can be directly plugged into the resolution process. In that case, the resolver can immediately resolve application dependencies that are not prone to known vulnerabilities.

Unlike the resolver implemented in pip, which uses a backtracking algorithm to resolve application dependencies, Thoth’s resolution process in the cloud resolver is a Markov decision process (MDP). The resolution process satisfies the essential Markov property: Any future state of the resolution process depends only upon the current state and the future actions taken, not on the sequence of preceding actions. The production deployment of Thoth’s resolver uses temporal difference learning to resolve application dependencies.

A state in the resolution process is defined by three main attributes: an already resolved set of dependencies based on the traversal done so far in the dependency graph, a set of dependencies to be resolved and a score defining the quality of the given state. A state that does not have any unresolved dependencies in the final state (based on the MDP) can be used to produce a lock file.

The resolver internally keeps multiple states that are generated during the traversal of the application dependency graph. They are generated based on the action that can be done in the dependency graph (search space), thus resolving an unresolved dependency.

These actions are scored positively or negatively, taking into account the quality of the dependency that is resolved. If the dependency does not meet desired criteria (e.g., the dependency has a vulnerability but the resolved set of dependencies must be free of vulnerabilities), the given action in the resolution process can be marked as invalid, causing the resolver to find another resolution path that would meet the desired quality.

To create a pluggable interface to the resolution process, actions are scored based on a resolution pipeline. This pipeline is made out of multiple units that score action to be taken and can additionally adjust the resolution process. These adjustments can include fixing underpinning or overpinning issues, where the developer has been too lax or too strict in requiring specific versions, respectively.

Further adjustments include inserting dependencies to the dependency graph, removing dependencies from the dependency graph and replacing some nodes in the dependency graph (for instance, resolving tensorflow-gpu for GPU-enabled environments instead of tensorflow-cpu).

To create a high-level interface for the resolution process, the resolver allows a developer to declare pipeline units in YAML files called prescriptionsPrescriptions provide a declarative interface to the resolution process and guide the resolver. Project Thoth offers prescriptions for open source Python projects in its prescriptions repository. This repository acts as a database of knowledge about open source Python packages and is directly used during the resolution process. Anyone can contribute to this shared database by opening pull requests that get reviewed.

Here is an example list of prescriptions to give a sense of the kinds of knowledge encoded in this open database:

More prescriptions can be found by browsing the prescriptions repository.

Recommendations Specific to Target Environments

As can be seen from the examples in the previous section, some prescriptions also include information outside of the Python packaging standards. For example, Python packaging standards don’t currently include information about the CUDA or cuDNN version that should be present in the environment to resolve a certain package. This information lies outside of tags used in wheels.

To allow such resolutions, Thoth’s resolver accepts information about the application’s runtime environment. As an example, Thoth’s resolver can pick a specific CUDA 11.1 build of PyTorch hosted on the PyTorch index for environments where CUDA 11.1 is available. Other dependencies required to run the application can be taken from another index, such as PyPI.

Developers can also provide information about the base container image used as a runtime environment for the given application. In such a case, the resolver can take into account analysis of the container image (which is done in the background in Thoth) and resolve requirements for the containerized environment where the application will be executed.

Thoth can additionally consider requirements for native dependencies (such as a specific RPM to be present in the runtime environment), Python packages already present in the environment or the availability of a certain ABI. As the content of the containerized environments is analyzed, the resolver can also spot possible issues or vulnerabilities in containerized environments outside of the Python ecosystem (e.g., vulnerabilities in RPM packages reported thanks to Quay Clair).

Information about the base container image is optional. If no base container image is provided, the resolver considers only information about the runtime environment supplied from the client’s configuration file. This resolution is similar to pip, Pipenv or Poetry, but it also guides users with respect to the quality of application dependencies resolved.

Safe Cross-Index Resolution without Dependency Confusion

To correctly combine different sources of Python package indexes, Thoth’s resolution process considers each package index as a separate source of packages. Python packaging tools, such as pip, treat multiple package indexes like mirrors and do not allow the developer to specify sources on package level. To avoid this limitation, developers can specify requirements on a package using hashes of artifacts, but this solution is unintuitive and error-prone. Treating multiple Python package indexes as mirrors opens the possibility for dependency confusion, which could create serious problems in supply chain security.

If Python indexes host different builds outside of many Linux standards (such as CUDA 11.1 builds of PyTorch, mentioned earlier), the resolution process makes sure the desired source is used based on the software package aspects required, not solely based on the version numbers of available builds.

Using and Contributing to the Python Cloud Resolver

The Thoth Python cloud resolver is a community project sponsored by Red Hat. The resolver is available to the community, so anyone can use it.

The main integration points for interacting with the Thoth’s resolver are the Thamos command-line interface (CLI) and jupyterlab-requirements. The CLI manages the developer’s environment and contacts Thoth’s resolver from the terminal. On the other hand, jupyterlab-requirements is an extension for Jupyter Notebook that helps manage dependencies directly in notebooks.

Check out the tutorial on how to set up your environment and use the Thoth cloud resolver. The tutorial walks through some security aspects. An extended video tutorial takes you through some key features of the resolver with a demo.

If you would like to contribute to our community, feel free to add new prescriptions that can improve the quality of Python open source. The implementation of the Python cloud resolver is open source, and the Thoth team accepts community contributions. We hope that Python programmers everywhere will find Thoth’s resolver valuable.

The New Stack is a wholly owned subsidiary of Insight Partners, an investor in the following companies mentioned in this article: Resolve.

Feature image via Pixabay.