Understanding the Docker Cache for Faster Builds
With regard to Docker itself, using it on a daliy basis has produced a few insights about the cache that others may find helpful. Docker will cache the results of the first build of a Dockerfile, allowing subsequent builds to be super fast. That’s no secret and it is well documented. But using the cache successfully requires the engineer to understand how it works.
At each occurrence of a RUN command in the Dockerfile, Docker will create and commit a new layer to the image, which is just a set of tightly-coupled directories full of various file structure that comprise a Docker image. In a default install, these are located in /var/lib/docker.
During a new build, all of these file structures have to be created and written to disk — this is where Docker stores base images. Once created, the container (and subsequent new ones) will be stored in the folder in this same area.
What makes the cache important? If the objects on the file system that Docker is about to produce are unchanged between builds, reusing a cache of a previous build on the host is a great time-saver. It makes building a new container really, really fast. None of those file structures have to be created and written to disk this time — the reference to them is sufficient to locate and reuse the previously built structures.
This is an order of magnitude faster than a a fresh build. If you’re building many containers, this reduced build-time means getting that container into production costs less, as measured by compute time.
However, that cache is used pretty aggressively and may cause issues when you want the updated output of a RUN command to make it into the new container. As it stands, unless the RUN command *itself* changes (and thus invalidates Docker’s on-host cache), Docker will reuse the previous results from cache. This is clearly disadvantageous when the RUN command is a source code checkout, for example a git clone that is RUN as the initial step of a project’s build.
Fortunately, there are a couple ways around this.
An engineer can run a Docker build with the ‘–no-cache’ option, which completely ignores all cache and thus makes every build take as much time as the first.
That approach is clearly binary – the cache is used or not used. When the ‘–no-cache’ option is passed to ‘Docker build…’, then that build will always start from scratch, writing a new image to the file system even if nothing in the Dockerfile has changed. This is guaranteed to not reuse stale results, but will always take the maximum amount of time.
To get a fresh clone while still taking advantage of the cache, the engineer could place the source control operations into the last RUN that is listed in the Dockerfile.
But in order to make this a unique command, thus ensuring it gets run each time, the engineer could wrap the Docker build in another script that generates a uniquely-numbered mini-script for the clone operation.
This step would insert the invocation of that script into the Dockerfile that is generated on-the-fly just prior to build time, such that for the operation that must be run every time – the clone – its RUN statement is indeed unique, i.e.
RUN /bin/sh /foo-1234567abc.sh
where ‘foo-1234567abc’ is uniquely generated for each build (and subsequent executions create something like ‘foo-26190def.sh’) and contains the clone operation, i.e.
cd /some/dir && /usr/bin/git clone http://some.git.server/your-repo.git
which itself may only very infrequently — or never — change.
This guarantees that Docker will run the clone during each build while having the advantages of being both fully automated and ensuring that the cache is used right up to that last unique RUN.
Using these approaches, the engineer can make best use of Docker’s cache to ensure consistent, fast and accurate build results.
John Minnihan is the co-founder and CEO of Modern Repo.