This started with a consulting snafu: Government organisation A got government organisation B to develop a web application. Government organisation B subcontracted part of the work to somebody. Hosting and maintenance of the project was later contracted out to a private-sector company C. Company C discovered that the subcontracted somebody (who was long gone) had built a custom Docker image and made it a dependency of the build system, but without committing the original Dockerfile. That left company C with a contractual obligation to manage a Docker image they had no source code for. Company C calls me in once in a while to do various things, so doing something about this mystery meat Docker image became my job.
Fortunately, the Docker image format is a lot more transparent than it could be. A little detective work is needed, but a lot can be figured out just by pulling apart an image file. As an example, here’s a quick walkthrough of an image for the Prettier code formatter.
First let’s get the Docker daemon to pull the image, then extract the image to a file:
Yes, the file is just an archive in the classic tarball format:
As you can see, Docker uses hashes a lot for naming things. Let’s have a look at the
manifest.json. It’s in hard-to-read compacted JSON, but the
jq JSON Swiss Army knife
can pretty print it for us:
Note that the three layers correspond to the three hash-named directories. We’ll look at them later. For now, let’s
look at the JSON file pointed to by the
Config key. It’s a
little long, so I’ll just dump the first bit here:
The most interesting part is the
history list, which lists
every single layer in the image. A Docker image is a stack of these layers. Almost every statement in a Dockerfile
turns into a layer that describes the changes to the image made by that statement. If you have a
RUN script.sh statement that creates
really_big_file that you then delete with
RUN rm really_big_file, you actually get two layers in the Docker image:
one that contains
really_big_file, and one that contains a
.wh.really_big_file tombstone to cancel it out. The overall
image file isn’t any smaller. That’s why you often see Dockerfile statements chained together like
RUN script.sh && rm really_big_file — it ensures all changes are
coalesced into one layer.
Here are all the layers recorded in the Docker image. Notice that most layers don’t change the filesystem image and
"empty_layer": true. Only three are non-empty,
which matches up with what we saw before.
Fantastic! All the statements are right there in the
created_by fields, so we can almost reconstruct the Dockerfile just from
this. Almost. The
ADD statement at the very top doesn’t
actually give us the file we need to
COPY statements are also going to be opaque. We also lose
FROM statements because they expand out to all the layers inherited from
the base Docker image.
We can group the layers by Dockerfile by looking at the timestamps. Most layer timestamps are under a minute apart,
representing how long each layer took to build. However, the first two layers are from
2020-04-24, and the rest of the layers are from
2020-04-29. This would be because the first two layers are from a base
Docker image. Ideally we’d figure out a
FROM statement that
gets us that image, so that we have a maintainable Dockerfile.
manifest.json says that the first non-empty layer is
Let’s take a look:
Okay, that looks like it might be an operating system base image, which is what you’d expect from a typical Dockerfile. There are 488 entries in the tarball, and if you scroll through them, some interesting ones stand out:
... dev/ etc/ etc/alpine-release etc/apk/ etc/apk/arch etc/apk/keys/ email@example.com firstname.lastname@example.org email@example.com etc/apk/protected_paths.d/ etc/apk/repositories etc/apk/world etc/conf.d/ ...
Sure enough, it’s an Alpine image, which you might have guessed if you
noticed that the other layers used an
apk command to install
packages. Let’s extract the tarball and look around:
If you pull
alpine:3.11.6 and extract it, you’ll find that
there’s one non-empty layer inside it, and the
identical to the
layer.tar in the base layer of the Prettier
Just for the heck of it, what’s in the other two non-empty layers? The second layer is the main layer containing the Prettier installation. It has 528 entries, including Prettier, a bunch of dependencies and certificate updates:
... usr/lib/libuv.so.1 usr/lib/libuv.so.1.0.0 usr/lib/node_modules/ usr/lib/node_modules/prettier/ usr/lib/node_modules/prettier/LICENSE usr/lib/node_modules/prettier/README.md usr/lib/node_modules/prettier/bin-prettier.js usr/lib/node_modules/prettier/doc.js usr/lib/node_modules/prettier/index.js usr/lib/node_modules/prettier/package.json usr/lib/node_modules/prettier/parser-angular.js usr/lib/node_modules/prettier/parser-babel.js usr/lib/node_modules/prettier/parser-flow.js usr/lib/node_modules/prettier/parser-glimmer.js usr/lib/node_modules/prettier/parser-graphql.js usr/lib/node_modules/prettier/parser-html.js usr/lib/node_modules/prettier/parser-markdown.js usr/lib/node_modules/prettier/parser-postcss.js usr/lib/node_modules/prettier/parser-typescript.js usr/lib/node_modules/prettier/parser-yaml.js usr/lib/node_modules/prettier/standalone.js usr/lib/node_modules/prettier/third-party.js usr/local/ usr/local/share/ usr/local/share/ca-certificates/ usr/sbin/ usr/sbin/update-ca-certificates usr/share/ usr/share/ca-certificates/ usr/share/ca-certificates/mozilla/ usr/share/ca-certificates/mozilla/ACCVRAIZ1.crt usr/share/ca-certificates/mozilla/AC_RAIZ_FNMT-RCM.crt usr/share/ca-certificates/mozilla/Actalis_Authentication_Root_CA.crt ...
The third layer is created by the
WORKDIR /work statement,
and it contains exactly one entry: