This started with a consulting snafu: Government organisation A got government organisation B to develop a web application. Government organisation B subcontracted part of the work to somebody. Hosting and maintenance of the project was later contracted out to a private-sector company C. Company C discovered that the subcontracted somebody (who was long gone) had built a custom Docker image and made it a dependency of the build system, but without committing the original Dockerfile. That left company C with a contractual obligation to manage a Docker image they had no source code for. Company C calls me in once in a while to do various things, so doing something about this mystery meat Docker image became my job.
Fortunately, the Docker image format is a lot more transparent than it could be. A little detective work is needed, but a lot can be figured out just by pulling apart an image file. As an example, here’s a quick walkthrough of an image for the Prettier code formatter. (In fact, it’s so easy, there’s a tool for it. Thanks Ezequiel Gonzalez Rial.)
First let’s get the Docker daemon to pull the image, then extract the image to a file:
Yes, the file is just an archive in the classic tarball format:
As you can see, Docker uses hashes a lot for naming things. Let’s have a look at the manifest.json
. It’s in hard-to-read compacted JSON, but the jq
JSON Swiss Army knife
can pretty print it for us:
Note that the three layers correspond to the three hash-named directories. We’ll look at them later. For now, let’s
look at the JSON file pointed to by the Config
key. It’s a
little long, so I’ll just dump the first bit here:
The most interesting part is the history
list, which lists
every single layer in the image. A Docker image is a stack of these layers. Almost every statement in a Dockerfile
turns into a layer that describes the changes to the image made by that statement. If you have a RUN script.sh
statement that creates really_big_file
that you then delete with RUN rm really_big_file
, you actually get two layers in the Docker image:
one that contains really_big_file
, and one that contains a
.wh.really_big_file
tombstone to cancel it out. The overall
image file isn’t any smaller. That’s why you often see Dockerfile statements chained together like RUN script.sh && rm really_big_file
— it ensures all changes are
coalesced into one layer.
Here are all the layers recorded in the Docker image. Notice that most layers don’t change the filesystem image and
are marked "empty_layer": true
. Only three are non-empty,
which matches up with what we saw before.
Fantastic! All the statements are right there in the created_by
fields, so we can almost reconstruct the Dockerfile just from
this. Almost. The ADD
statement at the very top doesn’t
actually give us the file we need to ADD
. COPY
statements are also going to be opaque. We also lose FROM
statements because they expand out to all the layers inherited from
the base Docker image.
We can group the layers by Dockerfile by looking at the timestamps. Most layer timestamps are under a minute apart,
representing how long each layer took to build. However, the first two layers are from 2020-04-24
, and the rest of the layers are from 2020-04-29
. This would be because the first two layers are from a base
Docker image. Ideally we’d figure out a FROM
statement that
gets us that image, so that we have a maintainable Dockerfile.
The manifest.json
says that the first non-empty layer is
a9cc4ace48cd792ef888ade20810f82f6c24aaf2436f30337a2a712cd054dc97/layer.tar
.
Let’s take a look:
Okay, that looks like it might be an operating system base image, which is what you’d expect from a typical Dockerfile. There are 488 entries in the tarball, and if you scroll through them, some interesting ones stand out:
...
dev/
etc/
etc/alpine-release
etc/apk/
etc/apk/arch
etc/apk/keys/
etc/apk/keys/alpine-devel@lists.alpinelinux.org-4a6a0840.rsa.pub
etc/apk/keys/alpine-devel@lists.alpinelinux.org-5243ef4b.rsa.pub
etc/apk/keys/alpine-devel@lists.alpinelinux.org-5261cecb.rsa.pub
etc/apk/protected_paths.d/
etc/apk/repositories
etc/apk/world
etc/conf.d/
...
Sure enough, it’s an Alpine image, which you might have guessed if you
noticed that the other layers used an apk
command to install
packages. Let’s extract the tarball and look around:
If you pull alpine:3.11.6
and extract it, you’ll find that
there’s one non-empty layer inside it, and the layer.tar
is
identical to the layer.tar
in the base layer of the Prettier
image.
Just for the heck of it, what’s in the other two non-empty layers? The second layer is the main layer containing the Prettier installation. It has 528 entries, including Prettier, a bunch of dependencies and certificate updates:
...
usr/lib/libuv.so.1
usr/lib/libuv.so.1.0.0
usr/lib/node_modules/
usr/lib/node_modules/prettier/
usr/lib/node_modules/prettier/LICENSE
usr/lib/node_modules/prettier/README.md
usr/lib/node_modules/prettier/bin-prettier.js
usr/lib/node_modules/prettier/doc.js
usr/lib/node_modules/prettier/index.js
usr/lib/node_modules/prettier/package.json
usr/lib/node_modules/prettier/parser-angular.js
usr/lib/node_modules/prettier/parser-babel.js
usr/lib/node_modules/prettier/parser-flow.js
usr/lib/node_modules/prettier/parser-glimmer.js
usr/lib/node_modules/prettier/parser-graphql.js
usr/lib/node_modules/prettier/parser-html.js
usr/lib/node_modules/prettier/parser-markdown.js
usr/lib/node_modules/prettier/parser-postcss.js
usr/lib/node_modules/prettier/parser-typescript.js
usr/lib/node_modules/prettier/parser-yaml.js
usr/lib/node_modules/prettier/standalone.js
usr/lib/node_modules/prettier/third-party.js
usr/local/
usr/local/share/
usr/local/share/ca-certificates/
usr/sbin/
usr/sbin/update-ca-certificates
usr/share/
usr/share/ca-certificates/
usr/share/ca-certificates/mozilla/
usr/share/ca-certificates/mozilla/ACCVRAIZ1.crt
usr/share/ca-certificates/mozilla/AC_RAIZ_FNMT-RCM.crt
usr/share/ca-certificates/mozilla/Actalis_Authentication_Root_CA.crt
...
The third layer is created by the WORKDIR /work
statement,
and it contains exactly one entry: