Understanding Container Image Layers
Building Layered Images
To create an image, you typically use a Dockerfile
which defines the contents of the container. This file contains a series of commands, such as:
FROM scratch
RUN echo "hello" > /work/message.txt
COPY content.txt /work/content.txt
RUN rm -rf /work/message.txt
Underneath the surface, the container engine runs these commands sequentially, forming a “layer” for each one. You can visualize each layer as a directory containing all the modified files.
Let’s walk through a potential implementation method.
FROM scratch
suggests that this container begins with no contents. This first layer could be represented as an empty directory,/img/layer1
.- Create a second directory,
/img/layer2
, and copy everything from/img/layer1
into it. Next, execute the subsequent Dockerfile command (which writes a file to/work/message.txt
). The contents are written to/img/layer2/work/message.txt
, forming the second layer. - Create a third directory,
/img/layer3
, and copy everything fromimg/layer2
into it. The following Dockerfile command requires copyingcontent.txt
from the host to that directory. This file is written to/img/layer3/work/content.txt
, creating the third layer. - Lastly, create a fourth directory,
/img/layer4
, and copy everything fromimg/layer3
into it. The next command deletes the message file,img/layer4/work/message.txt
, forming the fourth layer.
To share these layers, the simplest method is to create a compressed .tar.gz
for each directory. To minimize total file size, any files that are unchanged copies of data from a previous layer would be removed. A “whiteout file” can be used as a placeholder to indicate when a file was deleted. This file would simply prefix .wh.
to the original filename. For instance, the fourth layer would replace the deleted file with a placeholder named .wh.message.txt
. When unpacking a layer, any files starting with .wh.
can be removed.
Continuing our example, the compressed files would contain:
File | Contents |
---|---|
layer1.tar.gz | Empty file |
layer2.tar.gz | Contains /work/message.txt |
layer3.tar.gz | Contains /work/content.txt (since message.txt was not modified) |
layer4.tar.gz | Contains /work/.wh.message.txt (since message.txt was deleted).The file content.txt was not modified, so it is not included. |
Creating numerous images in this way would result in many “layer1” directories. To ensure uniqueness, the compressed file is named according to a digest of its contents, similar to Git’s operation. This approach helps identify identical content and detect any corruption in the files during download. If the digest doesn’t match the file name, the file is corrupt.
To make the results reproducible, a manifest is needed. This file outlines how to arrange the layers and indicates which files should be downloaded and the order in which they should be unpacked. This process allows for the recreation of directory structures. Moreover, it facilitates the reuse and sharing of layers between images, reducing local storage requirements.
In practice, further optimizations are possible. For instance, FROM scratch
implies no parent layer, so our example really starts with the contents of layer2
. The engine can inspect the files used in the build to determine if a layer needs to be recreated, forming the basis for layer caching. This minimizes the need to build or recreate layers. As an additional optimization, layers that don’t depend on the previous layer can use COPY --link
to indicate that the layer won’t need to delete or modify any files from the previous layer. This allows the compressed layer file to be created parallel to the other steps.
Snapshots
A container requires a file system to mount before it can run. Essentially, this means it needs a directory containing all necessary files. Although the compressed layer files house the components of this file system, they can’t be directly mounted and used. Instead, they must be unpacked and structured into a file system, creating what is known as a snapshot.
Creating a snapshot is the reverse of image building. The process begins with downloading the manifest and compiling a list of layers to download. For each layer, a directory, known as the active snapshot, is created containing the contents of the layer’s parent. Following this, a diff applier unpacks the compressed layer file and applies the changes to the active snapshot. The resulting directory, referred to as a committed snapshot, is the final version that is mounted as the container’s file system.
Using our earlier example:
- The initial layer,
FROM scratch
, implies we start with an empty directory and move on to the next layer. There is no parent. - A directory for
layer2
is created. This empty directory is now an active snapshot. The filelayer2.tar.gz
is downloaded, validated (by comparing the digest to the filename), and unpacked into the directory. The result is a directory containing/work/message.txt
. This becomes the first committed snapshot. - A directory for
layer3
is created, and the contents oflayer2
are copied into it. This is a new active snapshot. The filelayer3.tar.gz
is downloaded, validated, and unpacked. The result is a directory containing/work/message.txt
and/work/content.txt
. This is the second committed snapshot. - A directory for
layer4
is created, and the contents oflayer3
are copied into it. The filelayer4.tar.gz
is downloaded, validated, and unpacked. The diff applier recognizes the whiteout file,/work/.wh.message.txt
, and deletes/work/message.txt
, leaving just/work/content.txt
. This is the third committed snapshot. - Since
layer4
was the last layer, it serves as the basis for a container. To support read and write operations, a new snapshot directory is created and the contents oflayer4
are copied into it. This directory is mounted as the container’s file system. Any changes made by the running container will occur in this directory.
If any of these directories already exist, it suggests that another image had the same dependency. Therefore, the engine can skip the download and diff applier and use the layer as-is. In practice, each of these directories and files is named based on the digest of the contents for easier identification. For example, a set of snapshots might look like this:
/var/path/to/snapshots/blobs
└─ sha256
├─ 635944d2044d0a54d01385271ebe96ec18b26791eb8b85790974da36a452cc5c
├─ 9de59f6b211510bd59d745a5e49d7aa0db263deedc822005ed388f8d55227fc1
├─ fb0624e7b7cb9c912f952dd30833fb2fe1109ffdbcc80d995781f47bd1b4017f
└─ fb124ec4f943662ecf7aac45a43b096d316f1a6833548ec802226c7b406154e9
or alternatively:
Image | Parent |
---|---|
sha256:635944d2044d0a54d01385271ebe96ec18b26791eb8b85790974da36a452cc5c | |
sha256:9de59f6b211510bd59d745a5e49d7aa0db263deedc822005ed388f8d55227fc1 | sha256:635944d2044d0a54d01385271ebe96ec18b26791eb8b85790974da36a452cc5c |
sha256:fb0624e7b7cb9c912f952dd30833fb2fe1109ffdbcc80d995781f47bd1b4017f | sha256:9de59f6b211510bd59d745a5e49d7aa0db263deedc822005ed388f8d55227fc1 |
sha256:fb124ec4f943662ecf7aac45a43b096d316f1a6833548ec802226c7b406154e9 | sha256:fb0624e7b7cb9c912f952dd30833fb2fe1109ffdbcc80d995781f47bd1b4017f |
The current snapshot system supports plugins that enhance certain functions. For instance, plugins can enable snapshots to be pre-composed and unpacked, which accelerates the process. This feature also allows snapshots to be stored remotely. Moreover, it enables specialized optimizations like just-in-time downloading of necessary files and layers.
Overlays
Although mounting snapshots is straightforward, it often leads to a lot of file turnover and duplication. This not only slows down the initial container start-up but also wastes space. Fortunately, the file system can manage many aspects of the containerization process. For instance, Linux natively supports the mounting of directories as overlays, thereby simplifying the process considerably.
In Linux (or a Linux container running as --privileged
or with --cap-add=SYS_ADMIN
):
-
Create a
tmpfs
mount (memory-based file system that will be used to explore the overlay process)mkdir /tmp/overlay mount -t tmpfs tmpfs /tmp/overlay
-
Create directories for our process. We’ll use
lower
for the lower (parent) layer,upper
for the upper (child) layer,work
as a working directory for the file system, andmerged
to contain the merged file system.mkdir /tmp/overlay/{lower,upper,work,merged}
-
Create some files for the experiment. Optionally, you can add files in
upper
as well.cd /tmp/overlay echo hello > lower/hello.txt echo "I'm only here for a moment" > lower/delete-me.txt echo message > upper/upper-message.txt
-
Mount these directories as an
overlay
type file system. This will create a new file system in themerged
directory that contains the combined contents of thelower
andupper
directory. Thework
directory will be used to track changes to the file system.mount -t overlay overlay -o lowerdir=lower,upperdir=upper,workdir=work merged
-
Explore the file system. You’ll notice that
merged
contains the combined contents ofupper
andlower
. Then, make some changes:rm -rf merged/delete-me.txt echo "I'm new" > merged/new.txt echo world >> merged/hello.txt
-
As expected,
delete-me.txt
is removed frommerged
and a new file,new.txt
is created in the same directory. If youtree
the directories, you’ll see something interesting:|-- lower | |-- delete-me.txt | `-- hello.txt |-- merged | |-- hello.txt | |-- new.txt | `-- upper-message.txt |-- upper | |-- delete-me.txt | |-- hello.txt | |-- new.txt | `-- upper-message.txt
And running
ls -l upper
showstotal 12 c--------- 2 root root 0, 0 Jan 20 00:17 delete-me.txt -rw-r--r-- 1 root root 12 Jan 20 00:20 hello.txt -rw-r--r-- 1 root root 8 Jan 20 00:17 new.txt -rw-r--r-- 1 root root 8 Jan 20 00:17 upper-message.txt
While merged
displays the effects of our changes, upper
operates like the parent layer, storing the changes akin to the process outlined in our manual. It includes the new file new.txt
and the modified hello.txt
. A whiteout file has also been created. The overlay filesystem accomplishes this by replacing the file with a character device, possessing a 0, 0 device number. Simply put, it has everything we need to package up the directories!
This approach could also be utilized to implement a snapshot system. The mount
command can natively accept a colon (:
) delimited list of lowerdir
paths, which are all unioned together into a single filesystem. This is inherent to modern containers — they are assembled using native operating system features.
That’s essentially all there is to creating a basic system. In fact, the containerd
runtime, used by Kubernetes and the recently released Docker Desktop 4.27.0, employs a similar approach to build and manage its images, with the in-depth details covered in Content Flow (https://github.com/containerd/containerd/blob/main/docs/content-flow.md
). Hopefully, this has helped to demystify how container images function!
Reference: https://www.kenmuse.com/blog/understanding-container-image-layers/