Docker basics | What is a container?
Using containers is one of the things that fueled the DevOps and Agile movements. Being able to run docker containes has become a basic skill. But even though everyone is familar with docker very little people understand its amazingly simple core principles. This article is going to fix that.
Docker has made the use of containers really easy. The downside is that it hides away what is actually going on within our system. Often times that leads to misconceptions and security issues. So let’s take another perspective and check out what is going on inside the Kernel when we run containers.
The first thing we have to understand is that the Linux Kernel does not know that it is running containers. Containers simply use some newer Kernel features like namespaces and cgroups that were developed in the two decades from 2000 to 2020.
What makes up a container is bascially three things: Isolation, security and resources.
The most important thing is isolation. A container is only able to see what it contains (hence the name container). It can not go beyond its boundaries. These boundaries are called Kernel namespaces. As of today there are 8 different namespaces in the Linux kernel. These are mount namespace, PID namespace, network namepace, IPC namespace, UTS namespace, user namespace, cgroup namespace and time namespace. Depending on which Kernel you use you may miss the time namespace. It is the newest one on that list and it was introduced with Linux 5.6.
We take the PID namespace as an example to examine more closely how a namespace works. Basically whenever we boot our machine the Kernel creates a list of processes starting with the init process that gets PID 1. That list holds things like owner PID and GID, runtime, memory allocation, priority, linked libraries, open files and so on.
Whenever we start a process in its own PID namespace our Kernel creates an additional list with its own counting. That is it starts with PID 1 again. It is important to notice that the new list is additional to the original one and does not replace it. The same process will still show up on the main list and may have PID 20341 for instance. So the process has a different PID in each namespace.
For more detailed explanations see
man 7 namespaces
Containers can be more secure than virtual machines if done right. But if done wrong they can also be a great threat for security. As an example on how containers are secured let’s assume we want to change the time setting within our container. As long as we do not have a time namespace in our Kernel that change will be a global one for the entire machine and all containers running on that node. The only way to prevent this is to disallow the clock_settime sytem call.
Thus, containers should run with a limited set of capabilities. Granting the CAP_SYS_TIME capability is something we should avoid as long as we do not have a time namespace. Of course if our Container runs its own time namespace we can safely grant that CAP_SYS_TIME capability. But caution: The mere fact that we use a Kernel that provides a time namespace does not necessarily mean our containers use it.
Setting capabilities is always some kind of a tradeoff. On the one hand if a container is too limited with capabilities it may be completely useless for our application. On the other hand if we grant to much capabilties we can create a loophole that allows an attacker to break out of the container.
For an exhaustive explanation of capabilities see
man 7 capabilities
Last but not least there are cgroups (Control groups). They allow us to limit the amount of resources that a process (or a group of processes) can consume. Let’s say there is a memoriy leak in a software we run inside a container. That memory leak can eat up the entire memory of the host machine if do not limit the amount of memory on the container.
While it is not necessary to understand these basics in order to use docker in a development environment it is essential to understand it if we want to run containers in a production environment.