When I edited the container capabilities page for How Containers Work , I needed to explain why CDMY0CDMY does not work in Docker. Here's what happened when running CDMY1CDMY in the Docker container on my laptop:

$ docker run -it ubuntu:18.04/bin/bash $ #... install strace... root@e27f594da870:/# strace ls strace: ptrace(PTRACE_TRACEME,...): Operation not permitted 

CDMY2CDMY works through the CDMY3CDMY system call, so without permission for CDMY4CDMY nothing will work! But it’s easy to fix, and on my laptop I did everything like this:

docker run --cap-add=SYS_PTRACE -it ubuntu:18.04/bin/bash 

But it was interesting to me not to solve the problem, but to figure out why this situation generally arises. So why does CDMY5CDMY not work, and CDMY6CDMY fixes everything?

Hypothesis 1: Container processes do not have their own CDMY7CDMY privilege

Since the problem is resolved stably through CDMY8CDMY, it always seemed to me that Docker container processes, by definition, do not have the CDMY9CDMY privilege of their own, but for two reasons something doesn’t converge here.

Reason 1: As an experiment, I, being logged in as a regular user, could easily launch CDMY10CDMY to any process, however, checking if my current process had CDMY11CDMY privilege did not find anything:

$ getpcaps $$ Capabilities for `11589': = 

Reason 2: in CDMY12CDMY the privilege CDMY13CDMY says the following:

CAP_SYS_PTRACE * Trace arbitrary processes using ptrace(2); 

The whole point of CDMY14CDMY is that, by analogy with root, we can take control of any user’s arbitrary process. CDMY15CDMY doesn’t need this privilege for your regular user process.

In addition, I performed one more check: I launched the Docker container through CDMY16CDMY, then revoked the CDMY17CDMY privilege - and CDMY18CDMY continued to work correctly even without the privilege. Why ?!

Hypothesis 2: Is it a User Namespace Case?

My next (and much worse) hypothesis sounded like "hmm, maybe the process is in a different user namespace and CDMY19CDMY does not work... just because?" It looks like a set of not very coherent statements, but I still tried to look at the problem from this point too.
So, is the process in a different user namespace? So it looks in the container:

root@e27f594da870:/# ls/proc/$$/ns/user -l.../proc/1/ns/user -> 'user:[4026531837]' 

And so it looks on the host:

bork@kiwi:~$ ls/proc/$$/ns/user -l.../proc/12177/ns/user -> 'user:[4026531837]' 

The root in the container is the same user as the root on the host, because they have a common identifier in the user namespace (4026531837), so there should not be any reasons that prevent CDMY20CDMY from working. As you can see, the hypothesis turned out to be so-so, but then I still did not realize that the users in the container and on the host match, and this approach seemed interesting to me.

Hypothesis 3: The CDMY21CDMY system call is blocked by the CDMY22CDMY rule

I already knew that to limit the launch of a large number of system calls by container processors in Docker, there is the CDMY23CDMY rule, and it turned out that CDMY24CDMY is also in its list of calls blocked by definition! (Actually, the call list is an exception sheet and CDMY25CDMY simply does not fall into it, but the result does not change.)

Now it’s clear why CDMY26CDMY does not work in the Docker container, because it’s obvious that it will not be possible to call CDMY27CDMY completely blocked.

Let's test this hypothesis and see if we can use CDMY28CDMY in the Docker container if we disable all seccomp rules:

$ docker run --security-opt seccomp=unconfined -it ubuntu:18.04/bin/bash $ strace ls execve("/bin/ls", ["ls"], 0x7ffc69a65580/* 8 vars */)=0... it works fine... 

Fine! Everything works, and the secret is revealed! That's just...

Why does CDMY29CDMY solve the problem?

We still have not explained why CDMY30CDMY solves the emerging call problem. The CDMY31CDMY home page explains the operation of the CDMY32CDMY argument as follows:

--cap-add=[] Add Linux capabilities 

All this has nothing to do with the rules of seccomp! What is the matter?

Let's take a look at the source code for Docker.

If the documentation does not help already, all that remains for us is to plunge into the source.
Go has one nice feature: thanks to dependency vending in the Go repository, you can walk through the entire repository through CDMY33CDMY and find the code you are interested in. So I cloned CDMY34CDMY and ran it in search of expressions of the form CDMY35CDMY.

In my opinion, this is what happens here: in the implementation of seccomp in the container, in the section case "CAP_SYS_PTRACE": s.Syscalls=append(s.Syscalls, specs.LinuxSyscall{ Names: []string{ "kcmp", "process_vm_readv", "process_vm_writev", "ptrace", }, Action: specs.ActAllow, Args: []specs.LinuxSeccompArg{}, })

There is still code there that is in moby and for
, and for profile seccomp by definition carries out similar operations, so, probably, we found our answer!

In Docker, the CDMY36CDMY is capable of more than what is said

As a result, it seems that CDMY37CDMY is not quite doing what is written on the main page, and should rather look like CDMY38CDMY. And this seems like the truth: if you have the privilege in the spirit of CDMY39CDMY, which allows you to use the CDMY40CDMY system call, but this call is blocked by the seccomp profile, this will not help you much, so permission to use the CDMY41CDMY and CDMY42CDMY system calls through CDMY43CDMY looks reasonable.

It turns out CDMY44CDMY works in the latest versions of Docker

For kernel versions 4.8 and higher, thanks to this commit in Docker 19.03 finally allowed CDMY45CDMY system calls. That's just on my Docker laptop there is still version 18.09.7, and this commit is obviously missing.

That's it!

It turned out to be interesting to deal with this problem, and I think this is a good example of a non-trivially interacting moving “filling” of containers.

If you liked this post, you might like my How Containers Work journal, its 24 pages explaining the Linux kernel’s organization features. container work. You can also find privileges and seccomp-bpf .