Linux Capabilities and Seccomp Profiles: The Developer’s Guide to Docker Security Hardening

Linux capabilities and seccomp profiles are two of the most effective container security controls available — and two of the least consistently implemented. The barrier is not knowledge of what these controls do; most security-aware engineers understand capabilities and seccomp conceptually. The barrier is practical: configuring them correctly for a specific application requires knowing exactly what the application does at the system call level, which requires behavioral data that is rarely captured during normal development.


Linux Capabilities: What They Are and Why They Matter

Traditional Unix systems had a binary privilege model: either you were root and could do everything, or you were a non-root user with limited permissions. Linux capabilities broke root privilege into discrete units. A process can be granted specific capabilities without having full root access.

Examples of relevant capabilities for containers:

  • CAP_NET_BIND_SERVICE: Bind to ports below 1024
  • CAP_NET_RAW: Create raw sockets (used by ping, some network diagnostics)
  • CAP_SYS_ADMIN: A very broad capability covering mount, ioctl, and many other administrative operations
  • CAP_CHOWN: Change file ownership
  • CAP_DAC_READ_SEARCH: Bypass file read permission checks

Docker’s default capability set drops many capabilities that containers should not need, but retains others that a hardened container could remove. The principle of least privilege recommends granting only the specific capabilities your application requires.

The challenge: identifying which capabilities your application needs requires knowing which system calls it makes that require elevated capabilities. Determining this from static analysis alone is unreliable — capability requirements are often in library code, not application code, and may only be triggered by specific code paths.


Seccomp: System Call Filtering

Seccomp (secure computing mode) restricts which system calls a process can make. Docker’s default seccomp profile blocks about 44 potentially dangerous syscalls while allowing the ~300 most commonly needed.

A custom seccomp profile for a specific application allows only the syscalls that application actually uses. For a typical web API container, the relevant syscalls might be a subset of 80-100. Everything else can be blocked.

The security benefit: if an attacker exploits a vulnerability in the container and attempts to escalate privileges through a kernel exploit, the exploit may require syscalls that the seccomp profile has blocked. The attack surface exposed to the kernel is reduced.

The practical challenge: determining which syscalls an application uses without a custom profile requires either:

  1. Exhaustive review of all application and library code (impractical for large applications with many dependencies)
  2. Running the application with strace and analyzing the output (tedious, requires complete execution coverage)
  3. Runtime profiling that captures syscall behavior automatically

Why Trial-and-Error Is Not Acceptable?

The approach many teams use for seccomp and capability configuration: start with the default profile, add the application to a restricted seccomp mode, see what breaks, add back the syscalls that were blocked, repeat.

This approach has fundamental problems:

Coverage is incomplete: Trial-and-error only discovers syscall requirements for code paths that were exercised during testing. Edge cases, error handling paths, and infrequent operations may have different syscall requirements than the happy path. An application that passes all tests with a restrictive seccomp profile may fail in production when an unusual code path is triggered.

The process is slow: Each iteration of add-profile → test → observe-failure → update-profile takes time. For complex applications, this process can take days and still produce profiles that are either too permissive or break production edge cases.

Security teams cannot validate independently: A seccomp profile developed through ad-hoc testing has no documentation of why each allowed syscall was included. Security reviewers cannot determine whether the profile is minimal or whether it includes unnecessary permissions that were added to fix failures.


Runtime Profiling as the Foundation

Runtime profiling that captures system calls and capability usage during representative application execution solves the fundamental information problem. Instead of guessing which syscalls and capabilities an application requires, you observe what it actually uses.

A profiling run that exercises the application’s full execution paths — startup, all API endpoint types, background job execution, error handling, shutdown — produces a definitive list of:

  • System calls made during normal operation
  • Linux capabilities invoked
  • Libraries loaded (which influences both capability requirements and package removal decisions)

This execution data is the input to container hardening: capability configuration, seccomp profile generation, and package removal all derive from the same runtime profile. The result is an internally consistent set of controls — the capability set matches what the application uses, the seccomp profile matches the syscalls the application makes, and the package set matches the libraries the application loads.


Interaction Between Package Removal and Syscall Surface

An underappreciated consequence of package removal: removing packages that are never executed also removes their system call requirements from the application’s behavioral profile.

A container image that includes system administration utilities — which most base images do — includes packages that use syscalls like mount, ptrace, syslog, and others that a web application never needs. These packages are present, and their syscall requirements inflate the apparent syscall surface of the container.

After removing these packages through profiling-based hardening, the syscall surface of the container reflects only the application’s actual needs. A seccomp profile generated from the hardened image’s execution profile allows fewer syscalls than a profile generated from the un-hardened image — even though the application behavior is identical.

Hardened container images have smaller syscall surfaces, which enables more restrictive seccomp profiles, which reduces the kernel attack surface for any exploit attempting to use syscalls the application does not need. The security controls are multiplicative: package removal, capability restriction, and seccomp profiles each reduce attack surface, and they reduce each other’s scope as well.



Frequently Asked Questions

What are Linux capabilities in Docker security hardening?

Linux capabilities break the traditional binary root/non-root privilege model into discrete units, allowing containers to be granted only the specific permissions they require. In Docker security hardening, dropping all capabilities with –cap-drop ALL and adding back only what the application needs — such as CAP_NET_BIND_SERVICE for binding to low ports — ensures a container cannot perform privileged operations it has no legitimate reason to perform. Identifying which capabilities an application actually requires is best done through runtime profiling rather than trial and error.

What is a seccomp profile and why does it matter for Docker security?

Seccomp (secure computing mode) is a Linux kernel feature that restricts which system calls a container process can make. Docker applies a default seccomp profile blocking about 44 dangerous syscalls, but a custom application-specific profile can restrict the allowed set to the 80-100 syscalls a typical web API actually uses. The security benefit is that kernel exploits requiring blocked syscalls are blocked at the kernel level, significantly reducing the attack surface for privilege escalation.

How do Linux capabilities and seccomp profiles work together for Docker security hardening?

Linux capabilities and seccomp profiles are complementary and multiplicative controls: capabilities restrict what privileged operations a process can invoke, while seccomp restricts which system calls it can make at the kernel level. When applied together alongside package removal through runtime profiling, the three controls reinforce each other — fewer installed packages means a smaller syscall surface, which enables a more restrictive seccomp profile, which reduces the kernel attack surface for any exploit. A runtime profile derived from representative application execution provides the behavioral data needed to configure all three controls consistently.

Why is trial-and-error insufficient for configuring Docker security profiles?

Trial-and-error seccomp and capability configuration only captures requirements for code paths exercised during testing, leaving edge cases, error handlers, and infrequent operations undiscovered. An application that passes all tests under a restrictive profile may silently fail in production when an unusual path is triggered, and security teams have no way to verify that the profile is minimal rather than permissive. Runtime profiling that exercises the full application execution path produces a definitive, auditable set of syscalls and capability requirements that eliminates both coverage gaps and unnecessary permissions.


Generating and Deploying Profiles

Once runtime profiling data is available, generating profiles for deployment:

# Kubernetes pod spec with seccomp profile

apiVersion: v1

kind: Pod

spec:

  securityContext:

    seccompProfile:

      type: Localhost

      localhostProfile: profiles/myapp-seccomp.json

  containers:

    – name: myapp

      securityContext:

        capabilities:

          drop: [“ALL”]

          add: [“NET_BIND_SERVICE”]  # Only capabilities the app requires

The profile file contains only the syscalls observed during profiling. Deployment with this profile blocks all other syscalls at the kernel level, regardless of what an attacker might attempt.

Combining capability restriction with seccomp profiles and package removal produces defense in depth at the execution layer: limited capabilities, minimal syscall surface, and no unnecessary packages for exploitation.