Control Groups - CSE, IIT Bombay

Control Groups
- Prashanth, IIT-B
Problem Statement
To develop a mechanism to provide fine-grained resource control on a process/set of
processes running on a machine
Motivation
Operating Systems were originally designed to manage and control the different resources
present on a host system. Process (or) Processes groups on the other hand lack control
over resources, and makes it difficult to enforce policies to develop robust applications with
specific resource constraints.
Requirements
Process groups (Individual process/a group of processes) in a system must be able to,
1. Account different resources used them
2. Specify resource constrains required by them to the operating system
3. Resource constrains maybe be resource (CPU/IO/Memory) specific
4. Constrains could have soft and hard limits
5. The operating system must be able to enforce these policies on behalf of the
process groups
6. Should provide minimum overhead on enforcing the above mechanisms
Solution
The solution to the above problem was proposed by Google in 2007 which was originally
called Generic Process Containers and was later renamed to Control Groups (Cgroups), to
avoid confusion with the term Containers.
•
A Cgroup refers to a resource controller for a certain type of CPU resource. EgMemory Cgroup, Network Cgroup etc.
•
It is derives ideas and extends the process-tracking design used for cpusets system
present in the Linux kernel.
•
There are 12 different Cgroups (subsystems), one for each resource type classified.
•
A process group attaches itself to a single node present on each the hierarchies
mounted on a system.
Illustration 1: Cgroups Implementation, Hierarchies and Kernel Data Structures
Cgroup Hierarchies
•
It is possible to have 12 different hierarchies, one for each Cgroup, or a single
hierarchy with all 12 Cgroups attached, or any other combination in between
•
Every hierarchy has a parent node.
•
Each node(group) in a hierarchy derives a portion of resources from its parent
node. And resources used by the child is also charged for the parent.
•
Each process in the system is attached to a node in each hierarchy.
•
It starts from the root (default) node in each hierarchy.
•
The root node typically has no contains specified.
User Space APIs
•
Cgroups are managed using pseudo file system.
•
A root Cgroups File system directory contains a list of directories which refer to
different hierarchies present in the system.
•
A hierarchy refers to a Cgroup/ or a group of Cgroup subsystems attached to a
single hierarchy.
•
mount() syscall should indicate the Cgroups to be bound to an hierarchy.
•
Each hierarchy will have its own hierarchy of directories (nodes).
•
Every node in hierarchy contains files which specify control elements of the node in
the hierarchy.
•
Only unused subsystems maybe be mounted.
•
Initially all processes are in the root container of all hierarchies mounted on the
system.
•
Creating a new directory mkdir() in container directory creates the new child.
•
Can be nested in any desired manner.
•
Special control file – tasks: is used to track all processes in the container
•
rmdir() can be used to remove container provided no processes are attached to it
Implementation
There are data structures introduced to incorporate Cgroups,
1. Container_subsys
•
Single resource controller, it provides callbacks in the event of a process
group creation/modification/destruction
2. Container_subsys_state
•
Represents the base type from which subsystem state objects are derived
3. Css_group
•
It holds one container_subsys_state pointer for each registered subsystem
•
css_group pointer is added to the task struct
•
Set of tasks with same process group membership are binded to the same
css_group
The kernel code modified,
1. Hook at the start of fork() to add reference count to css_group
2. Hook at the end of fork() to invoke subsystem callbacks
3. Hook in exit() to release reference count and call exit callbacks
Cgroups Subsystems
Memory Cgroup
•
Use a common data structure and support library for tracking usage and imposing
limits: the "resource counter".
•
Resource controller is an existing Linux implementation for tracking resource usage.
•
Memory Cgroup subsystem allocates three res_counters.
Accounting
•
Accounting memory for each process group
•
Keeps track of pages used by each group
1. File pages – Reads/Writes/mmap from block devices
2. Anonymous pages – Stack, heap etc.
3. Active – Recently used pages
4. Inactive – Pages read for eviction
Limits
•
Each group can be specified with two types of limits,
1. Soft limit – When kernel is forced to reclaim it evicts from groups which have
crossed its soft limit
2. Hard limit – Forces process to randomly kill from container
•
To avoid eviction of a process caused due to choking of a different process on the
same container, it is highly recommended to user - “One process per container”
•
Limits can be set in terms of Byte for,
1. Physical memory
2. Kernel memory
3. Total memory (Physical + Swap)
OOM Killer
•
Used to kill processes on reaching hard limit by a process group
HugeTLB Cgroup
•
Also makes use of one “resource counter”
•
Restricts the huge pages usable by a process
CPU Cgroup
•
Manages how the scheduler shares CPU time among different processes groups
•
Allows to specify CPU weights and quota (time limit)
•
Token bucket filter over existing scheduler
•
CPU Cgroup creates a parallel hierarchy of struct sched_entity structures, which is
what the scheduler uses to store proportional weights and virtual runtime – 1
scheduler hierarchy for each logical CPU
CPU-set Cgroup
•
Existing support in the kernel
•
Pins groups to specific logical CPUs
•
Existing architecture in kernel
Cpuacct
•
All it does is accounting; it doesn't exert control at all
•
Total CPU time used by all processes in the group – per logical CPU/Total
•
Breakdown into "user" and "system" time of the total CPU time
Block IO Cgroup
•
Keeps track of I/Os for each group,
1. Per block device
2. Reads vs Writes
3. Sync vs Async
•
Can sets throttle limits for each group for,
1. Per block device
2. Reads or Writes
3. Block Operations or Bytes
•
Set relative weights for each group
Net_cls & net_priority Cgroups
•
Sets class or priority for traffic generated by group
•
Net_cls – Will assign a class to traffic, which has to then be matched in IP Tables
•
Net_prio – Will assign priority, which then in used by queuing discipline
•
Both these Cgroups delegate their work to respective network interfaces, which
leads to additional work at a lower layer
Device Cgroups
•
Lets you control which devices can be accessed by a group
•
By default cannot access any device
Freezer Cgroup
•
Allows to freeze (suspend) / throw (resume) a process group
•
Processes in the Cgroup are unaware of the freeze time
•
Used for batch scheduling, migration etc.
Perf Event Cgroup
•
Collects various performance data for some set of processes
•
Count hardware events such as instructions executed, cache-misses suffered, or
branches mis-predicted
Debug Cgroup
•
makes a number of internal details of individual groups, or of the Cgroup system as
a whole, visible via virtual files within the Cgroup file system
•
some data structures and the settings of some internal flags
Choosing the right hierarchy
•
Debug, net_cl, net_perf, device, freezer, perf_event, cpuset, and cpuacct. None of
these make very heavy use of hierarchy and, in almost all cases, the functionality
provided by hierarchy can be achieved separately using a single administrative
hierarchy.
•
Major resource controllers like - CPU, memory, block I/O, and network I/O, maintain
separate hierarchies to manage their resources.
•
Networking it is managed separately.
•
Not an implicit part of Cgroups – Network Traffic Control hierarchies has the ability
to delegate part of the hierarchy to a separate administrative domain which allows a
separate hierarchy for each interface.
Demonstration
Memory Cgroup
1. Navigate to the cgroups pseudo filesystem – it contains all the cgroup subsystems
which are currently attached to the system,
cd /sys/fs/cgroup
2. Login as root user
sudo su
3. Create a new directory named memory (if it doesn't exist) – this is will be used to
manage the memory subsystem,
mkdir memory
4. Mount the memory cgroup (if it hasn't been mounted already),
mount -t cgroup -o memory cgroup /sys/fs/cgroup/memory
5. Change directory to memory
cd memory
6. Now we are in the root memory cgroup, remember that a root cgroup is typically not
configurable, it merely provides statistics and accounting information
7. List all the files in the current directory and see all the files present – each file
corresponds to different attributes (maybe configurable) to the child cgroup
ls
8. Create a new child cgroup of the root memory cgroup by making a new directory
mkdir memcg1
9. Now navigate into this newly created cgroup, you could browse its contents which
appear to be similar to its parent
cd memcg1
10. Taks file stores the tasks belonging to the current cgroup. Display the contents of
this file, it initially is empty as no task exists.
cat tasks
11. Open a parallel terminal and start a new process and note its process id. (If you are
creating a new firefox process, make sure it wasn't running earlier),
ctrl + shift + t
firefox &
12. Now come back to the original terminal and add the created process into the current
cgroup (memcg1),
echo {{pid-of-process}} > cgroups.procs
Illustration 2: Starting a new process using new terminal
13. Now display contents of tasks again and you will find the pids of all your added
processes,
cat cgroups.procs
14. Now you can view the memory statistics of the current cgroup by,
cat memory.stat
Illustration 3: Screen shot of memory stats
15. Initially the memory limit is the parent cgroup node value, and this can be set using
echo 128M > memory.limit_in_bytes
16. You could check various resource accounting information like current memory uage,
maximum memory used, limit of on memory etc.
cat memory.usage_in_bytes
cat memory.max_usage_in_bytes
cat memory.limit_in_bytes
Illustration 4: When process over-goes limit, it starts freezing or may even terminate
17. One important parameter to track memory oom is failcnt, it lets us know how many
times a cgroup has wanted to exceed its alloted limit
cat memory.failcnt
18. Similarily memory.kmem.* and memory.kmem.tcp.* stats could be
accounted/controlled
19. It is to be remembered that by default processes in memory that exceed usage limit
maybe killed. To avoid this situation, we disable it as given below
echo 1 > memory.oom_control
20. You may create many more cgroup children under memory or memcg1, depending
on how you wish to nest them
21. You may delete a cgroup child when it doesn't have any tasks running in it
References
1. https://lwn.net/Articles/604609/, Control groups series by Neil Brown
2. https://help.ubuntu.com/lts/serverguide/cgroups-fs.html, Ubuntu server guide on
Cgroups
3. https://access.redhat.com/documentation/enUS/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-memory.html,
Redhat memory subsystem
4. https://www.youtube.com/watch?v=sK5i-N34im8, Cgroups, namespaces, and beyond: what
are containers made from? By Jérôme Petazzoni, DockerCon 15'