環境構築の手間を省くためのツールとしてDockerなどのコンテナシステムが多く使われている。
Dockerの使い方を理解していても、その裏側の仕組みまで理解している人は少ない。
動作原理を理解する一番の方法は自分で自作してみることである。
そこで、コンテナの根幹機能である「物理的な資源=CPU」と「パソコン上の資源=ファイル」の分離を行うミニコンテナをC++で実装する。
ネームスペース
Linuxに搭載されている標準機能。
以下のようなものを分離することができる。
- UID/GID
- IPC
- ネットワークデバイス
- ファイルシステム
- プロセスID
- hostname
詳しい説明はこちらの記事を参照
clone関数に渡すflags引数で親プロセスと子プロセスの間で何を共有するかを指定することができる。
clone(2)のmanページ
Both clone() and clone3() allow a flags bit mask that modifies their behavior and allows the caller to specify what is shared between the calling process and the child process. This bit mask—the flags argument of clone() or the cl_args.flags field passed to clone3()—is referred to as the flags mask in the remainder of this page. The flags mask is specified as a bitwise OR of zero or more of the constants listed below. Except as noted below, these flags are available (and have the same effect) in both clone() and clone3().
#include <cstring>
#include <errno.h>
#include <fcntl.h>
#include <filesystem>
#include <fstream>
#include <iostream>
#include <sched.h>
#include <string>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
#define STACK_SIZE (1024 * 1024) // Stack size for cloned child
static char child_stack[STACK_SIZE]; // Memory for child's stack
// Function executed by the child process for the run step
static int childFuncRun(void *arg) {
char *const shell_args[] = {"/bin/sh", NULL};
execv(shell_args[0], shell_args);
perror("execv");
return 1;
}
void run() {
int flags = CLONE_NEWUSER | CLONE_NEWIPC | CLONE_NEWNET | CLONE_NEWUTS |
CLONE_NEWPID | CLONE_NEWNS | SIGCHLD;
pid_t pid = clone(childFuncRun, child_stack + STACK_SIZE, flags, NULL);
std::cout << "Starting child process with PID=" << pid << std::endl;
if (pid == -1) {
perror("clone");
exit(EXIT_FAILURE);
}
if (waitpid(pid, NULL, 0) == -1) {
perror("waitpid");
exit(EXIT_FAILURE);
}
std::cout << "Child process exited" << std::endl;
exit(EXIT_SUCCESS);
}
int main(int argc, char *argv[]) {
run();
return 0;
}
以下にflagsとしてSIGCHLD以外も設定した場合とSIGCHDのみを設定した場合の結果を示す。
SIGCHLD以外も指定した場合

SIGCHLDのみを指定した場合

UID/GIDの設定
デフォルトではshを実行したユーザーがnobody/nogroupになっている。

/proc/pid/gid_mapと/proc/pid/uid_mapに書き込みを行うことで設定できる。
ただし、あらかじめ"deny"を/proc/pid/setgroupsに書き込まないとエラーが発生する。
user_namespaces(7)のman
Interaction with system calls that change process UIDs or GIDs
In a user namespace where the uid_map file has not been written, the system calls that change user IDs will fail. Similarly, if the gid_map file has not been written, the system calls that change group IDs will fail. After the uid_map and gid_map files have been written, only the mapped values may be used in system calls that change user and group IDs. For user IDs, the relevant system calls include setuid(2), setfsuid(2), setreuid(2), and setresuid(2). For group IDs, the relevant system calls include setgid(2), setfsgid(2), setregid(2), setresgid(2), and setgroups(2). Writing “deny” to the /proc/pid/setgroups file before writing to /proc/pid/gid_map will permanently disable setgroups(2) in a user namespace and allow writing to /proc/pid/gid_map without having the CAP_SETGID capability in the parent user namespace.
書き込む内容は以下の書式に従う。
(名前空間内の最初のID) (名前空間外の最初のID) (範囲)
名前空間内の最初のID:名前空間で使うUID/GID
名前空間外の最初のID:(名前空間内の最初のID)で設定したIDに対応する外部のUID/GID。
範囲:名前空間で使えるIDの個数
以下のコードは名前空間の0(root)を外部空間の1000に対応させている。
#include <cstring>
#include <errno.h>
#include <fcntl.h>
#include <filesystem>
#include <fstream>
#include <iostream>
#include <sched.h>
#include <string>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
#define STACK_SIZE (1024 * 1024) // Stack size for cloned child
static char child_stack[STACK_SIZE]; // Memory for child's stack
// Function executed by the child process for the run step
static int childFuncRun(void *arg) {
int uid = 1000;
int gid = 1000;
FILE *setgroups_file = fopen("/proc/self/setgroups", "w");
if (setgroups_file == NULL) {
perror("fopen setgroups");
return 1;
}
if (fprintf(setgroups_file, "deny") < 0) {
perror("fprintf setgroups");
fclose(setgroups_file);
return 1;
}
fclose(setgroups_file);
// Write UID mapping
FILE *uid_map_file = fopen("/proc/self/uid_map", "w");
if (uid_map_file == NULL) {
perror("fopen uid_map");
return 1;
}
if (fprintf(uid_map_file, "0 %d 1\n", uid) < 0) {
perror("fprintf uid_map");
fclose(uid_map_file);
return 1;
}
fclose(uid_map_file);
// Write GID mapping
FILE *gid_map_file = fopen("/proc/self/gid_map", "w");
if (gid_map_file == NULL) {
perror("fopen gid_map");
return 1;
}
if (fprintf(gid_map_file, "0 %d 1\n", gid) < 0) {
perror("fprintf gid_map");
fclose(gid_map_file);
return 1;
}
fclose(gid_map_file);
char *const shell_args[] = {"/bin/sh", NULL};
execv(shell_args[0], shell_args);
perror("execv");
return 1;
}
void run() {
int flags = CLONE_NEWUSER | CLONE_NEWIPC | CLONE_NEWNET | CLONE_NEWUTS |
CLONE_NEWPID | CLONE_NEWNS | SIGCHLD;
pid_t pid = clone(childFuncRun, child_stack + STACK_SIZE, flags, NULL);
std::cout << "Starting child process with PID=" << pid << std::endl;
if (pid == -1) {
perror("clone");
exit(EXIT_FAILURE);
}
if (waitpid(pid, NULL, 0) == -1) {
perror("waitpid");
exit(EXIT_FAILURE);
}
std::cout << "Child process exited" << std::endl;
exit(EXIT_SUCCESS);
}
int main(int argc, char *argv[]) {
run();
return 0;
}
もし、親プロセスの実行ユーザーを名前空間の0(root)に対応させたい場合はコードを以下のように修正する。
#include <cstring>
#include <errno.h>
#include <fcntl.h>
#include <filesystem>
#include <fstream>
#include <iostream>
#include <sched.h>
#include <string>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
#define STACK_SIZE (1024 * 1024) // Stack size for cloned child
static char child_stack[STACK_SIZE]; // Memory for child's stack
struct clone_args {
pid_t uid;
pid_t gid;
};
// Function executed by the child process for the run step
static int childFuncRun(void *arg) {
struct clone_args *args = (struct clone_args *)arg;
pid_t uid = args->uid;
pid_t gid = args->gid;
FILE *setgroups_file = fopen("/proc/self/setgroups", "w");
if (setgroups_file == NULL) {
perror("fopen setgroups");
return 1;
}
if (fprintf(setgroups_file, "deny") < 0) {
perror("fprintf setgroups");
fclose(setgroups_file);
return 1;
}
fclose(setgroups_file);
// Write UID mapping
FILE *uid_map_file = fopen("/proc/self/uid_map", "w");
if (uid_map_file == NULL) {
perror("fopen uid_map");
return 1;
}
if (fprintf(uid_map_file, "0 %d 1\n", uid) < 0) {
perror("fprintf uid_map");
fclose(uid_map_file);
return 1;
}
fclose(uid_map_file);
// Write GID mapping
FILE *gid_map_file = fopen("/proc/self/gid_map", "w");
if (gid_map_file == NULL) {
perror("fopen gid_map");
return 1;
}
if (fprintf(gid_map_file, "0 %d 1\n", gid) < 0) {
perror("fprintf gid_map");
fclose(gid_map_file);
return 1;
}
fclose(gid_map_file);
char *const shell_args[] = {"/bin/sh", NULL};
execv(shell_args[0], shell_args);
perror("execv");
return 1;
}
void run() {
struct clone_args args;
args.uid = getuid();
args.gid = getgid();
int flags = CLONE_NEWUSER | CLONE_NEWIPC | CLONE_NEWNET | CLONE_NEWUTS |
CLONE_NEWPID | CLONE_NEWNS | SIGCHLD;
pid_t pid = clone(childFuncRun, child_stack + STACK_SIZE, flags, &args);
std::cout << "Starting child process with PID=" << pid << std::endl;
if (pid == -1) {
perror("clone");
exit(EXIT_FAILURE);
}
if (waitpid(pid, NULL, 0) == -1) {
perror("waitpid");
exit(EXIT_FAILURE);
}
std::cout << "Child process exited" << std::endl;
exit(EXIT_SUCCESS);
}
int main(int argc, char *argv[]) {
run();
return 0;
}
上記のいづれのコードを用いた場合でも、名前空間内でユーザーidが0(root)になっている。

HOSTNAMEの変更
現在は、hostnameが親プロセスと同じになっている。
これはUTSネームスペースを新しく作ったときのデフォルトの機能。
cone(2)のmanページ
CLONE_NEWUTS (since Linux 2.6.19)
If CLONE_NEWUTS is set, then create the process in a new UTS namespace, whose identifiers are initialized by duplicating the identifiers from the UTS namespace of the calling process. If this flag is not set, then (as with fork(2)) the process is created in the same UTS namespace as the calling process. For further information on UTS namespaces, see uts_namespaces(7). Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWUTS.
親プロセス

子プロセス

ホストネームが決められるUTSネームスペース自体は新しく作られているので、このネームスペース内でホストネームを書き換える。
c++のsethostnameにはroot権限が必要なので、「UIG/GID=0→sethostname」の順番が守られるようにコードをrunとinitで分ける。
#include <cstring>
#include <errno.h>
#include <fcntl.h>
#include <filesystem>
#include <fstream>
#include <iostream>
#include <sched.h>
#include <string>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
#define STACK_SIZE (1024 * 1024) // Stack size for cloned child
static char child_stack[STACK_SIZE]; // Memory for child's stack
struct clone_args {
pid_t uid;
pid_t gid;
};
static int childFuncInit() {
// Set hostname
if (sethostname("container", 9) == -1) {
perror("sethostname");
return 1;
}
// Execute shell
char *const shell_args[] = {"/bin/sh", NULL};
execv(shell_args[0], shell_args);
perror("execv");
return 1;
}
// Function executed by the child process for the run step
static int childFuncRun(void *arg) {
struct clone_args *args = (struct clone_args *)arg;
pid_t uid = args->uid;
pid_t gid = args->gid;
FILE *setgroups_file = fopen("/proc/self/setgroups", "w");
if (setgroups_file == NULL) {
perror("fopen setgroups");
return 1;
}
if (fprintf(setgroups_file, "deny") < 0) {
perror("fprintf setgroups");
fclose(setgroups_file);
return 1;
}
fclose(setgroups_file);
// Write UID mapping
FILE *uid_map_file = fopen("/proc/self/uid_map", "w");
if (uid_map_file == NULL) {
perror("fopen uid_map");
return 1;
}
if (fprintf(uid_map_file, "0 %d 1\n", uid) < 0) {
perror("fprintf uid_map");
fclose(uid_map_file);
return 1;
}
fclose(uid_map_file);
// Write GID mapping
FILE *gid_map_file = fopen("/proc/self/gid_map", "w");
if (gid_map_file == NULL) {
perror("fopen gid_map");
return 1;
}
if (fprintf(gid_map_file, "0 %d 1\n", gid) < 0) {
perror("fprintf gid_map");
fclose(gid_map_file);
return 1;
}
fclose(gid_map_file);
char *const args_exec[] = {"/proc/self/exe", "init", NULL};
execv(args_exec[0], args_exec);
perror("execv");
return 1;
}
void run() {
struct clone_args args;
args.uid = getuid();
args.gid = getgid();
int flags = CLONE_NEWUSER | CLONE_NEWIPC | CLONE_NEWNET | CLONE_NEWUTS |
CLONE_NEWPID | CLONE_NEWNS | SIGCHLD;
pid_t pid = clone(childFuncRun, child_stack + STACK_SIZE, flags, &args);
std::cout << "Starting child process with PID=" << pid << std::endl;
if (pid == -1) {
perror("clone");
exit(EXIT_FAILURE);
}
if (waitpid(pid, NULL, 0) == -1) {
perror("waitpid");
exit(EXIT_FAILURE);
}
std::cout << "Child process exited" << std::endl;
exit(EXIT_SUCCESS);
}
void usage(const char *prog_name) {
std::cerr << "Usage: " << prog_name << " run" << std::endl;
exit(EXIT_FAILURE);
}
int main(int argc, char *argv[]) {
if (argc <= 1) {
usage(argv[0]);
}
std::string command = argv[1];
if (command == "run") {
run();
} else if (command == "init") {
if (childFuncInit() == 1) {
std::cerr << "Initialization failed." << std::endl;
exit(EXIT_FAILURE);
}
} else {
usage(argv[0]);
}
}
PIDの分離
このままでは、親のプロセスが全て見えてしまう。

コンテナとして機能させるためにはプロセスはコンテナ内で起動したものしか見えないのが理想。
#include <cstring>
#include <errno.h>
#include <fcntl.h>
#include <filesystem>
#include <fstream>
#include <iostream>
#include <sched.h>
#include <string>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
#define STACK_SIZE (1024 * 1024) // Stack size for cloned child
static char child_stack[STACK_SIZE]; // Memory for child's stack
struct clone_args {
pid_t uid;
pid_t gid;
};
static int childFuncInit() {
// Set hostname
if (sethostname("container", 9) == -1) {
perror("sethostname");
return 1;
}
// Mount /proc
if (mount("proc", "/proc", "proc",
MS_NOEXEC | MS_NOSUID | MS_NODEV, "") == -1) {
perror("mount /proc");
return 1;
}
// Execute shell
char *const shell_args[] = {"/bin/sh", NULL};
execv(shell_args[0], shell_args);
perror("execv");
return 1;
}
// Function executed by the child process for the run step
static int childFuncRun(void *arg) {
struct clone_args *args = (struct clone_args *)arg;
pid_t uid = args->uid;
pid_t gid = args->gid;
FILE *setgroups_file = fopen("/proc/self/setgroups", "w");
if (setgroups_file == NULL) {
perror("fopen setgroups");
return 1;
}
if (fprintf(setgroups_file, "deny") < 0) {
perror("fprintf setgroups");
fclose(setgroups_file);
return 1;
}
fclose(setgroups_file);
// Write UID mapping
FILE *uid_map_file = fopen("/proc/self/uid_map", "w");
if (uid_map_file == NULL) {
perror("fopen uid_map");
return 1;
}
if (fprintf(uid_map_file, "0 %d 1\n", uid) < 0) {
perror("fprintf uid_map");
fclose(uid_map_file);
return 1;
}
fclose(uid_map_file);
// Write GID mapping
FILE *gid_map_file = fopen("/proc/self/gid_map", "w");
if (gid_map_file == NULL) {
perror("fopen gid_map");
return 1;
}
if (fprintf(gid_map_file, "0 %d 1\n", gid) < 0) {
perror("fprintf gid_map");
fclose(gid_map_file);
return 1;
}
fclose(gid_map_file);
char *const args_exec[] = {"/proc/self/exe", "init", NULL};
execv(args_exec[0], args_exec);
perror("execv");
return 1;
}
void run() {
struct clone_args args;
args.uid = getuid();
args.gid = getgid();
int flags = CLONE_NEWUSER | CLONE_NEWIPC | CLONE_NEWNET | CLONE_NEWUTS |
CLONE_NEWPID | CLONE_NEWNS | SIGCHLD;
pid_t pid = clone(childFuncRun, child_stack + STACK_SIZE, flags, &args);
std::cout << "Starting child process with PID=" << pid << std::endl;
if (pid == -1) {
perror("clone");
exit(EXIT_FAILURE);
}
if (waitpid(pid, NULL, 0) == -1) {
perror("waitpid");
exit(EXIT_FAILURE);
}
std::cout << "Child process exited" << std::endl;
exit(EXIT_SUCCESS);
}
void usage(const char *prog_name) {
std::cerr << "Usage: " << prog_name << " run" << std::endl;
exit(EXIT_FAILURE);
}
int main(int argc, char *argv[]) {
if (argc <= 1) {
usage(argv[0]);
}
std::string command = argv[1];
if (command == "run") {
run();
} else if (command == "init") {
if (childFuncInit() == 1) {
std::cerr << "Initialization failed." << std::endl;
exit(EXIT_FAILURE);
}
} else {
usage(argv[0]);
}
}
shとpsのプロセスのみが表示されている。

ファイルシステムの分離
コンテナ専用のルートファイルを用意して、それ以外の部分は見えないようにしたい。
事前準備として、コンテナ用のルートとなるディレクトリを用意しておく。
mkdir -p /root/chroot/proc
mkdir -p /root/chroot/bin
mkdir -p /root/chroot/lib
cp /bin/sh /root/rootfs/bin
cp /bin/ls /root/rootfs/ls
ln -s lib lib64
(ldd sh, ldd lsで出てきた依存ライブラリをlib/内にcpする)
pivot_rootを使ってコンテナのルートファイルをrootfsに変更する。
#include <cstring>
#include <errno.h>
#include <fcntl.h>
#include <filesystem>
#include <fstream>
#include <iostream>
#include <sched.h>
#include <string>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
#define STACK_SIZE (1024 * 1024) // Stack size for cloned child
static char child_stack[STACK_SIZE]; // Memory for child's stack
struct clone_args {
pid_t uid;
pid_t gid;
};
static int childFuncInit() {
// Set hostname
if (sethostname("container", 9) == -1) {
perror("sethostname");
return 1;
}
// Mount /proc
if (mount("proc", "/root/rootfs/proc", "proc",
MS_NOEXEC | MS_NOSUID | MS_NODEV, "") == -1) {
perror("mount /proc");
return 1;
}
// Change directory to /root
if (chdir("/root") == -1) {
perror("chdir /root");
return 1;
}
// Bind mount rootfs
if (mount("rootfs", "/root/rootfs", "", MS_BIND | MS_REC, "") == -1) {
perror("mount rootfs");
return 1;
}
// Create oldrootfs directory
if (mkdir("/root/rootfs/oldrootfs", 0700) == -1) {
perror("mkdir /root/rootfs/oldrootfs");
return 1;
}
// Pivot root to new rootfs
if (syscall(SYS_pivot_root, "rootfs", "/root/rootfs/oldrootfs") == -1) {
perror("pivot_root");
return 1;
}
// Unmount old root
if (umount2("/oldrootfs", MNT_DETACH) == -1) {
perror("umount /oldrootfs");
return 1;
}
// Remove oldrootfs directory
if (rmdir("/oldrootfs") == -1) {
perror("rmdir /oldrootfs");
return 1;
}
// Change directory to /
if (chdir("/") == -1) {
perror("chdir /");
return 1;
}
// Execute shell
char *const shell_args[] = {"/bin/sh", NULL};
execv(shell_args[0], shell_args);
perror("execv");
return 1;
}
// Function executed by the child process for the run step
static int childFuncRun(void *arg) {
struct clone_args *args = (struct clone_args *)arg;
pid_t uid = args->uid;
pid_t gid = args->gid;
FILE *setgroups_file = fopen("/proc/self/setgroups", "w");
if (setgroups_file == NULL) {
perror("fopen setgroups");
return 1;
}
if (fprintf(setgroups_file, "deny") < 0) {
perror("fprintf setgroups");
fclose(setgroups_file);
return 1;
}
fclose(setgroups_file);
// Write UID mapping
FILE *uid_map_file = fopen("/proc/self/uid_map", "w");
if (uid_map_file == NULL) {
perror("fopen uid_map");
return 1;
}
if (fprintf(uid_map_file, "0 %d 1\n", uid) < 0) {
perror("fprintf uid_map");
fclose(uid_map_file);
return 1;
}
fclose(uid_map_file);
// Write GID mapping
FILE *gid_map_file = fopen("/proc/self/gid_map", "w");
if (gid_map_file == NULL) {
perror("fopen gid_map");
return 1;
}
if (fprintf(gid_map_file, "0 %d 1\n", gid) < 0) {
perror("fprintf gid_map");
fclose(gid_map_file);
return 1;
}
fclose(gid_map_file);
char *const args_exec[] = {"/proc/self/exe", "init", NULL};
execv(args_exec[0], args_exec);
perror("execv");
return 1;
}
void run() {
struct clone_args args;
args.uid = getuid();
args.gid = getgid();
int flags = CLONE_NEWUSER | CLONE_NEWIPC | CLONE_NEWNET | CLONE_NEWUTS |
CLONE_NEWPID | CLONE_NEWNS | SIGCHLD;
pid_t pid = clone(childFuncRun, child_stack + STACK_SIZE, flags, &args);
std::cout << "Starting child process with PID=" << pid << std::endl;
if (pid == -1) {
perror("clone");
exit(EXIT_FAILURE);
}
if (waitpid(pid, NULL, 0) == -1) {
perror("waitpid");
exit(EXIT_FAILURE);
}
std::cout << "Child process exited" << std::endl;
exit(EXIT_SUCCESS);
}
void usage(const char *prog_name) {
std::cerr << "Usage: " << prog_name << " run" << std::endl;
exit(EXIT_FAILURE);
}
int main(int argc, char *argv[]) {
if (argc <= 1) {
usage(argv[0]);
}
std::string command = argv[1];
if (command == "run") {
run();
} else if (command == "init") {
if (childFuncInit() == 1) {
std::cerr << "Initialization failed." << std::endl;
exit(EXIT_FAILURE);
}
} else {
usage(argv[0]);
}
}
CPUリソースの分離
コンテナの基本的な機能はできたが、コンテナ内でCPUをずっと使うプログラムがあると迷惑。
コンテナ内でのCPU使用制限を設けるためにcgroupを用いる。
/sys/fs/cgroup/my-containerを作成すると、my-container/内に複数の設定ファイルが自動で作られる。
my-container/group.procsにpidを書き込むとmy-containerの設定が適用される。
my-container/cpu.maxにcpuの使用制限を書き込む。
(制限値) (期間)
設定した期間の間に、制限値で設定した時間だけCPUが使用できるようになる。
cgroup v2 reference documentation
cpu.max
A read-write two value file which exists on non-root cgroups. The default is “max 100000”. The maximum bandwidth limit. It’s in the following format:: $MAX $PERIOD which indicates that the group may consume up to $MAX in each $PERIOD duration. “max” for $MAX indicates no limit. If only one number is written, $MAX is updated.
ただし、
10000 10000
と書いた場合に制限なしになるとは限らない。
ここでの制限はCPU全体に対してのものなので、例えばCPUが4つある場合は10000の間に最大で40000の働きができる。
cgroupを実装したコンテナのコード
#include <cstring>
#include <errno.h>
#include <fcntl.h>
#include <filesystem>
#include <fstream>
#include <iostream>
#include <sched.h>
#include <string>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/stat.h>
#include <unistd.h>
#define STACK_SIZE (1024 * 1024) // Stack size for cloned child
static char child_stack[STACK_SIZE]; // Memory for child's stack
struct clone_args {
pid_t uid;
pid_t gid;
};
// Function executed by the child process for the init step
static int childFuncInit() {
// Set hostname
if (sethostname("container", 9) == -1) {
perror("sethostname");
return 1;
}
struct stat statDir;
if (stat("/sys/fs/cgroup/my-container", &statDir) != 0) {
if (mkdir("/sys/fs/cgroup/my-container", 0700) == -1) {
perror("mkdir /sys/fs/cgroup/my-container");
return 1;
}
}
FILE *task_file = fopen("/sys/fs/cgroup/my-container/cgroup.procs", "w");
if (task_file == NULL) {
perror("fopen task_file");
return 1;
}
if (fprintf(task_file, "%d\n", getpid()) < 0) {
perror("fprintf task to pid");
fclose(task_file);
return 1;
}
fclose(task_file);
FILE *cpu_quota_file = fopen("/sys/fs/cgroup/my-container/cpu.max", "w");
if (cpu_quota_file == NULL) {
perror("fopen cpu_quota_file");
return 1;
}
if (fprintf(cpu_quota_file, "1000 1000\n") < 0) {
perror("fprintf cpu_quota to 1000");
fclose(cpu_quota_file);
return 1;
}
fclose(cpu_quota_file);
// Mount /proc
if (mount("proc", "/root/rootfs/proc", "proc",
MS_NOEXEC | MS_NOSUID | MS_NODEV, "") == -1) {
perror("mount /proc");
return 1;
}
// Change directory to /root
if (chdir("/root") == -1) {
perror("chdir /root");
return 1;
}
// Bind mount rootfs
if (mount("rootfs", "/root/rootfs", "", MS_BIND | MS_REC, "") == -1) {
perror("mount rootfs");
return 1;
}
// Create oldrootfs directory
if (mkdir("/root/rootfs/oldrootfs", 0700) == -1) {
perror("mkdir /root/rootfs/oldrootfs");
return 1;
}
// Pivot root to new rootfs
if (syscall(SYS_pivot_root, "rootfs", "/root/rootfs/oldrootfs") == -1) {
perror("pivot_root");
return 1;
}
// Unmount old root
if (umount2("/oldrootfs", MNT_DETACH) == -1) {
perror("umount /oldrootfs");
return 1;
}
// Remove oldrootfs directory
if (rmdir("/oldrootfs") == -1) {
perror("rmdir /oldrootfs");
return 1;
}
// Change directory to /
if (chdir("/") == -1) {
perror("chdir /");
return 1;
}
// Execute shell
char *const shell_args[] = {"/bin/sh", NULL};
execv(shell_args[0], shell_args);
perror("execv");
return 1;
}
// Function executed by the child process for the run step
static int childFuncRun(void *arg) {
struct clone_args *args = (struct clone_args *)arg;
pid_t uid = args->uid;
pid_t gid = args->gid;
// Disable setgroups
FILE *setgroups_file = fopen("/proc/self/setgroups", "w");
if (setgroups_file == NULL) {
perror("fopen setgroups");
return 1;
}
if (fprintf(setgroups_file, "deny") < 0) {
perror("fprintf setgroups");
fclose(setgroups_file);
return 1;
}
fclose(setgroups_file);
// Write UID mapping
FILE *uid_map_file = fopen("/proc/self/uid_map", "w");
if (uid_map_file == NULL) {
perror("fopen uid_map");
return 1;
}
if (fprintf(uid_map_file, "0 %d 1\n", uid) < 0) {
perror("fprintf uid_map");
fclose(uid_map_file);
return 1;
}
fclose(uid_map_file);
// Write GID mapping
FILE *gid_map_file = fopen("/proc/self/gid_map", "w");
if (gid_map_file == NULL) {
perror("fopen gid_map");
return 1;
}
if (fprintf(gid_map_file, "0 %d 1\n", gid) < 0) {
perror("fprintf gid_map");
fclose(gid_map_file);
return 1;
}
fclose(gid_map_file);
// Re-execute the process with the init argument
char *const args_exec[] = {"/proc/self/exe", "init", NULL};
execv(args_exec[0], args_exec);
perror("execv");
return 1;
}
void run() {
struct clone_args args;
args.uid = getuid();
args.gid = getgid();
int flags = CLONE_NEWUSER | CLONE_NEWIPC | CLONE_NEWNET | CLONE_NEWUTS |
CLONE_NEWPID | CLONE_NEWNS | SIGCHLD;
pid_t pid = clone(childFuncRun, child_stack + STACK_SIZE, flags, &args);
std::cout << "Starting child process with PID=" << pid << std::endl;
if (pid == -1) {
perror("clone");
exit(EXIT_FAILURE);
}
if (waitpid(pid, NULL, 0) == -1) {
perror("waitpid");
exit(EXIT_FAILURE);
}
std::cout << "Child process exited" << std::endl;
exit(EXIT_SUCCESS);
}
void usage(const char *prog_name) {
std::cerr << "Usage: " << prog_name << " run" << std::endl;
exit(EXIT_FAILURE);
}
int main(int argc, char *argv[]) {
if (argc <= 1) {
usage(argv[0]);
}
std::string command = argv[1];
if (command == "run") {
run();
} else if (command == "init") {
if (childFuncInit() == 1) {
std::cerr << "Initialization failed." << std::endl;
exit(EXIT_FAILURE);
}
} else {
usage(argv[0]);
}
return 0;
}
cgroupの検証をするために、ファイルの分離をしないプログラムを使った。cpu.maxに書き込む値を変えて検証を行う。
ファイルの分離をしないcgroup分離のプログラム
#include <cstring>
#include <errno.h>
#include <fcntl.h>
#include <filesystem>
#include <fstream>
#include <iostream>
#include <sched.h>
#include <string>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/stat.h>
#include <unistd.h>
#define STACK_SIZE (1024 * 1024) // Stack size for cloned child
static char child_stack[STACK_SIZE]; // Memory for child's stack
struct clone_args {
pid_t uid;
pid_t gid;
};
// Function executed by the child process for the init step
static int childFuncInit() {
// Set hostname
if (sethostname("container", 9) == -1) {
perror("sethostname");
return 1;
}
struct stat statDir;
if (stat("/sys/fs/cgroup/my-container", &statDir) != 0) {
if (mkdir("/sys/fs/cgroup/my-container", 0700) == -1) {
perror("mkdir /sys/fs/cgroup/my-container");
return 1;
}
}
FILE *task_file = fopen("/sys/fs/cgroup/my-container/cgroup.procs", "w");
if (task_file == NULL) {
perror("fopen task_file");
return 1;
}
if (fprintf(task_file, "%d\n", getpid()) < 0) {
perror("fprintf task to pid");
fclose(task_file);
return 1;
}
fclose(task_file);
FILE *cpu_quota_file = fopen("/sys/fs/cgroup/my-container/cpu.max", "w");
if (cpu_quota_file == NULL) {
perror("fopen cpu_quota_file");
return 1;
}
if (fprintf(cpu_quota_file, "1 1000\n") < 0) {
perror("fprintf cpu_quota to 1000");
fclose(cpu_quota_file);
return 1;
}
fclose(cpu_quota_file);
// Mount /proc
if (mount("proc", "/root/rootfs/proc", "proc",
MS_NOEXEC | MS_NOSUID | MS_NODEV, "") == -1) {
perror("mount /proc");
return 1;
}
// Execute shell
char *const shell_args[] = {"/bin/sh", NULL};
execv(shell_args[0], shell_args);
perror("execv");
return 1;
}
// Function executed by the child process for the run step
static int childFuncRun(void *arg) {
struct clone_args *args = (struct clone_args *)arg;
pid_t uid = args->uid;
pid_t gid = args->gid;
// Disable setgroups
FILE *setgroups_file = fopen("/proc/self/setgroups", "w");
if (setgroups_file == NULL) {
perror("fopen setgroups");
return 1;
}
if (fprintf(setgroups_file, "deny") < 0) {
perror("fprintf setgroups");
fclose(setgroups_file);
return 1;
}
fclose(setgroups_file);
// Write UID mapping
FILE *uid_map_file = fopen("/proc/self/uid_map", "w");
if (uid_map_file == NULL) {
perror("fopen uid_map");
return 1;
}
if (fprintf(uid_map_file, "0 %d 1\n", uid) < 0) {
perror("fprintf uid_map");
fclose(uid_map_file);
return 1;
}
fclose(uid_map_file);
// Write GID mapping
FILE *gid_map_file = fopen("/proc/self/gid_map", "w");
if (gid_map_file == NULL) {
perror("fopen gid_map");
return 1;
}
if (fprintf(gid_map_file, "0 %d 1\n", gid) < 0) {
perror("fprintf gid_map");
fclose(gid_map_file);
return 1;
}
fclose(gid_map_file);
// Re-execute the process with the init argument
char *const args_exec[] = {"/proc/self/exe", "init", NULL};
execv(args_exec[0], args_exec);
perror("execv");
return 1;
}
void run() {
struct clone_args args;
args.uid = getuid();
args.gid = getgid();
int flags = CLONE_NEWUSER | CLONE_NEWIPC | CLONE_NEWNET | CLONE_NEWUTS |
CLONE_NEWPID | CLONE_NEWNS | SIGCHLD;
pid_t pid = clone(childFuncRun, child_stack + STACK_SIZE, flags, &args);
std::cout << "Starting child process with PID=" << pid << std::endl;
if (pid == -1) {
perror("clone");
exit(EXIT_FAILURE);
}
if (waitpid(pid, NULL, 0) == -1) {
perror("waitpid");
exit(EXIT_FAILURE);
}
std::cout << "Child process exited" << std::endl;
exit(EXIT_SUCCESS);
}
void usage(const char *prog_name) {
std::cerr << "Usage: " << prog_name << " run" << std::endl;
exit(EXIT_FAILURE);
}
int main(int argc, char *argv[]) {
if (argc <= 1) {
usage(argv[0]);
}
std::string command = argv[1];
if (command == "run") {
run();
} else if (command == "init") {
if (childFuncInit() == 1) {
std::cerr << "Initialization failed." << std::endl;
exit(EXIT_FAILURE);
}
} else {
usage(argv[0]);
}
return 0;
}
動作検証用として、以下のプログラムを動かす(マルチスレッド)
#include <iostream>
#include <thread>
#include <vector>
#include <chrono>
void cpu_bound_task() {
auto start = std::chrono::high_resolution_clock::now();
long long result = 0;
for (long long i = 0; i < 1e8; ++i) {
result += i * i;
}
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> duration = end - start;
std::cout << "Task completed in " << duration.count() << " seconds.\n";
}
int main() {
const int num_threads = 4;
std::vector<std::thread> threads;
for (int i = 0; i < num_threads; ++i) {
threads.emplace_back(cpu_bound_task);
}
for (auto& t : threads) {
t.join();
}
return 0;
}cgroupを作成しなかった場合
Task completed in 0.0734886 seconds.
Task completed in 0.0734435 seconds.
Task completed in 0.0744386 seconds.
Task completed in 0.0755676 seconds.cpu.maxに何も書き込まなかった場合(デフォルト値:max 100000)
Task completed in 0.0737794 seconds.
Task completed in 0.0738442 seconds.
Task completed in 0.0739563 seconds.
Task completed in 0.0750249 seconds.cpu.maxに100000 100000と書き込んだ場合
Task completed in 0.145839 seconds.
Task completed in 0.22167 seconds.
Task completed in 0.222576 seconds.
Task completed in 0.225495 seconds.cpu.maxに200000 100000と書き込んだ場合
Task completed in 0.0999713 seconds.
Task completed in 0.105472 seconds.
Task completed in 0.107703 seconds.
Task completed in 0.109426 seconds.cpu.maxにと書き込んだ場合400000 100000
Task completed in 0.0741137 seconds.
Task completed in 0.074401 seconds.
Task completed in 0.0745016 seconds.
Task completed in 0.0748775 seconds.今度は、スレッドを一つしか使わない検証用のプログラムを動かす。
グラフ描画のために、cpu.maxに書き込む値を都度変えている。
#include <chrono>
#include <cstdio>
#include <fstream>
#include <iostream>
#include <thread>
#include <vector>
float cpu_bound_task() {
auto start = std::chrono::high_resolution_clock::now();
long long result = 0;
for (long long i = 0; i < 1e6; ++i) {
result += i * i;
}
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> duration = end - start;
return duration.count();
}
int main() {
for (int mx = 1000; mx <= 100000; mx = mx + 100) {
FILE *cpu_quota_file = fopen("/sys/fs/cgroup/my-container/cpu.max", "w");
if (cpu_quota_file == NULL) {
perror("fopen cpu_quota_file");
return 1;
}
if (fprintf(cpu_quota_file, "%d 100000\n", mx) < 0) {
perror("fprintf cpu_quota");
fclose(cpu_quota_file);
return 1;
}
fclose(cpu_quota_file);
float sum = 0;
for (int i = 0; i < 100; i++) {
float res = cpu_bound_task();
sum += res;
}
float ave = sum / 100;
std::cout << mx << " " << ave << std::endl;
}
return 0;
}これをpythonを使って描画すると以下のようになった。
描画に用いたpythonのプログラム
import matplotlib.pyplot as plt
import numpy as np
data = np.loadtxt("./result.txt")
x = data[:,0]
y = data[:,1]
plt.xlabel("available CPU time(ms) during 100ms-period")
plt.ylabel("execution time")
plt.minorticks_on()
plt.grid()
plt.plot(x / 1000, y)
plt.show()

この場合は、100msの間に使えるCPUの時間を100msにしたときにほぼ実行時間がかからなくなっていることがわかる。
また、利用可能なCPUな時間を2倍にすると、実行時間はおよそ1/2になっている。
コメントを残す