Last week, Bernhard and I had to investigate high disk I/O load reported by one of our storage colleagues on our NFS server which serves the data for our Docker containers. We still have high disk load, because we are running lots of containers and therefore this post is will not resolve a load issue but it will give you some deep insights about some strange behaviors and technical details which we discovered during our I/O deep dive.
First of all, the high I/O load is not the problem per se. We have plenty reserves in our storage and we are not investigating any performance issues. But the question asked by our storage colleagues was as simple as to ask which container (or project) generates the I/O load?
Short answer: We do not know and we are not able to track it down. Not now and not with the current setup. Read ahead to get used to the “why”?
No, really, it doesn’t track all block I/0. This took us some time to understand, but lets do it step by step. The first thing you will think about when triaging block I/O loads is to run docker stats
which is absolutely correct. And that’s where you reach the end of the world because Docker and to be more precise, the Linux Kernel, does not see block I/0 which is served over a NFS mount! You don’t believe it? Just look at the following example.
First, create and mount a file system over a loop device. Mount a NFS share onto a folder inside this mount and monitor the block I/O on this device to see what happens, respectively what you cannot see.
|
|
At this point, open a second console. In the first console enter a dd
command to write a file into /mnt/testmountpoint/nfsmount
and in the second console, start the iostat
command to monitor the block I/O on your loop device.
|
|
|
|
Here is an output from this run and as you can see, iostat does not recognize any block I/O because the I/O never touches the underlying disk. If you do the same test without using the mounted NFS share, you will see the block I/O in the iostat
command as usual.
The same is true, if you are using docker volume
NFS mounts! The block I/O is not tracked and it’s fully logical because this block I/O never touches a local disk. Bad luck. Even it is true with any other mount type that will not be written to local disks like Gluster (FUSE) and many more.
We think we will open an issue for this, becausedocker stats
counts the block I/O wrong. You can test this by starting a container, run a deterministic dd
command and watch the docker stats
output of the container in parallel. See the terminal recording to get an idea.
As the recording shows, the first dd
write is completely unseen by the docker stats
command. This might be OK, because there are several buffers for write operations involved. But, as the dd
command is issued second time, to write additional 100 megabytes, the docker stats
command shows a summary of 0B / 393MB
megabytes, roughly 400 megabytes. The test wrote 200 megabytes, but docker stat
shows the doubled amount of data written. Strange buy why does this happen?
At this point, more information is needed. Therefore it is recommended to query the docker api
to retrieve more detailed information about the container stats. This can be done by using an actual version of curl
which would generate the following output.
|
|
Now, search for io_service_bytes_recursive
in the json output. There will be something like this:
|
|
Ups, there are three block devices here. Where are they coming from? If the totals are summed up, we get the 393 megabytes we have seen before. The major
and minor
numbers identify the device type. The documentation of the Linux kernel includes the complete list of the device major and minor numbers. The major number 8
identifies a block device as SCSI
disk device and this is correct, as the server uses sd*
for the local devices. The major numner 253
refers to RESERVED FOR DYNAMIC ASSIGNMENT
which is also correct, because the container get a local mount for the write layer. Therefore there are multiple devices: The real device sd*
and the dynamic device for the writeable image layer, which will write the data to the local disk. That’s way the block I/O is counted multiple times!
But we can dig even deeper and we can inspect the cgroup
information used by the Linux kernel to isolate the resources for the container. This information can be found under /sys/sys/fs/cgroup/blkio/docker/<container id>
eg /sys/fs/cgroup/blkio/docker/195fd970ab95d06b0ca1199ad19ca281d8da626ce6a6de3d29e3646ea1b2d033
. The file blkio.throttle.io_service_bytes
contains the information what data was really transferred to the block devices. For this test container the output will be:
|
|
There we have the correct output. In SUM Total we have roughly 250 megabytes. 200 megabytes were written by the dd
commands and the rest would be logging and other I/O stuff. This would be the correct number. You can test this by yourself by running a dd
command and watching the blkio.throttle.io_service_bytes
content.
The docker stats
command is really helpful to get an overview about you block device I/O, but it does not show the full truth. But it is useful to monitor containers that are writing local data, which may indicate, that something is not correctly configured regarding the data persistence. Furthermore, if you use network shares to allow the containers to persist data, you cannot measure the block I/O count on the Docker host the container is running on. The ugly part is, if you are using one physical device (large LVM for example) on your network share server, you will only get one great number of block I/O but you will not be able to assign the I/O to a container, a group of containers or a project.
Facts: - If you use NFS (or whatever shares) which are backed by a single block device on the NFS server, you can only get the sum of all block I/O and you cannot assign this block I/O to a concrete project or container - Use separate block devices for your shares - Even if you use Gluster, there will be the exactly same problem - FUSE mounts are also not seen by the Linux kernel
We are currently evaluating a combination of thin allocated LVM devices in combination with Gluster to report the block I/O via iostat (the json out) to Elastic search. Stay tuned for more about this soon!
-M