Site Reliability Engineering

Estimated reading time: 2 mins

The two weeks Holidays vacation is over, we are back at work and the Docker Swarms ran fully unattended without a single outage during this period! For every IT engineer reliability may be something different because it depends upon which goals you have to achieve with your team.

Last year at the same time we ran roughly 450 containers in our Docker Swarm’s, this year we already had more than 1500 containers. Almost three times more than last year.

For me, the Yin and Yang symbol is not the worst symbol to reflect the idea of Site Reliability Engineering because there are always some kind of trade offs you have to accept between the ultimate reliable system and the infinite time it would take to implement such system. Therefore different and often fully contrary needs have to work together to still create a system that fulfills all needs, like Yin and Yang.

The monitoring and alerting system observed the Docker Swarms autonomous and today we reviewed the data tracked. The only thing that happened was a failure of a single Docker worker node which did a reboot. The Docker Swarm automatically started the missing containers on the remaining Docker workers and nothing more happens.

I think that we did a great job, because we had two full vacation weeks without any stress. The Docker Swarm doesn’t break, all services were always up and running and the system has handled a failing Docker worker as expected.

Times like these are always exciting because they proof if the systems are working even without people who are watching it. Happy new year and happy hacking!

Posted on: Mon, 07 Jan 2019 20:24:46 +0100 by Mario Kleinsasser

Mario Kleinsasser
Doing Linux since 2000 and containers since 2009. Like to hack new and interesting stuff. Containers, Python, DevOps, automation and so on. Interested in science and I like to read (if I found the time). Einstein said "Imagination is more important than knowledge. For knowledge is limited." - I say "The distance between faith and knowledge is infinite. (c) by me". Interesting contacts are always welcome - nice to meet you out there - if you like, do not hesitate and contact me!

Condense DockerCon EU 2018 Barcelona

Estimated reading time: 4 mins

This year we had the great opportunity to attend to DockerCon EU 2018 at Barcelona with four people. Two of us, Alex and Martin are developers, and Bernhard and I are operators so we did a real DevOps journey! The decision to go with both teams, in terms of DevOps, was the best we ever made and we are very thankful that our company, STRABAG BRVZ supported this idea. In fact, there were a lot of topics which were developer focused and in parallel there were also a lot of breakouts that were more operator focused. So we’ve got the best of both worlds.

We will not write a long summary of all sessions, break outs and workshops we attended as you can find all the sessions already online - videos of all sessions are available here!

Rather we will give you an inside view about a great community.

I am (Mario) a active Docker Community Leader and therefore I got the chance to attend to the Docker Community Leader Summit which took place on Monday afternoon. I came late to the summit because our flight was delayed, but Lisa (Docker Community Manager) reserved a seat for me. Therefore I was only able to bring myself in for the last two hours of the summit, but this was still a huge benefit. You might think that at such summits there are only soft laundered discussions going on, but from my point of view I can tell, that this was not the case. Instead, the discussion was very focused about the pros and the cons on what Docker does expect from the Community Leaders and what the Community Leaders can expect from Docker to retrieve support with their meetups. In short, there will be a new Code of Conduct for the Community Leaders in the near future. The second discussion was about Bevy, the “Meetup” platform where the Docker Meetup pages are created and the Docker Meetups are to be announced. Not all of us are happy with the current community split up situation between bevy.com and meetup.com and we had discussed both sides of the medal. This is obviously a topic we will have to look more at in the next few month and we will see how things progress. Sadly, I had to leave the summit just in time, as Bernhard and I were going to hold a Hallway Track and therefore I missed the Community Leader Summit group photo…

The Hallway Track we did was really fun and impressive. We shared our BosnD project as we think, that a lot of people are still struggling to run more than a handful services in production. There are new load balancer concepts like Traefik out there and there are also service meshes but most of the time people just want to get up and running with the things they already have but in containers and with the many benefits of an orchestrator (like docker swarm). And regarding to our Hallway Track and also referencing the Hallway Track held by Rachid Zarouali (AMA Docker Captains Track) which I attended too, this is still one of the main issues.

The DockerCon party was huge and we had the chance to talk to a lot of people and friends. It was a very nice evening with great food and a large number of discussions. After the DockerCon EU 2017 people said that Docker is dead and that the Docker experiment will be over soon. And yes it was not clear how the Docker Inc. will handle the facing challenges. One year later, after Microsoft bought GitHub and RedHat was swallowed by IBM, Docker Inc. is now on a good course. Of course, they have to run their Enterprise program, they have to earn money, but they are still dedicated to the community and, and this was surprising, to their customers. There were some really cool break outs, like the one from Citizens Bank, which clearly showed, that Docker inc. (the company) is able to handle both, Docker Swarm AND Kubernetes, very well with their Docker EE product.

Well, we will see where this is all going to, but, in our oppinion, Docker Inc currently seems to be vital (look at their growing customer number) and their business model seems to work.

Posted on: Sun, 30 Dec 2018 20:00:09 +0100 by Mario Kleinsasser , Bernhard Rausch

  • Docker
  • DockerCon
Mario Kleinsasser
Doing Linux since 2000 and containers since 2009. Like to hack new and interesting stuff. Containers, Python, DevOps, automation and so on. Interested in science and I like to read (if I found the time). Einstein said "Imagination is more important than knowledge. For knowledge is limited." - I say "The distance between faith and knowledge is infinite. (c) by me". Interesting contacts are always welcome - nice to meet you out there - if you like, do not hesitate and contact me!
Bernhard Rausch
SysAdmin/OpsEngineer/CloudArchitect; loves to get things ordered the right way: "A tidy house, a tidy mind."; configuration management fetishist; loving backups; impressed by docker; Always up to get in contact with interesting people - do not hesitate to write a comment or to contact me!

Running Play with Docker on AWS

Estimated reading time: 10 mins

Some weeks ago I dived a little bit into the Play with Docker GitHub repository because I would like to run Play With Docker (called PWD) locally to have a backup option during a Docker Meetup if something would be wrong with the internet connectivity or with the Docker prepared workshop sessions.

The most important thing first: Running PWD locally means, running it on localhost per default and this will not allow others to connect to the PWD setup on your localhost obviously.

Second, I read a PWD GitHub Issue where a user asked how to run PWD on AWS and I thought, that this would be a nice to have and of course I would like to help this user. So, that’s for you Kevin Hung too.

Third, due to our job as Cloud Solution Architects at STRABAG BRVZ IT we have the possibility to try out things without having to hassle about the framework conditions. This blog is a Holidays gift from #strabagit. If you like it share it, as sharing is caring. :-)

To be honest, this blog post will be very technical (again) and there are a lot of probably other ways to achieve the same result. Therefore this post is not meant to be the holy grail and it is far from being perfect in the meaning of security, eg authentication. This post is meant to be a cooking recipe - feel free to change the ingredients as you like! I will try to describe all steps detailed enough so that everyone could derive it to the personal needs and possibility.

Tipp: It might be helpful to read the whole article once before start working with it!

Ingredient list

As every cooking recipe needs an ingredient list, here it comes:

The recipe

This is going to be a cloud solution, hosted on AWS. And as with nearly every cloud solution it is hard to bootstrap the components in the correct order to get up and running because there might be implizit dependencies. Before we can cover the installation of PWD we have to prepare the environment. And first of all we need the internet domain name we would like to use, as this name needs to be known later during the PWD configuration.

1. The domain and AWS Route53

As written above, a free domain from Freenom fits perfect! Therefore, choose a domain name and register it there on Freenom. At this point, we have to do two things in parallel, as both, your domain name and the AWS Route 53 configuration are depending on each other!

If you have registered a domain name on Freenom move to your AWS console and start the AWS Route53 dashboard. Create a public hosted zone there with your zone name from Freenom. What we would like to achieve is a so called DNS delegation. To achieve this, write down your NS records you get, when you create a hosted zone with AWS Route53. For example I registered m4r10k.cf at Freenom. Therefore I created a hosted zone called m4r10k.cf in AWS Route53 which results in a list of NS records, in my case eg ns-296.awsdns-37.com. and ns-874.awsdns-45.net.. Head over to Freenom, edit your domain name and under your domain configuration choose DNS and use the DNS NS records provided by AWS Route53. See the picture on the right for details.

We will need the AWS Route53 hosted domain later to automatically register our AWS EC2 instance with an appropriate DNS CNAME entry called pwd.m4r10k.cf.

2. The AWS EC2 instance and Play with Docker installation

As mentioned above, we are using Ansible to automatize our cloud setups but you can do all the next steps manually of course. I will reference the Ansible tasks in the correct sequence to show you how to setup Play With Docker on a AWS EC2 instance. The process itself is fully automated but once again, you can do all this manually too.

At first we start the AWS EC2 instance which is pretty easy with Ansible. The documentation for every module, in this example this is ec2, can be found in the Ansible documentation. The most important thing here is, that the created instance is tagged, so we can find it later by the provided tag. As operating system (AMI), we use Ubuntu 18.04 as it is easier to install go-dep which is needed later.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
    - name: Launch instance
      ec2:
         key_name: "{{ ssh_key_name }}"
         group: "{{ security_group }}"
         instance_type: "{{ instance_type }}"
         image: "{{ image }}"
         wait: true
         region: "{{ region }}"
         assign_public_ip: yes
         vpc_subnet_id: "{{ vpc_subnet_id }}"
         instance_tags: "{{ instance_tags }}"
      register: ec2
      with_sequence: start=0 end=0

After that, we install the needed software into the newly created AWS EC2 instance. This is the longer part of the Ansible playbook. Be aware that you might have to wait a little bit until the SSH connection to the AWS EC2 instance is ready. You can use the following to wait for it. The ec2training inventory is dynamically build during runtime.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
- hosts: ec2training
  gather_facts: no
  vars:
    ansible_user: ubuntu
  tasks:
    - name: Wait 300 seconds for port 22 to become open and contain "OpenSSH" on "{{inventory_hostname}}"
      wait_for:
        port: 22
        host: "{{inventory_hostname}}"
        search_regex: OpenSSH
        delay: 10
      vars:
        ansible_connection: local

The next thing we have to do is to install Python as the AWS EC2 Ubuntu AMI does not include Python. Python is needed for the Ansible modules. Therefore we install Python into the AWS EC2 instance the hard way.

1
2
3
4
5
6
7
- hosts: ec2training
  gather_facts: no
  vars:
    ansible_user: ubuntu
  tasks:
    - name: install python 2
      raw: test -e /usr/bin/python || (sudo apt -y update && sudo apt install -y python-minimal)

Now we go on and install the whole Docker and PWD software. Here comes the description of the tasks in the playbook. The most important step here is, that you replace the localhost in the config.go file of PWD with your Freenom domain!

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
- hosts: ec2training
  gather_facts: yes
  vars:
    ansible_user: ubuntu
    docker_version: "docker-ce=18.06.1~ce~3-0~ubuntu"
  tasks:
    - name: Ping pong
      ping:

    - name: Add Docker GPG key
      apt_key: url=https://download.docker.com/linux/ubuntu/gpg
      become: yes

    - name: Add Docker APT repository
      apt_repository:
        repo: deb [arch=amd64] https://download.docker.com/linux/ubuntu {{ansible_distribution_release}} stable
      become: yes

    - name: Install Docker
      apt:
        name: "{{ docker_version }}"
        state: present
        update_cache: yes
      become: yes
    
    - name: Apt mark hold Docker
      shell: apt-mark hold "{{ docker_version }}"
      become: yes

    - name: Install go-dep
      apt:
        name: "go-dep"
        state: present
        update_cache: yes
      become: yes

    - name: Install docker-compose
      shell: curl -L "https://github.com/docker/compose/releases/download/1.22.0/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
      become: yes

    - name: Set docker-compose permissions
      shell: chmod +x /usr/local/bin/docker-compose
      become: yes

    - name: Add ubuntu user to Docker group
      shell: gpasswd -a ubuntu docker
      become: yes
    
    - name: Run Docker Swarm Init
      shell: docker swarm init
      become: yes

    - name: Git clone Docker PWD
      git:
        force: yes
        repo: 'https://github.com/play-with-docker/play-with-docker.git'
        dest: /home/ubuntu/go/src/github.com/play-with-docker/play-with-docker

    - name: Run go dep
      shell: cd /home/ubuntu/go/src/github.com/play-with-docker/play-with-docker && dep ensure
      environment:
        GOPATH: /home/ubuntu/go

    - name: Replace localhost in config.go of PWD
      replace:
        path: /home/ubuntu/go/src/github.com/play-with-docker/play-with-docker/config/config.go
        regexp: 'localhost'
        replace: 'pwd.m4r10k.cf'
        backup: no

    - name: Docker pull franela/dind
      shell: docker pull franela/dind
      environment:
        GOPATH: /home/ubuntu/go

    - name: Run docker compose
      shell: docker-compose up -d
      args:
        chdir: /home/ubuntu/go/src/github.com/play-with-docker/play-with-docker
      environment:
        GOPATH: /home/ubuntu/go

3. Automatically create the AWS Route53 CNAME records

Now the only thing left is to create AWS Route53 CNAME records. We can use Ansible for this too. The most important thing here is, that you also create a wildcard entry for your domain. If you later run Docker images which are exposing ports, like Nginx for example, PWD will automatically map the ports to a dynamic domain name which resides under your PWD domain.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
- hosts: localhost
  gather_facts: no
  connection: local
  vars:
    ssh_key_name: pwd-m4r10k
    region: eu-central-1
  tasks:
    - name: List instances
      ec2_instance_facts:
        region: "{{ region }}"
        filters:
          "tag:type": pwd-m4r10k
          instance-state-name: running
      register: ec2

    - name: Debug
      debug: var=ec2

    - name: Add all instance public IPs to host group
      add_host: 
        name: "{{ item.public_ip_address }}"
        groups:
          - ec2training
      with_items: "{{ ec2.instances }}"

    - name: Create pwd CNAME record
      route53:
        state: present
        zone: m4r10k.cf
        record: pwd.m4r10k.cf
        type: CNAME
        value: "{{ item.public_dns_name  }}" 
        ttl: 30
        overwrite: yes
      with_items: "{{ ec2.instances }}"
    
    - name: Create "*.pwd" CNAME record
      route53:
        state: present
        zone: m4r10k.cf
        record: "*.pwd.m4r10k.cf"
        type: CNAME
        value: "{{ item.public_dns_name  }}" 
        ttl: 30
        overwrite: yes
      with_items: "{{ ec2.instances }}"

How does it looks like

After the setup is up and running, you can point your browser to your given domain, which in my case is http://pwd.m4r10k.cf. Then you can just click the start button to start your PWD session. Create some instances and start a Nginx for example. Just wait a little bit and the dynamic port number, usually 33768, will come up and you can just click on it to see the NGinx welcome page.

Sum Up

This blog post should show, that it is possible to setup a Play With Docker environment for your personal usage in Amazons AWS Cloud fully automated with Ansible. You can use the PWD setup for different purposes like your Docker Meetups. Furthermore you do not have to use Ansible, all steps can also be done manually or with another automation framework of course.

Have a lot of fun, happy hacking, nice Holidays and a happy new year!

-M

Posted on: Fri, 28 Dec 2018 13:20:53 +0100 by Mario Kleinsasser

  • Docker
  • PWD
Mario Kleinsasser
Doing Linux since 2000 and containers since 2009. Like to hack new and interesting stuff. Containers, Python, DevOps, automation and so on. Interested in science and I like to read (if I found the time). Einstein said "Imagination is more important than knowledge. For knowledge is limited." - I say "The distance between faith and knowledge is infinite. (c) by me". Interesting contacts are always welcome - nice to meet you out there - if you like, do not hesitate and contact me!

Docker, block I/O and the black hole

Estimated reading time: 7 mins

Last week, Bernhard and I had to investigate high disk I/O load reported by one of our storage colleagues on our NFS server which serves the data for our Docker containers. We still have high disk load, because we are running lots of containers and therefore this post is will not resolve a load issue but it will give you some deep insights about some strange behaviors and technical details which we discovered during our I/O deep dive.

The question: Which container (or project) generates the I/O load?

First of all, the high I/O load is not the problem per se. We have plenty reserves in our storage and we are not investigating any performance issues. But the question asked by our storage colleagues was as simple as to ask which container (or project) generates the I/O load?

Short answer: We do not know and we are not able to track it down. Not now and not with the current setup. Read ahead to get used to the “why”?

Finding 1: docker stats does not track all block I/O

No, really, it doesn’t track all block I/0. This took us some time to understand, but lets do it step by step. The first thing you will think about when triaging block I/O loads is to run docker stats which is absolutely correct. And that’s where you reach the end of the world because Docker and to be more precise, the Linux Kernel, does not see block I/0 which is served over a NFS mount! You don’t believe it? Just look at the following example.

First, create and mount a file system over a loop device. Mount a NFS share onto a folder inside this mount and monitor the block I/O on this device to see what happens, respectively what you cannot see.

1
2
3
4
5
6
dd if=/dev/zero of=testdisk bs=50M count=100
mkfs.ext4 testdisk 
mkdir /mnt/testmountpoint
mount -o loop=/dev/loop0 testdisk /mnt/testmountpoint
mkdir /mnt/testmountpoint/nfsmount
mount -t nfs <mynfsserver>:/myshare /mnt/testmountpoint/nfsmount

At this point, open a second console. In the first console enter a dd command to write a file into /mnt/testmountpoint/nfsmount and in the second console, start the iostat command to monitor the block I/O on your loop device.

1
2
# First console
dd if=/dev/zero of=/mnt/testmountpoint/nfsmount/testfile bs=40M count=100
1
2
# Second console
iostat -dx /dev/loop0 1

Here is an output from this run and as you can see, iostat does not recognize any block I/O because the I/O never touches the underlying disk. If you do the same test without using the mounted NFS share, you will see the block I/O in the iostat command as usual.

The same is true, if you are using docker volume NFS mounts! The block I/O is not tracked and it’s fully logical because this block I/O never touches a local disk. Bad luck. Even it is true with any other mount type that will not be written to local disks like Gluster (FUSE) and many more.

Finding 2: docker stats tracks block I/O not fully correct

We think we will open an issue for this, becausedocker stats counts the block I/O wrong. You can test this by starting a container, run a deterministic dd command and watch the docker stats output of the container in parallel. See the terminal recording to get an idea.

As the recording shows, the first dd write is completely unseen by the docker stats command. This might be OK, because there are several buffers for write operations involved. But, as the dd command is issued second time, to write additional 100 megabytes, the docker stats command shows a summary of 0B / 393MB megabytes, roughly 400 megabytes. The test wrote 200 megabytes, but docker stat shows the doubled amount of data written. Strange buy why does this happen?

At this point, more information is needed. Therefore it is recommended to query the docker api to retrieve more detailed information about the container stats. This can be done by using an actual version of curl which would generate the following output.

1
2
3
docker run -ti -v /var/run/docker.sock:/var/run/docker.sock nathanleclaire/curl sh
curl --unix-socket /var/run/docker.sock http:/containers/195fd970ab95/stats
{"read":"2018-09-30T07:18:13.870693255Z","preread":"0001-01-01T00:00:00Z","pids_stats":{"current":1},"blkio_stats":{"io_service_bytes_recursive":[{"major":8,"minor":0,"op":"Read","value":0},{"major":8,"minor":0,"op":"Write","value":144179200},{"major":8,"minor":0,"op":"Sync","value":0},{"major":8,"minor":0,"op":"Async","value":144179200},{"major":8,"minor":0,"op":"Total","value":144179200},{"major":253,"minor":0,"op":"Read","value":0},{"major":253,"minor":0,"op":"Write","value":104857600},{"major":253,"minor":0,"op":"Sync","value":0},{"major":253,"minor":0,"op":"Async","value":104857600},{"major":253,"minor":0,"op":"Total","value":104857600},{"major":8,"minor":0,"op":"Read","value":0},{"major":8,"minor":0,"op":"Write","value":144179200},{"major":8,"minor":0,"op":"Sync","value":0},{"major":8,"minor":0,"op":"Async","value":144179200},{"major":8,"minor":0,"op":"Total","value":144179200}],"io_serviced_recursive":[{"major":8,"minor":0,"op":"Read","value":0},{"major":8,"minor":0,"op":"Write","value":100},{"major":8,"minor":0,"op":"Sync","value":0},{"major":8,"minor":0,"op":"Async","value":100},{"major":8,"minor":0,"op":"Total","value":100},{"major":253,"minor":0,"op":"Read","value":0},{"major":253,"minor":0,"op":"Write","value":50},{"major":253,"minor":0,"op":"Sync","value":0},{"major":253,"minor":0,"op":"Async","value":50},{"major":253,"minor":0,"op":"Total","value":50},{"major":8,"minor":0,"op":"Read","value":0},{"major":8,"minor":0,"op":"Write","value":100},{"major":8,"minor":0,"op":"Sync","value":0},{"major":8,"minor":0,"op":"Async","value":100},{"major":8,"minor":0,"op":"Total","value":100}],"io_queue_recursive":[{"major":8,"minor":0,"op":"Read","value":0},{"major":8,"minor":0,"op":"Write","value":0},{"major":8,"minor":0,"op":"Sync","value":0},{"major":8,"minor":0,"op":"Async","value":0},{"major":8,"minor":0,"op":"Total","value":0}],"io_service_time_recursive":[{"major":8,"minor":0,"op":"Read","value":0},{"major":8,"minor":0,"op":"Write","value":3145508615},{"major":8,"minor":0,"op":"Sync","value":0},{"major":8,"minor":0,"op":"Async","value":3145508615},{"major":8,"minor":0,"op":"Total","value":3145508615}],"io_wait_time_recursive":[{"major":8,"minor":0,"op":"Read","value":0},{"major":8,"minor":0,"op":"Write","value":2942389316},{"major":8,"minor":0,"op":"Sync","value":0},{"major":8,"minor":0,"op":"Async","value":2942389316},{"major":8,"minor":0,"op":"Total","value":2942389316}],"io_merged_recursive":[{"major":8,"minor":0,"op":"Read","value":0},{"major":8,"minor":0,"op":"Write","value":0},{"major":8,"minor":0,"op":"Sync","value":0},{"major":8,"minor":0,"op":"Async","value":0},{"major":8,"minor":0,"op":"Total","value":0}],"io_time_recursive":[{"major":8,"minor":0,"op":"","value":94781197}],"sectors_recursive":[{"major":8,"minor":0,"op":"","value":281600}]},"num_procs":0,"storage_stats":{},"cpu_stats":{"cpu_usage":{"total_usage":323236434,"percpu_usage":[138755707,5333466,2227292,2923874,24996475,16587971,129569530,2842119],"usage_in_kernelmode":280000000,"usage_in_usermode":20000000},"system_cpu_usage":10647668950000000,"online_cpus":8,"throttling_data":{"periods":0,"throttled_periods":0,"throttled_time":0}},"precpu_stats":{"cpu_usage":{"total_usage":0,"usage_in_kernelmode":0,"usage_in_usermode":0},"throttling_data":{"periods":0,"throttled_periods":0,"throttled_time":0}},"memory_stats":{"usage":107589632,"max_usage":118456320,"stats":{"active_anon":487424,"active_file":0,"cache":104857600,"dirty":0,"hierarchical_memory_limit":9223372036854771712,"hierarchical_memsw_limit":0,"inactive_anon":0,"inactive_file":104857600,"mapped_file":0,"pgfault":6829,"pgmajfault":0,"pgpgin":57351,"pgpgout":31632,"rss":487424,"rss_huge":0,"total_active_anon":487424,"total_active_file":0,"total_cache":104857600,"total_dirty":0,"total_inactive_anon":0,"total_inactive_file":104857600,"total_mapped_file":0,"total_pgfault":6829,"total_pgmajfault":0,"total_pgpgin":57351,"total_pgpgout":31632,"total_rss":487424,"total_rss_huge":0,"total_unevictable":0,"total_writeback":0,"unevictable":0,"writeback":0},"limit":16818872320},"name":"/testct","id":"195fd970ab95d06b0ca1199ad19ca281d8da626ce6a6de3d29e3646ea1b2d033","networks":{"eth0":{"rx_bytes":49518,"rx_packets":245,"rx_errors":0,"rx_dropped":0,"tx_bytes":0,"tx_packets":0,"tx_errors":0,"tx_dropped":0}}}

Now, search for io_service_bytes_recursive in the json output. There will be something like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
{
  "blkio_stats": {
    "io_service_bytes_recursive": [
      {
        "major": 8,
        "minor": 0,
        "op": "Total",
        "value": 144179200
      },
      {
        "major": 253,
        "minor": 0,
        "op": "Total",
        "value": 104857600
      },
      {
        "major": 8,
        "minor": 0,
        "op": "Total",
        "value": 144179200
      }
    ]
  }
}

Ups, there are three block devices here. Where are they coming from? If the totals are summed up, we get the 393 megabytes we have seen before. The major and minor numbers identify the device type. The documentation of the Linux kernel includes the complete list of the device major and minor numbers. The major number 8 identifies a block device as SCSI disk device and this is correct, as the server uses sd* for the local devices. The major numner 253 refers to RESERVED FOR DYNAMIC ASSIGNMENT which is also correct, because the container get a local mount for the write layer. Therefore there are multiple devices: The real device sd* and the dynamic device for the writeable image layer, which will write the data to the local disk. That’s way the block I/O is counted multiple times!

But we can dig even deeper and we can inspect the cgroup information used by the Linux kernel to isolate the resources for the container. This information can be found under /sys/sys/fs/cgroup/blkio/docker/<container id> eg /sys/fs/cgroup/blkio/docker/195fd970ab95d06b0ca1199ad19ca281d8da626ce6a6de3d29e3646ea1b2d033. The file blkio.throttle.io_service_bytes contains the information what data was really transferred to the block devices. For this test container the output will be:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
8:0 Read 0
8:0 Write 144179200
8:0 Sync 0
8:0 Async 144179200
8:0 Total 144179200
253:0 Read 0
253:0 Write 104857600
253:0 Sync 0
253:0 Async 104857600
253:0 Total 104857600
Total 249036800

There we have the correct output. In SUM Total we have roughly 250 megabytes. 200 megabytes were written by the dd commands and the rest would be logging and other I/O stuff. This would be the correct number. You can test this by yourself by running a dd command and watching the blkio.throttle.io_service_bytes content.

Conclusion

The docker stats command is really helpful to get an overview about you block device I/O, but it does not show the full truth. But it is useful to monitor containers that are writing local data, which may indicate, that something is not correctly configured regarding the data persistence. Furthermore, if you use network shares to allow the containers to persist data, you cannot measure the block I/O count on the Docker host the container is running on. The ugly part is, if you are using one physical device (large LVM for example) on your network share server, you will only get one great number of block I/O but you will not be able to assign the I/O to a container, a group of containers or a project.

Facts: - If you use NFS (or whatever shares) which are backed by a single block device on the NFS server, you can only get the sum of all block I/O and you cannot assign this block I/O to a concrete project or container - Use separate block devices for your shares - Even if you use Gluster, there will be the exactly same problem - FUSE mounts are also not seen by the Linux kernel

Follow up

We are currently evaluating a combination of thin allocated LVM devices in combination with Gluster to report the block I/O via iostat (the json out) to Elastic search. Stay tuned for more about this soon!

-M

Posted on: Sun, 30 Sep 2018 10:47:39 +0100 by Mario Kleinsasser

  • Docker
Mario Kleinsasser
Doing Linux since 2000 and containers since 2009. Like to hack new and interesting stuff. Containers, Python, DevOps, automation and so on. Interested in science and I like to read (if I found the time). Einstein said "Imagination is more important than knowledge. For knowledge is limited." - I say "The distance between faith and knowledge is infinite. (c) by me". Interesting contacts are always welcome - nice to meet you out there - if you like, do not hesitate and contact me!

Writing a Docker Volume Plugin for CephFS

Estimated reading time: 7 mins

Currently we are evaluating Ceph for our Docker/Kubernetes on-premise cluster for persistent volume storage. Kubernetes officially supports CephRBD and CephFS as storage volume driver. Docker does not offer a Docker Volume plugin for CephFS currently.

But there are some plugins available online. A Google search comes up with a handful plugins that supports the CephFS protocol but the results are quite old (> 2 years) and outdated or they are using too much dependencies like direct Ceph cluster communication.

This blog post will be a little longer, as it is necessary to provide some basic facts about Ceph and because there are some odd pitfalls during the Plugin creation. Without the great Docker Volume Plugin for SSHFS which is written by Victor Vieux it won’t be possible for me to get the clue about the Docker Volume Plugin structure! Thank you for your work!

Source code of the Docker Volume Plugin for CephFS can be found here.

About Ceph

Basically Ceph is a storage platform that provides three types of storage: RBD (Rados Block Device), CephFS (Shared Filesystem) and ObjectStorage(S3 compatible protocol). Beside this, Ceph offers some API interfaces to operate the Ceph storage remotely. Usually the mounting of the RBD and CephFS is enabled by installing the Ceph client part into your Linux machine via APT, YUM or whatever available. This client side software will install a Linux kernel module which can be used for a classic mount command like mount -t ceph .... Alternatively the use of fuse is also possible. The usage of the client side bindings can be tricky, when different versions of the Ceph Cluster (eg Minic release) and Ceph Client (eg Luminous) are in use. This may lead to the situation where someone creates a RBD device which has a newer feature set than the client which may lead to a non mountable file system.

RBD devices are meant to be exclusively mounted by exactly one end system, like a container which is pretty clear as you would also never share a physical device between two end systems. RBD block devices therefore cannot be shared between multiple containers. Most of the RBD volume plugins are able to create such a device during the creation of a volume if it does not exist. This means that the plugin must be able to communicate with the Ceph Cluster either via the installed Ceph Client software on the server or via the implementation of one of the Ceph API libraries.

CephFS is a shared filesystem which is backed by the Ceph cluster and which can be shared between multiple end systems like any other shared file system you may know. It has some nice features like file system paths which can be authorised separately.

The Kubernetes Persistent Volume documentation contains a matrix about the different file systems and which modes (ReadWriteOnce, ReadOnlyMany, ReadWriteMany) they support.

Docker Volume Plugin Anatomy

Due to the great work of Victor Vieux I was able to get used to the anatomy of the Docker Volume plugin as the official Docker documentation is a little bit, uhm, short. I’am not a programmer ( Especially the docker GitHub go-plugin-helpers repository contains a lot of useful stuff and in sum I was able to copy/paste/change the plugin within a day.

The api.go file of the plugin helper contains the interface method description which needs to be implemented by a plugin.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
// Driver represent the interface a driver must fulfill.
type Driver interface {
	Create(*CreateRequest) error
	List() (*ListResponse, error)
	Get(*GetRequest) (*GetResponse, error)
	Remove(*RemoveRequest) error
	Path(*PathRequest) (*PathResponse, error)
	Mount(*MountRequest) (*MountResponse, error)
	Unmount(*UnmountRequest) error
	Capabilities() *CapabilitiesResponse
}

Some words about the interface:

Get and List are used retrieve the information about a volume and to list the volumes powered by the volume plugin when someone executes docker volume ls.

Create creates the volume with the volume plugin but it will not call the mount command at this time. The volume is only created and nothing more.

Mount is called when a container, which will use created volume, starts.

Path is used to track the mount paths for the container.

Unmount is called when the containers stops.

Remove is called, when the deletion of the volume is requested.

Capabilities is used to describe the needed capabilities of the Docker volume plugin, for example net=host if the plugin needs network communication.

Beside this, every plugin contains a config.json file which describes the configuration (and capabilities) of the plugin.

The plugin itself must use a special file structure, called rootfs!

Howto write the plugin

OK, I admit, I just copied the Docker Volume SSHFS plugin :-) and after that I did the following (beside learning the structure):

1) I changed the config.json of the plugin and removed all the things that my plugin does not need 2) I changed the functions mentioned above to reflect the needs of my plugin 3) I packed together everything, test it, uploaded it.

For point 1) and 2), this is just programming and configuring. But 3) is more interesting because the are the pitfalls an this pitfalls are described in the following section.

The pitfalls

Pitfall 1 Vendors

The first thing I did during the development was to refresh the vendors. And this was also my first problem, at it was not possible to get the Plugin up and running. There is a little bug in the api.go of the helper. The CreatedAt cannot be JSON encoded if it empty. There is already a GitHub PR for it, which simply adds the needed annotations to the config. You can use the PR or you just add the needed annotations to the struct like this:

1
2
3
4
5
6
type Volume struct {
	Name       string
	Mountpoint string
	CreatedAt  string                 `json:",omitempty"`
	Status     map[string]interface{} `json:",omitempty"`
}

Pitfall 2 Make

The SSHFS Docker Volume is great! Make yourself life easier and use the provided Makefile! You can create the plugin rootfs with it (make rootfs) and you can easily create the plugin with it (make create)!

Pitfall 3 Push

After I’ve done all the work I uploaded the source code to GitLab and created a pipeline to push the resulting Docker Container image to Docker Hub so everyone can use it. But this won’t work. After fiddling around an hour, I had the eye opener. The command docker plugin has a separate push function. So you have to use docker plugin push to push a Docker plugin to Docker Hub!

Be aware: The Docker push repository must not exist before your fist push! If you create a repository manually or you push a Container into it, it will be flagged as Container repository and you can never ever push a plugin to it! The error message will be denied: requested access to the resource is denied.

To be able to push the plugin, it must be installed (at least created) in your local Docker engine. Otherwise you cannot push it!

Pitfall 4 Wrong Docker image

Be aware that you use the correct Docker image if you are writing a plugin. If you build your binary with Ubuntu, you might not be able to run it inside your final Docker Volume Plugin container because the image you use is based on Alpine (or the other way around)

Pitfall 5 Unresolved dependencies

Be sure to include all you dependencies in your Docker image build process. For example: If you need the gluster-client, you will have to install them in your Dockerfile to have the dependencies in place when the Docker Volume Plugin image is loaded by the container engine.

Pitfall 6 Linux capabilities

Inside the Docker Plugin configuration, you have to specify all Linux capabilities you need for your plugin. If you miss a capability, the plugin will not do what you like that it does. Eg:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
...
  },
  "linux": {
    "capabilities": [
      "CAP_SYS_ADMIN"
    ],
    "devices": [
      {
          "path": "/dev/fuse"
      }
    ]
  },
...

Debug

A word about debugging a Docker Volume Plugin. Beside the information you get from the Docker site (debug via docker socket), I found it helpful to just use the resulting Docker Volume image as a normal Container via docker run. This gives you the ability to test if the Docker image is including all the stuff that you can do what you want with your plugin later. If you go this way, you have to use the correct docker run options with all the capabilities, devices and the privileged flag. Yes, Docker Volume Plugins run privileged! Here is a example command: docker run -ti --rm --privileged --net=host --cap-add SYS_ADMIN --device /dev/fuse myrootfsimage bash. After this, test if all features are working.

Thats all! If you have questions, just contact me via the various channels. -M

Posted on: Thu, 06 Sep 2018 11:25:39 +0200 by Mario Kleinsasser

  • Docker
  • Ceph
Mario Kleinsasser
Doing Linux since 2000 and containers since 2009. Like to hack new and interesting stuff. Containers, Python, DevOps, automation and so on. Interested in science and I like to read (if I found the time). Einstein said "Imagination is more important than knowledge. For knowledge is limited." - I say "The distance between faith and knowledge is infinite. (c) by me". Interesting contacts are always welcome - nice to meet you out there - if you like, do not hesitate and contact me!