Reviving a near-death Docker Swarm cluster

or why a storage update can hurt your cluster badly.

Today, shortly before our working day has ended, one of our Docker Swarm Clusters, the test environment cluster, nearly died. This wasn’t a Docker Swarm fault, but a coincidence of several different causes. At this point, we would like to share, how we are encountering such situations. Therefore, here comes the timeline!

16:30

At this time a developer contacted us that he had a problem with deploying his application in our Docker Swarm Test Cluster. The application container hasn’t started correctly and he didn’t know why.

16:35

So we had a look at the service the developer mentioned with the docker service ps command, to get some additional information.

ID                  NAME                IMAGE NODE                        DESIRED STATE       CURRENT STATE             ERROR                         PORTS
0cnafdmxdrvf        wmc_wmc.1           ...   xyz123                      Running             Assigned 8 minutes ago                                  
qd2muj3pdy1d         \_ wmc_wmc.1       ...   abc123                      Remove              Running ... ago                                 
me85ue4xii3f         \_ wmc_wmc.1       ...   3pp5zmsfe3jz2n5o54azylgtf   Shutdown            ... ago                   "task: non-zero exit (143)"   
ssbthaef0093         \_ wmc_wmc.1       ...   smqghxgmbkyxi5dn9odd9r39v   Shutdown            ... ago

The service hang on remove! That’s never a good sign, as the remove of the container should be done pretty fast. If something like this happens, strange things are going on.

16:40

A look into journalctl --since 2h -u docker.service shows that around 16:22 the Docker Swarm raft was broken. Now the question was - why? In the logs, we saw, that at this point in time, a Docker swarm deployment was running. Which is OK, since we are using GitLab as our GitOps/CI tool.

16:45

On the host were the remove wasn’t finished, we found an additional information in the journalctl --since 2h -u docker.service log.

Jun 03 16:22:14 abc123 dockerd[1400]: time="2019-06-03T16:22:14.014844062+02:00" level=error msg="heartbeat to manager { sm.example.com:2377} failed" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" method="(*session).heartbeat" module=node/agent node.id=vti4nk735ubv9gk9x126dk9tj session.id=cgpntgnefhtqqj5f5yrci4023 sessionID=cgpntgnefhtqqj5f5yrci4023

This log messages says that the Docker host can’t connect to the Docker swarm cluster manager what’s not really good but pointed us to the next systemd - DNS.

16:50

A look at the DNS logs of our coreDNS showed many of the following log messages.

2019-06-03T16:22:18.168+02:00 [ERROR] plugin/errors: 0 ns.dns.api.example.com. NS: context deadline exceeded
2019-06-03T16:22:23.347+02:00 [ERROR] plugin/errors: 0 el01.api.example.com. A: context deadline exceeded
2019-06-03T16:22:23.410+02:00 [ERROR] plugin/errors: 0 sl.example.com. A: context deadline exceeded

So DNS wasn’t working correctly. Our database for coreDNS is our etcd cluster…

16:55

The etcd cluster logs showed the following.

2019-06-03 16:22:08.156352 W | etcdserver: read-only range request "key:\"/dns-internal/com/example/sm/\" range_end:\"/dns-internal/com/example/sm0\" " took too long (3.243964451s) to execute
2019-06-03 16:22:17.059472 E | etcdserver/api/v2http: got unexpected response error (etcdserver: request timed out)

Our etcd cluster couldn’t read the data from the disk/storage. A quick phone call to our storage colleagues informed us that they are making planned firmware upgrades and therefore the storage controllers had to make planned “failovers”.

Conclusion

We are using a dns name to join the Docker worker hosts to the Docker manager hosts in our Ansible scripts and the dns information is stored somewhere in the Docker raft database. We are also not using a DNS caching mechanism in our coreDNS installation which can cause really bad outages in this situation because the dns name isn’t resolvable. Our Docker Swarm Test Cluster had an inconsistent state at this point and we had to restart our 3 managers one after another so that they where able to rebuild/validate themselves with the cluster information of the running stacks/services/tasks. With Docker 18.09.3 this works pretty well and we had our cluster control back again in less than half an hour. (including the analysis) No running services where effected except the one with the deploy problem at the beginning.