Why Open Source is great - Fix an issue in under 48 hours

Estimated reading time: 4 mins

This is a follow up post to Reviving a near-death Docker Swarm cluster where we showed, that a Docker Swarm cluster can be hurt badly if DNS does not work (because of a storage hiccup). Therefore it was obvious that we had to enable caching in our coreDNS servers.

A short recap about the situation: We use coreDNS with ETCD as storage backend for the DNS records. This is a common use case - it is the same as in Kubernetes. We use the same concept as Kubernetes does but for slightly different purposes because the ETCD has an easy to implement API. We started to use coreDNS way before it came to Kubernetes as a DNS service. We also helped to implement the APEX records and we also did some bug triage in the past.

Enabling caching in coreDNS is simple, just add the cache statement to the Corefile as documented in the plugin.

The problem

So yes, we enabled caching and some minutes later our monitoring system showed several systems which were not able to do the Puppet agent run anymore. This happened on Tuesday afternoon around 3pm. After the monitoring alerted the problem, we already guessed, that it has something to do with the shortly before enabled cache in our coreDNS instances. A rollback would be possible without any problems, because we use the coreDNS inside containers, their image is build via GitLab CI/CD and the Docker run is issued by Puppet on the given hosts. So a rollback is pretty easy! But we didn’t roll back because only some of our hosts had an DNS resolve error, the rest (hundreds) were running fine!

Analyzing the problem

We suspended the rollback to a previous Corefile (coreDNS config) and took a closer look at the affected hosts. Shortly after we knew that only older Linux OS’s were hit by this problem, Bernhard started to search on the internet about this specific problem because we also got the following log output from a rsync run (and a similar one from the Puppet agent):

rsync: getaddrinfo: rsync.example.com 873: No address associated with hostname
rsync error: error in socket IO (code 10) at clientserver.c(122) [sender=3.0.9]

We found two GitHub entries, this one in coreDNS / this one in Kubernetes. And in addition a Stack Overflow post article.

Ok, there was something strange going on with some old clients. We decided to share our information inside the issues above. You can read the issues and the pull requests if you like to read the full details. In short, I mailed with Miek Gieben who is one of the coreDNS maintainers privately after we chatted on GitHub to share some tcpdumps with him. DNS is really something you won’t mess around that deep. It’s ugly and I am feeling great respect for those who are working in this field, like Miek does! Kudos to all of you!

The result

After chatting via e-mail we shortly came to the point, that the switching of the authoritative/non-authoritative flag - that is one(!) single bit in the header of the UDP datagram of a DNS query response - confuses older clients, because at first they get an authoritative answer and on the next query (within the TTL of the record) they get a non-authoritative answer. Some older DNS client code is screwed up at this point.

Miek provided a PR for this, I opened up an issue and on Thursday morning I did a manual build of the coreDNS including this PR and everything worked fine. As mitigation in between, Bernhard rolled out a hosts entry for our Puppet master domain on all affected hosts! Thanks! But some hosts with quite old software were still affected. Therefore this PR works much better.

Thank you!

We would like to say “Thank you!” to all who are working in Open Source and in this case especially to Chris O’Haver, Miek Gieben and Stefan Budeanu! This shows why Open Source is unbeatable when it comes to problems. Of course you have to have know how to do work together like we did in this case, but you have the opportunity to do it! Don’t be afraid and try it! Getting a fix for a problem within 48 hours is absolutely impressive and stunning! I am sure that this is not possible with closed source.

Posted on: Fri, 14 Jun 2019 04:39:06 +0200 by Mario Kleinsasser

  • Culture
Mario Kleinsasser
Doing Linux since 2000 and containers since 2009. Like to hack new and interesting stuff. Containers, Python, DevOps, automation and so on. Interested in science and I like to read (if I found the time). Einstein said "Imagination is more important than knowledge. For knowledge is limited." - I say "The distance between faith and knowledge is infinite. (c) by me". Interesting contacts are always welcome - nice to meet you out there - if you like, do not hesitate and contact me!

About X vs Y in tech

Estimated reading time: 4 mins

Actually, I was going to write a blog post about “Why Docker might be enough” but soon I caught myself, that this blog post would be one of these posts where someone (I) tries to convince somebody to use something (in this case Docker) instead of another thing (Kubernetes maybe). Bad, because in this case the article wouldn’t be more than one of these “X vs Y in tech” blog posts the internet is already flooded with.

Why is it that hard to accept one others position?

One reason might be because we are trained to compare everything at any time since we were young? For example, at school we got our marks and we all know that an one is better than a five (in school systems with numeric marks). Pretty clear, isn’t it? I think it’s not, because it’s only one of probably many ways to look at it. At school, one mark does not make a report card. There are many different subjects, and if someone is maybe bad at math he or she might be good at PE.

Another example is sports. There is always “versus”, in every game. And most of the time there is one, and only one, winner. The second or third place hardly matters. This is something which is trained over years, since our early days. Therefore, we are conditioned at that in most of our discussions only the first place matters. “The winner takes it all the loser standing small.” s[ABBA]

And the same point of view should be taken into account, if we are talking about X vs Y in tech. Often, only technical parameters are compared, for example: Kubernetes clusters can scale out to more than 5000 nodes versus Docker Swarms are able to handle about 1000 save and sound. So, ding ding, point for Kubernetes! Pretty clear, isn’t it? I think it’s not, because a lot of other parameters are not taken into account. One important thing is experience, the other point is the environment.

About experiences and environments

As we all proceed throughout our tech lives, we make different experiences, good ones and bad ones. And of course, as we all proceed, we prefer the solutions which worked well. That’s why we tend to convince others about our solution. That’s human but sometimes it would be helpful, if we first try to understand the environment of the person we discuss with. An example: At KubeCon EU 2019 my colleagues and I talked with a lot of people. In nearly every chat, sooner or later we were asked where and how we use Kubernetes. Our answer was: We are currently not using Kubernetes (neither in the Cloud nor as on-premises enterprise solution), we use native Docker Swarm. The reaction differed from wondering to horrify - until we, the Devs and Ops, explained our environment, our experience, our knowledge and also the why (simply we don’t need it at the moment) behind our current decision - which is not set in stone!

Co-operations would be better

In our spare time, we (Bernhard and I and some other colleagues) are also playing computer games for fun. And you know what? We are always playing co-operative games, were three or four people have to co-operate to achieve the given goal. It is much more satisfying than playing a player versus player game. If you are interested in the background to co-operative behavior take a look at the Prisoner’s dilemma.

Summary

Instead of focusing about the “versus” when comparing two (or more) things, I think it would be much more helpful to focus on the co-operative or the synergy between to objects. Like in the picture of this post, it might be absolutely OK that two persons can be right at the same thing at the same time but with different points of view about the object of interest. But if they co-operate, they can find a way where both can still be right. In this case, the common result will exceed the individual achievement.

Maybe, next time I will write an article about “The synergy of container technologies”… πŸ˜‚

Posted on: Sat, 08 Jun 2019 04:39:06 +0200 by Mario Kleinsasser

  • Culture
Mario Kleinsasser
Doing Linux since 2000 and containers since 2009. Like to hack new and interesting stuff. Containers, Python, DevOps, automation and so on. Interested in science and I like to read (if I found the time). Einstein said "Imagination is more important than knowledge. For knowledge is limited." - I say "The distance between faith and knowledge is infinite. (c) by me". Interesting contacts are always welcome - nice to meet you out there - if you like, do not hesitate and contact me!

Reviving a near-death Docker Swarm cluster

Estimated reading time: 3 mins

or why a storage update can hurt your cluster badly.

Today, shortly before our working day has ended, one of our Docker Swarm Clusters, the test environment cluster, nearly died. This wasn’t a Docker Swarm fault, but a coincidence of several different causes. At this point, we would like to share, how we are encountering such situations. Therefore, here comes the timeline!

16:30

At this time a developer contacted us that he had a problem with deploying his application in our Docker Swarm Test Cluster. The application container hasn’t started correctly and he didn’t know why.

16:35

So we had a look at the service the developer mentioned with the docker service ps command, to get some additional information.

ID                  NAME                IMAGE NODE                        DESIRED STATE       CURRENT STATE             ERROR                         PORTS
0cnafdmxdrvf        wmc_wmc.1           ...   xyz123                      Running             Assigned 8 minutes ago                                  
qd2muj3pdy1d         \_ wmc_wmc.1       ...   abc123                      Remove              Running ... ago                                 
me85ue4xii3f         \_ wmc_wmc.1       ...   3pp5zmsfe3jz2n5o54azylgtf   Shutdown            ... ago                   "task: non-zero exit (143)"   
ssbthaef0093         \_ wmc_wmc.1       ...   smqghxgmbkyxi5dn9odd9r39v   Shutdown            ... ago                                 

The service hang on remove! That’s never a good sign, as the remove of the container should be done pretty fast. If something like this happens, strange things are going on.

16:40

A look into journalctl --since 2h -u docker.service shows that around 16:22 the Docker Swarm raft was broken. Now the question was - why? In the logs, we saw, that at this point in time, a Docker swarm deployment was running. Which is OK, since we are using GitLab as our GitOps/CI tool.

16:45

On the host were the remove wasn’t finished, we found an additional information in the journalctl --since 2h -u docker.service log.

Jun 03 16:22:14 abc123 dockerd[1400]: time="2019-06-03T16:22:14.014844062+02:00" level=error msg="heartbeat to manager { sm.example.com:2377} failed" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" method="(*session).heartbeat" module=node/agent node.id=vti4nk735ubv9gk9x126dk9tj session.id=cgpntgnefhtqqj5f5yrci4023 sessionID=cgpntgnefhtqqj5f5yrci4023

This log messages says that the Docker host can’t connect to the Docker swarm cluster manager what’s not really good but pointed us to the next systemd - DNS.

16:50

A look at the DNS logs of our coreDNS showed many of the following log messages.

2019-06-03T16:22:18.168+02:00 [ERROR] plugin/errors: 0 ns.dns.api.example.com. NS: context deadline exceeded
2019-06-03T16:22:23.347+02:00 [ERROR] plugin/errors: 0 el01.api.example.com. A: context deadline exceeded
2019-06-03T16:22:23.410+02:00 [ERROR] plugin/errors: 0 sl.example.com. A: context deadline exceeded 

So DNS wasn’t working correctly. Our database for coreDNS is our etcd cluster…

16:55

The etcd cluster logs showed the following.

2019-06-03 16:22:08.156352 W | etcdserver: read-only range request "key:\"/dns-internal/com/example/sm/\" range_end:\"/dns-internal/com/example/sm0\" " took too long (3.243964451s) to execute
2019-06-03 16:22:17.059472 E | etcdserver/api/v2http: got unexpected response error (etcdserver: request timed out)

Our etcd cluster couldn’t read the data from the disk/storage. A quick phone call to our storage colleagues informed us that they are making planned firmware upgrades and therefore the storage controllers had to make planned “failovers”.

Conclusion

We are using a dns name to join the Docker worker hosts to the Docker manager hosts in our Ansible scripts and the dns information is stored somewhere in the Docker raft database. We are also not using a DNS caching mechanism in our coreDNS installation which can cause really bad outages in this situation because the dns name isn’t resolvable. Our Docker Swarm Test Cluster had an inconsistent state at this point and we had to restart our 3 managers one after another so that they where able to rebuild/validate themselves with the cluster information of the running stacks/services/tasks. With Docker 18.09.3 this works pretty well and we had our cluster control back again in less than half an hour. (including the analysis) No running services where effected except the one with the deploy problem at the beginning.

Posted on: Mon, 03 Jun 2019 15:39:06 +0200 by Mario Kleinsasser , Bernhard Rausch

  • Docker
Mario Kleinsasser
Doing Linux since 2000 and containers since 2009. Like to hack new and interesting stuff. Containers, Python, DevOps, automation and so on. Interested in science and I like to read (if I found the time). Einstein said "Imagination is more important than knowledge. For knowledge is limited." - I say "The distance between faith and knowledge is infinite. (c) by me". Interesting contacts are always welcome - nice to meet you out there - if you like, do not hesitate and contact me!
Bernhard Rausch
CloudSolutionsArchitect/SysOps; loves to get things ordered the right way: "A tidy house, a tidy mind."; configuration management fetishist; loving backups; impressed by docker; Always up to get in contact with interesting people - do not hesitate to write a comment or to contact me!

KubeCon + CloudNativeCon Europe 2019 - Quo Vadis?

Estimated reading time: 8 mins

The KubeCon + CloudNativeCon Europe 2019 at Barcelona (Spain) was great! But there were also some things, which I like to mention, that weren’t that good. The KubeCon EU is undisputedly the largest conference that covers all topics starting by Kubernetes as such and ending by all of the more than 600 projects from the CNCF landscape.

As a two-time visitor I am still fairly new to this conference but there was a difference between the KubeCon EU 2018 at Copenhagen and this years KubeCon EU 2019 conference at Barcelona and for sure, it’s definitely not about the cities. πŸ˜‚

As always, this is a fully personal review and it’s only my opinion, nothing more. I know, that every comparison between conferences limps but due to the fact that I had the opportunities to visit some other conferences in the last two years, like the DockerCon’s, the DevConf.cz at Brno, the DevOps Gathering at Bochum and some more, I think that it is OK to show some parallels and differences.

Therefore, this KubeCon EU 2019 recap might be a little bit different from what you expect, but nevertheless hopefully still interesting. Where there is light, there is also (some) shadow. Now let’s have a look into the details.

The venue

With about 7700 attendees the KubeCon EU 2019 was the largest conference that I’ve ever attend to and it was exciting! THe venue was huge and everyone had enough space to float around. But the size of the venue had some disadvantages. I am pretty fit and good on foot but this long walks between the halls are not suitable for everyone. There was enough time between the breakouts to change the hall and the room between the hall 8.0 and 8.1 but walking over to the coffee bar which was located between hall 7 and 6 took to much time. One cause for this was, that there were only four escalators from hall 8.0 (basement) to the first floor, which was necessary to move over to hall 6 and hall 7. And a lot of people like coffee! πŸ˜‚ Some coffee breakpoints like in Copenhagen last year would have been nice.

The venue offered enough space for the sponsor showcase, which was pretty cool and the venue was completely air conditioned! But hall 8.1 had a serious sound problem. Unfortunately I didn’t make a picture of it, but I will try to describe it. In hall 8.1 there were 6 cubes (I think) build up to divide the large hall into smaller “rooms”. The idea was great but this cubes were only covered with fabric! As a consequence, at the moment the tracks started, you also heard the other speakers from the other cubes - this was a little bit confusing.😜

The breakout sessions and pancake breakfast

This year I have mixed feelings about the talks I’ve listened to. Some of them were very impressive, some were of mixed quality. In addition some tracks were starting really interesting but till the end, they were a little bit commercial. I am perfectly OK with it but some people argument that the DockerCon is so “enterprise” and “commercial” - I am not sure about this anymore. I mean hey, DockerCon is named along the Docker Inc. corporate and that’s their conference - you know what you will get there.

I visited the following sessions:

The pancake breakfast on the second day, which was sponsored by VMWare was very disillusioning. The discussion panels topic was service mesh. After the introduction the host asked the audience how many people are currently using a service mesh in production. Of course, people were coming for pancakes but it was an eye opener that only two out of approximately 200 in this room raised their hands. Wow. The second question from the host was, how many of the audience are currently testing service meshes - now roughly 10 people raised their hands. That was a second surprise. Before the panel starts, I thought, that my colleagues and myself will be the minority because we are not using nor currently testing services meshes. But it turns out, that we are not alone. During the discussion the host asked the users of the service meshes what the greatest show stoppers are in their opinion. All agreed that the additional abstraction layer makes it hard to debug problems, if there are any. That’s also my opinion currently.

The sponsor showcase

The sponsor showcase during the KubeCon EU 2019 was huge! It was amazing to have the opportunity to talk to all these KubeCon and CNCF sponsors at one place. I talked to a lot of friends and of course to new people. Unfortunately there was no sticker table this time. The K8s Boothday Party was awesome! As you can see on the picture to the left, the sponsor booths were often really crowded. One exception was the Oracle booth - there were never a lot of people there.

The all attendee party

The Poble Espanyol was 😍😍😍! I’ve never seen something like that before. It was very interesting to walk around and to see the different regions of Spain at one place. The food was awesome and the mood was great!

This and that

First of all, I personally missed a sticker table! I sure that at such a large conference it would not be easy to handle a sticker table, but it is not impossible. I understand that for he sponsors which are running a booth at the sponsor showcase it is better to not have a sticker table because it’s better for them if people visit the booths. Thats fully OK, but it ould be nice to have a sticker table, where small projects would have the opportunity to place their stickers there. For example, I took my BosnD stickers with me, to place it on the ticker table, to do some promotion for our small project. Therefore, a community sticker table would be a nice to have next time.

I like it to meet community members and interested people which may be the reason why I am a huge fan of the DockerCon Hallway Track and the DockerCon Pals Program. For me it’s not that hard to find some friends around the venue, because I am already a member of the community. But for first time visitors it can be a challenging time. Programs like a Hallway Track would be really helpful.

Together with Dominique Top and the other Docker Community Leaders we were able to arrange a “Docker Community Leader Mini Summit” πŸ€— which was really fun and of course, I get connected to new members from all over the world! That’s really something I definitely enjoy - and it was one of the biggest moments of my KubeCon EU 2019 trip (see the picture to the left)!

There was a track about serverless technologies but in the sponsor showcase that wasn’t really a topic which was surprising.

Personal hopes and resume

Quo vadis KubeCon - Where are you going? Personally I think that it would be better if the KubeCon + CloudNativeCon Europe 2019 would be split up because in my opinion, the conference is too large for people who are there for the first time especially. This time, three of my colleagues were with me at the KubeCon - all three were new to the KubeCon and two of them have never visited such a large tech conference before. I am really happy that I was already at some conferences because therefore I was able to guide my colleagues. I am a KubeCon Pal now? Maybe it would be helpful if the conference would be a little bit smaller, like the KubeCon EU 2019 at Copenhagen with approximately 5500 attendees. Otherwise it would be an option to split the conference in two parts - one for the Kubernetes platform and one for maybe all the other CNCF projects. One of the keynote speakers at the KubeCon EU 2019 said it clear: “Kubernets is a platform to build platforms” - this might be a reason to split up the conferences.

One more thing is important: the community. In my opinion, the KubeCon has to focus on community. There should be definitely something like a Hallway Track and a Community Track, where people can meet each other for the same projects and projects can promote themselves. Otherwise, the whole conference might be overtaken by business partners completely - and this would be really bad for such great communities.

For me the visit was more than worth, because I met some new friends from the Docker Community Leaders, CoreDNS maintainers and some other people.

Thats it. Have a lot of fun!

Posted on: Sat, 25 May 2019 15:39:06 +0200 by Mario Kleinsasser

Mario Kleinsasser
Doing Linux since 2000 and containers since 2009. Like to hack new and interesting stuff. Containers, Python, DevOps, automation and so on. Interested in science and I like to read (if I found the time). Einstein said "Imagination is more important than knowledge. For knowledge is limited." - I say "The distance between faith and knowledge is infinite. (c) by me". Interesting contacts are always welcome - nice to meet you out there - if you like, do not hesitate and contact me!

IBM Power Systems and IBM i @ LinuxDay 2019

Estimated reading time: 2 mins

IBM Power Systems and IBM i @ LinuxDay / Carinthia / Austria

As I’m a graduate of the HTL-Villach and the LinuxDay was co-organized by Mario Kleinsasser and Bernhard Rausch - both coworkers at STRABAG SE - I posted a CFP: Node.js on Midrange Server?

The technical setup was the easy part. I already had setup the Java code deployment to IBM i systems with GitLab CI/CD Pipelines. Thanks to the effort of Jesse Gorzinski, Kevin Adler and the rest of the IBMiOSS team at IBM the installation of Node.js was done in a minute. With YUM and RPM on IBM i PASE this is not a miracle at all.

I talked to Roman, also a business colleague, to write a simple Node.js web page and an interacting 5250 screen. Thanks for presenting this part at our talk.

The idea for the outline was to show that open source software in a business context has to work together with legacy systems and that this can be done with a popular open source coding framework. Second goal was to reach the students and show them that there is a big IT company within STRABAG SE as well.

So all was fine until I realized, that this will be my first talk about IBM i in front of an audience which knows little to nothing about IBM Power Systems and/or IBM i operating system.

It took me four after-work sessions to end up with the final outline:

Followed by Roman’s part:

I really loved to do the session and to talk about the long and lasting story of IBM i, Power Systems and the active Open Source initiative. It was nice to chat with some of the visitor afterwards and I’m looking forward to LinuxDay 2020 (Carinthia / Austria) - Be there!

P.s. Thanks to IT Power Services and IBM Γ–sterreich for organizing IBM related giveaways for the event.

Posted on: Sun, 19 May 2019 20:10:59 +0200 by Markus Neuhold

  • IBM i
Markus Neuhold
IBM i (AS/400) SysAdmin since 1997, Linux fanboy and loving open source, docker and all about tech and science.