Kubernetes has its virtues and is worth investing in, but it is undoubtedly complex and comes with many operational challenges. We faced many of them on our journey toward "cloud native" at Zalando.
We constantly learned from other organizations that shared their failures and insights, so I started to compile a list of public failure horror stories related to Kubernetes. The goal was to make it easier for people tasked with operations to find outage reports to learn from.
Many of these failures had a few things in common. Here are the factors, in four major buckets, that contributed to failure.
Missing operational maturity
Infrastructure operations is a challenge for most organizations. and the transformation toward end-to-end responsibility (DevOps, "you build it, you run it") is often in full swing. Smaller organizations usually use a tool to bootstrap a cluster (e.g., kops), but do not dedicate time to set up full continuous delivery for the infrastructure. This leads to painful manual Kubernetes upgrades, untested infrastructure changes, and brittle clusters.
The same situation applies to managed infrastructure, since cloud offerings never come with all batteries included. Infrastructure changes should get at least the same attention and rigor as your customer-facing app deployments.
Upstream Kubernetes/Docker issues
Some of the failures can be attributed to upstream issues, e.g., Docker daemon hanging, issues with a kubelet not reconnecting to the control plane, kernel CPU throttling bugs, unsafe CronJob defaults, and kubelet memory leaks.
If you hit an upstream issue—congratulations! You can follow or file an upstream issue and hope or contribute a fix helping many others. I would expect this class of failure causes to get smaller over time as CNCF projects mature and the user base grows, making it less probable that you’ll be the first to hit an upstream issue.
[ Also see: 7 things developers should know about production infrastructure ]
Cloud and other integrations
Kubernetes comes in more than one flavor—there are many possible combinations of Kubernetes components and configurations. Kubernetes needs to interact with your cloud platform, such as Google Cloud or AWS, and your existing IT landscape. And all of these integrations can lead to failure scenarios.
We saw Kubernetes' AWS cloud provider code easily hit AWS API rate limits and have problems with EBS persistent volume attachments. Using AWS Elastic Load Balancing with dynamic IPs caused problems with the kubelet losing connections. The AWS IAM integration (kube2iam) is notoriously prone to race conditions.
Human error
Let’s be clear: There is no such thing as "human error" as a root cause. If your root-cause analysis (RCA) concludes with "human error," start over and ask some hard questions.
[ Also see: One year using Kubernetes in production: Lessons learned ]
Share what you learn
Nowadays everybody is talking about failure culture, but what organization is truly ready to share its failures and lessons learned publicly? Kubernetes gives us a common ground where we can all broadly benefit from sharing our experiences with one another.
Many contributing factors are not new, such as the maturity in infrastructure changes, Docker, distributed systems, and so on. But Kubernetes gives us a common language to talk through and address them. By reducing the unknown unknowns of operating or using Kubernetes through shared experiences, it will get easier for everyone over time.
Do you have experiences to share? Post them below. And for more on Kubernetes failures, come to my talk, “Kubernetes Failure Stories and How to Crash Your Clusters,” at KubeCon + CloudNativeCon Europe 2019 in Barcelona, Spain, on May 20-23.
Keep learning
Choose the right ESM tool for your needs. Get up to speed with the our Buyer's Guide to Enterprise Service Management Tools
What will the next generation of enterprise service management tools look like? TechBeacon's Guide to Optimizing Enterprise Service Management offers the insights.
Discover more about IT Operations Monitoring with TechBeacon's Guide.
What's the best way to get your robotic process automation project off the ground? Find out how to choose the right tools—and the right project.
Ready to advance up the IT career ladder? TechBeacon's Careers Topic Center provides expert advice you need to prepare for your next move.