Thoughts & Ideas

Jun 4, 2024

Jun 4, 2024

DevOps Nightmares — A Kubernetes Odyssey Gone Awry

Tin Nguyen

Go-To-Market

SHARE

SHARE

SHARE

Imagine, if you will, a series of tragic tales from dark dimensions where nightmares come to life and best practices go to die. The stories you’re about to witness are all true, pieced together from the shattered psyches of those who lived to tell the tales. Accounts have been anonymized to safeguard the unfortunate souls who were caught in the crosshairs of catastrophe.


As you follow this motley crew of unsuspecting engineers navigating the murky waters of automation, integration, and delivery, each alarming anecdote will become another foreboding reminder that it could all happen to you one day.


Prepare for downtime, disasters, and dilemmas. Your pulse will quicken and your hardware will cringe as you travel through the hair-raising vortex of DevOps Nightmares…

A Kubernetes Odyssey Gone Awry

Caught between compliance crackdowns and K8s vendor chaos


“I've been through so many DevOps nightmares over the years, but one really stands out,” says the CTO of a regulated identity management company. “Over the course of three years, we adopted three different Kubernetes distributions and never used any of them.”


Still shaken five years later, the CTO recalls his team’s initial enthusiasm for modern infrastructure. “Automation was crucial. We had Mesos, Mesosphere, and Marathon managing containers, but saw Kubernetes becoming the standard.” At the time, it was the betamax/VHS problem: when there’s no clear standard, how do you pick the one most likely to win? Or rather, the one least likely to lead to millions of misspent dollars?

Hitting roadblocks — and paying for the privilege


At the time of this nightmare, EKS and AWS weren’t quite ready for primetime and the containers needed to operate in GovCloud. They didn’t have the time or budget to build something in-house, so the CTO had to be selective about finding the right vendor to work with. That’s when his team found Heptio, a company that came out of Google and was run by the same people who released Kubernetes. 


After signing the licensing paperwork, the identity provider faced an audit, which put everything on hold for four months. When the audit was finished, the infrastructure team found out that Heptio had been acquired by VMware. 


“We were afraid of getting turned into abandonware,” the CTO says. “So we wrote off the six-figure licensing fees and decided to manage open source Kubernetes until we could decide next steps.” He jokes, “It wasn’t easy, but at least it was really expensive.” 


For seven or eight months, the infra team spent a lot more time on servers and clusters than adding value to the business, which was growing by about 100% year over year on infrastructure that couldn’t support the explosion of users. A blessing and a curse — well, mostly a curse in this situation.

Regulatory compliance, audits, and a regrettable migration


Self-managing their client-facing Kubernetes was becoming too much to manage. The whole “managed Kubernetes” industry was still in flux with a lot of competing vendors, so finding the right path forward was tough. The DevOps team chased the latest and greatest, but there were almost no choices of decent, viable options that were also regulatory-compliant.


So, the CTO made the difficult decision to move out of GovCloud to reduce constraints. It would cost a lot of time and money, but hopefully allow them to use more popular commercial options. They chose Red Hat OpenShift believing it could appease the gods of governmental contracts and regulatory compliance. 


During the contract phase with Red Hat, another audit occurred. Hooray! It resulted in good news and crushingly-bad news. The good news: the government was totally cool with OpenShift. The crushing bad news: the move away from GovCloud was a no-go. They had to move back to GovCloud. *womp womp*


By the time the migration back to GovCloud was complete, Red Hat was acquired by IBM. IBM wasn’t known for its cloud expertise and the team didn’t want to hitch its wagon there. So once again they abandoned the effort, despite leaving another six-figure licensing fee on the table.


The most fun thing to do with your money: Giving it to huge companies for doing nothing.

Painful but pragmatic decisions and lessons learned


The team resigned itself to self-managing open-source Kubernetes. “We wrote helm charts, parsed JSON and YAML, and managed infrastructure constantly,” laments the CTO. “I like managing infrastructure, but it didn't move business ambitions forward at all.”


Audits and ecosystem turbulence stalled real automation progress for three years. Three. Years. And while the team’s self-management efforts worked, six full-time infrastructure engineers were required to keep things going. 


The CTO’s takeaways:


  • If you can avoid it at all, don’t get into managing infrastructure

  • But if you have to manage infrastructure, see how much you can abstract away

  • Keep your team focused on delivering value and see how much of the low-level stuff you can automate, outsource, or delegate


“Managing your own infrastructure is a dead end,” says the CTO, asserting that it’s much better to spend time writing code and packaging it. “If I were smarter, I’d have started a Kubernetes management company in the midst of all this. Oh well.”


——————————————————————————————————————————————————————


Have your own tales of automation woes or delivery disasters? We want to hear them! If you've endured a devastating DevOps debacle and are willing to anonymously share the cringe-worthy details, please reach out to us at DevOpsStory@aptible.com.


Don’t hold back or hide the scars of your most frightening system scares. Together, we can immortalize the valuable lessons within your darkest DevOps hours. Your therapy is our treasured content, and we’ll gracefully craft your organizational mishap into a cautionary case study for the ages. And, in return for your candor, we'll ship some sweet swag your way as thanks.

Build Your Product.
Not Product Infra.

Build Your Product.
Not Product Infra.

Build Your Product.
Not Product Infra.

Build Your Product.
Not Product Infra.

548 Market St #75826 San Francisco, CA 94104

© 2024. All rights reserved. Privacy Policy

548 Market St #75826 San Francisco, CA 94104

© 2024. All rights reserved. Privacy Policy

548 Market St #75826 San Francisco, CA 94104

© 2024. All rights reserved. Privacy Policy

548 Market St #75826 San Francisco, CA 94104

© 2024. All rights reserved. Privacy Policy