VSHN.timer #66: The Elusive 9

Welcome to another VSHN.timer! Every Monday, 5 links related to Kubernetes, OpenShift, CI / CD, and DevOps; all stuff coming out of our own chat system, making us think, laugh, or simply work better.

This week we’re going to talk about noteworthy outages and availability issues that brought SLAs to their knees.

1. Let’s play some Jeopardy!, shall we? „Cloud services for 1000!“ „It happened to GitHub, Zoom, Slack, IBM Cloud and T-Mobile during the pandemic.“ You hit the buzzer and scream the right answer: „What is an outage?“ Unfortunately all of these services had at least one major outage this year. It makes us remember how fragile online services can be under stress, and yes, that not even the biggest names in the industry are immune to failure.

https://statusgator.com/blog/2020/08/21/5-biggest-outages-of-q2-2020/

2. Sometimes Murphy’s Law hits so hard, it’s almost unfathomable. Take, for example, Basecamp, the popular project management service. They had three consecutive outages on the same week. What are the odds? The important takeaway in the post mortem is not so much the measures and countermeasures they took, but the timely communication and the openness to say „I’m sorry“ to their customers.

https://m.signalvnoise.com/three-basecamp-outages-one-week-what-happened/

3. Speaking about post-mortems, some teams take the time to write some really comprehensive ones, documenting every single detail about what happened. It makes for a fascinating read and provides fantastic information for teams preparing for (or suffering) such events.

https://signal.eu.org/blog/2020/09/09/post-mortem-of-a-dnssec-incident-at-eu-org/

4. Application developers are eternal optimists by nature. The default values in many programming languages and frameworks literally specify „infinite“ as a timeout. Developers, instead, must learn how to deal with flaky networks that can fail at any time. Roberto Vitillo from Microsoft urges software developers, both on the front and backend, to override the default timeouts, and gives some useful examples of how to do that.

https://robertovitillo.com/default-timeouts/

5. The tool of the week is Crowdsec, „An open-source, lightweight agent to detect and respond to bad behaviours.“ The idea can be summarized as a next-generation firewall aiming to achieve „digital herd immunity“ for cloud services. Their website has an interesting list of objectives for the coming years, including the addition of machine learning to thwart attacks before they even happen. To say that this is intriguing would be an understatement.

https://github.com/crowdsecurity/crowdsec

Is your infrastructure ready to handle outages? Do you have a status page for your customers? Would you like to share any war stories with our readers? Get in touch with us through the form at the bottom of this page, and see you next week for another edition of VSHN.timer.

PS: would you like to receive VSHN.timer every Monday in your inbox? Subscribe to our VSHN.timer newsletter!