VSHN.timer

VSHN.timer #75: Dealing with Catastrophe

11. Jan 2021

Welcome to another VSHN.timer! Every Monday, 5 links related to Kubernetes, OpenShift, CI / CD, and DevOps; all stuff coming out of our own chat system, making us think, laugh, or simply work better.

This week we’re going to talk about how teams deal with failures, sometimes of seemingly catastrophic proportions.

1. We will start with this gem by Lisa Seelye from Red Hat. It is a mind bending article, providing an answer to the basic question: „Why do we externalize our successes but internalize our failures?“ Failures in systems is a matter of „when“ rather than „if“, and healthy teams, including their management, need to embrace them. And to stop calling those situations „mistakes“ but rather „learning opportunities“.

https://opensource.com/article/20/11/normalize-failure

Terrific thread. Shipping software should not be scary, it should be downright pedestrian. Boring. Unremarkable and undeserving of comment.

Praise is super powerful weapon, and if you praise effort rather than impact may you teach people to repeat the exact wrong behaviors. https://t.co/fUti38WlvN
— Charity Majors (@mipsytipsy) December 20, 2020

2. Failure can happen to anyone, even to the biggest IaaS provider on the planet. AWS suffered a major outage of its Kinesis service in the US-EAST-1 region, on November 25th, 2020. And since Kinesis is used by many other major pieces of AWS infrastructure, this failure rippled on other parts of the infrastructure like domino pieces. First Cognito, dealing with user authentication; then CloudWatch, dealing with systems monitoring; and finally Lambda and EventBridge, both of which depend on CloudWatch. The post mortem of this outage reads like a detective novel, the fascinating story of a hard day at the core of the cloud.

https://aws.amazon.com/message/11201/

3. How to deal with failures in OpenShift clusters? The Performance and Scalability team at Red Hat has published a short summary of the three biggest outages they faced in production environments: a rogue DaemonSet taking down a 2000-node cluster, an etcd database that refused to write things down, and the sad results of running etcd on slow storage. Extreme examples for sure, but interesting lessons nonetheless, even though one would rather read about than experience them first hand.

https://www.openshift.com/blog/openshift-failure-stories-at-scale-cluster-on-fire

4. DevOps engineering brings its own load of issues to consider. Take for example the issues related to DNS records, their propagation and validity, and the availability of the systems referenced by them. Blake Stoddard from HEY tells the story of a whole day spent at work „banging his head against the desk“ because of failing to RTFM. In this case the manual was RFC 1034 so please go and re-read it now before you hit that „deploy“ button once again.

https://m.signalvnoise.com/how-to-waste-half-a-day-by-not-reading-rfc-1034/

5. Raphael Michel from Pretix explains how they solved a data loss failure caused by a video file overwritten by mistake… by grepping the contents of a disk, looking for the header of the FLV video format. Spoiler alert: it took 7 hours, and it’s absolutely epic.

https://behind.pretix.eu/2020/11/28/undelete-flv-file/

How does your team manage failure? Do you keep a log or do you write post mortems after major outages? Do you have any failure handling tips you would like to share with the community? Get in touch with us through the form at the bottom of this page, and see you next week for another edition of VSHN.timer.

PS: would you like to receive VSHN.timer every Monday in your inbox? Sign up for our weekly VSHN.timer newsletter.

PS²: would you like to watch VSHN.timer on YouTube? Subscribe to our channel vshn.tv and give a „thumbs up“ to our videos.

PS³: check out our previous VSHN.timer editions about incidents and failures: #32, #41, #49, and #66.

Adrian Kosmaczewski

Adrian Kosmaczewski ist bei VSHN für den Bereich Developer Relations zuständig. Er ist seit 1996 Software-Entwickler, Trainer und veröffentlichter Autor. Adrian hat einen Master in Informationstechnologie von der Universität Liverpool.

VSHN.timer

VSHN.timer #227: Things that make you go „hmmmm“

VSHN.timer

VSHN.timer #226: Kubernetes Chronicle – Unveiling the latest innovations and security solutions

VSHN.timer

VSHN.timer #225: DevOpsDays Zürich 2024

VSHN.timer

VSHN.timer #224: System Monitoring – an important cogwheel in the digital world

VSHN.timer

VSHN.timer #223: An Exozodiacal Threat

VSHN.timer

VSHN.timer #222: Videos from KubeCon + CloudNativeCon Europe 2024

Project Syn
Tech

Rewriting a Python Library in Rust

VSHN.timer

VSHN.timer #221: A Turning Point for Microservices?

VSHN.timer

VSHN.timer #220: Employment in Times of Crisis

Kontaktiere uns

Unser Expertenteam steht für dich bereit. Im Notfall auch 24/7.

Kontakt

Schau dir das APPUiO Video an

DevOps in der Schweiz 2023 Report

Kommende Events

VSHN hilft acrevis Bank

Wir suchen Talente für DevOps Engineering

VSHN.timer #75: Dealing with Catastrophe

Adrian Kosmaczewski

Latest news

VSHN.timer #227: Things that make you go „hmmmm“

VSHN.timer #226: Kubernetes Chronicle – Unveiling the latest innovations and security solutions

VSHN.timer #225: DevOpsDays Zürich 2024

VSHN.timer #224: System Monitoring – an important cogwheel in the digital world

VSHN.timer #223: An Exozodiacal Threat

VSHN.timer #222: Videos from KubeCon + CloudNativeCon Europe 2024

Rewriting a Python Library in Rust

VSHN.timer #221: A Turning Point for Microservices?

VSHN.timer #220: Employment in Times of Crisis

Kontaktiere uns