Lessons from the field – avoiding another “CrowdStrike”

“If builders built buildings the way programmers wrote programs, then the first woodpecker that came along would destroy civilization.”
– Gerald Weinberg, Software Engineering guru, and Psychologist

Weinberg was a legend. When he said this, he did not mean that programmers are incompetent and software is junk. What he meant was, that it is possible to build better software. I came across this quote many years ago and it has stuck with me.

Those who had never heard of CrowdStrike before now also know it. Talk about being infamous! It is easy to blame CrowdStrike for the biggest global IT outage on the 19th of July 2024. In my view, the responsibility (almost) entirely sits with the businesses that suffered. Why? They failed to prepare. CrowdStrike’s success rate in deploying these updates over a year would be in four or five 9s. These businesses were expecting it to be 100%. They never planned for failures that could impact their businesses. The same businesses could be struck again, by something else. Their BCP exercise is likely to be non-existent, or a tick in the box, or does not cover real life scenarios. Some are multi-billion-dollar businesses, and some not even few million-dollars’ worth. At least they got a huge wake-up call. Those who weren’t using CrowdStrike may be thinking they dodged a bullet. ‘til something else strikes.

A lot has been written recently in the technology and business media on this subject.

Forbes: On resilience and back to basics
CIO: On risk of cloud concentration
McKinsey: On what to ask the team after the outage
PWC: released a paper targeted at CxOs to take specific actions.

This is my plain-speak, not a consultant view, lessons from the field.

The problem: Technology is complex. Most people have no idea how computers really work. It will take more than a lifetime for one person to understand it totally. Right from the silicon wafer to operating system, to the data being transmitted as 1s and 0s on the wire (or wireless). It has taken many brains to put this all together. One does not need to know the intricate details of technology but understanding what can go wrong with it is a gift.

In over twenty-five years in technology, mostly spent designing, building and operating technology infrastructure, I have not always been successful in preventing disasters. But I can tell you a thing or two, or maybe more, that may have prevented many.

A few weeks ago, I was chatting to a colleague, and we were talking about fighting fires (aka Major Incidents in the IT speak). I have fought a few such fires. It has always occurred to me how do I stop these fires in occurring altogether. I get my high from preventing these fires in the first place. It may be boring, does not come with accolades of bringing a broken system back to life.

“Rarely is anyone ever thanked for the work they did to prevent the disaster that never happened.”
-Mikko Hypponen, Finnish cybersecurity expert, author.

The approach I learnt from my peers and leaders when designing technology infrastructure involves a bit of thinking. Thinking about systems as a whole. Thinking about complex systems. Thinking about how they are interconnected and interdependent. Thinking about what can fail where. And when. Thinking about the spread of impact when (not if) things fail, a.k.a. blast radius. Then add to the list some layers – backup layers, protection layers and recoverability layers. One cannot keep thinking forever, unless a philosopher by profession! The act of putting all the thinking together and the testing and simulating failures is rewarding. When it fails – in simulation or in real, lessons are captured, tweaks made. The cycle continues.

Systems Thinking: This is where it all begins. Even a single laptop that one can buy for $500 is a hugely complex. If we talk about Microsoft Windows 11, the fundamentals have still not changed since Windows 95/ Windows NT days. The operating system, file system, authentication use the same principles. Some filenames are still the same (yes, the code has changed). Anyone wants to talk regedit? Anyways, I digress. A computer connected to corporate network and Internet for its apps can be regarded as a complex system. People who deal with complex systems day in and day out know that complex systems fail. Complex systems have entropy, they will move from order to disorder. (Entropy, put simply, is the tendency of things to break down, to wear out over time.) Anticipate failure. I came across this in my LinkedIn feed – How Complex Systems Fail. It is a must-read if you deal with systems and failures. Talk about systemic failure, eh!

Points of Failure: There will be always multiple points of failure. Let’s first digest a local fact. Currently in New Zealand, there are no Tier 4 Data Centres, a data centre that can guarantee 99.995% of availability. (Read more about data centre tiers at Uptime Institute). If a service is hosted locally (private cloud or co-located) then there is always a chance that it will fail. Well, it does not mean that if it is hosted in the public cloud, it won’t. Where the service is hosted is one piece of the puzzle, how it is architected is the next piece. Business and technology leaders must realize and recognise the points of failure.

Blast Radius: It is an interesting term, and it sticks with people. There are two aspects of this. One – the immediate damage. Second – the collateral damage. The collateral damage is bigger than the immediate damage. It kills reputation and bleeds money. A key principle is to keep the blast radius as small as possible when things go wrong. You can’t do it after the fact, systems need to be designed keeping it in mind. This practice dates back many years. When anti-virus updates or system patches were rolled out, they were tested and then rolled out in a staggered way. Yes, things failed even after testing. You can only do that when it is a push-pull system (you pull the updates and then push as you wish). What happens when you have no control on what gets pushed to your system and when. You cannot manage the blast radius. The entire organisation is in the blast radius. Some years ago, it was impossible for vendors to sell software products that the business had no control on. With the CrowdStrike event, these discussions will be at the table again.

Layered Protection: When I first arrived in New Zealand, people told me about layered clothing approach to deal with the four seasons in a day weather. Same applies to technology, except you don’t remove a layer! There are many layers of redundancy that is required to be built. As we moved from centralised computing (Mainframes) to distributed computing (Windows/Linux), the risk also got distributed. It got more complex as well. Multi-tier architecture introduced multiple layers that could fail. Designs need to enforce fault tolerance and an architecture that is built on foundation of high availability. Imagine your core application in two datacentres in two geographic reasons, fully in sync with each other with network load balancing. You would think that an event like CrowdStrike would have no impact. Wrong! Only if it was possible to limit the blast radius to one datacentre.

Backup of a backup: If a backup has ever saved your day (or nights), you will realise the importance of it. One of my early lessons was about backups. The big challenge with backups is that you can only recover what has been backed up. Commonly known as Recovery Point. The point in time you are happy to lose your data up to. For large systems it takes many hours to backup (and restore) data. This is the last resort, when every idea has been explored and no stone has been left unturned. Businesses do not like this option. You lose all the transactions that have happened since the last backup, could be 2, 8 or 24 hours or more. Imagine an ecommerce site, restoring backup that is 24 hours old. What happens to the orders placed after that? Call your lawyers!! Thankfully, technology architecture has evolved, and the recovery point and recovery time have both been shortened. In modern system, you can almost restore to the last minute.

Test and simulations: The most underrated and poorly performed technology function. We use so many sports analogy in leadership, but don’t use it in this case. Practice, practice, practice!!! I remember, when I worked for Citibank in Singapore in the early 2000s, there used to be an annual disaster recovery test. The planning used to last for more than a month. A complete datacentre failure was initiated, switching to a secondary site and many systems restored from backup. Table top exercises are now more common, it does not take months of planning. That is great, but most of them miss the point and are not taken seriously. Like a fire-drill, they need to be more regular, and focus on multiple scenarios, not just security events. There are many boards that now require these tests to be reported to the Audit and Risk or a similar committee.

Recoverability: So far, I have talked about approached that are preventive in nature. The fence at the top of the cliff. Systems will fail. The next big question is how quickly you can recover! It will depend upon the prevention approach, and the type of failure and the extent of damage. In scenarios like CrowdStrike event, where it requires manual intervention for each affected system, it will be painstakingly long. At least in this case, the fix was quickly known. In many scenarios I have been are like finding a needle in haystack situations, with a manager or a senior leader asking every 15 minutes – “Are we there yet?” A well-established and tested process and an Incident Response Team that swings into action swiftly are a blessing. The last thing you want is trying to find the on-call roster and find your on-call Database Administrator!

Today, businesses and customer rely heavily upon technology. There is hardly any business that is untouched by technology in some way or another. Small businesses, the corner dairy, the local cafe may not be using a computer, but they are relying upon payment systems that use technology in a nearly cashless society. As technologists, sometimes we (me included) end up chasing quick wins and forget our “duty of care” to our customers, to those who rely upon systems we build, we design, we manage, we operate. The CrowdStrike event is a wakeup call, from the woodpecker!

P.S. My apologies to CrowdStrike employees for using the unfortunate event as title and content of my post. This is not a dig at you or the organization. $*IT happens. Yes, you could have done better, so could have every other business that knowingly or unknowingly was your customer.

Discover more from Sid Kumar

Subscribe to get the latest posts to your email.

Or follow me on