Enterprise

How to prevent your software update from being the next CrowdStrike

Comment

Times Square billboards displaying Windows blue screen of death after CrowdStrike outage on July 19, 2024.
Image Credits: Selcuk Acar/Anadolu / Getty Images

CrowdStrike released a relatively minor patch on Friday, and somehow it wreaked havoc on large swaths of the IT world running Microsoft Windows, bringing down airports, healthcare facilities and 911 call centers. While we know a faulty update caused the problem, we don’t know how it got released in the first place. A company like CrowdStrike very likely has a sophisticated DevOps pipeline with release policies in place, but even with that, the buggy code somehow slipped through.

In this case it was perhaps the mother of all buggy code. The company has suffered a steep hit to its reputation, and the stock price plunged from $345.10 on Thursday evening to $263.10 by Monday afternoon. It has since recovered slightly.

In a statement on Friday, the company acknowledged the consequences of the faulty update: “All of CrowdStrike understands the gravity and impact of the situation. We quickly identified the issue and deployed a fix, allowing us to focus diligently on restoring customer systems as our highest priority.”

Further, it explained the root cause of the outage, although not how it happened. That’s a post mortem process that will likely go on inside the company for some time as it looks to prevent such a thing from happening again.

Dan Rogers, CEO at LaunchDarkly, a firm that uses a concept called feature flags to deploy software in a highly controlled way, couldn’t speak directly to the CrowdStrike deployment problem, but he could speak to software deployment issues more broadly.

“Software bugs happen, but most of the software experience issues that someone would experience are actually not because of infrastructure issues,” he told TechCrunch. “They’re because someone rolled out a piece of software that doesn’t work, and those in general are very controllable.” With feature flags, you can control the speed of deployment of new features, and turn a feature off, if things go wrong to prevent the problem from spreading widely.

It is important to note however, that in this case, the problem was at the operating system kernel level, and once that has run amok, it’s harder to fix than say a web application. Still, a slower deployment could have alerted the company to the problem a lot sooner.

What happened at CrowdStrike could potentially happen to any software company, even one with good software release practices in place, said Jyoti Bansal, founder and CEO at Harness, a maker of DevOps pipeline developer tools. While he also couldn’t say precisely what happened at CrowdStrike, he talked generally about how buggy code can slip through the cracks.

Typically, there is a process in place where code gets tested thoroughly before it gets deployed, but sometimes an engineering team, especially in a large engineering group, may cut corners. “It’s possible for something like this to happen when you skip the DevOps testing pipeline, which is pretty common with minor updates,” Bansal told TechCrunch.

He says this often happens at larger organizations where there isn’t a single approach to software releases. “Let’s say you have 5,000 engineers, which probably will be divided into 100 teams of 50 or so different developers. These teams adopt different practices,” he said. And without standardization, it’s easier for bad code to slip through the cracks.

How to prevent bugs from slipping through

Both CEOs acknowledge that bugs get through sometimes, but there are ways to minimize the risk, including perhaps the most obvious one: practicing standard software release hygiene. That involves testing before deploying and then deploying in a controlled way.

Rogers points to his company’s software and notes that progressive rollouts are the place to start. Instead of delivering the change to every user all at once, you instead release it to a small subset and see what happens before expanding the rollout. Along the same lines, if you have controlled rollouts and something goes wrong, you can roll back. “This idea of feature management or feature control lets you roll back features that aren’t working and get people back to the prior version if things are not working.”

Bansal, whose company just bought feature flag startup Split.io in May, also recommends what he calls “canary deployments,” which are small controlled test deployments. They are called this because they hark back to canaries being sent into coal mines to test for carbon monoxide leakage. Once you prove the test roll out looks good, then you can move to the progressive roll out that Rogers alluded to.

As Bansal says, it can look fine in testing, but a lab test doesn’t always catch everything, and that’s why you have to combine good DevOps testing with controlled deployment to catch things that lab tests miss.

Rogers suggests when doing an analysis of your software testing regimen, you look at three key areas — platform, people and processes — and they all work together in his view. “It’s not sufficient to just have a great software platform. It’s not sufficient to have highly enabled developers. It’s also not sufficient to just have predefined workflows and governance. All three of those have to come together,” he said.

One way to prevent individual engineers or teams from circumventing the pipeline is to require the same approach for everyone, but in a way that doesn’t slow the teams down. “If you build a pipeline that slows down developers, they will at some point find ways to get their job done outside of it because they will think that the process is going to add another two weeks or a month before we can ship the code that we wrote,” Bansal said.

Rogers agrees that it’s important not to put rigid systems in place in response to one bad incident. “What you don’t want to have happen now is that you’re so worried about making software changes that you have a very long and protracted testing cycle and you end up stifling software innovation,” he said.

Bansal says a thoughtful automated approach can actually be helpful, especially with larger engineering groups. But there is always going to be some tension between security and governance and the need for release velocity, and it’s hard to find the right balance.

We might not know what happened at CrowdStrike for some time, but we do know that certain approaches help minimize the risks around software deployment. Bad code is going to slip through from time to time, but if you follow best practices, it probably won’t be as catastrophic as what happened last week.

More TechCrunch

The pharma giant won’t say how many patients were affected by its February data breach. A count by TechCrunch confirms that over a million people are affected.

Pharma giant Cencora is alerting millions about its data breach

Self-driving technology company Aurora Innovation is looking to raise hundreds of millions in additional capital as it races toward a driverless commercial launch by the end of 2024.  Aurora is…

Self-driving truck startup Aurora Innovation to sell up to $420M in shares ahead of commercial launch

Payments infrastructure firm Infibeam Avenues has acquired a majority 54% stake in Rediff.com for up to $3 million, a dramatic twist of fate for the 28-year-old business that was the…

Rediff, once an internet pioneer in India, sells majority stake for $3M

The ruling confirmed an earlier decision in April from the High Court of Podgorica which rejected a request to extradite the crypto fugitive to the United States.

Terraform Labs co-founder and crypto fugitive Do Kwon set for extradition to South Korea

A day after Meta CEO Mark Zuckerberg talked about his newest social media experiment Threads reaching “almost” 200 million users on the company’s Q2 2024 earnings call, the platform has…

Meta’s Threads crosses 200 million active users

TechCrunch Disrupt 2024 will be in San Francisco on October 28–30, and we’re already excited! Disrupt brings innovation for every stage of your startup journey, and we could not bring you this…

Connect with Google Cloud, Aerospace, Qualcomm and more at Disrupt 2024

Featured Article

A comprehensive list of 2024 tech layoffs

The tech layoff wave is still going strong in 2024. Following significant workforce reductions in 2022 and 2023, this year has already seen 60,000 job cuts across 254 companies, according to independent layoffs tracker Layoffs.fyi. Companies like Tesla, Amazon, Google, TikTok, Snap and Microsoft have conducted sizable layoffs in the…

A comprehensive list of 2024 tech layoffs

Intel announced it would layoff more than 15% of its staff, or 15,000 employees, in a memo to employees on Thursday. The massive headcount is part of a large plan…

Intel to lay off 15,000 employees

Following the recent lawsuit filed by the Recording Industry Association of America (RIAA) against music generation startups Udio and Suno, Suno admitted in a court filing on Thursday that it did, in…

AI music startup Suno claims training model on copyrighted music is ‘fair use’

In spite of a drop for the quarter, iPhone remained Apple’s most important category by a wide margin.

iPad sales help bail out Apple amid a continued iPhone slide

Molly Alter wears a lot of hats. She’s a mocumentary filmmaker working on a project about an alternate reality where charades is big business. She’s a caesar salad connoisseur and…

How filming a cappella concerts and dance recitals led Northzone’s newest partner Molly Alter to a career in VC

Microsoft has a long and tangled history with OpenAI, having invested a reported $13 billion in the ChatGPT maker as part of a long-term partnership. As part of the deal,…

Microsoft now lists OpenAI as a competitor in AI and search

The San Jose-based startup raised $60 million in a round that values it lower than the $500 million valuation it garnered in its most recent round, according to multiple sources.

Sequoia-backed Knowde raises Series C at a valuation cut

X (formerly Twitter) can no longer be accessed in the Mac App Store, suggesting that it has been officially delisted.  Searches for both “Twitter” and “X” on Apple’s platform no…

Twitter disappears from Mac App Store

Google Thursday said that it is introducing new Gemini-powered features for Chrome’s desktop version, including Lens for desktop, tab compare for shopping assistance, and natural language integration for search history.…

Google brings Gemini-powered search history and Lens to Chrome desktop

When Xiaoyin Qu was growing up in China, she was obsessed with learning how to build paper airplanes that could do flips in the air. Her parents, though, didn’t have…

Heeyo built an AI chatbot to be a billion kids’ interactive tutor and friend

While the company was awarded a massive, $4.2 billion contract to accelerate Starliner development in 2014, it was structured as a “fixed-price” model.

Boeing bleeds another $125M on Starliner program, bringing total losses to $1.6B

Welcome back to TechCrunch Mobility — your central hub for news and insights on the future of transportation. Sign up here for free — just click TechCrunch Mobility! Summer road…

Anthony Levandowski bets on off-road autonomy, Nuro plots a comeback and Applied Intuition gets more investor love

Google’s new features include Gemini in BigQuery and Looker to help users with data engineering and analysis.

Google Cloud expands its database portfolio with new AI capabilities

Rad Power Bikes, the Seattle-based e-bike startup that has raised more than $300 million from investors, went through another round of layoffs in July, TechCrunch has exclusively learned. This is…

VC darling Rad Power Bikes hit with another round of layoffs

Five years ago, as robotaxis and self-driving truck startups were still raking in millions in venture capital, Anthony Levandowski turned to off-road autonomy. Now, that decision — which brought the…

Why Anthony Levandowski returned to his off-road autonomous vehicle roots with AV startup Pronto

Commercial space station company Vast is building a private microgravity research lab as part of its wider Haven-1 station plans. The module is set to launch no earlier than the…

Vast plans microgravity lab on its Haven-1 private space station

Google Cloud is giving Y Combinator startups access to a dedicated, subsidized cluster of Nvidia graphics processing units and Google tensor processing units to build AI models. It’s part of…

Google Cloud now has a dedicated cluster of Nvidia GPUs for Y Combinator startups

StackShare is one of the more popular platforms for developers to discuss, track, and share the tools they use to build applications.

Open source startup FOSSA is buying StackShare, a site used by 1.5M developers

Featured Article

Indian startups gut valuations ahead of IPO push

Ola Electric and FirstCry are set to test investor appetite with public listing, both pricing their shares below their previous valuation asks.

Indian startups gut valuations ahead of IPO push

The European Union’s risk-based regulation for applications of artificial intelligence has come into force starting from today.

The EU’s AI Act is now in force

The company also said it has received regulatory clearance to start Phase 2 clinical trials for a new drug in the U.S. later this year.

Healx, an AI-enabled drug discovery platform for rare diseases, raises $47M

The European Commission (EC) has given the go-ahead to HPE’s planned megabucks acquisition of Juniper Networks.

EU greenlights HPE’s $14B Juniper Networks acquisition

Meta, which develops one of the biggest foundational open source large language models, Llama, believes it will need significantly more computing power to train models in the future. Mark Zuckerberg…

Zuckerberg says Meta will need 10x more computing power to train Llama 4 than Llama 3

Axle Energy is a B2B, back-end infrastructure business focused on connecting flexible assets, such as electric vehicles and home batteries, to energy markets that aren’t otherwise available for consumers to…

Axle Energy’s sprint to decarbonize the grid lights up with $9M seed led by Accel