OpenAI breach is a reminder that AI companies are treasure troves for hackers

12:49 PM PDT • July 5, 2024

Image Credits: Bryce Durbin / TechCrunch

There’s no need to worry that your secret ChatGPT conversations were obtained in a recently reported breach of OpenAI’s systems. The hack itself, while troubling, appears to have been superficial — but it’s a reminder that AI companies have in short order made themselves into one of the juiciest targets out there for hackers.

The New York Times reported the hack in more detail after former OpenAI employee Leopold Aschenbrenner hinted at it recently in a podcast. He called it a “major security incident,” but unnamed company sources told the Times the hacker only got access to an employee discussion forum. (I reached out to OpenAI for confirmation and comment.)

No security breach should really be treated as trivial, and eavesdropping on internal OpenAI development talk certainly has its value. But it’s far from a hacker getting access to internal systems, models in progress, secret roadmaps, and so on.

But it should scare us anyway, and not necessarily because of the threat of China or other adversaries overtaking us in the AI arms race. The simple fact is that these AI companies have become gatekeepers to a tremendous amount of very valuable data.

Let’s talk about three kinds of data OpenAI and, to a lesser extent, other AI companies created or have access to: high-quality training data, bulk user interactions, and customer data.

It’s uncertain what training data exactly they have, because the companies are incredibly secretive about their hoards. But it’s a mistake to think they are just big piles of scraped web data. Yes, they do use web scrapers or datasets like the Pile, but it’s a gargantuan task shaping that raw data into something that can be used to train a model like GPT-4o. A huge amount of human work hours are required to do this — it can only be partially automated.

AI training data has a price tag that only Big Tech can afford

Some machine learning engineers have speculated that of all the factors going into the creation of a large language model (or, perhaps, any transformer-based system), the single most important one is dataset quality. That’s why a model trained on Twitter and Reddit will never be as eloquent as one trained on every published work of the last century. (And probably why OpenAI reportedly used questionably legal sources like copyrighted books in their training data, a practice they claim to have given up.)

So the training datasets OpenAI has built are of tremendous value to competitors, from other companies to adversary states to regulators here in the U.S. Wouldn’t the Federal Trade Commission (FTC) or courts like to know exactly what data was being used, and whether OpenAI has been truthful about that?

But perhaps even more valuable is OpenAI’s enormous trove of user data — probably billions of conversations with ChatGPT on hundreds of thousands of topics. Just as search data was once the key to understanding the collective psyche of the web, ChatGPT has its finger on the pulse of a population that may not be as broad as the universe of Google users, but provides far more depth. (In case you weren’t aware, unless you opt out, your conversations are being used for training data.)

AI-powered scams and what you can do about them

In the case of Google, an uptick in searches for “air conditioners” tells you the market is heating up a bit. But those users don’t then have a whole conversation about what they want, how much money they’re willing to spend, what their home is like, manufacturers they want to avoid, and so on. You know this is valuable because Google is itself trying to convert its users to provide this very information by substituting AI interactions for searches!

Think of how many conversations people have had with ChatGPT, and how useful that information is, not just to developers of AIs, but also to marketing teams, consultants, analysts … It’s a gold mine.

The last category of data is perhaps of the highest value on the open market: how customers are actually using AI, and the data they have themselves fed to the models.

Hundreds of major companies and countless smaller ones use tools like OpenAI and Anthropic’s APIs for an equally large variety of tasks. And in order for a language model to be useful to them, it usually must be fine-tuned on or otherwise given access to their own internal databases.

This might be something as prosaic as old budget sheets or personnel records (e.g., to make them more easily searchable) or as valuable as code for an unreleased piece of software. What they do with the AI’s capabilities (and whether they’re actually useful) is their business, but the simple fact is that the AI provider has privileged access, just as any other SaaS product does.

These are industrial secrets, and AI companies are suddenly right at the heart of a great deal of them. The newness of this side of the industry carries with it a special risk in that AI processes are simply not yet standardized or fully understood.

Hugging Face says it detected ‘unauthorized access’ to its AI model hosting platform

Like any SaaS provider, AI companies are perfectly capable of providing industry standard levels of security, privacy, on-premises options, and generally speaking providing their service responsibly. I have no doubt that the private databases and API calls of OpenAI’s Fortune 500 customers are locked down very tightly! They must certainly be as aware or more of the risks inherent in handling confidential data in the context of AI. (The fact that OpenAI did not report this attack is their choice to make, but it doesn’t inspire trust for a company that desperately needs it.)

But good security practices don’t change the value of what they are meant to protect, or the fact that malicious actors and sundry adversaries are clawing at the door to get in. Security isn’t just picking the right settings or keeping your software updated — though of course the basics are important too. It’s a never-ending cat-and-mouse game that is, ironically, now being supercharged by AI itself: Agents and attack automators are probing every nook and cranny of these companies’ attack surfaces.

There’s no reason to panic — companies with access to lots of personal or commercially valuable data have faced and managed similar risks for years. But AI companies represent a newer, younger, and potentially juicier target than your garden-variety, poorly configured enterprise server or irresponsible data broker. Even a hack like the one reported above, with no serious exfiltrations that we know of, should worry anybody who does business with AI companies. They’ve painted the targets on their backs. Don’t be surprised when anyone, or everyone, takes a shot.

AI aids nation-state hackers but also helps US spies to find them, says NSA cyber director

More TechCrunch

Pharma giant Cencora is alerting millions about its data breach

Zack Whittaker

2 mins ago

The pharma giant won’t say how many patients were affected by its February data breach. A count by TechCrunch confirms that over a million people are affected.

Pharma giant Cencora is alerting millions about its data breach

Transportation

Self-driving truck startup Aurora Innovation to sell up to $420M in shares ahead of commercial launch

Rebecca Bellan

15 mins ago

Self-driving technology company Aurora Innovation is looking to raise hundreds of millions in additional capital as it races toward a driverless commercial launch by the end of 2024. Aurora is…

Self-driving truck startup Aurora Innovation to sell up to $420M in shares ahead of commercial launch

Venture

Rediff, once an internet pioneer in India, sells majority stake for $3M

Manish Singh

4 hours ago

Payments infrastructure firm Infibeam Avenues has acquired a majority 54% stake in Rediff.com for up to $3 million, a dramatic twist of fate for the 28-year-old business that was the…

Rediff, once an internet pioneer in India, sells majority stake for $3M

Crypto

Terraform Labs co-founder and crypto fugitive Do Kwon set for extradition to South Korea

Kate Park

6 hours ago

The ruling confirmed an earlier decision in April from the High Court of Podgorica which rejected a request to extradite the crypto fugitive to the United States.

Terraform Labs co-founder and crypto fugitive Do Kwon set for extradition to South Korea

Apps

Meta’s Threads crosses 200 million active users

Ivan Mehta

6 hours ago

A day after Meta CEO Mark Zuckerberg talked about his newest social media experiment Threads reaching “almost” 200 million users on the company’s Q2 2024 earnings call, the platform has…

Meta’s Threads crosses 200 million active users

TechCrunch Disrupt 2024

Connect with Google Cloud, Aerospace, Qualcomm and more at Disrupt 2024

Cindy Zackney

12 hours ago

TechCrunch Disrupt 2024 will be in San Francisco on October 28–30, and we’re already excited! Disrupt brings innovation for every stage of your startup journey, and we could not bring you this…

Connect with Google Cloud, Aerospace, Qualcomm and more at Disrupt 2024

Featured Article

A comprehensive list of 2024 tech layoffs

The tech layoff wave is still going strong in 2024. Following significant workforce reductions in 2022 and 2023, this year has already seen 60,000 job cuts across 254 companies, according to independent layoffs tracker Layoffs.fyi. Companies like Tesla, Amazon, Google, TikTok, Snap and Microsoft have conducted sizable layoffs in the…

Cody Corrall

Alyssa Stringer

16 hours ago

A comprehensive list of 2024 tech layoffs

Enterprise

Intel to lay off 15,000 employees

Maxwell Zeff

16 hours ago

Intel announced it would layoff more than 15% of its staff, or 15,000 employees, in a memo to employees on Thursday. The massive headcount is part of a large plan…

AI music startup Suno claims training model on copyrighted music is ‘fair use’

Lauren Forristal

17 hours ago

Following the recent lawsuit filed by the Recording Industry Association of America (RIAA) against music generation startups Udio and Suno, Suno admitted in a court filing on Thursday that it did, in…

AI music startup Suno claims training model on copyrighted music is ‘fair use’

Hardware

iPad sales help bail out Apple amid a continued iPhone slide

Brian Heater

17 hours ago

In spite of a drop for the quarter, iPhone remained Apple’s most important category by a wide margin.

iPad sales help bail out Apple amid a continued iPhone slide

Venture

How filming a cappella concerts and dance recitals led Northzone’s newest partner Molly Alter to a career in VC

Rebecca Szkutak

20 hours ago

Molly Alter wears a lot of hats. She’s a mocumentary filmmaker working on a project about an alternate reality where charades is big business. She’s a caesar salad connoisseur and…

How filming a cappella concerts and dance recitals led Northzone’s newest partner Molly Alter to a career in VC

Microsoft now lists OpenAI as a competitor in AI and search

Maxwell Zeff

21 hours ago

Microsoft has a long and tangled history with OpenAI, having invested a reported $13 billion in the ChatGPT maker as part of a long-term partnership. As part of the deal,…

Microsoft now lists OpenAI as a competitor in AI and search

Startups

Sequoia-backed Knowde raises Series C at a valuation cut

Rebecca Szkutak

22 hours ago

The San Jose-based startup raised $60 million in a round that values it lower than the $500 million valuation it garnered in its most recent round, according to multiple sources.

Sequoia-backed Knowde raises Series C at a valuation cut

Apps

Twitter disappears from Mac App Store

Lauren Forristal

22 hours ago

X (formerly Twitter) can no longer be accessed in the Mac App Store, suggesting that it has been officially delisted. Searches for both “Twitter” and “X” on Apple’s platform no…

Google brings Gemini-powered search history and Lens to Chrome desktop

Ivan Mehta

23 hours ago

Google Thursday said that it is introducing new Gemini-powered features for Chrome’s desktop version, including Lens for desktop, tab compare for shopping assistance, and natural language integration for search history.…

The EU’s AI Act is now in force

Natasha Lomas

1 day ago

The European Union’s risk-based regulation for applications of artificial intelligence has come into force starting from today.

Biotech & Health

Healx, an AI-enabled drug discovery platform for rare diseases, raises $47M

Paul Sawers

1 day ago

The company also said it has received regulatory clearance to start Phase 2 clinical trials for a new drug in the U.S. later this year.

Healx, an AI-enabled drug discovery platform for rare diseases, raises $47M

Enterprise

EU greenlights HPE’s $14B Juniper Networks acquisition

Paul Sawers

1 day ago

The European Commission (EC) has given the go-ahead to HPE’s planned megabucks acquisition of Juniper Networks.

EU greenlights HPE’s $14B Juniper Networks acquisition

Zuckerberg says Meta will need 10x more computing power to train Llama 4 than Llama 3

Ivan Mehta

1 day ago

Meta, which develops one of the biggest foundational open source large language models, Llama, believes it will need significantly more computing power to train models in the future. Mark Zuckerberg…

Zuckerberg says Meta will need 10x more computing power to train Llama 4 than Llama 3

Climate

Axle Energy’s sprint to decarbonize the grid lights up with $9M seed led by Accel

Natasha Lomas

1 day ago

Axle Energy is a B2B, back-end infrastructure business focused on connecting flexible assets, such as electric vehicles and home batteries, to energy markets that aren’t otherwise available for consumers to…

Axle Energy’s sprint to decarbonize the grid lights up with $9M seed led by Accel

OpenAI breach is a reminder that AI companies are treasure troves for hackers

More TechCrunch

Get the industry’s biggest tech news

TechCrunch Daily News

Startups Weekly

TechCrunch Fintech

TechCrunch Mobility

Tags