After AgentGPT’s success, Reworkd pivots to web-scraping AI agents

8:00 AM PDT • July 24, 2024

Reworkd’s founders went viral on GitHub last year with AgentGPT, a free tool to build AI agents that acquired more than 100,000 daily users in a week. This earned them a spot in Y Combinator’s summer 2023 cohort, but the co-founders quickly realized building general AI agents was too broad. So now Reworkd is a web-scraping company, specifically building AI agents to extract structured data from the public web.

AgentGPT provided a simple interface in a browser where users could create autonomous AI agents. Soon, everyone was raving about how agents were the future of computing.

When the tool took off, Asim Shrestha, Adam Watkins, and Srijan Subedi were still living in Canada and Reworkd didn’t exist. The massive user influx caught them off guard; Subedi, now Reworkd’s COO, said the tool was costing them $2,000 a day in API calls. For that reason, they had to create Reworkd and get funded fast. One of the most popular use cases for AgentGPT was creating web scrapers, a relatively simple but high-volume task, so Reworkd made this its singular focus.

Web scrapers have become invaluable in the AI era. The number one reason organizations use public web data in 2024 is to build AI models, according to Bright Data’s latest report. The problem is that web scrapers are traditionally built by humans and must be customized for specific web pages, making them expensive. But Reworkd’s AI agents can scrape more of the web with fewer humans in the loop.

Customers can give Reworkd a list of hundreds, or even thousands, of websites to scrape and then specify the types of data they’re interested in. Then Reworkd’s AI agents use multimodal code generation to turn this into structured data. Agents generate unique code to scrape each website and extract that data for customers to use as they please.

For example, say you want stats on every NFL player, but every team’s website has a different layout. Instead of building a scraper for each website, Reworkd’s agents do that for you given just links and a description of the data you want to extract. With 32 teams, that could save you hours — but if there were 1,000 teams, it could save you weeks.

Reworkd raised a fresh $2.75 million in seed funding from Paul Graham, AI Grant (Nat Friedman and Daniel Gross’ startup accelerator), SV Angel, General Catalyst and Panache Ventures, among others, the startup exclusively told TechCrunch. Combined with a $1.25 million pre-seed investment last year from Panache Ventures and Y Combinator, this brings Reworkd’s total funding raised to date to $4 million.

AI that can use the internet

Shortly after forming Reworkd and moving to San Francisco, the team hired Rohan Pandey as a founding research engineer. He currently lives in AGI House SF, one of the Bay Area’s most popular hacker houses for the AI era. One investor described Pandey as a “one person research lab within Reworkd.”

“We see ourselves as the culmination of this 30-year dream of the Semantic Web,” said Pandey in an interview with TechCrunch, referring to a vision of world wide web inventor Tim Berners-Lee in which computers can read the entire internet. “Even though some websites don’t have markup, LLMs can understand the websites in the same ways that humans can, in such that we can expose basically any website as an API. So in some sense, Reworkd is like the universal API layer for the internet.”

Reworkd says it’s able to capture the long tail end of customer data needs, meaning its AI agents are specifically good for scraping thousands of smaller public websites that large competitors often skip over. Others, such as Bright Data, have scrapers for large websites like LinkedIn or Amazon already built out, but it may not be worth the trouble for a human to build a scraper for every small website. Reworkd addresses this concern, but potentially raises others.

What exactly is “public” web data?

Though web scrapers have existed for decades, they have attracted controversy in the AI era. Unfettered scraping of huge swathes of data has thrown OpenAI and Perplexity into legal trouble: News and media organizations allege the AI companies extracted intellectual property from behind a paywall, reproducing it widely without payment. Reworkd is taking precautions to avoid these issues.

“We look at it as uplifting the accessibility of publicly available information,” said Shrestha, co-founder and CEO of Reworkd, in an interview with TechCrunch. “We’re only allowing information that’s publicly available; we’re not going through sign-in walls or anything like that.”

To go a step further, Reworkd says it’s avoiding scraping news altogether, and being selective about who they work with. Watkins, the company’s CTO, says there are better tools for aggregating news content elsewhere, and it is not their focus.

As an example of what is, Reworkd described their work with Axis, a company that helps policy teams comply with government regulations. Axis uses Reworkd’s AI to extract data from thousands of government regulation documents for many countries across the European Union. Axis then trains and fine-tunes an AI model based on this data and offers it to clients as a product.

Starting a web-scraping company these days could be considered wading into dangerous territory, according to Aaron Fiske, partner at Silicon-Valley based law firm Gunderson Dettmer. The landscape is somewhat fluid right now, and the jury is still out on how “public” web data really is for AI models. However, Fiske says Reworkd’s approach, where customers decide what websites to scrape, may insulate them from legal liability.

“It’s like they invented the copying machine, and there’s this one use case for making copies that turned out to be hugely economically valuable, but also legally, really questionable,” said Fiske in an interview with TechCrunch. “It’s not like web scrapers servicing AI companies is necessarily risky, but working with AI companies that are really interested in harvesting copyrighted content is maybe an issue.”

That’s why Reworkd is being careful about who it works with. Web scrapers have obfuscated much of the blame in potential copyright infringement cases related to AI thus far. In the OpenAI case, Fiske points out that The New York Times did not sue the web scraper that collected its articles, but rather the company that allegedly reproduced its work. But even there, it’s yet to be decided if what OpenAI did was truly copyright infringement.

There’s more evidence that web scrapers are legally in the clear during the AI boom. A court recently ruled in favor of Bright Data after it scraped Facebook and Instagram profiles via the web. One example in the court case was a dataset of 615 million records of Instagram user data, which Bright Data sells for $860,000. Meta sued the company, alleging this violated its terms of service. But a court ruled that this data is public and therefore available to scrape.

Investors think Reworkd scales with the big guys

Reworkd has attracted big names as early investors, from Y Combinator and Paul Graham to Daniel Gross and Nat Friedman. Some investors say this is because Reworkd’s technology stands to improve, and get cheaper, alongside new models. The startup says OpenAI’s GPT-4o is currently the best for its multimodal code generation and that a lot of Reworkd’s technology wasn’t possible until just a few months ago.

“If you try to compete with the rate of technology progress — not building on top of it — then I think that you’ll have a hard time as a founder,” General Catalyst’s Viet Le told TechCrunch. “Reworkd has the mindset of basing its solution on the rate of progress.”

Reworkd is creating AI agents that address a particular gap in the market; companies need more data because AI is advancing quickly. As more companies build custom AI models specific to their business, Reworkd stands to gain more customers. Fine-tuning models necessitates quality, structured data, and lots of it.

Reworkd says its approach is “self-healing,” meaning that its web scrapers won’t break down due to a web page update. The startup claims to avoid hallucination issues traditionally associated with AI models because Reworkd’s agents are generating code to scrape a website. It’s possible the AI could make a mistake and grab the wrong data from a website, but Reworkd’s team created B anana-lyzer, an open source evaluation framework, to regularly assess its accuracy.

Reworkd doesn’t have a large payroll — the team is just four people — but it does have to take on considerable inference costs for running its AI agents. The startup expects its pricing to get increasingly competitive as these costs trend downward. OpenAI just released GPT-4o mini, a smaller version of its industry-leading model with competitive benchmarks. Innovations like these could make Reworkd more competitive.

Paul Graham and AI Grant did not respond to TechCrunch’s request for comment.

More TechCrunch

Pharma giant Cencora is alerting millions about its data breach

Zack Whittaker

4 mins ago

The pharma giant won’t say how many patients were affected by its February data breach. A count by TechCrunch confirms that over a million people are affected.

Pharma giant Cencora is alerting millions about its data breach

Transportation

Self-driving truck startup Aurora Innovation to sell up to $420M in shares ahead of commercial launch

Rebecca Bellan

17 mins ago

Self-driving technology company Aurora Innovation is looking to raise hundreds of millions in additional capital as it races toward a driverless commercial launch by the end of 2024. Aurora is…

Self-driving truck startup Aurora Innovation to sell up to $420M in shares ahead of commercial launch

Venture

Rediff, once an internet pioneer in India, sells majority stake for $3M

Manish Singh

4 hours ago

Payments infrastructure firm Infibeam Avenues has acquired a majority 54% stake in Rediff.com for up to $3 million, a dramatic twist of fate for the 28-year-old business that was the…

Rediff, once an internet pioneer in India, sells majority stake for $3M

Crypto

Terraform Labs co-founder and crypto fugitive Do Kwon set for extradition to South Korea

Kate Park

6 hours ago

The ruling confirmed an earlier decision in April from the High Court of Podgorica which rejected a request to extradite the crypto fugitive to the United States.

Terraform Labs co-founder and crypto fugitive Do Kwon set for extradition to South Korea

Apps

Meta’s Threads crosses 200 million active users

Ivan Mehta

6 hours ago

A day after Meta CEO Mark Zuckerberg talked about his newest social media experiment Threads reaching “almost” 200 million users on the company’s Q2 2024 earnings call, the platform has…

Meta’s Threads crosses 200 million active users

TechCrunch Disrupt 2024

Connect with Google Cloud, Aerospace, Qualcomm and more at Disrupt 2024

Cindy Zackney

12 hours ago

TechCrunch Disrupt 2024 will be in San Francisco on October 28–30, and we’re already excited! Disrupt brings innovation for every stage of your startup journey, and we could not bring you this…

Connect with Google Cloud, Aerospace, Qualcomm and more at Disrupt 2024

Featured Article

A comprehensive list of 2024 tech layoffs

The tech layoff wave is still going strong in 2024. Following significant workforce reductions in 2022 and 2023, this year has already seen 60,000 job cuts across 254 companies, according to independent layoffs tracker Layoffs.fyi. Companies like Tesla, Amazon, Google, TikTok, Snap and Microsoft have conducted sizable layoffs in the…

Cody Corrall

Alyssa Stringer

16 hours ago

A comprehensive list of 2024 tech layoffs

Enterprise

Intel to lay off 15,000 employees

Maxwell Zeff

16 hours ago

Intel announced it would layoff more than 15% of its staff, or 15,000 employees, in a memo to employees on Thursday. The massive headcount is part of a large plan…

AI music startup Suno claims training model on copyrighted music is ‘fair use’

Lauren Forristal

17 hours ago

Following the recent lawsuit filed by the Recording Industry Association of America (RIAA) against music generation startups Udio and Suno, Suno admitted in a court filing on Thursday that it did, in…

AI music startup Suno claims training model on copyrighted music is ‘fair use’

Hardware

iPad sales help bail out Apple amid a continued iPhone slide

Brian Heater

17 hours ago

In spite of a drop for the quarter, iPhone remained Apple’s most important category by a wide margin.

iPad sales help bail out Apple amid a continued iPhone slide

Venture

How filming a cappella concerts and dance recitals led Northzone’s newest partner Molly Alter to a career in VC

Rebecca Szkutak

20 hours ago

Molly Alter wears a lot of hats. She’s a mocumentary filmmaker working on a project about an alternate reality where charades is big business. She’s a caesar salad connoisseur and…

How filming a cappella concerts and dance recitals led Northzone’s newest partner Molly Alter to a career in VC

Microsoft now lists OpenAI as a competitor in AI and search

Maxwell Zeff

21 hours ago

Microsoft has a long and tangled history with OpenAI, having invested a reported $13 billion in the ChatGPT maker as part of a long-term partnership. As part of the deal,…

Microsoft now lists OpenAI as a competitor in AI and search

Startups

Sequoia-backed Knowde raises Series C at a valuation cut

Rebecca Szkutak

22 hours ago

The San Jose-based startup raised $60 million in a round that values it lower than the $500 million valuation it garnered in its most recent round, according to multiple sources.

Sequoia-backed Knowde raises Series C at a valuation cut

Apps

Twitter disappears from Mac App Store

Lauren Forristal

22 hours ago

X (formerly Twitter) can no longer be accessed in the Mac App Store, suggesting that it has been officially delisted. Searches for both “Twitter” and “X” on Apple’s platform no…

Google brings Gemini-powered search history and Lens to Chrome desktop

Ivan Mehta

23 hours ago

Google Thursday said that it is introducing new Gemini-powered features for Chrome’s desktop version, including Lens for desktop, tab compare for shopping assistance, and natural language integration for search history.…

The EU’s AI Act is now in force

Natasha Lomas

1 day ago

The European Union’s risk-based regulation for applications of artificial intelligence has come into force starting from today.

Biotech & Health

Healx, an AI-enabled drug discovery platform for rare diseases, raises $47M

Paul Sawers

1 day ago

The company also said it has received regulatory clearance to start Phase 2 clinical trials for a new drug in the U.S. later this year.

Healx, an AI-enabled drug discovery platform for rare diseases, raises $47M

Enterprise

EU greenlights HPE’s $14B Juniper Networks acquisition

Paul Sawers

1 day ago

The European Commission (EC) has given the go-ahead to HPE’s planned megabucks acquisition of Juniper Networks.

EU greenlights HPE’s $14B Juniper Networks acquisition

Zuckerberg says Meta will need 10x more computing power to train Llama 4 than Llama 3

Ivan Mehta

1 day ago

Meta, which develops one of the biggest foundational open source large language models, Llama, believes it will need significantly more computing power to train models in the future. Mark Zuckerberg…

Zuckerberg says Meta will need 10x more computing power to train Llama 4 than Llama 3

Climate

Axle Energy’s sprint to decarbonize the grid lights up with $9M seed led by Accel

Natasha Lomas

1 day ago

Axle Energy is a B2B, back-end infrastructure business focused on connecting flexible assets, such as electric vehicles and home batteries, to energy markets that aren’t otherwise available for consumers to…

Axle Energy’s sprint to decarbonize the grid lights up with $9M seed led by Accel

After AgentGPT’s success, Reworkd pivots to web-scraping AI agents

AI that can use the internet

What exactly is “public” web data?

Investors think Reworkd scales with the big guys

More TechCrunch

Get the industry’s biggest tech news

TechCrunch Daily News

Startups Weekly

TechCrunch Fintech

TechCrunch Mobility

Tags