web scraping arşivleri

Cloudflare has unveiled a new tool designed to give website owners greater control over AI bot crawlers accessing their content, allowing them to block unauthorized scraping or set fees for access. The move aims to help publishers and content creators monetize the use of their material by artificial intelligence companies, which increasingly crawl websites to train AI models without sending traffic back or providing compensation.

The tool enables site owners to choose which AI crawlers can access their content and implement a “pay per crawl” pricing model, helping creators control how their work is used and ensure fair payment. This innovation comes amid declining referral traffic from search engines, which historically drove ad revenue to websites.

Major publishers like Condé Nast, the Associated Press, and social platforms including Reddit and Pinterest back the initiative. Cloudflare’s Chief Strategy Officer, Stephanie Cohen, explained that the tool is designed to establish a sustainable ecosystem for content creators and AI companies alike. She highlighted that rapid changes in traffic patterns demand new approaches, calling this tool “the beginning of a new model for the internet.”

Data from Cloudflare shows that Google’s ratio of crawls to visitor referrals has dropped from 6:1 to 18:1 in six months, suggesting users increasingly get answers directly from Google search results or AI features rather than visiting original sites. However, Google’s crawl-to-visit ratio remains far lower than AI firms like OpenAI, which have ratios around 1,500:1, reflecting heavy content scraping without referral traffic.

For decades, traditional search engines indexed web content and drove users to publishers, rewarding them for their work. But AI crawlers disrupt this model by harvesting data without sending visitors back, aggregating content in chatbots like ChatGPT, and reducing creators’ revenue and recognition.

Many AI companies bypass common publisher tools used to block scraping and argue their data collection is legal and fair use. This has led some publishers, including the New York Times, to sue AI firms for copyright infringement. Others have negotiated licensing agreements to protect their content and monetize usage.

Reddit, notably, has sued AI startup Anthropic for scraping user comments but also signed a licensing deal with Google, illustrating the complex responses from content owners seeking to protect their assets in the AI era.

Reddit has filed a lawsuit against artificial intelligence startup Anthropic, accusing it of illegally using Reddit’s content to train its AI models without permission or a licensing agreement. The suit was filed Wednesday in San Francisco Superior Court, marking the latest legal clash over AI companies’ use of third-party online content.

In the complaint, Reddit alleges that Anthropic has scraped and exploited data from the platform over 100,000 times, despite publicly claiming last year that it had blocked its bots from accessing Reddit. According to Reddit, Anthropic’s Claude chatbot even acknowledged it was trained on at least some Reddit data, but could not confirm whether deleted content had been included.

“Anthropic refuses to respect Reddit’s guardrails and enter into a license agreement,” the complaint says, contrasting the company’s stance with that of Google and OpenAI, both of which have entered licensing arrangements with Reddit.

Reddit claims Anthropic’s actions violate its user policies and have allowed the startup to enrich itself by “tens of billions of dollars.” The lawsuit seeks unspecified restitution, punitive damages, and an injunction to stop Anthropic from further using Reddit content for commercial purposes.

Anthropic Responds

An Anthropic spokesperson said the company disagrees with Reddit’s claims and intends to defend itself vigorously. The lawsuit adds further scrutiny to Anthropic, whose backers include tech giants Amazon and Alphabet (Google).

Anthropic recently launched its latest Claude models, Opus 4 and Sonnet 4, on May 22, and has reportedly reached $3 billion in annualized revenue, according to sources familiar with the matter.

Growing Legal Tensions Over AI Training Data

This legal dispute highlights a broader industry-wide debate over how AI companies source and utilize data to train large language models. Many websites and publishers argue that AI firms are profiting from content without compensating the creators, while AI companies contend that publicly available internet data falls under fair use.

In a statement, Reddit Chief Legal Officer Ben Lee emphasized the platform’s support for an open internet but said AI companies need “clear limitations” when it comes to scraping and monetizing content.

Both companies are headquartered in San Francisco, located just a few blocks apart.

The case has been filed under Reddit Inc v Anthropic PBC, California Superior Court, San Francisco County, No. CGC-25-524892.

Yazılar

Cloudflare Introduces Pay-Per-Crawl Tool to Help Websites Monetize AI Bot Access

Reddit Sues AI Firm Anthropic for Alleged Unauthorized Use of Data

Anthropic Responds

Growing Legal Tensions Over AI Training Data

İlgi çekici linkler

Sayfalar

Kategoriler

Arşiv

Şunun için etiket arşivi: web scraping

Yazılar

Anthropic Responds

Growing Legal Tensions Over AI Training Data

İlgi çekici linkler

Sayfalar

Kategoriler

Arşiv