×

Are AI Crawlers User-Driven Tools or Malicious Bots?


In response to Cloudflare blocking Perplexity’s AI bots, the AI company has issued a statement saying that Cloudflare has got nearly everything wrong about how modern AI assistants function.

The AI company argued that AI assistants differ from traditional bots that crawl websites’ content and claimed that AI agents fall under the principles of an open web.

Notably, Cloudflare told MediaNama that it does not block Perplexity unless a customer instructs it to do so. “Our point remains specific: content creators should have the right to control access to their content. We believe Perplexity’s admitted practices undermine this fundamental right,” read the official statement by Cloudflare.

The statement also acknowledges Perplexity’s attempt to change how AI agents and agent-driven web crawling are generally understood. This issue raises the question: Should Perplexity’s AI crawler qualify as a “user-driven automated tool” or as a traditional web-crawling bot?

How Perplexity responded to Cloudflare’s blocking directives?

Perplexity explained that if the AI doesn’t have the answer to the queried question in its database, it goes to the internet to fetch the answers. “It goes to the relevant websites, reads the content, and brings back a summary tailored to your specific question,” explains its blog post.

Therefore, Perplexity is claiming that its crawlers don’t systematically visit millions of web pages to build “massive databases”. “Perplexity’s user-driven agents do not store the information or train with it,” the company claims.

Elsewhere, Cloudflare told MediaNama that Perplexity’s behaviour revealed two things: its bots don’t respect the Robots Exclusion Protocol (REP), and use obfuscated crawling methods to access content. “When we blocked the access, we immediately observed follow-on requests from other user agents, which Perplexity admits belong to a third-party tool they use,” said Cloudflare.

The cloud company was referring to a third-party service called BrowserBase that Perplexity confirms using “occasionally for highly specialised tasks”. For context, BrowserBase provides a cloud-based browser infrastructure that rents headless Chrome, a browser without a traditional user interface, to developers and AI agents.

Are AI Agents similar to Google’s crawling bots?

It has been over 30 years since REP was introduced as an internet standard to combat automated robots (simply refered to as bots). There are many types of bots, but web crawlers are one of the most popular bots on the internet. People also call these web crawlers “spiders”, because they crawl numerous websites and use a structured approach to extract data and metadata. Search engines like Google use these web crawling bots to index websites on their search results page.

In the context of violating protocol mentioned as per REP (robots.txt), Perplexity drew parallels with Google’s preview mode. It was referring to the instant preview feature of Google, which was reportedly discontinued due to lower user adoption, and the snippets feature that gives a quick visual preview of the web results.

However, Perplexity claims that no one has documented conclusive evidence showing Google violated the robots.txt protocol to feature content in its instant preview or snippet features. Interestingly, earlier Cloudflare itself had said that not every bot follows the protocol mentioned in robots.txt file.

How are AI agents different from traditional bots?

Elsewhere, Perplexity says that an AI assistant works like a human assistant that can fetch the information from the internet that is not present in its database. It claims that the answering of queries on AI agentic platforms happens in real-time. Additionally, the company also claims that it does not store the data in massive databases for future use or to train AI models.

Pointing at the fundamental difference between a user agent and a bot, it said: “User-driven agents only act when users make specific requests, and they only fetch the content needed to fulfil those requests.”

Advertisements

Perplexity says that it has two user agents acting as web crawlers, that gather and index the information available on the internet.

These are:

  • PerplexityBot: Designed to surface and link websites in Perplexity’s search results. The company also claims that it did not design this system to crawl content for training AI foundation models.
  • Perplexity‑User: Designed to support user actions within Perplexity. When users ask Perplexity a question, it might visit a web page to help provide an accurate answer and include a link to the page in its reply.

How Perplexity violates the robots.txt protocol

In its blog post, Perplexity says that Cloudflare is mischaracterising “user-driven AI assistants” as “malicious bots”. It also states that Cloudflare’s approach suspects any “automated tool” serving users. However, Perplexity admits that it generally ignores robots.txt rules, because users request the information it gathers.

Kiran Jonnalagadda, Co-Founder of HasGeek, explains how Perplexity is trying to evade the robots.txt protocol. He said that nothing should ideally happen to an AI agent unless the user makes it happen. “The AI agents need not to respect the protocols mentioned in robots.txt file. Because, it is a browser performing a task based on the user’s query or given task,” he explained.

Therefore, seems like Perplexity is somewhere in the right in ignoring robots.txt rules. But, the complexity arises when the AI chatbot gives the answers based on an already indexed database.

Jonnalagadda affirmed that Cloudflare wants to block the unethical indexing through unauthorised means, especially by violating robots.txt protocols and Web Application Firewall (WAF) rules, which protect web applications by filtering and monitoring HTTP (Hypertext Transfer Protocol) traffic.

Should the open web doctrine apply to AI agents?

The definition of open web has two sides: The technical aspect of it talks about open-source code and open standards. Similarly, the democratic aspect of it talks about free expression and digital inclusion. “But there’s a single underlying principle connecting all these ideas: An open web is a web by and for all its users, not select gatekeepers or governments,” says Mark Surman, Executive Director of the Mozilla Foundation.

Citing the open web doctrine, Perplexity brings in the argument of freedom to access the information on the internet. For example, it says that if an AI assistant is blocked, then its users might lose access to valuable information. “This undermines user choice and threatens the open web’s accessibility for innovative services competing with established giants,” Perplexity argues.

The argument that AI agents are also part of the open web raises serious questions, such as:

  • Should the user-driven and AI-powered bots and crawlers used by AI companies to scrape or index the content on the internet be treated as the extended tools of the real humans? And, not as traditional bots?
  • If we exclude AI agents from traditional bot frameworks, don’t the AI companies have an undue advantage of accessing (and, probably indexing?) literally every website and its content that is publicly accessible on the internet?
  • And, what happens when both PerplexityBot and Perplexity-User interact on Perplexity to generate answers and perform tasks, regardless of whether they are operating within the agentic framework? In this context, who is taking the undue advantage, and who faces the exploitation?

Also Read: 

Support our journalism:

For You

Post Comment