Is Meta Scraping the Fediverse for AI?
Almost definitely, but what can we do?

A new report from Dropsite News makes the claim that Meta is allegedly scraping a large amount of independent sites for content to train their AI. What’s worse is that this scraping operation appears to completely disregard robots.txt
, a control list used to tell crawlers, search engines, and bots which parts of a site should be accessed, and which parts should be avoided. It’s worth mentioning that the efficacy of such lists depend on the consuming software to honor this, and not every piece of software does.
Meta Denies All Wrongdoing
Andy Stone, a communications representative for Meta, has gone on record by claiming that the list is bogus, and the story is incorrect. Unfortunately, the spread of Dropsite’s story is relatively small, and there haven’t been any other public statements about the list at this time. This makes it difficult to adequately critique the initial story, but the concept is nevertheless a wakeup call.
However, it’s worth acknowledging Meta’s ongoing efforts to scrape data from many different sources. This includes user data, vast amounts of published books, and independent websites not part of Meta’s sprawling online infrastructure. Given that the Fediverse is very much a public network, it’s not surprising to see instances getting caught in Meta’s net.
Purportedly Affected Instances
The FediPact account has dug in to the leaked PDF, and a considerable amount of Fediverse instances appear on the list. The document itself is 1,659 pages of URLs, so we were able to filter down a number of matches based on keywords. Please keep in mind that these only account for sites that use a platform’s name in the domain:
- Mastodon: 46 matches
- Lemmy: 6 matches
- PeerTube: 46 matches
There are likely considerably more unique domain matches in the list for a variety of platforms. Admins are advised to review whether their own instances are documented there. Even if your instance’s domain isn’t on the list, consider whether your instance is federating with something on the list. Due to the way federation works, cached copies of posts from other parts of the network can still show up on an instance that’s been crawled.
Access the Leaked List
We are mirroring this document for posterity, in case the original article is taken offline.
Protective Measures to Take
Regardless of the accuracy of the Dropsite News article, there’s an open question as to what admins can do to protect their instances from being scraped. Due to the nature of the situation, there is likely no singular silver bullet to solve these problems, but there are a few different measures that admins can take:
- Establish Community Terms of Service – Establish a Terms of Service for your instance that explicitly calls out scraping for the purposes of data collection and LLM training specifically. While it may have little to no effect on Meta’s own scraping efforts, it at least establishes precedence and a paper trail for your own server community’s expectations and consent.
- Request Data Removal – Meta has a form buried within the Facebook Privacy Center that could be used to submit a formal complaint regarding instance data and posts being part of their AI training data. Whether or not Meta does anything is a matter of debate, but it’s nevertheless an option.
- (EU-Only) Send a GDPR Form – Similar to the above step, but try to get the request in front of Meta’s GDPR representatives that have to deal with compliance.
- Establish Blocking Measures Anyway: Even if private companies can still choose to disregard things like
robots.txt
and HTTP Headers such asX-Robots-Tag: noindex
, you can still reduce the attack surface of your site from AI agents that do actually honor those things. - Set Up a Firewall: one popular software package that’s seeing a lot of recent adoption for blocking AI traffic is Anubis, which has configurable policies that you can adjust as needed to handle different kinds of traffic.
- Use Zip Bombs: When all else fails, take measures into your own hands. On the server side, use an Nginx or Apache configuration to detect specific User Agents associated with AI, and serve them ever-expanding compressed archives to slow them down.
In all reality, fighting against AI scraping is still a relatively new problem that’s complicated by lack of clear regulation, and companies deciding to do whatever they want. The best we can do for our communities is to adopt protective measures and stay informed of new developments in the space.