Is Meta Scraping the Fediverse for AI?

Almost definitely, but what can we do?

A new report from Dropsite News makes the claim that Meta is allegedly scraping a large amount of independent sites for content to train their AI. What’s worse is that this scraping operation appears to completely disregard robots.txt, a control list used to tell crawlers, search engines, and bots which parts of a site should be accessed, and which parts should be avoided. It’s worth mentioning that the efficacy of such lists depend on the consuming software to honor this, and not every piece of software does.

Meta Denies All Wrongdoing

Andy Stone, a communications representative for Meta, has gone on record by claiming that the list is bogus, and the story is incorrect. Unfortunately, the spread of Dropsite’s story is relatively small, and there haven’t been any other public statements about the list at this time. This makes it difficult to adequately critique the initial story, but the concept is nevertheless a wakeup call.

However, it’s worth acknowledging Meta’s ongoing efforts to scrape data from many different sources. This includes user data, vast amounts of published books, and independent websites not part of Meta’s sprawling online infrastructure. Given that the Fediverse is very much a public network, it’s not surprising to see instances getting caught in Meta’s net.

Purportedly Affected Instances

The FediPact account has dug in to the leaked PDF, and a considerable amount of Fediverse instances appear on the list. The document itself is 1,659 pages of URLs, so we were able to filter down a number of matches based on keywords. Please keep in mind that these only account for sites that use a platform’s name in the domain:

  • Mastodon: 46 matches
  • Lemmy: 6 matches
  • PeerTube: 46 matches

There are likely considerably more unique domain matches in the list for a variety of platforms. Admins are advised to review whether their own instances are documented there. Even if your instance’s domain isn’t on the list, consider whether your instance is federating with something on the list. Due to the way federation works, cached copies of posts from other parts of the network can still show up on an instance that’s been crawled.

Access the Leaked List

We are mirroring this document for posterity, in case the original article is taken offline.

Protective Measures to Take

Regardless of the accuracy of the Dropsite News article, there’s an open question as to what admins can do to protect their instances from being scraped. Due to the nature of the situation, there is likely no singular silver bullet to solve these problems, but there are a few different measures that admins can take:

  • Establish Community Terms of Service – Establish a Terms of Service for your instance that explicitly calls out scraping for the purposes of data collection and LLM training specifically. While it may have little to no effect on Meta’s own scraping efforts, it at least establishes precedence and a paper trail for your own server community’s expectations and consent.
  • Request Data Removal – Meta has a form buried within the Facebook Privacy Center that could be used to submit a formal complaint regarding instance data and posts being part of their AI training data. Whether or not Meta does anything is a matter of debate, but it’s nevertheless an option.
  • (EU-Only) Send a GDPR Form – Similar to the above step, but try to get the request in front of Meta’s GDPR representatives that have to deal with compliance.
  • Establish Blocking Measures Anyway: Even if private companies can still choose to disregard things like robots.txt and HTTP Headers such as X-Robots-Tag: noindex, you can still reduce the attack surface of your site from AI agents that do actually honor those things.
  • Set Up a Firewall: one popular software package that’s seeing a lot of recent adoption for blocking AI traffic is Anubis, which has configurable policies that you can adjust as needed to handle different kinds of traffic.
  • Use Zip Bombs: When all else fails, take measures into your own hands. On the server side, use an Nginx or Apache configuration to detect specific User Agents associated with AI, and serve them ever-expanding compressed archives to slow them down.

In all reality, fighting against AI scraping is still a relatively new problem that’s complicated by lack of clear regulation, and companies deciding to do whatever they want. The best we can do for our communities is to adopt protective measures and stay informed of new developments in the space.

Sean Tilley

Sean Tilley has been a part of the federated social web for over 15+ years, starting with his experiences with Identi.ca back in 2008. Sean was involved with the Diaspora project as a Community Manager from 2011 to 2013, and helped the project move to a self-governed model. Since then, Sean has continued to study, discuss, and document the evolution of the space and the new platforms that have risen within it.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

To respond on your own website, enter the URL of your response which should contain a link to this post's permalink URL. Your response will then appear (possibly after moderation) on this page. Want to update or remove your response? Update or delete your post and re-enter your post's URL again. (Find out more about Webmentions.)

Back to top button