Outlets like The Guardian and The New York Times are scrutinizing digital archives as potential backdoors for AI crawlers.

Hacker News is a community-driven platform for sharing and discussing technology news, startups, and programming-related topics. Through user submissions and comments, Hacker News offers insights into emerging technology trends, industry developments, and entrepreneurial ventures. Readers can participate in discussions, share their insights, and stay informed about the latest advancements in technology and innovation.

Hacker News

Major news publishers including The Guardian, The New York Times, and Gannett-owned outlets are blocking or limiting Internet Archive's crawlers over concerns that AI companies might scrape their archived content for training data. The Guardian has excluded article pages from the Wayback Machine's APIs while maintaining access to homepages, and The New York Times is hard-blocking Internet Archive bots entirely. Analysis of 1,167 news sites shows 241 explicitly disallow at least one Internet Archive crawler, with 87% being Gannett properties. The restrictions create tension between preserving the historical web record and protecting intellectual property from unauthorized AI training use.

News publishers limit Internet Archive access due to AI scraping concerns