

It might be assumed that the people at have changed their minds and that now The Internet Archive is interested in offering access to web sites or other Internet documents whose authors do not want their materials in the collection. Ironically, you can still see the defunct exclusion page on WayBack machine. To exclude the Internet Archive’s crawler (and remove documents from the Wayback Machine) while allowing all other robots to crawl your site, your robots.txt file should say: User-agent: ia_archiver It will tell us not to crawl your site in the future.It will remove documents from your domain from the Wayback Machine.To remove your site from the Wayback Machine, place a robots.txt file at the top level of your site (e.g. The Internet Archive is not interested in offering access to web sites or other Internet documents whose authors do not want their materials in the collection. According to the now defunct exclusion page: Why does everyone think ia_archiver is an bot?īecause it used to be. While they seem very keen on fulfilling their purposes, they seem to have overlooked the wishes of website owners who do not want their intellectual property scraped and displayed. You can read their post about it here, but one of the important points they claim is: "Over time we have observed that the robots.txt files that are geared toward search engine crawlers do not necessarily serve our archival purposes." The folks at said that robots.txt files don't serve the purpose of an archive site. The Wayback Machine was created as a joint effort between Alexa Internet and the Internet Archive when a three-dimensional index was built to allow for the browsing of archived web content." and that " Brewster Kahle founded the archive in May 1996 at around the same time that he began the for-profit web crawling company Alexa Internet. Is there a connection between and Alexa? It will not disallow (Wayback Machine) but will instead block Alexa from crawling your site. That means that if you use robots.txt exclusion like this: User-agent: ia_archiver How do we know? The screenshot below comes from this Alexa webpage. Spoiler alert: Internet Archive did remove our site once we asked, but the robots.txt method did not work. This is the most accurate information we can find about it as of this writing.
#Archive website how to
There is a lot of apparently bad / old / inaccurate information out in the world about how to block, otherwise known as "The Wayback Machine" from scraping your site.
