for people with large data sets
[log in]
Google Simulator: Write a script that will appropriate simulate Google's user agents and headers to fool sites which only provide their content to Google.
WARC plugin for wget: Add an option to wget that will cause it to output WARC files instead of lots of little files.
Sitemap plugin for wget: Add an option to wget to read sitemap.xml files.
Scrape worldcat.org: 50M pages about books. Someone ought to crawl that.
AOL IP user: Write a script that will send packets through AOL's proxies, causing them to come from their IP bank.
Crawl harness: Runs crawls regularly, makes sure they work and deals with temporary errors gracefully, notifies the owner of the crawl actually dies.
bulk.resource.org: Lotsa good stuff here.
infochimpsuDateportail guyane: A free public repository for rich data sets. When you get good stuff, post it on infochimps.org for everyone else.
last modified September 30, 2012