theinfo.org

for people with large data sets

[log in]

getting theinfo: help

[edit] [history]

Google Simulator: Write a script that will appropriate simulate Google's user agents and headers to fool sites which only provide their content to Google.

WARC plugin for wget: Add an option to wget that will cause it to output WARC files instead of lots of little files.

Sitemap plugin for wget: Add an option to wget to read sitemap.xml files.

Scrape worldcat.org: 50M pages about books. Someone ought to crawl that.

AOL IP user: Write a script that will send packets through AOL's proxies, causing them to come from their IP bank.

Crawl harness: Runs crawls regularly, makes sure they work and deals with temporary errors gracefully, notifies the owner of the crawl actually dies.

bulk.resource.org: Lotsa good stuff here.

infochimps: A free public repository for rich data sets. When you get good stuff, post it on infochimps.org for everyone else.

[edit] [history]

last modified March 17