theinfo.org

for people with large data sets

[log in]

getting theinfo: help

[edit] [history]

Google Simulator: Write a script that will appropriate simulate Google's user agents and headers to fool sites which only provide their content to Google.

WARC plugin for wget: Add an option to wget that will cause it to output WARC files instead of lots of little files.

Sitemap plugin for wget: Add an option to wget to read sitemap.xml files.

Scrape worldcat.org: 50M pages about books. Someone ought to crawl that.

AOL IP user: Write a script that will send packets through AOL's proxies, causing them to come from their IP bank.

Crawl harness: Runs crawls regularly, makes sure they work and deals with temporary errors gracefully, notifies the owner of the crawl actually dies.

bulk.resource.org: Lotsa good stuff here.

infochimpsuDateportail guyane: A free public repository for rich data sets. When you get good stuff, post it on infochimps.org for everyone else.

gooogle tv world of fun tv entertainment Click to Download tv twilight,tv lists,tv bests, toptv ,gooogle wtf google? ---LOVE.S--- google-clone

[edit] [history]

last modified September 30, 2012