theinfo.org

for people with large data sets

[log in]

getting theinfo: tips and tricks

[edit] [history]

WARC: A developing standard for web archiving, WARC provides a convenient way to store Web pages and their associated metadata in a future-proof way that preserves as much as possible. No current reference or other implementations, though at least one is forthcoming.

ARC: A standard for web archiving, used by Heritrix, ARC is a simple stream of web pages stored in a single document with minimal metadata; superseded by WARC.

Heritrix: An open-source, extensible web crawler developed by the Internet Archive.

Get more IPs: Often servers will cut you off if you hit them too hard from the same IP. Luckily, you can get more IPs from tor and anonymous proxies.

AOL now utilizes the X-Forwarded-For header, rendering the use of their proxies useless.

Shoeleather: Nothing compares to old-fashioned reporting-style shoeleather. Start calling people on the phone. Start with the receptionist, ask who's responsible for a certain thing, get their number, and just keep pushing until you find someone who can make things happen.

[edit] [history]

last modified May 6