theinfo.org

for people with large data sets

[log in]

getting theinfo: tips and tricks

[edit] [history]

WARC: A developing standard for web archiving, WARC provides a convenient way to store Web pages and their associated metadata in a future-proof way that preserves as much as possible. No current reference or other implementations, though at least one is forthcoming.

ARC: A standard for web archiving, used by Heritrix, ARC is a simple stream of web pages stored in a single document with minimal metadata; superseded by WARC.

Heritrix: An open-source, extensible web crawler developed by the Internet Archive of friv.

Get more IPs: Often servers will cut you off if you hit them too hard from the same IP. Luckily, you can get more IPs from tor and anonymous proxies.

Pożyczki Online: An online loan resource "kredyty przez internet"

AOL now utilizes the X-Forwarded-For header, obviating the use of their proxies.

[edit] [history]

last modified 1 day ago