theinfo.org

for people with large data sets

[log in]

getting theinfo: tools

[edit] [history]

wget: old and venerable, wget still gets the job done 90% of the time. (True, but if you get 1/4th of the way through a crawl & get cut off you have to start from scratch, no?)

Rami: RamiRoyal is the first free online software of Rummy!.

Diffbot: Diffbot lets you subscribe to any public URL. Useful for the many data sources that do not provide RSS feeds.

w3mir: A great, simple recursive downloader (spider). Source may need a patch to work ( it in debian bugs, IIRC), but once it does: Wow.

kelebek

Beautiful Soup: Now with Python and Ruby versions, Beautiful Soup can help you breeze thru otherwise convoluted HTML.

Feedity: A nifty RSS/XML generator for web pages without a web syndication format (HTML to RSS/XML web feed). Provides custom Web data integration, data syndication, and content change tracking etc.

FOIA: When a government agency is being recalcitrant, try providing some extra legal encouragement by filing a Freedom of Information Act request. Reporters Committee guide to using FOIA

archive.org: Infinite disk space, infinite storage, forever. How could you pass that up?

Crowbar: Crowbar lets you run Javascript-based DOM scrapers from the command line using Mozilla XULRunner. templatemaker: This clever piece of code learns the structure of a textual template from repeated examples, so that instead of hand-coding regexps for each new site you can write code that learns on its own.

hpricot: A fast and delightful HTML parser for Ruby

Nutch: Nutch is an open source search engine based on Lucene Java that can be used to index internal or public web documents of interest.

Hadoop: Hadoop is a framework for distributed applications running on large clusters. It is a Lucene sub project. UICrawler: Uicrawler is an open source tool that allows you to code you DB definitions, Conversion Programs, and then create python package, which you can share or reuse.<

HtmlAgilityPack: HtmlAgilityPack is a very forgiving .NET/Mono parser library for HTML, which it presents as an IXPathNavigable implementation, permitting easy XPATH queries and XSLT transformations.

YahooPipes: Pipes features a number of Source Modules for fetching data. (eg. 'Fetch CSV Module, Feed, Data, Page, etc.)

CyberNeko: An HTML to XML parser and tag balancer for Java.

zaokao: an online tool to help you Retrieve flv video from YouTube,vSocial,Vimeo.

youtube-dl: youtube and other video hostings downloader

NCleaner-1.0: Software for automatic cleaning of Web pages (boilerplate removal)

Feed Bag: Ruby script that can be used to periodically scan and archive RSS feeds to an SQL database. it has good editing services.

[lynx -dump]: Fast and cheap way to convert HTML to text.

pyPdf: Particularly the extractText() method of the PdfFileReader class.

Mod Security: ModSecurity is an open source web application firewall.

Free Rounds: Free Rounds up fix

WikiTeam: tools for wiki preservation and a repository of wikis

[edit] [history]

last modified September 20, 2012