for people with large data sets
[log in]
wget: old and venerable, wget still gets the job done 90% of the time. (True, but if you get 1/4th of the way through a crawl & get cut off you have to start from scratch, no?)
Rami: RamiRoyal is the first free online software of Rummy!.
Diffbot: Diffbot lets you subscribe to any public URL. Useful for the many data sources that do not provide RSS feeds.
w3mir: A great, simple recursive downloader (spider). Source may need a patch to work ( it in debian bugs, IIRC), but once it does: Wow.
Beautiful Soup: Now with Python and Ruby versions, Beautiful Soup can help you breeze thru otherwise convoluted HTML.
Feedity: A nifty RSS/XML generator for web pages without a web syndication format (HTML to RSS/XML web feed). Provides custom Web data integration, data syndication, and content change tracking etc.
FOIA: When a government agency is being recalcitrant, try providing some extra legal encouragement by filing a Freedom of Information Act request. Reporters Committee guide to using FOIA
archive.org: Infinite disk space, infinite storage, forever. How could you pass that up?
Crowbar: Crowbar lets you run Javascript-based DOM scrapers from the command line using Mozilla XULRunner. templatemaker: This clever piece of code learns the structure of a textual template from repeated examples, so that instead of hand-coding regexps for each new site you can write code that learns on its own.
hpricot: A fast and delightful HTML parser for Ruby
Nutch: Nutch is an open source search engine based on Lucene Java that can be used to index internal or public web documents of interest.
Hadoop: Hadoop is a framework for distributed applications running on large clusters. It is a Lucene sub project. UICrawler: Uicrawler is an open source tool that allows you to code you DB definitions, Conversion Programs, and then create python package, which you can share or reuse.<
HtmlAgilityPack: HtmlAgilityPack is a very forgiving .NET/Mono parser library for HTML, which it presents as an IXPathNavigable implementation, permitting easy XPATH queries and XSLT transformations.
YahooPipes: Pipes features a number of Source Modules for fetching data. (eg. 'Fetch CSV Module, Feed, Data, Page, etc.)
CyberNeko: An HTML to XML parser and tag balancer for Java.
zaokao: an online tool to help you Retrieve flv video from YouTube,vSocial,Vimeo.
youtube-dl: youtube and other video hostings downloader
NCleaner-1.0: Software for automatic cleaning of Web pages (boilerplate removal)
Feed Bag: Ruby script that can be used to periodically scan and archive RSS feeds to an SQL database. it has good editing services.
[lynx -dump]: Fast and cheap way to convert HTML to text.
pyPdf: Particularly the extractText() method of the PdfFileReader class.
Mod Security: ModSecurity is an open source web application firewall.
Free Rounds: Free Rounds up fix
WikiTeam: tools for wiki preservation and a repository of wikis
last modified September 20, 2012