for people with large data sets
[log in]
R: Is it just me or are all the cool kids using R for their sexy mathematical processing these days?
SOCR Resources: A web of applets, computational libraries, educational and data resources.
Python, a simple sql database layer, PIL, and SciPy are good tools of the trade. Examples forthcoming. Python + PIL has been found to cope in some cases where R falls over due to a surfeit of data. Also, Matplotlib, the most glorious image plotting library for python, ever.
In Ruby, skynet is a young implementation of google's MapReduce.Jailbreak 6.0
Wek is an open-source java library of machine learning algorithms for data mining tasks.
Nutch is open source web-search and crawl software for SEO
Hadoop is open source distributed filesystem and MapReduce implementation.
When you have a lot of crunching to do, Amazon's EC2 lets you pay machines as you care to spin up. The bandwidth costs are a little high unless you're working from S3, so EC2 is better for heavy computation than large datasets.
Prolog is a nice language for querying large datasets. You can convert Prolog terms to and from RDF.
WordNet is an opensource lexicon database that contains a database of popular english terms and their noun, verb, adj grouping.
Generic-CV: An opensource distributed implementation of the k-fold cross-validation protocol for performance evaluation of supervised learning algorithms.
LingPipe is a Java toolkit for linguistic text processing. Designed to be efficient for large data sets. Does sentiment, entity extraction, POS tagging and more. See tutorials. Source that is open. Can be downloaded for free, commercial or free licensing.
Splunk is an application that specializes in indexing and searching on time-based data from heterogeneous sources. It has decent ad-hoc reporting tools to generate various reports and charts. Base version is free; gradual pricing for "enterprise" features.
Kirix Strata Data Browser, is a specialty web browser for accessing and manipulating data from the web (HTML, RSS, CSV), local data files and database systems (MySQL, Oracle, etc.). Despite having a spreadsheet-like interface, its a fast tool and can handle 100s of millions of records. It incorporates the Gecko engine for web rendering and has a full javascript implementation for manipulation/processing outside of the GUI. Strata's currently at the end of its beta cycle; the software is free to download and they're offering free licenses for the official release to those who provide feedback during the beta period with the transportation.
NLTK Python's Natural Language Toolkit. Include corpora and tools for natural language text analysis.
Shard-Query is a PHP script which uses Gearman to distribute a SQL statement between multiple cores and/or multiple servers for massive scale-up or scale-out of MySQL queries. New BSD license. See This blog pos for some simple benchmarks.
last modified October 3, 2012