theinfo.org

for people with large data sets

[log in]

Jocuri masini si barbie

[edit] [history]

wget: old and venerable, wget still gets the job done 90% of the time. (True, but if you get 1/4th of the way through a crawl & get cut off you have to start from scratch, no?)

Diffbot: Diffbot lets you subscribe to any public URL. Useful for the many data sources that do not provide RSS feeds.

Hadoop: is essentially a mechanism for analyzing huge datasets, which do not necessarily need to be housed in a datastore. Hadoop abstracts MapReduce.s massive data-analysis engine, making it more accessible to developers. Hadoop scales out to myriad nodes and can handle all of the activity and coordination related to data sorting. Yahoo! and countless other organizations have found it an efficient mechanism for analyzing mountains of bits and bytes. Hadoop is also fairly easy to get working on a single node; all you need is some data to analyze and familiarity with Java code, including generics.

Spark: is an in-memory distributed computing framework being developed at UC Berkeley. Spark allows us to load the data of interest from HDFS (or any persistent storage) into RAM across multiple servers and cache it. We can then perform multiple queries against the cached data. Since the data is in RAM, queries are super quick. If a node dies, Spark automatically reconstructs the data from persistent storage.

archive.org: Infinite disk space, infinite storage, forever. How could you pass that up?

WikiTeam: tools for wiki preservation and a repository of wikis

Cloud computing: This is the most expensive method but it's also the fastest, you can literally processes teraflops of data in minutes.

The gamer: this method involves having a lot of people online gaming and having their computer linked

jocuri barbiejocuri cu masinijocuri 3d
together and use a @home program to gather all that processing power together.

[edit] [history]

last modified October 4, 2012