theinfo.org

for people with large data sets

[log in]

processing theinfo: tips and tricks

[edit] [history]

RDF: It's always nice to have a consistent standard to convert things to. And, for better or worse, it looks like the one that people are settling on is RDF. Yes, I know some of that Semantic Web stuff sounds crazy, but really the underlying concept behind RDF (simple triples) isn't so bad. So give it a try! The W3C RDF Primer.

Topic Maps: Everything I do, I convert to Topic Maps. Sure, a lot of people use RDF but Topic Maps is a lot easier to understand, parse and reuse. Look to something like the CSXTM format for a generic XML for meta data format.

POX (Plain 'ole XML) At least until you get your head round the dataset. You'll soon find out which items are key. List them for filtering, viewing whatever. XML is easier to play with until you decide what to do with it. Keep the element names small though to reduce overhead.

YAML is simple, clean, works with most languages and scripting tools, and does databinding / marshalling (where the data shows up in your program knowing its datatype and data structure) better than any other interchange format. There are well-tested libraries for Java, C, Javascript, Ruby, Perl and Python, and it degrades nicely to JSON. (more info).

JSON is a good data-binding language in its own right, and is popular for data interchange with web services APIs.

Web Content Mining Book: Bing Liu seems to have done a lot of research on web content mining, including written a book on the subject. Some free slides are available that cover web content mining.

Programming Collective Intelligence: A book by Toby Segaran that covers everything from scraping the web and collecting information from large datasets and then moves on into ranking, data mining, decision trees, clustering. There are practical examples like how would you cluster similar blogs based on their content. Download it here. (Note from the author: I don't mind you linking to a torrent of the book :) My website with downloadable code and updates is at kiwitobes.com) (Note from the torrent poster: Toby, your comment just made me buy the book. :)

[edit] [history]

last modified May 28