processing theinfo: tips and tricks (theinfo)

RDF: It's always nice to have a consistent standard to convert things to. And, for better or worse, it looks like the one that people are settling on is RDF. Yes, I know some of that Semantic Web stuff sounds crazy, but really the underlying concept behind RDF (simple triples) isn't so bad. So give it a try! Traslochi a Milano e Traslochi a Roma.

Topic Maps: Everything I do, I convert to Topic Maps. Sure, a lot of people love Prestiti Personali and use RDF but Topic Maps is a lot easier to understand, parse and reuse. Look to something like the CSXTM format for a generic XML for meta data format Mutui On Line.

POX (Plain 'ole XML) At least until you get your head round the dataset. You'll soon find out which items are key. List them for filtering, viewing whatever. XML is easier to play Car Insurance Comparisons or with until you decide what to do with it Car Insurance Comparison. Keep the element names small though to reduce overhead.

YAML is simple, clean, works with most languages and scripting tools, and does databinding / marshalling (where the data shows up in your program knowing its datatype and data structure) better than any other interchange format. There are well-tested libraries for Java, C, Javascript, Ruby, Perl and Python, and it degrades nicely to JSON. (more info)

JSON is a good data-binding language in its own right, and is popular for data interchange with web services APIs.

Web Content Mining Book: Bing Liu seems to have done a lot of research on web contenting mining, including written a book on the subject. Some free slides are available that cover web content mining like on Finanziamenti.

theinfo.org is a community site; if you want to help run it, join the mailing list. (It was originally started by Aaron Swartz and is powered by infogami.)