• Posted by Konstantin 11.09.2008

    The internet nowadays is a data miner's paradise, providing unlimited ground for novel ideas and experiments. With just a brief look around one will easily find well-structured data about anything ranging from macroeconomic and business indicators to networks and genes.

    However, among this wild variety of datasets there are a few, which seem to possess an especially high wow-factor, either in terms of their fame, size, content, or the amount of potentially interesting information that is still waiting to be uncovered from them. Here is a short list of my text- and graph-mining favourites.

    1. Enron Emails. (download)
      Need a test set for your brand new text mining algorithm? Take this massive set of about 500 000 email messages from about 150 employees of the Enron Corporation, that were made public during the investigation of the famous fraud incident in 2004.
    2. AOL Search Logs. (download)
      A set of about 20 000 000 search queries, made by about 500 000 users, released by the AOL in 2006. The data contains an (anonymous) user id, search keywords and the rank of the the results link, that was clicked. NB: Besides being a nice research resource, it's a comprehensive collection of (probably outdated) porn links.
    3. OpenCyc Common Knowledge. (download)
      A large database of seemingly arbitrary common knowledge facts, such as "girls' moccasin" is a "moccasin used by female children", which is, in other words, "(ArtifactTypeForUserTypeFn Moccasin FemaleChild)". Comes with a reasoning engine and some serious ambitions.
    4. RealityMining: People Tracks. (on request)
      A continuous activity log of 100 people over the period of 9 months, recorded via a Symbian phone application. Includes call logs, Bluetooth devices in proximity, cell tower IDs, application usage, and phone status. If you ever want to construct a model of human social behaviour (or just interested in the emerging stalker technologies), this is certainly a place to look at. In contrast to some other entries in this list, the download is just a 38 megabyte file.
    5. DMOZ100k06: Internet Documents and Tags. (download)
      This is a sample of 100 000 webpages from the Mozilla Open Directory project together with their metadata and tags. A must see for those interested in bookmarks, tagging and how it relates to PageRank. Somewhat similar information could probably also be mined from CiteULike data.
    6. SwetoDblp: Ontology of CS Publications. (download)
      Interested in very large graphs and ontologies? Why not take the Computer Science Digital Bibliography and Library Project's data, converted to RDF. Contains about 11 000 000 triples, relating authors, universities and publications.
    7. Wikipedia3: Wikipedia in RDF. (download)
      This is just a conversion of English Wikipedia links and category relations into RDF, which produces a massive 47 000 000 triple dataset, available as a single download. The maintainers of the project, however, promise to develop it further and add more interesting relations to it in the future.
    8. Overstock Product Data. (download)
      Information on more than a million various products and their review ratings, available in a convenient tabular format. I'm not really sure about what can be done with it, but the sheer size of this data requires mentioning it here.
    9. Ensembl: Genomes of 40+ organisms. (download)
      This dataset is of a slightly different flavour than all of the above, and comes from a completely different weight category, but nonetheless it is purely textual and I thought it would be unfair not to mention it. In fact, this is probably one of the most expensive, promising and yet the least understood textual datasets in the world. And as long as noone really knows what to do with it, bioinformatics will prosper.

    I still lack one entry to make it a "top 10", so your suggestions are welcome.


    1. The Netflix Challenge Dataset.(register)
      A data mining challenge that raised the awareness of the commercial importance of machine learning as a field to a new level. Netflix promises $1 000 000 for the first one to beat their movie rating prediction algorithm by just 10%. The contest will soon celebrate its second year but no one has yet claimed the grand prize, and the stage is open.

    Update from 2015, when the world of data is miles away from what it was in 2008: List of awesome datasets.

    Posted by Konstantin @ 11:51 pm

    Tags: , , , ,


    1. Ilja on 12.09.2008 at 11:27 (Reply)

      Well, browsing bookmarks (btw, firefox3 has a great bookmark/tag integration) I found this link:

      From that list - http://bioinfo.uib.es/~joemiro/marvel.html perhaps?

      1. Konstantin on 12.09.2008 at 11:56 (Reply)

        If you analyze the post carefully enough, you'll find the datawrangling link there too (it is pretty well known). A brief review also shows that, unsurprisingly, that list contains at least 6 out of the 9 datasets I name here.

        Not to say that I would not be utterly impressed by the possibility to find out the social network and the hidden life of Spiderman, I think that you still can't compare the Marvel dataset neither in size nor in importance to anything of the above.

        1. Ilja on 13.09.2008 at 02:09 (Reply)

          Actually importance in terms of commercial revenue might be more than comparable. Choosing the right character to "expand" into movies and games can bring you much more money than finding a gene that might cause yet another cancer type.

          1. Mark on 13.09.2008 at 10:17 (Reply)

            If you start evaluating research by its usefulness then 90% of it all will be shut down 🙂

          2. Konstantin on 13.09.2008 at 12:52 (Reply)

            That's a very interesting idea I haven't thought about, but I think it's slightly flawed. Firstly, I don't believe that Marvel social network can be of much help to find the "next big thing" in the entertainment industry - you need other kinds of data for that (ratings, at least). Secondly, you are greatly underestimating the revenues of the pharmaceutical companies: some googling will tell you that Pfizer annually makes about 2 times more than Sony Entertainment, for example.

            But you did remind me of this perfect candidate:

            10. The Netflix Challenge Dataset. (register)
            A data mining challenge that raised the awareness of the commercial importance of machine learning as a field to a new level. Netflix promises $1 000 000 for the first one to beat their movie rating prediction algorithm by just 10%. The contest will soon celebrate its second year but no one has yet claimed the grand prize, and the stage is open.

    2. Mark on 12.09.2008 at 12:10 (Reply)


      Westbury Lab USENET corpus, raw text, approx. 13 billion words.

      1. Konstantin on 12.09.2008 at 15:43 (Reply)

        Nice! Compressed size of 12gb puts it in a rather heavy category here. But otherwise it's just untagged public text from the web. And once you mention this kind of data, you should note the downloadable versions of Wikipedia and ArXiv (and probably some other similar resources). And thus we'd have to put all of them on the tenth place then, together with the remaining internet...

    Leave a comment

    Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.