• Posted by Konstantin 11.09.2008 7 Comments

    The internet nowadays is a data miner's paradise, providing unlimited ground for novel ideas and experiments. With just a brief look around one will easily find well-structured data about anything ranging from macroeconomic and business indicators to networks and genes.

    However, among this wild variety of datasets there are a few, which seem to possess an especially high wow-factor, either in terms of their fame, size, content, or the amount of potentially interesting information that is still waiting to be uncovered from them. Here is a short list of my text- and graph-mining favourites.

    1. Enron Emails. (download)
      Need a test set for your brand new text mining algorithm? Take this massive set of about 500 000 email messages from about 150 employees of the Enron Corporation, that were made public during the investigation of the famous fraud incident in 2004.
    2. AOL Search Logs. (download)
      A set of about 20 000 000 search queries, made by about 500 000 users, released by the AOL in 2006. The data contains an (anonymous) user id, search keywords and the rank of the the results link, that was clicked. NB: Besides being a nice research resource, it's a comprehensive collection of (probably outdated) porn links.
    3. OpenCyc Common Knowledge. (download)
      A large database of seemingly arbitrary common knowledge facts, such as "girls' moccasin" is a "moccasin used by female children", which is, in other words, "(ArtifactTypeForUserTypeFn Moccasin FemaleChild)". Comes with a reasoning engine and some serious ambitions.
    4. RealityMining: People Tracks. (on request)
      A continuous activity log of 100 people over the period of 9 months, recorded via a Symbian phone application. Includes call logs, Bluetooth devices in proximity, cell tower IDs, application usage, and phone status. If you ever want to construct a model of human social behaviour (or just interested in the emerging stalker technologies), this is certainly a place to look at. In contrast to some other entries in this list, the download is just a 38 megabyte file.
    5. DMOZ100k06: Internet Documents and Tags. (download)
      This is a sample of 100 000 webpages from the Mozilla Open Directory project together with their metadata and tags. A must see for those interested in bookmarks, tagging and how it relates to PageRank. Somewhat similar information could probably also be mined from CiteULike data.
    6. SwetoDblp: Ontology of CS Publications. (download)
      Interested in very large graphs and ontologies? Why not take the Computer Science Digital Bibliography and Library Project's data, converted to RDF. Contains about 11 000 000 triples, relating authors, universities and publications.
    7. Wikipedia3: Wikipedia in RDF. (download)
      This is just a conversion of English Wikipedia links and category relations into RDF, which produces a massive 47 000 000 triple dataset, available as a single download. The maintainers of the project, however, promise to develop it further and add more interesting relations to it in the future.
    8. Overstock Product Data. (download)
      Information on more than a million various products and their review ratings, available in a convenient tabular format. I'm not really sure about what can be done with it, but the sheer size of this data requires mentioning it here.
    9. Ensembl: Genomes of 40+ organisms. (download)
      This dataset is of a slightly different flavour than all of the above, and comes from a completely different weight category, but nonetheless it is purely textual and I thought it would be unfair not to mention it. In fact, this is probably one of the most expensive, promising and yet the least understood textual datasets in the world. And as long as noone really knows what to do with it, bioinformatics will prosper.

    I still lack one entry to make it a "top 10", so your suggestions are welcome.

    Update:

    1. The Netflix Challenge Dataset.(register)
      A data mining challenge that raised the awareness of the commercial importance of machine learning as a field to a new level. Netflix promises $1 000 000 for the first one to beat their movie rating prediction algorithm by just 10%. The contest will soon celebrate its second year but no one has yet claimed the grand prize, and the stage is open.

    Update from 2015, when the world of data is miles away from what it was in 2008: List of awesome datasets.

    Tags: , , , ,