This dataset consists of the "Science" subtree of the dmoz.org web directory. It contains 11625 topics and 104853 documents. The topics are numbered by integers from 0 to 11624; the documents are numbered by integers from 0 to 104852. The topics are organized into a tree (topic 0 is the root). Each document belongs to one or more topics; however, the vast majority of documents (102801 out of 104853) belong to exactly one topic.
The dataset has been provided as a set of files. The following are the file names and their contents:
Each line contains a topic number and the full name of that topic, separated by a tab character.
Each line contains a topic number and the description of that topic, separated by a tab character. For some topics, the description is a zero-length string.
Each line contains a pair of topic numbers (separated by a tab character). The first of these two topics is the parent of the second topic. Each topic has exactly one parent, except for the root (topic 0), which has no parent.
Each line contains a document number and its URL, separated by a tab character.
Each line contains a document number and its title, separated by a tab character.
Each line contains a document number and a topic number, separated by a tab character. This indicates that the document belongs to the given topic. If a document was originally assigned to some topic t that did not make it into our selection of 996 topics, it has been reassigned to the nearest ancestor of t that did make it into the selection. The original assignments of documents to topics are nevertheless available in a separate file,
The contents of the documents seperately