"Science" Subdirectory of DMOZ Notes

This dataset consists of the "Science" subtree of the dmoz.org web directory. It contains 11625 topics and 104853 documents. The topics are numbered by integers from 0 to 11624; the documents are numbered by integers from 0 to 104852. The topics are organized into a tree (topic 0 is the root). Each document belongs to one or more topics; however, the vast majority of documents (102801 out of 104853) belong to exactly one topic.

List of files

The dataset has been provided as a set of  files. The following are the file names and their contents:

TopicNames.txt

Each line contains a topic number and the full name of that topic, separated by a tab character.

TopicDescs.txt

Each line contains a topic number and the description of that topic, separated by a tab character. For some topics, the description is a zero-length string.

TopicHierarchy.txt

Each line contains a pair of topic numbers (separated by a tab character). The first of these two topics is the parent of the second topic. Each topic has exactly one parent, except for the root (topic 0), which has no parent.

DocUrls.txt

Each line contains a document number and its URL, separated by a tab character.

DocTitles.txt

Each line contains a document number and its title, separated by a tab character.

DocTopics.txt

Each line contains a document number and a topic number, separated by a tab character. This indicates that the document belongs to the given topic. If a document was originally assigned to some topic t that did not make it into our selection of 996 topics, it has been reassigned to the nearest ancestor of t that did make it into the selection. The original assignments of documents to topics are nevertheless available in a separate file,

Documents.zip

The  contents of the documents seperately