Computer Science Department
School of Computer Science, Carnegie Mellon University


Hypertext Classification

Sean Slattery

May 2002

Ph.D. Thesis

Keywords: Machine learning, text classification, hypertext classification, relational learning

Hypertext classification is the task of assigning labels to arbitrary hypertext documents, typically Web pages. One major problem with current techniques for this task is that they can not be easily extended to incorporate hyperlink information. This dissertation explores the space of algorithms that use hyperlinks effectively and shows that such algorithms can improve classification accuracy.

I demonstrate how a First-Order learner (FOIL) can be used for hypertext classification in a way that easily incorporates hyperlink information. This approach leads to better classification performance and also produces learned rules which tell us more about how hyperlinks can help classification.

A drawback of this approach is that it builds rules which assess document content using the presence or absence of specific keywords. The word-distribution approach used by text classifiers such as Naive Bayes and k Nearest Neighbour is more intuitively appealing for testing document content. I show how a new hypertext classifier, FOIL-PILFS, combines the ability to use hyperlinks easily (via FOIL) and test document content effectively (using Naive Bayes) to produce improved classification performance.

Another useful source of information for improved classification can be the hyperlink structure of the test set. Given an initial labelling of the test documents, hyperlink patterns in the test set can allow us to achieve even better classification. The First-Order Hubs algorithm looks for one kind of hyperlink regularity in the test set, similar to Kleinberg's Hubs and Authorities regularity, and can improve upon an initial test-set classification. Of course other types of regularity are possible and I show how we might find and use these with First-Order Hubs.

134 pages

Return to: SCS Technical Report Collection
School of Computer Science homepage

This page maintained by