|
CMU-CS-98-122
Computer Science Department
School of Computer Science, Carnegie Mellon University
CMU-CS-98-122
Learning to Extract Symbolic Knowledge from the World Wide Web
Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew McCallum,
Tom Mitchell, Kamal Nigam, Sean Slattery
September 1998
CMU-CS-98-122.ps
CMU-CS-98-122.pdf
Keywords:
The World Wide Web is a vast source of information accessible to computers,
but understandable only to humans. The goal of the research described here is
to automatically create a computer understandable knowledge
base whose content mirrors that of the World Wide Web. Such a knowledge base
would enable much more effective retrieval of Web information, and promote new
uses of the Web to support knowledge-based inference and problem solving. Our
approach is to develop a trainable information extraction system that takes
two inputs. The first is an ontology that defines the classes (e.g.,
Company, Person, Employee, Product) and relations (e.g.,
Employed.By, Produced.By) of interest when creating the knowledge
base. The second is a set of training data consisting of labeled regions of
hypertext that represent instances of these classes and relations. Given
these inputs, the system learns to extract information from other pages and
hyperlinks on the Web. This paper describes our general approach, several
machine learning algorithms for this task, and promising initial results with
a prototype system that has created a knowledge base describing university
people, courses, and research projects.
51 pages
|