Computer Science Department
School of Computer Science, Carnegie Mellon University


Facts and Reasons: Web Information Querying to Support
Agents and Human Decision Making

Medhi Samadi

June 2015

Ph.D. Thesis


Keywords: Information Extraction, Information Validation, Knowledge Harvesting Systems, Web Mining, Knowledge Integration, Planning, Artificial Intelligence, Robotics

More than one million queries are made every minute on the Internet, and people are asking an ever increasing number of queries. Researchers have developed Information Extraction (IE) systems that are able to address some of these queries. IE systems automatically construct machine-readable knowledge bases by extracting structured knowledge from the Web. Most of these IE systems, however, are designed for batch processing and favor high precision (i.e., few false positives) over high recall (i.e., few false negatives). These IE systems have also been developed to readily evaluate only factoid queries (e.g., what is capital of France?). By contrast, many real-world applications, such as servicing knowledge requests from humans or automated agents, require broad coverage (high recall) and fast, yet customizable response times for non-factoid and complex queries (e.g., Is shrimp meat healthy?). Users may be willing to trade off time for accuracy. The existing IE techniques are inherently unsuitable to meet these requirements.

In this thesis, we investigate anytime applications, as information extraction tasks initiated as queries from either automated agents or humans. The thesis will introduce new models and approaches for learning to respond to the truth of facts using unstructured web information, while considering the credibility of sources of information.

We introduce OpenEval, a new anytime information validation technique that evaluates the truthfulness of knowledge statements. As input, agents or humans provide a set of queries that are stated as multi-argument predicate instances (e.g., DrugHasSideEffect(Aspirin, GI Bleeding))), which the system should evaluate for truthfulness. OpenEval achieves high recall with acceptable precision by using unstructured information on the Web to validate information.

We extend the OpenEval approach to determine the response to a new query by integrating opinions from multiple knowledge harvesting systems. If a response is desired within a specific time budget (e.g., in less than 2 seconds), then only a subset of these resources can be queried. We propose a new method, AskWorld, which learns a policy that chooses which queries to send to which resources, by accommodating varying budget constraints that are available only at query (test) time. Through extensive experiments on real world datasets, we demonstrate AskWorld's capability in selecting the most informative resources to query within test-time constraints, resulting in improved performance compared to competitive baselines.

We further extend our information validation approaches to automatically measure and incorporate the credibility of different web information sources into their claim validation. To address this problem, we present ClaimEval, a novel and integrated approach which given a set of claims to validate, extracts a set of pro and con arguments from the Web using the OpenEval approach, and jointly estimates the credibility of sources and the correctness of claims. ClaimEval uses Probabilistic Soft Logic (PSL), resulting in a flexible and principled framework which makes it easy to state and incorporate different forms of prior-knowledge. Through extensive experiments on real-world datasets, we demonstrate ClaimEval's capability in determining the validity of a set of claims, resulting in improved accuracy compared to state-of-the-art approaches.

Finally, we show how our information extraction techniques can be used to provide knowledge to anytime intelligent agents, in particular, for a find-deliver task in a real mobile robot (CoBot) and for a trip planner agent. We show that OpenEval enables robots to actively query the Web to learn new background knowledge about the physical environment. The robot generates the maximum-utility plan corresponding to a sequence of locations it should visit, asks humans for the object, and then carries it to the requested destination location. For the trip planner agent, we also contribute a novel method for a planner to actively query the open World Wide Web to acquire instant knowledge about the planning problem. We introduce a novel technique, called Open World Planner, that estimates the knowledge that is relevant to the initial state and the goal state of a planning problem, and then effectively generates corresponding queries to the Web using our OpenEval query system.

227 pages

Thesis Committee:
Manuel Blum (Co-Chair)
Manuela Veloso (Co-Chair)
Tom Mitchell
Craig Knoblock (USC/ISI)

Frank Pfenning, Head, Computer Science Department
Andrew W. Moore, Dean, School of Computer Science

Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by