COMPUTER SCIENCE TECHNICAL REPORT ABSTRACTS

CMU-CS-08-125
Computer Science Department
School of Computer Science, Carnegie Mellon University

CMU-CS-08-125

Information Mediation in the Presence of
Constraints and Uncertainties

Sandeep Pandey

May 2008

Ph.D. Thesis

Keywords: Information mediator, search engine, constrained optimization, uncertainty, exploration/exploitation tradeoff, web crawling, web page ranking, advertisement ranking, web page discovery, set cover, greedy, synchronization, web page refreshing, randomized ranking, entrenchment effect, rank promotion, multi-armed bandit, click-through rate, BMMP

People often require a unified view of multiple disparate information sources. This role is played by information mediators such as Web search engines and database integration systems. Internally, information mediators perform three basic tasks: acquisition, analysis, and presentation of content. For example, Web search engines acquire Web pages via crawling, analyze the text and link structure of the crawled pages, and present lists of pages in response to user search queries.

In environments such as the Web that have very large amounts of content, the content acquisition and presentation tasks can be viewed as constrained optimization problems. On the acquisition side, the resources required to discover, and maintain synchronization with, all available online content vastly exceed the capacity of even the most well-provisioned information mediators. On the presentation side, the constraint is on user attention: users cannot be expected to sift through large amounts of content. Rather than being exhaustive, an information mediator must be selective in acquiring and presenting content, with selections driven by some meaningful objective function. This dissertation studies formulations of, and solutions to, these optimization problems.

A complication that arises in this setting is that some parameters needed to solve the optimization problem are unknown. For example, in the Web search context, due to autonomy of sources a search engine may not know the update rate of pages, making it difficult to conduct an optimal synchronization strategy. Similarly, due to sparsity of user feedback, the search engine may not have accurate page quality measurements, making it difficult to present search results in the optimal order. This dissertation studies means of simultaneously estimating unknown parameters and exploiting current estimates. We focus specically on the Web search context, although many of our ideas apply to other contexts as well.

154 pages

Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu