|   | CMU-CS-03-168 Computer Science Department
 School of Computer Science, Carnegie Mellon University
 
    
     
 CMU-CS-03-168
 
Finding lists of People on the Web 
Latanya Sweeney* 
July 2003  
Also appears asKeywords: Information retrieval, data linkage, data mining, privacy,
policyInstitute for Software Research International
 Technical Report CMU-ISRI-03-104
 
CMU-CS-03-168.psCMU-CS-03-168.pdf
 
 Among the vast amounts of personal information published on the World 
Wide Web ( Web ) and indexed by search engines are lists of names 
of people. Examples include employees at companies, students enrolled 
in universities, officers in the military, law enforcement personnel, 
members of social organizations, and lists of acquaintances. 
Knowing who works where, attends what, or affiliates with whom 
provides strategic knowledge to competitors, marketers, and government 
surveillance efforts. However, finding online rosters of people does 
not lend itself to keyword lookup on search engines because the 
keywords tend to be common expressions such as  employees  or  students.
A typical search often retrieves hundreds of Web pages requiring many 
hours of human inspection to locate a page containing a list of names. 
As a result, people may falsely believe online rosters provide more 
privacy than they do. This paper presents RosterFinder, a set of 
simple algorithms for locating Web pages that consist predominately 
of a list of names. The specific names are not known beforehand. 
RosterFinder works by identifying rosters from candidate Web pages 
based on the ratio of distinct known names to distinct words appearing 
in the page. Accurate classification by RosterFinder depends on the 
set of names used. Results are reported on real Web pages using: 
(1) dictionary lookup employing a limited set of known names; and, 
(2) dictionary lookup on utilizing an extensive set of known names. 
Privacy implications are discussed using the example of FERPA and 
online student rosters.
 
22  pages 
*Institute for Software Research International, School of Computer
Science, Carnegie Mellon University
 |