CMU-S3D-24-103
Software and Societal Systems Department
School of Computer Science, Carnegie Mellon University



CMU-S3D-24-103

Automatic Identification and Extraction
of Privacy Policy Text: GenScraper

Tong Jiao, Norman Sadeh

April 2024

CMU-S3D-24-103.pdf


Keywords: Natural Language Processing, Privacy Policies

Privacy policies are known to be long and challenging for people to read. To overcome this issue, a variety of natural language processing (NLP) techniques have been developed over the past dozen years to automate the analysis of these documents. The initial step in this process is the automatic extraction of privacy policies - typically from a website. While a number of tools have been proposed to extract privacy policy text, these tools all suffer from various limitations. This has to do with the fact that webpages hosting privacy policies often contain additional unrelated text and the fact that privacy policies are increasingly presented in a layered format. Additional challenges include URLs pointing to the wrong page or even pages that do not exist. We discuss the development and testing of an enhanced scraping tool that leverages generative AI to interact with web pages and scrape privacy policies more effectively. Starting from privacy policy URLs provided by service providers, we aim to collect the full text of privacy policies. We compare the performance of our technique with that of other tools proposed so far using a corpus of 275 privacy policy URLs for mobile apps in the iOS app store. The URLS were selected to include both popular and less popular apps. Our findings indicate that our proposed technique successfully retrieved a privacy policy from 90.2% of tested URLs from the iOS app store. Performance was compared against the Polipy tool and a naive scraper with Polipy’s success rate measured at 77.8% and the naive scraper resulting in even lower performance. This research underscores the potential of generative AI to significantly improve the automation of privacy policy scraping.

15 pages


Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu