CMU-S3D-24-103 Software and Societal Systems Department School of Computer Science, Carnegie Mellon University
Automatic Identification and Extraction Tong Jiao, Norman Sadeh April 2024
Privacy policies are known to be long and challenging for people to read. To overcome this issue, a variety of natural language processing (NLP) techniques have been developed over the past dozen years to automate the analysis of these documents. The initial step in this process is the automatic extraction of privacy policies - typically from a website. While a number of tools have been proposed to extract privacy policy text, these tools all suffer from various limitations. This has to do with the fact that webpages hosting privacy policies often contain additional unrelated text and the fact that privacy policies are increasingly presented in a layered format. Additional challenges include URLs pointing to the wrong page or even pages that do not exist. We discuss the development and testing of an enhanced scraping tool that leverages generative AI to interact with web pages and scrape privacy policies more effectively. Starting from privacy policy URLs provided by service providers, we aim to collect the full text of privacy policies. We compare the performance of our technique with that of other tools proposed so far using a corpus of 275 privacy policy URLs for mobile apps in the iOS app store. The URLS were selected to include both popular and less popular apps. Our findings indicate that our proposed technique successfully retrieved a privacy policy from 90.2% of tested URLs from the iOS app store. Performance was compared against the Polipy tool and a naive scraper with Polipy’s success rate measured at 77.8% and the naive scraper resulting in even lower performance. This research underscores the potential of generative AI to significantly improve the automation of privacy policy scraping.
15 pages
| |
Return to:
SCS Technical Report Collection This page maintained by reports@cs.cmu.edu |