CMU-ISR-17-118R
Institute for Software Research
School of Computer Science, Carnegie Mellon University



CMU-ISR-17-118R

Towards Automatic Classification of Privacy Policy Text

Fredrick Liu, Shomir Wilson, Peter Story, Sebastian Zimmeck, Norman Sadeh

June 2018

CMU-ISR-17-118R.pdf

Supersedes Institute for Sofware Research
Technical Report CMU-ISR-17-118.pdf


Also appears as Language Technology Institute
Technical Report CMU-LTI-17-010


Keywords: Privacy, machine learning, classification, cnn, neural network, privacy policy

Privacy policies notify Internet users about the privacy practices of websites, mobile apps, and other products and services. However, users rarely read them and struggle to understand their contents. Also, the entities that provide these policies are sometimes unmotivated to make them comprehensible. Recently, annotated corpora of privacy policies have been introduced to the research community. They open the door to the development of machine learning and natural language processing techniques to automate the annotation of these documents. In turn, these annotations can be passed on to interfaces (e.g., web browser plugins) that help users quickly identify and understand relevant privacy statements. We present advances in extracting privacy policy paragraphs (termed segments in this paper) and individual sentences that relate to expert-identified categories of policy contents, using methods in supervised learning. In particular, we show that relevant segments and sentences can be classified with average micro-F1 scores of 0.78 and 0.66 respectively, improving over prior work. We discuss how the techniques introduced in this paper have been used to automatically annotate the text of about 7,000 privacy policies. Our discussion highlights opportunities as well as limitations associated with our classification approach.

11 pages


Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu