CMU-CS-24-111
Computer Science Department
School of Computer Science, Carnegie Mellon University



CMU-CS-24-111

Taxonomy for Data Contamination in
Large Language Models

Medha Palavalli

M.S. Thesis

May 2024

CMU-CS-24-111.pdf


Keywords: Large Language Models, Data Contamination, Taxonomy, Contamination Ratio

Large language models pretrained on extensive web corpora demonstrate remarkable performance across a wide range of downstream tasks. However, a growing concern is data contamination, where evaluation datasets may unintentionally be contained in the pretraining corpus, inflating model performance. Not all contamination manifests in the evaluation form when encountered within the pretraining data; these contaminants may originate from altered versions of the test set, evading detection during decontamination. Despite these concerns, how different types of contamination impact the performance of language models on downstream tasks is not fully understood. In this thesis, we present a taxonomy that categorizes the various types of contamination encountered by LLMs during the pretraining phase and identify which types pose the highest risk. We analyze the impact of contamination on two key NLP tasks – summarization and question answering – revealing how different types of contamination influence task performance during evaluation. Our findings yield concrete recommendations for prioritizing data decontamination for pretraining.

58 pages

Thesis Committee:
Matt Gormley (Chair)
Lori Levin

Srinivasan Seshan, Head, Computer Science Department
Martial Hebert, Dean, School of Computer Science


Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu