CMU-CS-23-111
Computer Science Department
School of Computer Science, Carnegie Mellon University



CMU-CS-23-111

Using Computer Vision and Machine
Learning to Unlock Historical Data

Jun Tao Luo

M.S. Thesis

May 2023

CMU-CS-23-111.pdf


Keywords: Machine Learning, Computer Vision, OCR, Historical Records

Historical administrative records (e.g., property transfers, birth certificates, census data) can be extremely valuable for academic research and industry applications. However, such data is rarely digitized or accessible in analyzable formats.

We demonstrate how machine learning and computer vision methods can be combined to create a cost-effective digitization technique for historical property tax assessment records. We show how image processing and optical character recognition (OCR) deep learning models retrieve records with a mean absolute percentage error (MAPE) of 14.72%. For cases where OCR cannot be applied, such as when scanned documents are not available, we combine a small sample of manually la- beled historical data with contemporary feature data to build regression models that retrieve records with a reduced accuracy of 17.48% MAPE. Both methods present a substantial saving over manually digitizing the same data, with OCR achieving a cost reduction of 78% and the regression model achieving a cost reduction of 89%.

83 pages

Thesis Committee:
Matthew Gormley (Chair)
Rayid Ghani

Srinivasan Seshan, Head, Computer Science Department
Martial Hebert, Dean, School of Computer Science


Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu