CMU-CS-23-111 Computer Science Department School of Computer Science, Carnegie Mellon University
Using Computer Vision and Machine Jun Tao Luo M.S. Thesis May 2023
Historical administrative records (e.g., property transfers, birth certificates, census data) can be extremely valuable for academic research and industry applications. However, such data is rarely digitized or accessible in analyzable formats. We demonstrate how machine learning and computer vision methods can be combined to create a cost-effective digitization technique for historical property tax assessment records. We show how image processing and optical character recognition (OCR) deep learning models retrieve records with a mean absolute percentage error (MAPE) of 14.72%. For cases where OCR cannot be applied, such as when scanned documents are not available, we combine a small sample of manually la- beled historical data with contemporary feature data to build regression models that retrieve records with a reduced accuracy of 17.48% MAPE. Both methods present a substantial saving over manually digitizing the same data, with OCR achieving a cost reduction of 78% and the regression model achieving a cost reduction of 89%.
83 pages
Thesis Committee:
Srinivasan Seshan, Head, Computer Science Department
| |
Return to:
SCS Technical Report Collection This page maintained by reports@cs.cmu.edu |