CMU-CS-21-129
Computer Science Department
School of Computer Science, Carnegie Mellon University



CMU-CS-21-129

Deep Learning Based Data Augmentation for
Breast Lesion Detection

Zhendong Yuan

M.S. Thesis

August 2021

CMU-CS-21-129.pdf


Keywords: Breast lesion detection systems, data imbalance, SMOTE, up-sampling, down sampling, image processing, ResNet, GAN, deep learning

Deep learning has become increasingly popular in a wide range of applications in the past few years. The performance improvements in hardware and machine learning models have made it possible to train a deeper and wider network to achieve state-of-the-art (SOTA) performance for those applications. However, there still exist several potential obstacles that researchers have to overcome before producing a model that could actually be useful in reality. One of the common obstacles is related to the data itself. The training data collected from a small hospital could be limited in quantity and a pre-trained model taken from other hospitals could have bad generalization performance due to potential differences in the X-ray machines and the environment in which the mammogram is taken[41]. Moreover, since the majority of the data collected from the mammogram comes from patients who actually have no illness, there could be a serious imbalance of positive/negative cases in the training data. Models trained using such data could naively achieve an extremely high overall accuracy by predicting everything as normal and would have no actual value in reality. However, lesion/cancer detection is a task that requires the model's predictions to be accurate for both positive/negative cases, resilient to noises, and consistent across different data sources.

In this thesis, we provide workarounds for the issues mentioned. Our experiment is based on the UPITT mammogram dataset that is comprised of 79501 images collected from approximately 22267 distinct patients. In order to deal with the dataset size restriction and to achieve localized explanation, we decide to use a patch-based model for the lesion classification. We extract the normal patches from the breast tissue in images with BIRADS level of 1. The lesion patches are extracted from the ROI(region of interest) labeled by the radiologist from images with BIRADS level score of 0,2 using computer vision techniques. We designed our own techniques to deal with the serious data imbalance via deep learning-based SMOTE[9] and GAN[6, 12, 18, 28] and test those techniques with a deep convolutional model that is similar to VGG16[35].

54 pages

Thesis Committee:
Adam Perer (Chair)
Zachary Lipton

Srinivasan Seshan, Head, Computer Science Department
Martial Hebert, Dean, School of Computer Science


Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by reports@cs.cmu.edu