CMU-ML-10-106
Machine Learning Department
School of Computer Science, Carnegie Mellon University



CMU-ML-10-106

Rare Category Analysis

Jingrui He

May 2010

Ph.D. Thesis

CMU-ML-10-106.pdf


Keywords: Majority class, minority clss, rare category, supervised, unsupervised, detection, characterization, feature selection


In many real world problems, rare categories (minority classes) play an essential role despite of their extreme scarcity. For example, in financial fraud detection, the vast majority of the financial transactions are legitimate, and only a small number may be fraudulent; in Medicare fraud detection, the percentage of bogus claims is small, but the total loss is significant; in network intrusion detection, malicious network activities are hidden among huge volumes of routine network traffic; in astronomy, only 0.001% of the objects in sky survey images are truly beyond the scope of current science and may lead to new discoveries; in spam image detection, the near-duplicate spam images are difficult to discover from the large number of non-spam image; in rare disease diagnosis, the rare diseases affect less than 1 out of 2000 people, but the consequences can be very severe. Therefore, the discovery, characterization and prediction of rare categories or rare examples may protect us from fraudulent or malicious behaviors, provide the aid for scientific discoveries, and even save lives.

This thesis focuses on rare category analysis, where the majority classes have a smooth distribution, and the minority classes exhibit a compactness property. Furthermore, we focus on the challenging cases where the support regions of the majority and minority classes overlap each other. To the best of our knowledge, this thesis is the first end-to-end investigation of rare categories.

Depending on the availability of the label information, we can perform either supervised or unsupervised rare category analysis. In the supervised settings, our first task is rare category detection, which is to discover at least one example from each minority class with the help of a labeling oracle. Then given labeled examples from all the classes, our second task is rare category characterization. The goal here is to find a compact representation for the minority classes in order to identify all the rare examples with high precision and recall. On the other hand, in the unsupervised settings, we do not have access to a labeling oracle. Here we propose to co-select candidate examples from the minority classes and the relevant features, which benefits both tasks (rare category selection and feature selection). For each of the above tasks, we have developed effective algorithms with theoretical guarantees as well as good empirical results.

In the future, we plan to apply rare category analysis on rich data, such as medical images, texts / blogs, Electronic Health Records (EHR), web link graphs, stream data, etc; we plan to build statistical models for the rare categories in order to understand how they emerge and evolve over time; we plan to study complex fraud based on rare category analysis; we plan to make use of transfer learning to help with our analysis; we also plan to build a complete system for rare category analysis.

113 pages


SCS Technical Report Collection
School of Computer Science homepage

This page maintained by reports@cs.cmu.edu