Lane Center for Computational Biology
School of Computer Science, Carnegie Mellon University


Active Learning for Drug Discovery

Joshua Kangas

February 2013

Ph.D. Thesis


Keywords: Drug discovery, active learning, machine learning, computational biology, high-throughput screening, high-content screening, protoplasts, bioimage informatics

The use of high throughput screening methods has aided the drug discovery process allowing for the testing of numerous compounds for effects on a single target protein. However, by focusing primarily on a single target during high throughput screening, undesirable secondary effects are often detected late in the development process after substantial investment has been made. In order to better detect effects on a system, high content screening methods have been developed utilizing imaging technology in conjunction with machine learning methods to detect effects on living systems as a result of exposure to a drug stimulus. These have primarily been applied in animal systems, and we therefore explored approaches to extending high content screen ing methods to plant cells. A pilot high content screening approach was developed and used to test the effects of nine compounds on protoplasts from six lines of Arabidopsis thaliana expressing different fluorescently-tagged proteins. Various image analysis and machine learning techniques were used to determine which compounds affected the subcellular distributions of the proteins

. Both high throughput and high content screening methods are primarily limited in that very few target proteins are measured directly in these experiments. An alternative approach would be to do a more global screen against many undesired effects early in the process, but the number of possible secondary targets makes this prohibitively expensive due to the number of combinations of potential drugs and secondary targets. Methods for making this global approach feasible through active machine learning were therefore developed. The active learning approach iteratively constructs models to predict the results of unobserved experiments and utilizes these models to guide experimentation efforts. Such methods were developed and applied to screening data for 20,000 compounds on 177 assays. It was shown through simulations that nearly 60% of all hits (compounds that have an effect on a particular assay) could be identified after exploring only 3% of the experimental space. Finally, an automated approach to creating NIH 3T3 cell lines expressing fluorescently-tagged proteins via CD- tagging and identifying the tagged protein was developed. This was used to create a set of lines used to test active learning for detection of compound effects on the location patterns of the tagged proteins.

Our results suggest that active learning can be used to enable more complete characterization of compound effects across a diverse set of assays than otherwise affordable. The methods described are also likely to find widespread application in biomedical research.

155 pages

Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by