Computational phenotype discovery using unsupervised feature learning over noisy, sparse, and irregular clinical data

Thomas A Lasko; Joshua C Denny; Mia A Levy

doi:10.1371/journal.pone.0066341

Computational phenotype discovery using unsupervised feature learning over noisy, sparse, and irregular clinical data

PLoS One. 2013 Jun 24;8(6):e66341. doi: 10.1371/journal.pone.0066341. Print 2013.

Authors

Thomas A Lasko¹, Joshua C Denny, Mia A Levy

Affiliation

¹ Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, Tennessee, USA. tom.lasko@vanderbilt.edu

Abstract

Inferring precise phenotypic patterns from population-scale clinical data is a core computational task in the development of precision, personalized medicine. The traditional approach uses supervised learning, in which an expert designates which patterns to look for (by specifying the learning task and the class labels), and where to look for them (by specifying the input variables). While appropriate for individual tasks, this approach scales poorly and misses the patterns that we don't think to look for. Unsupervised feature learning overcomes these limitations by identifying patterns (or features) that collectively form a compact and expressive representation of the source data, with no need for expert input or labeled examples. Its rising popularity is driven by new deep learning methods, which have produced high-profile successes on difficult standardized problems of object recognition in images. Here we introduce its use for phenotype discovery in clinical data. This use is challenging because the largest source of clinical data - Electronic Medical Records - typically contains noisy, sparse, and irregularly timed observations, rendering them poor substrates for deep learning methods. Our approach couples dirty clinical data to deep learning architecture via longitudinal probability densities inferred using Gaussian process regression. From episodic, longitudinal sequences of serum uric acid measurements in 4368 individuals we produced continuous phenotypic features that suggest multiple population subtypes, and that accurately distinguished (0.97 AUC) the uric-acid signatures of gout vs. acute leukemia despite not being optimized for the task. The unsupervised features were as accurate as gold-standard features engineered by an expert with complete knowledge of the domain, the classification task, and the class labels. Our findings demonstrate the potential for achieving computational phenotype discovery at population scale. We expect such data-driven phenotypes to expose unknown disease variants and subtypes and to provide rich targets for genetic association studies.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Humans
Learning*
Medical Records Systems, Computerized
Phenotype*

Grants and funding

R01 LM010685/LM/NLM NIH HHS/United States