Bayesian Nonparametric Collaborative Topic Poisson ......phenotype patterns, or phenotype topics, which this paper refers to them as topics, and 2) to predict a patient’s critical

Bayesian Nonparametric Collaborative Topic Poisson Factorizationfor Electronic Health Records-Based Phenotyping

Wonsung Lee, Youngmin Lee, Heeyoung Kim, and Il-Chul MoonDepartment of Industrial and Systems Engineerning

KAIST291 Daehak-ro, Yuseong-gu, Daejeon 305-701, Republic of Korea

{aporia,lym1989,heeyoungkim,icmoon}@kaist.ac.kr

AbstractPhenotyping with electronic health records (EHR)has received much attention in recent years becausethe phenotyping opens a new way to discover clini-cally meaningful insights, such as disease progres-sion and disease subtypes without human supervi-sions. In spite of its potential benefits, the com-plex nature of EHR often requires more sophis-ticated methodologies compared with traditionalmethods. Previous works on EHR-based pheno-typing utilized unsupervised and supervised learn-ing methods separately by independently detect-ing phenotypes and predicting medical risk scores.To improve EHR-based phenotyping by bridgingthe separated methods, we present Bayesian non-parametric collaborative topic Poisson factorization(BN-CTPF) that is the first nonparametric content-based Poisson factorization and first application ofjointly analyzing the phenotye topics and estimat-ing the individual risk scores. BN-CTPF showsbetter performances in predicting the risk scoreswhen we compared the model with previous matrixfactorization and topic modeling methods includ-ing a Poisson factorization and its collaborative ex-tensions. Also, BN-CTPF provides faceted viewson the phenotype topics by patients’ demographics.Finally, we demonstrate a scalable stochastic vari-ational inference algorithm by applying BN-CTPFto a national-scale EHR dataset.

1 IntroductionDiscovering phenotypes, or clinical attributes, of each indi-vidual can be a solid foundation for understanding the latentpathology of complex diseases and preventing patients frompotential risk of the diseases. This paper presents a phe-notyping method to predict a human’s medical status withelectronic medical records, or EHR. The EHR phenotypingrequires the estimation of the latent background on individ-ual patients [Pathak et al., 2013], and the EHR phenotypingpredicts observable medical status with the estimated latentvariables to verify the validity of the estimated phenotypes.Therefore, this paper proposes a model 1) to discover latentphenotype patterns, or phenotype topics, which this paper

refers to them as topics, and 2) to predict a patient’s criticalmedical risk score, such as comorbidity and polypharmacy.

EHR phenotyping is difficult from three aspects. The firstaspect is the combination of the unsupervised and the su-pervised learning tasks. The EHR phenotyping needs to ex-tract the critical latent information from the collected EHRs,which falls under the unsupervised domain [Bellazzi and Zu-pan, 2008]. Then, the EHR phenotyping uses the latentinformation to predict the medical risks, which is a super-vised learning task. Often, the difference of the two tasksmade researchers to combine two different analysis modelsto pipeline, or batch-process, one analysis output to anotherin the step-wise manner. This pipelining of two separate mod-els would limit the accuracy of the combined model becauseof two different learning objectives. Hence, an ideal modelwould combine two separate models into a single model byrepresenting the unique structure of EHR for phenotyping.The second challenge is reflecting the medical domain char-acteristics in the learning model. The model should incor-porate the medical data structure, and the model should an-ticipate and design potential noises and mixtures from themedical practices. The third difficulty is the scalability ofthe learning process. The phenotyping fields have noticedthe importance of the subject sizes to retrieve the meaningfulphenotyping results [Hripcsak and Albers, 2013].

This paper introduces a new statistical model, BayesianNonparametric Collaborative Topic Poisson Factorization, orBN-CTPF, for EHR phenotyping. BN-CTPF extracts the de-mographic latent information by relating it to the predictionof medical risks. Specifically, BN-CTPF extracts phenotypetopics, which are combinations of diagnosis and medication,then BN-CTPF infers the correlations, named as topic as-sociations, between two topics. BN-CTPF models that thetopic associations are indirectly linked to the patient demo-graphic information, and we name these indirect links astopic-covariate associations. These topic-covariate associa-tions provide multiple views on the phenotype topics withfaceted demographics. Finally, BN-CTPF predicts the medi-cal risks by combining the topic associations and the topic-covariate associations conditioned upon a patient’s demo-graphic background. These inferences are unified and scal-able to optimize a single objective function unlike the previ-ous pipelined approaches. The entire procedures and analysesof BN-CTPF is summarized by Figure 1. BN-CTPF is a con-

Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16)

2544

solidated statistical model that provides more statistically ac-curate predictions and more statistically coherent latent infor-mation. Also, BN-CTPF is able to analyze over one-millionEHRs gathered at the national level by stochastic variationalinference. Moreover, this is the first nonparametric model ofthe collaborative topic Poisson factorization, which utilizes anormalized gamma construction of hierarchical Dirichlet pro-cesses.

2 Related WorkThe first subsection of the related work is on the automatedphenotyping on EHR with the machine learning methodolo-gies. The second subsection enumerates the previous content-based recommendation studies, which BN-CTPF conceptu-ally belongs to.

2.1 Automated EHR PhenotypingWhile earlier works used knowledge-based approaches [Mc-Carty et al., 2011], probabilistic topic models have been re-cently applied to the unsupervised EHR phenotyping. Forexample, latent Dirichlet allocation [Blei et al., 2003], orLDA, generates medial coherent concepts, or topics, whichcould be used for prediction tasks in the subsequent model.[Saria et al., 2010] suggested a nonparametric topic modelwith temporal aspects, and this model was applied to trackphysiological signals of premature infants from the topicalperspective. They also used a separate supervised learningmodel to predict risks of infants. [Lehman et al., 2012] usedhierarchical Dirichlet processes, or HDP [Teh et al., 2012],to learn topics from unstructured clinical notes, and they per-formed risk stratification for intensive care unit (ICU) patientswith the topics. [Ghassemi et al., 2015] adopted multi-taskGaussian processes (GPs) along with topic models to sum-marize multivariate patients’ physiological signals as well astopic proportions, and the extracted topic proportions are fedinto the multi-task GPs to learn the kernel hyperparameters.They found that the inferred hyperparameters were useful inpredicting the mortality with a separate logistic regressionmodel. [Ghassemi et al., 2014] also used both topic modelingand SVM to predict patient mortality in a dynamic setting.

Some used other approaches besides of the probabilistictopic models. For instance, [Lasko et al., 2013] adopted adeep learning auto-encoder and GPs to extract representa-tive features of 4,368 patients having either gout or acuteleukemia. [Tran et al., 2015] utilized restricted Boltzmannmachines (RBM) to derive a new representation of medi-cal objects, such as diseases and procedures by mappinghigh-dimensional observations into a low-dimensional vectorspace. [Zhou et al., 2014] proposed a matrix imputation ap-proach to remedy the noisy EHRs for the better phenotypingresult. Recently, a tensor factorization-based approach [Hoet al., 2014], which decomposes multi-dimensional EHR ob-servations into clinically meaningful tensors, or phenotypes,performed a separate prediction on heart failure by utilizingobtained phenotypes.

BN-CTPF improves the previous works by two aspects.The step-wise fashion of the previous works have limits infinding clinically meaningful phenotypes because the overall

performance can be dominated by the chosen classifier. Fur-thermore, some classifiers involve onerous parameter tuningprocedures that could be limited to a fixed phenotype. In con-trast, BN-CTPF integrates two models to jointly optimize thephenotype discovery and the subsequent prediction. Addi-tionally, BN-CTPF estimates the associations between envi-ronmental factors and phenotypes, yet the associations havenot been inferred by most of the unsupervised approaches in-cluding the tensor factorization. The associations are keys infinding applicable research insights [Saria and Goldenberg,2015]. For example, the existing models would not infer therelations between the patient’s demographic background andthe phenotypes, while this could be a useful source of infor-mation in the medical practices.

2.2 Content-Based RecommendationIn the machine learning field, there is much literature de-voted to studies on collaborative filtering by matrix factor-ization for recommender systems. Under the assumptionthat users with similar records of events would share simi-lar traits, the matrix factorization methods discover the low-dimensional latent factors which capture essential informa-tion for representing present records and predicting unob-served outcomes. Matrix factorization has been applied toa wide range of applications, i.e. recommender systems [Ko-ren et al., 2009], document modeling [Canny, 2004], and dis-ease risk prediction [Davis et al., 2010]. [Wang and Blei,2011] developed collaborative topic regression, or CTR, torecommend scientific articles. CTR combines the matrix fac-torization with the topic modeling to collaboratively predictratings and to learn topics. However, a matrix factorizationof CTR assumes a rating following the Gaussian distributionwhich could be other distributions in some factors, partic-ularly when observations are sparse with implicit feedback[Gopalan et al., 2013]. Therefore, [Gopalan et al., 2014]presented collaborative topic Poisson factorization, or CTPF,that replaces the Gaussian assumption in the rating predictionand the multinomial-Dirichlet distributions in the topic mod-eling with the Poisson and the Poisson-gamma distributions,respectively.

While the content-based recommendation models intro-duced above are useful in both recommending tailored itemsfor users and providing interpretable reasons, there are threepoints that should be considered when applying those modelsto the EHR phenotyping. First, the models should be tailoredto incorporate the EHR structure. For example, the relation-ship between diagnoses and medications should be treatedby multiple types of observations, not a single type of ob-servations (words) as in the topic models. Thus, we use thetwo different types of observations, diagnoses and medica-tions, to learn the phenotypes with both distinct observationsin EHRs. In addition, we use another type of observation, apatient medical risk score as a prediction target variable withPoisson factorization. Second, the model needs to considerthat different medications are feasible for a single diagnosisin the medical practices. To enable this modeling, we designBN-CTPF to have a mixed-membership representation in themedications as well as another mixed-memberships in the di-agnosis. The two mixed memberships are joined by the topic

2545

? ?

Diagnosis phenotype

Kidneydisease (20%)

Type-I diabetes (8%)

Gastritis (5%)

Medication phenotype

Teprenone (72%)

Fluoxetine HCI (4%)

Ascorbic acid (3%)Discovery ofphenotypes

(topics)

A

Analysis on phenotypes

C

Topic-covariateassociations

Quality evaluation of phenotypes

𝑿𝑿(𝟏𝟏)

𝑿𝑿(𝟐𝟐)Diagnosis

Medication

Demographics

𝑹𝑹

Medical risk

𝒀𝒀Prediction of medical risks

≈𝒀𝒀 𝝅𝝅 𝜽𝜽from

Poisson matrix factorizationB

1. EHR of patients 2. Discovery of phenotypes andPrediction of medical risks

3. Detailed Analysison phenotypes

T

Figure 1: The entire procedure and analysis flow of BN-CTPF. Used notations in this figure are the same as the plate notationof the generative process of BN-CTPF in Figure 2. We omit ✏ in the above Poisson matrix factorization for simplicity.

𝜋𝜋𝑚𝑚

𝑌𝑌𝑚𝑚𝑚𝑚

𝑢𝑢𝑚𝑚

𝑟𝑟𝑚𝑚𝑚𝑚

𝑤𝑤𝑚𝑚𝑗𝑗 𝜃𝜃𝑚𝑚𝑗𝑗 𝜆𝜆𝑗𝑗J 𝑃𝑃

∞

∞

𝐷𝐷

𝜂𝜂𝑗𝑗𝑙𝑙𝑗𝑗𝑉𝑉𝑗𝑗

𝜀𝜀𝑚𝑚

𝛼𝛼

𝛽𝛽

𝑐𝑐,𝑑𝑑

𝑋𝑋𝑚𝑚𝑚𝑚(1)

𝑋𝑋𝑚𝑚𝑚𝑚(2)

𝑎𝑎, 𝑏𝑏

𝑀𝑀

𝐶𝐶𝑚𝑚𝑑𝑑(1)

𝑁𝑁

𝐶𝐶𝑚𝑚𝑑𝑑(2)

𝜂𝜂0

𝜆𝜆0

DemographicCovariates

PatientPlate

MedicalRisk Scores Medication

Per-MedicationIndicator

Per-DiagnosisIndicator

Diagnosis

Figure 2: Plate notation for Bayesian Nonparametric-Collaborative Topic Poisson Factorization, or BN-CTPF.

intensities at the patient dimension. Third, the EHR pheno-typing for medical risk scores needs to optimize continuousmeasures, i.e. mean absolute error (MAE) and root meansquared error (RMSE), because the scores are inherently con-tinuous. However, the surveyed models were evaluated undermeasures for categorical results, i.e. recall and precision.

3 Methodology

In this section, we present a detailed model description ofBN-CTPF which is a Bayesian nonparametric extension ofparametric CTPF. To compute posterior probabilities, we de-rive a mean-field stochastic variational inference algorithm toapproximate it. We also provide a prediction procedure forunseen risk scores.

3.1 Model Description

BN-CTPF builds on the parametric CTPF that models bothuser-ratings and document-word counts by Poisson distribu-tions. Unlike CTPF, BN-CTPF adopts a mixed-membserhiprepresentation to describe medical contents, which are pa-tients’ diagnoses and medications. Each patient is realizedfrom a mixture with components shared by all patients. Thismixed-membership assumption allows the model to encodethe heterogeneity of patients. In the perspective of topic mod-eling, BN-CTPF can be viewed as a collaborative extension ofmixed-membership models built on HDP [Teh et al., 2012].HDP have been widely used to model grouped data in theBayesian nonparametrics literature. When it comes to mod-eling a mixture, the model with HDP allows the data to de-termine how many components are needed, which means thatit can adaptively determine the model complexity as data be-come available.

Suppose that we have EHR containing three types of obser-vations: VD unique diagnoses, VN unique medications, andclinical risk scores of M individual patients for P timesteps.We assume that each patient has records containing N med-ications and D diagnoses, but this assumption can be easilyrelaxed to indicate Nm and Dm by a patient m. Additionally,all patients have covariates rmj , where j = 1, ..., J is thedimension of covariates. These covariates reflect the patientdemographics, such age, gender, and region. For modelingconvenience, we use medication words and diagnosis words

to refer to medication and diagnosis records from each pa-tient, respectively.

Let X(1)

md denote which diagnosis is encoded in d-th recordout of total D observed diagnoses for a patient m. X(2)

mn de-fines the same meaning to indicate a medication record. BN-CTPF infers phenotype topics from co-occurrence patternsfrom both diagnoses and medications. Figure 1 representsthe inferred topics by two distincnt, yet coupled distributionsover diagnoses and medications: a diagnosis topic ⌘k and a

2546

medication topic �k, where k = 1, ...,1. Clinical risk scoresare subject to the prediction by the matrix factorization.

Here, we jointly model the latent phenotype topics and therisk scores. Let Y be an M by P matrix describing the longi-tudinal risk scores of each patient, where Ymp 2 {1, 2, 3, ..}is the integer risk score of a patient m at a timestep p. BN-CTPF assumes that Ymp follows a Poisson distribution withthe inner product of (⇡m + ✏m) and ✓p; where ⇡m is topic in-tensities of a patient m, ✏m is topic offsets, ✓p is topic intensi-ties of the timestep p, and all of these are infinite dimensionalnonnegative vectors. Topic offests ✏m are introduced to cap-ture the inherent heterogeneity of individual patients, whichis not fully explained by the topic intensities ⇡m.

We now describe a hierarchical Dirichlet process construc-tion of BN-CTPF. For the top-level Dirichlet process (DP) weuse stick-breaking processes [Sethuraman, 1994]:

⌘k ⇠ G0

, �k ⇠ H0

, lk ⇠ L0

, wk ⇠ W0

,

Vk ⇠ Beta(1,↵), pk = Vk

k�1

Y

l=1

(1� Vl),

G =

1X

k=1

pk�(⌘k,�k,lk,wk)

where G0

, H0

, L0

, and W0

are base measures of correspond-ing atoms. Also, lk is a d-dimensional latent location vectorof topic k introducing topic associations, and wk is a weightparameter of topic k. For EHR with J demographic factors ofeach patient, wk becomes a J-dimensional parameter. Lastly,Vk is a top-level stick length, and ↵ is a top-level concentra-tion parameter. Components of an atom G are as follows:

⌘k ⇠ DirichletVD (⌘0), �k ⇠ DirichletVN (�0

),

lk ⇠ Normal(0,�2

l Id), wkj ⇠ Normal(0,�2

w).

For the second-level construction, we utilize a normalizedgamma construction [Paisley et al., 2012] to introduce topicassocations. We also combine the original construction withdifferent one [Kim and Oh, 2014] to introduce topic-covariateassociations. The second-level construction for describingpatient-wise heterogeneity is as follows:

um ⇠ Normal(0,�2

uId),

Fm(wk, lk) = lTkum +

JX

j=1

wkjrmj ,

⇡mk ⇠ Gamma(�pk, exp{�Fm(wk, lk)}),

Gm =

1X

k=1

⇡mkP1

l=1

⇡ml�(⌘k,�k)

,

where � is a concentration parameter.There is a recently published work [Ranganath and Blei,

2015], correlated random measures (CorrRM), which intro-duces a unified framework to generalize the construction pro-cesses of previous correlated random measures. BN-CTPFdiffers from CorrRM in that we consider additional depen-dency structures, named topic-covariate associations, whichare not fully discussed in CorrRM. In order to model topic-covariate associations, we assume a Gaussian processe (GP)

with meanPJ

j=1

wkjrmj , a weighted average of observedcovariates. Unlike our model, CorrRM considers a GP witha random mean vector, and it can be viewed as a funciton oflatent covariates. While the GP framework is not introducedin our notation, the generative processes can be easily trans-formed into a GP notation as shown in [Paisley et al., 2012].

The d-dimensional vector um is a latent location vector ofa patient m, and F controls the degree of latent topic intensi-ties. For example, as the distance between two location vec-tors lk and um is getting closer or a weight parameter wkj

grows, the topic intensity ⇡mk increases. From the two-levelconstruction, we can ensure that all atoms (⌘k,�k, lk, wk)

1k=1

are shared across the entire patients with different degrees ofexhibition.

Finally, we describe a generative process for 1) observeddiagnoses and medications in patients; and 2) observed indi-vidual risk scores under BN-CTPF:

1. For each topic k = 1, ...,1 and timestep p = 1, ..., P :

(a) Draw ✓pk ⇠ Gamma(a, b).

2. For each patient m = 1, ...,M :

(a) For each diagnosis word d = 1, ..., D:

i. Draw C(1)

md ⇠P1

k=1

⇡mkP1l=1 ⇡ml

�(⌘k).

ii. Draw X(1)

md ⇠ Discrete(⌘C(1)

md).

(b) For each medication word n = 1, ..., N :

i. Draw C(2)

mn ⇠P1

k=1

⇡mkP1l=1 ⇡ml

�(�k)

.

ii. Draw X(2)

mn ⇠ Discrete(�C(2)

mn).

(a) For each topic k = 1, ...,1:

i. Draw ✏mk ⇠ Gamma(c, d).(b) For each timestep p = 1, ..., P :

i. Draw Ymp ⇠ Poisson(P1

k=1

(⇡mk + ✏mk)✓pk),

where C(1)

md and C(2)

mn are a per-diagnosis and a per-medication topic indicators, respectively. Figure 2 shows theplate notation of BN-CTPF. We omit several base measuresand priors such as G

0

, G, and Gm for simplicity.

3.2 Stochastic Variational Inference of BN-CTPF

In many hierarchical Baysian models incluing nonparametricmodels, computing an exact posteior is intractable. There-fore, we derive a stochastic variational inference (SVI) algo-rithm based on mean-field variational families [Hoffman et

al., 2013]. To faciliate the posterior inference of BN-CTPFlike the inference of CTPF [Gopalan et al., 2014], we shouldincorporate two kinds of auxiliary latent variables for riskscores Ymp. The first auxiliary latent variable is K latentvariables Za

mp,k ⇠ Poisson(⇡mk✓pk), and the second one isK latent variables Zb

mp,k ⇠ Poisson(✏mk✓pk).

2547

Next, we posit the fully factorized variational families.

Q :

=

TY

k=1

q(Vk)q(⌘k)q(�k)q(lk)JY

j=1

q(wkj)

PY

p=1

q(✓pk)

MY

m=1

q(um)q(⇡mk)q(✏mk)q(Zmp)

DY

d=1

q(C(1)

md)

NY

n=1

q(C(2)

mn),

where T is the truncation level and Zmp = (Zamp, Z

bmp).

We assume that the following variational distribution for eachvariable,

q(Vk)q(lk)q(wkj)q(um) = �ˆVk�ˆlk�wkj�um

q(⌘k) = Dirichlet(⌘k|�⌘k,1, ..., �⌘k,D)

q(�k) = Dirichlet(�k|��k,1, ..., ��k,N )

q(✓pk) = Gamma(a✓pk, b✓pk)

q(⇡mk) = Gamma(a⇡mk, b⇡mk)

q(✏mk) = Gamma(a✏mk, b✏mk)

q(C(1)

md) = Multinomial(C(1)

md|�(1)

md,1, ...,�(1)

md,T )

q(C(2)

mn) = Multinomial(C(2)

mn|�(2)

mn,1, ...,�(2)

mn,T )

q(Zmp) = Multinomial(Zmp|�(3)mp,1, ...,�(3)

mp,2T ),

where the set of these distributions are parameterized by theirown variational parameters, denoted by . At each iterationt, we select 1) a set of observations, ⌦Bt , from the subset pa-tients, Bt ⇢ {1, ...M}; 2) a set of given batch-specific varia-toinal parameters Bt ; and 3) a set of global variational pa-rameters 0. With these three sets of information, BN-CTPFstochastically optimizes the following objective function:

Lt(⌦Bt , Bt ,

0) =

M

|Bt|X

i2Bt

EQ[log p(⌦i,⇥i|⇥0)]

+

M

|Bt|X

i2Bt

H[Q(⇥i)] + EQ[log p(⇥0)] +H[Q(⇥

0)],

where ⌦, ⇥, and H[Q] denote all observations, hidden vari-ables, and an entropy of Q distribution, respectively. Ad-ditionally, ⇥m and ⇥0 denote a set of batch-specific hiddenvariables and a set of global hidden variables, respectively.

In the local updates of SVI, at each iteration t, we updatethe variational distributions over ⇥m for m 2 Bt until theyconverge by using closed-form equations while holding fixed

0. In the global updates of SVI, we update the variationaldistributions over ⇥0 by taking a gradient step, multiplied bya precondition matrix G

0 . We use the inverse Fisher infor-mation or inverse negative Hessian as a precondition matrix:

(t+1)

= (t)+ ⇢t ˜r L,

where ˜r L :

= G r L is a natural gradient of 2 0, and⇢t > 0 is a step size satisfying the convergence condition.The overall update information is summarized in Table 1.

3.3 PredictionAfter the entire variational parameters are learned, BN-CTPF predicts individual risk scores Y . Specifically, givenpatients excluded from a training set, we predict the riskscores Ymp of a patient m at the timestep p, by their posteriorexpected Poisson parameters.

ˆYmp = Eh

(⇡m + ✏m)

T✓p

i

. (1)

We can approximate the posterior distributions of ⇡m and ✏mby the SVI for BN-CTPF from Section 3.2.

4 ResultsWe demonstrate the applicability of BN-CTPF on EHR phe-notyping with three experimental results. First, we provideerror metrics to evaluate the performance of risk score predic-tions by Poisson factorization. Second, we performed a quan-titative evaluation of phenotypes through the computation ofheldout predictive perplexity (PPX). Finally, we analyze thevarious inter-dependency patterns which are represented bytopic associations and topic-covariate associations by explor-ing the differences on intensities of a certain phenotypes asdemographic factors vary.

4.1 Data Description and Experimental DesignWe used a National Patient Sample (NPS) that is providedfrom Health Insurance Review and Assessment (HIRA),which is a public institution of Republic of Korea. We use2011 NPS dataset, which is accessible after registrations, ofboth inpatients and outpatients. The dataset includes theentire prescription records of approximately 1.1 million pa-tients. All diagnoses and medications are encoded in the pre-scription records. We select the subset of patients who areolder than or equal to 65 years.

Although the large amount of data promises myriad waysof healthcare applications, the dataset lacks detailed informa-tion in some aspects. For instance, the NPS dataset has nophysiolocal signals and lab test results. Thus, we calculateCharlson’s comorbidity index (CMB) [Sundararajan et al.,2004] and polypharmacy (PP) scores [Hajjar et al., 2007] touse these values as potential medical risk scores. The po-tential risk of CMB and PP has been a traditional subjectmatter of research in medicine [Evans et al., 2012]. Theselected subset for experiments includes 158,630 patients,3,156,234 prescription records, 9,138 unique diagnoses, and2,256 unique medications. The average number of uniquemedications and diagnoses per patient are 82.022 and 39.790,respectively.

We set hyperparameters as follows: ↵ = 20,� = 5, T =

200, d = 20,�2

. =

1

250

, ⇢t = (25 + t)�0.9, and |Bt| =

1, 024. All topic Dirichlet hyperparameters are fixed by 0.1.We set training and heldout ratio by 80:20. The algorithmterminates when the fractional change in the validation prob-ability falls below 10

�3, where we set aside 1% of trainingdata as a validations set for convergence checking.

We introduce the following alternative models to com-pare performances. Specifically, to compare the error met-rics, we adopt the following models: Poisson factorization,

2548

Table 1: A summary of update information for all variational parameters. We provide not only closed-form update equationsfor local variational parameters, except for um, which needs a gradient to update the parameters, but also graidents and hessiansto compute natural gradients for global variational parameters at iteration t and for a random subset Bt ⇢ {1, ...,M}, whereˆFmk =

ˆlTk um +

PJj=1

wkjrmj . Natural gradients are directly provided for several variables (✓pk, ⌘k, and �k), which turn outto be closed-form solutions. We set a truncation level as T , thus VT :

= 1 and updates for Vk are defined for k = 1, ..., T � 1.

Variable Type Udpate InformationC

(1)

md Closed-form �(1)

md,k / exp

n

Eq[log ⌘k,X(1)md

] + Eq[log ⇡mk]

o

C(2)

mn Closed-form �(2)

mn,k / exp

n

Eq[log �k,X(2)mn

] + Eq[log ⇡mk]

o

Zmp Closed-form �(3)

mp,k /⇢

exp {Eq[log ⇡mk] + Eq [log ✓pk]} for k = 1, ..., T

exp {Eq[log ✏mk] + Eq [log ✓pk]} for k = T + 1, ..., 2T

⇡mk Closed-form a⇡mk = �pk +

P

d �(1)

md,k +

P

n �(2)

mn,k +

PTt=1

Ymt�(3)

mt,k

b⇡mk =

Nm+DmPk Eq [⇡mk]

+

P

t Eq [✓pk] + exp

⇣

� ˆFmk

⌘

✏mk Closed-form a✏mk = c+P

2Tt=T+1

Ymt�(3)

mt,kb✏mk = d+

P

t Eq [✓pk]

um Gradient @L@um

= � 1

�2uum +

P

kˆlk

✓

��p+ Eq [⇡mk]

exp

(

ˆFmk)

◆

✓pk Natural gradient @a✓kt = �a✓kt + a+

M|Bt|

n

P

m Ymt�(3)

mt,k2{1:T} +P

m Ymt�(3)

mt,k2{T+1:2T}

o

@b✓kt = �b✓kt + b+ M|Bt|

P

m {Eq [⇡mk] + Eq [✏mk]}⌘k Natural gradient @�⌘k,l = ��⌘k,l + ⌘

0

+

M|Bt|

P

m

P

d �(1)

md,k1h

X(1)

md = li

for l = 1, ..., VD

�k Natural gradient @��k,l = ��k,l + �0

+

M|Bt|

P

m

P

n �(2)

mn,k1h

X(2)

mn = li

for l = 1, ..., VN

wkj Gradient, Hessian@L@wkj

= � wkj

�2w

+

M|Bt|

P

m rmj

✓

��pk +

Eq [⇡mk]

exp

(

ˆFmk)

◆

@2L@wkj@w0

kj= �1 [j = j0] 1

�2w�P

m2|Bt| r0mjrmj

✓

Eq [⇡mk]

exp

(

ˆFmk)

◆

lk Gradient, Hessian@L@ˆlk

= � ˆlk�2l+

M|Bt|

P

m2|Bt| um

✓

��pk +

Eq [⇡mk]

exp

(

ˆFmk)

◆

@2L@ˆlk@ˆl0k

= � Id�2l�P

m2|Bt| umuTm

✓

Eq [⇡mk]

exp

(

ˆFmk)

◆

Vk Gradient, Hessian

@L@Vk

= � ↵�1

(

1� ˆVk)� �pk

ˆVk

h

M|Bt|

P

m2|Bi|

n

ˆFmk � Eq [log ⇡mk]

o

+

M|Bt| (�pk)

i

+

PTl=k+1

�pl

(

1� ˆVk)

h

M|Bt|

P

m2|Bi|

n

ˆFml � Eq [log ⇡ml]

o

+

M|Bt| (�pl)

i

@2L@ ˆV 2

k

= � ↵�1

(

1� ˆVk)2 � pk

ˆVk

h

�2|Bt| 0(�pk)

pkˆVk

i

+

PTl=k+1

pl

(

1� ˆVk)

h

�2|Bt| 0(�pl)

plˆVl

i

@2L@ ˆVk@ ˆVr

=

�pkˆVk(1� ˆVr)

n

P

m2|Bt|

⇣

ˆFmk � Eq [log ⇡mk]

⌘

+ |Bt| (�pk)o

+

�2pkˆVk

n

|Bt| 0(�pk)

pk

1� ˆVr

o

�PT

l=k+1

�pl

(

1� ˆVk)(1� ˆVr)

n

P

m2|Bt|

⇣

ˆFml � Eq [log ⇡ml]

⌘

+ |Bt| (�pl)o

�PT

l=k+1

�2pl

(

1� ˆVk)

n

|Bt| 0(�pl)

pl

1� ˆVr

o

(for r < k)

or PF, CTPF, and CTR. Also, we use a variations of BN-CTPF by limiting some of modeled features, so we can mea-sure the effect of topic associations and topic-covariate asso-ciations. All parametric models have a latent dimension ofK = 200. For the hyperparameter selection, we use vali-dation datasets to tune to the optimal setting. Accordingly,the shape and the rate parameters of gamma distributionsfor Poisson factorization-based models are fixed as 0.3 likeCTPF. Content-based recommendation models are initializedby the learned parameters from LDA [Blei et al., 2003].

PF [Gopalan et al., 2013]. PF is a simple model that fac-

torizes risk scores without contents information.CTPF [Gopalan et al., 2014]. That is one of the main com-

ponents of BN-CTPF. CTPF is a parametric model and factor-izes risk scores with a single source of information, so CTPFonly utilizes either medications or diagnoses.

CTR [Wang and Blei, 2011]. Original CTR does not scaleto massive datasets. Thus, we fix topics and patient-topic pro-portions to their LDA values, and it is known that the perfor-mance resembles an original CTR. CTR also suffers from theproblem of the single source information.

HDP-PF, DILN-PF and HDSP-PF. A variation of BN-

2549

(a) CMB Prediction Errors. (b) PP Prediction Errors.

Figure 3: MAE and RMSE on CMB and PP predictions.

(a) PPX of models on CMB. (b) PPX of models on PP.

Figure 4: Comparisons of Predictive Perplexity for phenotyp-ing quality.

CTPF which excludes the corresponding part of topic assoca-tions and topic-covariate associations. HDP-PF can be seenas a combination of HDP and PF. DILN-PF and HDSP-PFcorrespond to a Poisson factorization extension from [Paisleyet al., 2012] and [Kim and Oh, 2014], respectively.

4.2 Performances on Medical Risk Predictions

The joint modeling of BN-CTPF can evaluate the validity ofphenotypes by calculating MAE and RMSE, while extractingphenotypes simultaneously. We calculate MAE and RMSEon two target medical risks: comorbidity and polypharmacy.Following [Asuncion et al., 2009], we randomly partitioneach patient from the heldout data into two halves; and weevaluate the conditional distribution of the second half giventhe first half and the training data. The first half data is usedto estimate the local variational parameters for each patient.Given global variational parameters learned in a training pro-cedure, the predicted risk is given by the conditional expecta-tion in Eq. (1).

From the Figure 3, we show that BN-CTPF outperformsother models on MAE and RMSE, except for MAE in theCMB risk score. It should be noted that BN-CTPF ourper-formes every other model in RMSE. The result illustrates thefollowing statements: 1) reflecting the heteregeneous char-acteristics of EHR is useful in predicting medical risks and2) utilizing learned various associations can potentially boostthe performance of EHR phenotyping. Additionally, a non-parametric model might provide the capability for adapting amodel complexity which leads to better performances.

4.3 Quantitative Evaluation of PhenotypesWe utilize PPX to evaluate the phenotyping quality. Al-though PPX is widely used to evalute topic models and ma-trix factorization, most previous work on unsupervised EHRphenotyping did not utilize the metric to evaluate a qual-ity of phenotypes. We compute PPX with the same man-ner as in Section 4.2. More formally, we denote the train-ing data D and a heldout data X . Formally, we divide theheldout data into two havles X 0 and X 00. The per-word per-plexity on the second half of the heldout data is given byexp{� log p(X 00|X 0

)/N}, where N is the number of obser-vations constituting X 00. Since it is intractble to compute theexact value of the marginal probability p (X 00|X 0

), we ap-proximate the marginal probability by the variational infer-ence algorithm that we described in Section 3.2. The lowerperplexity indicates the better generalization performance,and it does not rely on the KL divergence between a vari-ational distribution and a true posterior which is relevant toour objective function.

Figure 4 provides the PPX of several baseline models.Since each phenotype consists of diagnoses and medications,two PPX values can be calculated from diagnoses and med-ications, respectively. BN-CTPF achieved the best perfor-mance in diagnosis phenotype modeling, but it was less gen-eralizable in the medication phenotype modeling. We con-jecture that there might exist the distributional difference be-tween the medication and the diagnoses, and the Poisson dis-tribution is more proper in modeling the medication pheno-types, rather than the multinomial distribution used in BN-CTPF. We note that the PPX value of a three-way Poissontensor factorization which is similar to early studies [Ho et

al., 2014] is about 4,000. While the value of PPX is greatlylarger than other models, it should not be compared at thesame level since the Poisson tensor factorization has to con-sider the combinatorial space of diagnoses and medications,which is larger than other models.

4.4 Phenotypes and DemographicsAnalyzing the relationship between phenotypes and pa-tient demograhpics reveals a deeper understanding on thedemographic-specfic diseases and medications, and this en-ables setting a better healthcare policy and medical risk man-agement. Figure 5(a) illustrates the phenotype-age and -gender associations by enumerating the expected top tenphenotype probabilities conditioned upon the demographics,where the topics are sorted by their posterior word counts. Wefound a strong relationship between the comorbidity statuswith hypertension and type-II diabetes (T2DM) (phenotype6) and the aged 65-75 elderly. An older group with age � 85

is strongly related to the respiratory diseases (phenotype 2). Itshould be noted that there exists different medication patternsof similar diagnosis phenotypes. For example, phenotype 1,6, 7, and 9 indicate the complications of hypertension andT2DM, but their medication patterns are different. The cor-relations between topics, or phenotypes, in Figure 5(b) statewhich phenotypes are similar when jointly considering med-ications and diagnoses. The correlation coefficients are cal-culated by taking the dot product of the topic locations, lTk lk0 .

2550

HypertensionType-II dia-betesPrimary go-narthrosisGastritisHypertension, benign

Acute bron-chitisAllergic rh-initisAllergic con-tact derma-titisGastritis,Dyspepsia

Presence ofintraocularlensDry eye syn-dromeSenile cata-ractSenile nu-clear cataractOther senilecataract

AsthmaChronic obst-ructive pul-monarydiseaseAcute bron-chitisAllergic as-thmaPneumonia

HypertensionHypertensiveheart diseasewithout (con-gestive) heartfailureAllergic rh-initisAtherosclero-sisNonsuppura-tive otitis me-dia

HypertensionType-II dia-betesPrimary go-narthrosisGastritisGlaucoma

HypertensionType-II dia-betesAcute gastritisGastritis, un-specifiedAcute bron-chitis

HypertensionPeripheralvascular dis-easeHyperlipide-miaAcute bron-chitisDietary cal-cium defi-ciency

HypertensionAcute bron-chitisOther acutegastritisPrimary go-narthrosisType-II dia-betes

GastritisHypertensionDyspepsiaAcute bron-chitisPrimary gona-rthrosis

Aspirinmetformin HClrebamipideatorvastatin(calcium)clopidogrel

rebamipideAspirinmosapridecitrateartemisiaasiatica95% ethanolext.streptodor-nase strep-tokinase

vacciniummyrtillus ext.atorvastatin(calcium)Aspirinsimvastatinstreptodorrnase strep-tokinase

doxofyllinemontelukastsodiumerdosteineacetylcysteinemicronizedtiotropiumbromidemonohydrate

clopidogrelatorvastatin(calcium)Aspirinhydro-chloro-thiazidesimvastatin

carvedilolAspirintrimetazidineHClclopidogrelfurosemide

Aspirinrebamipideaceclofenachydro-chloro-thiazideartemisiaasiatica95%ethanol ext.

prednisolonemagnesiumhydroxidemicronizedtiotropiumbromidemonohydrateamlodipinebesylateatenolol

triflusalartemisiaasiatica 95%thanol ext.acetylL-carnitineHClsodiumvalproateatorvastatin(calcium)

polystyrenesulfonatecalciumamlodipineadipaterosuvastatincalciumclopidogrelhydro-chlorothiazide

(a) Topic (phenotype)-covariate associations given demograph-ics. Diagnosis topics at the top, and medication topics at thebottom.

(b) Topic assocationsamong the top ten phe-notype topics.

>80th percentile >60th percentile >40th percentile>20th percentile less than or equal to 20th percentile

No label Age: 65-75 Gender: Female Age: 85-Gender: Male

(c) Expected distribution of the diagnosistopic 6: (hypertension, T2DM, gonarthro-sis, Gastritis and Glucoma) over regions.

Figure 5: Phenotype representations and correlations.

Lastly, we explore the regional difference of phenotype prob-abilities. Figure 5(c) illustrates the clear difference amongregions in exhibiting phenotypes with or without demograph-ics.

5 ConclusionThis paper introduces BN-CTPF to extract phenotypes fromEHR and to predict medical risks. BN-CTPF outperformsmodels that are either general-purposes or pipelined by ana-lytic steps. BN-CTPF analyzes EHR with over three millionprescriptions, and the result provides more accurate medicalrisks per demographics and why.

AcknowledgmentsThis research was supported by the Korean ICT R&D pro-gram of MSIP/IITP (R7117-16-0219, Development of Pre-dictive Analysis Technology on Socio-Economics using Self-Evolving Agent-Based Simulation embedded with Incremen-tal Machine Learning). The data used for this study are ob-tained from HIRA-NIS-2011-0058 provided by Health Insur-ance Review & Assessment Service. The results of this study

are unrelated to the Ministry of Health and Welfare (MW)and HIRA. We thank the four anonymous reviewers for theirvaluable comments on our manuscript. We also thank Prof.Minki Kim for preprocessing the HIRA dataset.

References[Asuncion et al., 2009] A. Asuncion, M. Welling, P. Smyth,

and Y. W. Teh. On smoothing and inference for topic mod-els. In UAI, pages 27–34. AUAI Press, 2009.

[Bellazzi and Zupan, 2008] R. Bellazzi and B. Zupan. Pre-dictive data mining in clinical medicine: current issues andguidelines. International journal of medical informatics,77(2):81–97, 2008.

[Blei et al., 2003] D. M Blei, A. Y Ng, and M. I Jordan. La-tent dirichlet allocation. Journal of machine Learning re-

search, 3:993–1022, 2003.[Canny, 2004] J. Canny. Gap: a factor model for discrete

data. In ACM SIGIR, pages 122–129. ACM, 2004.[Davis et al., 2010] D. A Davis, N. V Chawla, N. A Chris-

takis, and A.-L Barababasi. Time to care: a collaborativeengine for practical disease prediction. Data Mining and

Knowledge Discovery, 20(3):388–415, 2010.[Evans et al., 2012] D. C Evans, C. H Cook, J. M Christy,

C. V Murphy, A. T Gerlach, D. Eiferman, D. E Lind-sey, M. L Whitmill, T. J Papadimos, P. R Beery, et al.Comorbidity-polypharmacy scoring facilitates outcomeprediction in older trauma patients. Journal of the Ameri-

can Geriatrics Society, 60(8):1465–1470, 2012.[Ghassemi et al., 2014] M. Ghassemi, T. Naumann,

F. Doshi-Velez, N. Brimmer, R. Joshi, A. Rumshisky, andP. Szolovits. Unfolding physiological state: Mortalitymodelling in intensive care units. In ACM SIGKDD, pages75–84. ACM, 2014.

[Ghassemi et al., 2015] M. Ghassemi, M. AF Pimentel,T. Naumann, T. Brennan, D. A Clifton, P. Szolovits, andM. Feng. A multivariate timeseries modeling approach toseverity of illness assessment and forecasting in icu withsparse, heterogeneous clinical data. In AAAI, pages 446–453, 2015.

[Gopalan et al., 2013] P. Gopalan, J. M Hofman, and D. MBlei. Scalable recommendation with poisson factorization.arXiv preprint arXiv:1311.1704, 2013.

[Gopalan et al., 2014] P. K Gopalan, L. Charlin, and D. Blei.Content-based recommendations with poisson factoriza-tion. In NIPS, pages 3176–3184, 2014.

[Hajjar et al., 2007] E. R Hajjar, A. C Cafiero, and J. T Han-lon. Polypharmacy in elderly patients. American journal

of geriatric pharmacotherapy, 5(4):345–351, 2007.[Ho et al., 2014] J. C Ho, J. Ghosh, and J. Sun. Mar-

ble: high-throughput phenotyping from electronic healthrecords via sparse nonnegative tensor factorization. InACM SIGKDD, pages 115–124. ACM, 2014.

[Hoffman et al., 2013] M. D Hoffman, D. M Blei, C. Wang,and J. Paisley. Stochastic variational inference. JMLR,14(1):1303–1347, 2013.

2551

[Hripcsak and Albers, 2013] Ge. Hripcsak and D. J Albers.Next-generation phenotyping of electronic health records.Journal of the American Medical Informatics Association,20(1):117–121, 2013.

[Kim and Oh, 2014] D. Kim and A. Oh. Hierarchical dirich-let scaling process. In ICML, pages 973–981, 2014.

[Koren et al., 2009] Y. Koren, R. Bell, and C. Volinsky. Ma-trix factorization techniques for recommender systems.Computer, (8):30–37, 2009.

[Lasko et al., 2013] T. A Lasko, J. C Denny, and M. A Levy.Computational phenotype discovery using unsupervisedfeature learning over noisy, sparse, and irregular clinicaldata. PloS one, 8(6):e66341, 2013.

[Lehman et al., 2012] L. H Lehman, M. Saeed, W. J Long,J. Lee, and R. G Mark. Risk stratification of icu patientsusing topic models inferred from unstructured progressnotes. In AMIA. Citeseer, 2012.

[McCarty et al., 2011] C. A McCarty, R. L Chisholm, C. GChute, I. J Kullo, G. P Jarvik, E. B Larson, R. Li, D. RMasys, M. D Ritchie, D. M Roden, et al. The emergenetwork: a consortium of biorepositories linked to elec-tronic medical records data for conducting genomic stud-ies. BMC medical genomics, 4(1):13, 2011.

[Paisley et al., 2012] J. Paisley, C. Wang, D. M Blei, et al.The discrete infinite logistic normal distribution. Bayesian

Analysis, 7(4):997–1034, 2012.[Pathak et al., 2013] J. Pathak, A. N Kho, and J. C Denny.

Electronic health records-driven phenotyping: challenges,recent advances, and perspectives. Journal of the Amer-

ican Medical Informatics Association, 20(e2):e206–e211,2013.

[Ranganath and Blei, 2015] Rajesh Ranganath and DavidBlei. Correlated random measures. arXiv preprint

arXiv:1507.00720, 2015.[Saria and Goldenberg, 2015] S. Saria and A. Goldenberg.

Subtyping: What it is and its role in precision medicine.IEEE Intelligent Systems, 30(4):70–75, 2015.

[Saria et al., 2010] S. Saria, D. Koller, and A. Penn. Learn-ing individual and population level traits from clinical tem-poral data. In Proc. Neural Information Processing Sys-

tems (NIPS), Predictive Models in Personalized Medicine

workshop. Citeseer, 2010.[Sethuraman, 1994] J. Sethuraman. A constructive definition

of dirichlet priors. Statistica sinica, pages 639–650, 1994.[Sundararajan et al., 2004] V. Sundararajan, T. Henderson,

C. Perry, A. Muggivan, H. Quan, and W. A Ghali. Newicd-10 version of the charlson comorbidity index predictedin-hospital mortality. Journal of clinical epidemiology,57(12):1288–1294, 2004.

[Teh et al., 2012] Y. W. Teh, M. I Jordan, M. J Beal, andD. M Blei. Hierarchical dirichlet processes. Journal of

the american statistical association, 2012.[Tran et al., 2015] T. Tran, T. D. Nguyen, D. Phung, and

S. Venkatesh. Learning vector representation of medical

objects via emr-driven nonnegative restricted boltzmannmachines (enrbm). Journal of biomedical informatics,54:96–105, 2015.

[Wang and Blei, 2011] C. Wang and D. M Blei. Collabora-tive topic modeling for recommending scientific articles.In ACM SIGKDD, pages 448–456. ACM, 2011.

[Zhou et al., 2014] J. Zhou, F. Wang, J. Hu, and J. Ye. Frommicro to macro: data driven phenotyping by densifica-tion of longitudinal electronic medical records. In ACM

SIGKDD, pages 135–144. ACM, 2014.

2552

Bayesian Nonparametric Collaborative Topic Poisson ......phenotype patterns, or phenotype topics, which this paper refers to them as topics, and 2) to predict a patient’s critical

Documents