Characterizing dengue spread and severity using internet media sources

Characterizing dengue spread and severity using Internetmedia sources

ABSTRACTPakistan witnessed one of its deadliest dengue outbreaks in2011 resulting in thousands of deaths throughout the coun-try. Prior to the outbreak, dengue awareness was relativelylow and hospitals in the country were not completely pre-pared to handle the epidemic with limited knowledge aboutthe spread of the disease in each locality. This paper aimsto build a system that automatically aims to characterize thespread and severity of the dengue disease at a fine-grainedlocation granularity based on analyzing news reports fromInternet media sources. Our system leverages a range ofstandard data mining and machine learning techniques to ar-rive at an accurate dengue severity measure for any givenlocation. Based on a detailed analysis of news reports gath-ered from several leading dailies in Pakistan, we demonstratethe effectiveness of our system to accurately characterize thedengue spread and severity across different locations withinPakistan.

1. INTRODUCTIONPreventing large-scale outbreaks of diseases like dengue,

malaria, typhoid constitute an enormous public health chal-lenge, especially in countries with limited infrastructure com-mitted for prevention, spreading awareness and containmentof these diseases. Under the charter of the InternationalHealth Regulations established by the World Health Orga-nization, by the end of 2012, signatory countries must haveaccomplished key milestones in this regard including “as-sessment of their surveillance and response capacities andthe development and implementation of plans of action toensure that these core capacities are functioning.”1 Not sur-prisingly, significant effort has been invested in developingstate-of-the-art disease surveillance systems. [18]

These systems can be distinguished from each other onthe basis of three main components: data collection, analy-sis and reporting. In the case of conventional government-managed public health surveillance systems, data collectionis handled in a hierarchical manner. Using hospital reports,health hotline calls, door-to-door surveys etc., medical offi-cers on the field and healthcare providers notify local and re-gional health departments of incidents and suspected cases.1http://www.who.int/ihr/en/

These reports are analyzed to determine the severity of thespread, and suitably forwarded up to command centers atthe state or national level (the Centers for Disease Controlis an example of such an agency in the United States) whothen coordinate response and prevention procedures. Thischain-of-command framework has its advantages and disad-vantages. While a hierarchical process reduces the occur-rence of unnecessary alerts (false positives) that may causepanic, it introduces some bureaucracy in how agencies needto coordinate with each other up and down the order, therebymaking it inefficient.

An alternative approach is to use a semi-automated so-lution, leveraging the ever-expanding ubiquity of the Inter-net (especially in developed countries). Examples includeProMED-Mail [16] and GPHIN [7], to name just a few. Thesehave some improvements over the manually operated systemdescribed above, but still rely on human intervention in theform of experts supervising data collection, analysis and re-porting.

Some recent work has focused on taking this further tobuild completely automated web-based disease surveillancesystems [19, 4, 11, 10, 14, 2, 17, 5]. These systems parsenews articles from around the world using sources such asGoogle News and RSS feeds, as well as social media suchas Twitter, to filter and classify articles based on the natureof epidemic, location and news sources.

At the same time, not much work has been done in thespace of automating analysis of longer term trends. This isvery crucial from a policy planner’s perspective for a numberof reasons:

1. It gives policy planners better insight on whether in-tended measures put in place to curb the spread haveworked, and how to plan for the future.

2. It serves as backfill for inadequate/incomplete surveil-lance reports collected in the past.

3. It is the first step towards a full-fledged automated sys-tem that supplements localized disease surveillance withdeep-dive historical analysis.

Furthermore, since the study of these trends can be time-consuming involving long man-hours at retrieving and ana-lyzing historical records, it makes a fully-automated disease

1

http://www.who.int/ihr/en/

analytics system even more relevant. And while both web-based as well as traditional disease surveillance systems actas excellent outlets for dissemination of worldwide epidemicoutbreak reports, they are not ideal for localized, disease-specific analysis, necessarily trading off depth of coveragefor breadth. These systems are especially suited for (near)real-time tracking of epidemics and do not offer historicalperspective on a specific disease.

In this paper, we present an analysis of a system that ad-dresses these issues. Specifically, we focus on seasonal dengueoutbreaks in Pakistan and describe a web-based system thatautomatically classifies and characterizes severity of occur-rences based on local online news sources. Our system isbuilt with the intention of choosing the other side of thetrade-off mentioned above, viz. focus on in-depth localizedanalysis of news sources over broad-based binary classifica-tion of disease outbreak. The main highlights of this systemare:

• Spatiotemporal model: We capture historical trendsfor disease spread severity for the geographical regionover a medium-term timespan (1-5 years). While someother systems, notably BioCaster and HealthMap, pro-vide maps with an overlay denoting hotspots, and gen-eral macroscopic trends, ours is the first system thatincorporates time as another parameter and providesfiner-grained location-specific trends.

• Disease and location specificity: Our system is uniquein this regard as it focuses exclusively on dengue oc-currence in Pakistan. We believe that this lends it-self to a modular framework for better understandingepidemic outbreaks since it gives policy planners op-tions to invoke disease-specific modules as and whenneeded.

• Local media sources: By channeling our insights solelythrough the lens of local news sources based in Pak-istan, we are able to customize and exploit the domain-and location-specific nature of the disease outbreak.Moreover, this isolates our analysis from externalities(such as dengue outbreaks outside Pakistan) especiallywhen it is known that such externalities may have verylittle impact on seasonal epidemic recurrences like dengue.

• Severity characterization: We propose an severityscale that labels reports from not severe to severe fora given region and time period. To the best of ourknowledge, this has not been studied before in the dis-ease surveillance literature, and we believe that this is acrucial next step in the evolution of web-based diseasesurveillance and analysis.

We evaluated the system by extracting 3400 dengue-relatednews articles extracted from six leading Pakistan news pa-pers over a 2-year time period from 2010 to 2012. Basedon our analysis, we show that our system can provide a fine-grained spatiotemporal model at a city level to capture the

spread and severity of dengue outbreak. We are able to es-tablish a strong correlation between the actual number ofdeaths/cases in a city level and the dengue score we inferfrom the news articles. This makes our system a good candi-date for historical trend analysis at a city level using purelyInternet media sources. To achieve these results, we need toaddress several challenges including: (a) accurate extractionof dengue related news articles; (b) identifying the locationand time corresponding to every document; (c) extractingthe right set of features for capturing dengue severity in alocation; (d) applying the right regression model to computean accurate dengue index for every locality. While our sys-tem has been primarily tested for only the dengue disease inPakistan, we believe that the ideas can easily be generalizedfor other diseases and locations. (Evaluation & Results)

The rest of the paper is structured as follows. In Section 2,we go through related work pertaining to the disease surveil-lance and the use of Internet media sources to that end. Wegive an overview of our system in Section 3 and argue in fa-vor of our approach towards capturing spatiotemporal trendsand disease spread severity. We highlight the main steps indata collection and the challenges associated with cleaningup unstructured data retrieved from the Web in Section 4. InSection 5, we describe the supervised model used to trainand analyze the dataset, and present our evaluations and re-sults in Section 6. We offer conclusions and possibilities forfuture work in Section 7.

2. RELATED WORKAs highlighted earlier in Section 1, any surveillance sys-

tem has three main components: data collection, analysisand reporting. We review related work in disease surveil-lance and tracking based on this classification. In terms ofdata collection, traditional disease surveillance has alwaysbeen an integral component of public health services offeredby governmental agencies. In these systems, data is gath-ered from actual cases reported from hospitals. Althoughthis is data at micro-granularity, this is often limited to of-ficials and not made public due to numerous political andsocial reasons. As a proxy, other finer precision data sourcessuch as calls to health centers [21], over the counter drugsales [6], school absentee records [15], ambulance dispatchrecords [13] have been shown to have good correlation withactual cases data [], and are being used for disease surveil-lance.

In the last three decades, the Internet has played a vi-tal role in augmenting standard data collection approaches.With the advent of near-instant publishing mechanisms suchas Twitter and Facebook, especially using cellphones, datacollection happens in real-time and around the clock. Exam-ples referred to earlier, such as ProMED-Mail and GPHINare prime examples in this regard. ProMED-Mail is a mail-ing list [16] that depends on a community of volunteers,universities, industrial laboratories, news agencies etc. inaddition to public health workers to contribute reportage of

2

incidents, while list moderators then filter through these re-ports and send them out with annotations on time and loca-tion of incident. GPHIN [7] mines news articles and blogposts from around the world and sorts through 2000 articlesdaily using a pre-determined scoring system. A team of an-alysts working round the clock then analyzes these articlesand correspondingly decide whether the reports warrant rais-ing an alert. Although both significantly enhance and rendermore effective conventional data sources, they still requiresignificant human monitoring.

Completely automated web-based disease surveillance gainedmainstream awareness with a study by Google in 2009 [12]that showed a strong correlation between flu-related key-words and actual ILI (influenza-like illnesses) outbreaks. Othersources along these lines, such as visits to health websites[1], clicks on keyword-triggered ads [9], have also been demon-strated to be reliably accurate data sources for early epidemicdetection and disease surveillance. However, all these ap-proaches suffer from a common drawback in that the datasetsare not publicly available for further analysis and continuousaccess; one is forced to depend on portal operators to trackdata and occasionally provide peripheral insights.

In this regard, content published in (micro)blogs and newswebsites is far more accessible and can be readily mined atwill by the public health community to operate round-the-clock surveillance systems. Given Twitter’s open nature andsimple publishing process, it has quickly evolved into the fo-rum of choice for real-time news dissemination. Not surpris-ingly, “tweets” have been analyzed for health-related topicsidentification and classification [3]. EpiSpider [11] and In-foVigil [8] are examples of surveillance systems which mon-itor diseases activity using Twitter. Blogs, website articlesand news feeds have been explored for various health infor-matics problems [20]. However, the instantaneous nature ofcontent being streamed on Twitter also makes it a less thaneffective data source for longer-term disease trends analyses.

News websites fill this void and are an important sourcefor monitoring health related events. BioCaster [19] mineshealth-related news from over 1700 RSS feeds, as well asfrom news aggregators such as Google News. After a basicsemantic analysis to determine whether the article is rele-vant or not, the system runs through sophisticated NLP tech-niques including topic classification model, entity recogni-tion, and event extraction based on some rules, in order toobtain an actionable event. Finally, the system looks at anumber of test statistics such as weighted-moving averagesand cumulative sum methods to classify the event as alert-worthy or not. As a web-based disease surveillance system,BioCaster is very comprehensive in the geographical regionsit spans. Outside of the fact that our approach is specific to asingle epidemic and geographical region, our system is dif-ferent in two other crucial aspects. Firstly, we are more con-cerned with ex-post-facto event analysis. In other words, wewant to know how severe a reported event based on text pro-cessing of articles covering the event. Secondly, BioCaster,

Figure 1: System architecture

like other disease surveillance systems, is intended for real-time outbreak detection and not longer-term historical trendanalysis. It does support analyzing macroscopic trends al-though only at the continental level. Still it is possible toenvision our system as complementary to web-based surveil-lance systems like BioCaster.

3. SYSTEM OVERVIEWOur analysis is based on news articles published online

from English-based newspaper websites in Pakistan. Whilethe adoption of the Internet as a medium for newspapers tohost articles is a relatively new phenomenon in Pakistan, ithas gathered pace quickly and more newspapers are estab-lishing a digital presence2. We identified six main nation-wide dailies: Dawn, Tribune, The News, Nation, PakistanTribune, and Pakistan Today and retrieved articles from Au-gust 2010 to April 2012.

Figure 1 shows the various steps in our system architec-ture. Our system design consists of four key steps. In thefirst step, we use a combination of crawling and search querybased extraction techniques to gather a corpus of dengue-related articles from different newspapers in Pakistan. In thesecond step, we use a Support Vector Machine (SVM) basedclassifier to identify and filter dengue related articles fromthe larger pool of extracted documents. In the third step,we extract a combination of several dengue related featuresfrom these news paper articles and also tag each article withits corresponding location and the time of creation. We en-counter challenges with respect to extraction of time and lo-cation from news articles that are common to other Internet-based surveillance systems. We address these by using stan-dard text processing tools such as TF-IDF based keywordextraction, checking against location whitelists, pattern and2http://www.ejc.net/magazine/article/modern_technological_trends_emerging_in_pakistani_media/

3

http://www.ejc.net/magazine/article/modern_technological_trends_emerging_in_pakistani_media/



regular expressions matching for timestamps. Finally, weconsider the list of documents corresponding to a localitywithin a given time period to compute a dengue severity in-dex for the specified location. For this, we use a trainedpolynomial regression model on our set of extracted featuresfrom each article and simplify the score to a dengue severityscore in the range 1-5.

Many of the learning steps in our system use a supervisedapproach with an appropriate training set. With the helpof the supervised model, we are able to capture historicaltrends for dengue occurrence and severity in the form of aspatiotemporal map that depicts how severely a given regionhas been affected by seasonal recurrences of the epidemic.

The goal of the system is accurate and timely quantifica-tion of the severity of rise in dengue in an area. The sever-ity of an outbreak can help in proper allocation of resourcesfor the containment of outbreak. The system has to be verytimely because latency in detection has a tendency to causeenormous amount of damage to the containment effort, andcan actually aid in the spreading of outbreak to nearby ge-ographical regions. In the long run, the goal is to producea strong independent system that can replace the traditionalsystem and can be accurate enough to be extended for morediseases and to more developing countries.

4. NEWS ARTICLES EXTRACTION ENGINEIn this section, we will elaborate on the individual steps

for extracting and processing the news articles from differentsources.

4.1 Data GatheringThe extraction of articles from a news website constitutes

a challenge because none of them has made available an APIfor article extraction. We used different extraction method-ologies depending on the way they used to display data. Werestricted our data sources to the most popular and authenticnews websites which have news articles in English. For eval-uation of our study, we collected articles from six leadingnational English newspapers in Pakistan that publish theircontent online: Dawn, Tribune, The News, Nation, PakTri-bune and Pakistan Today. We retrieve these articles for thetime period of August 2010 to April 2012, retaining onlythose containing the word ‘dengue’ in either the article ti-tle or body. We focus on this time period for two reasons:not all six dailies had a big online presence until 2010, andmoreover, this period was when the media in Pakistan wasproactive in terms of highlighting government response tothe dengue epidemic.

4.2 Data CleaningThe downloaded news articles was first processed through

an array of standard text processing tools to remove variouswebpage artifacts and extract only the relevant text corre-sponding to each article. Popular news articles and quotestend to be replicated across different dailies. We use stan-

dard duplicate removal techniques to remove articles thatwere even partially duplicated using a simple similarity match-ing score.

The next phase was classification of articles into usefuland not useful for the study. From this initial set, we dis-carded documents that were not relevant to dengue using anSVM classifier. To this end, we calculated TF-IDF scores foreach word in the document set after filtering out stopwordsand pick the top 5 words: “health”, “patients”, “fever”, “hos-pital”, and “virus” (in addition to “dengue”) as our featureset for training the SVM classifier. For our training set ex-amples, we randomly sampled articles from the downloadeddataset and manually labeled 400 articles as relevant to dengueand 400 more as irrelevant to dengue, making sure to includearticles related to public health and other diseases such asmalaria in the latter set. We used the classifier on the doc-ument set and identified 3405 documents as relevant, thatconstituted our final set for evaluation.

4.3 Location and Time TaggingA critical component of our analysis is to identify the

location-tag and the time-tag associated with every article.To obtain a time-tag, one cannot simply use the date of cre-ation of the article as the time of the event. For example,an article published in 2012 can be evaluating the contain-ment strategy of the 2011 dengue outbreak and hence thenumber of deaths given in that article cannot be used to cal-ibrate dengue activity in 2012. Using a set of robust pattern-matching and regular expression based text processing rules,we extract all dates referenced in the article, including thedateline.3 If an article is associated with multiple dates, thenfor each date we only associate a portion of the documentwith the specified date. For every date embedded in the text,we extract the textual portion within the neighborhood ofthe date occurrence and tag that text separately as associatedwith a specific date. For the rest of the unmarked text, weuse the date of publishing as the associated time. While thismay not be a perfect solution, we observed that this simpleapproximation worked fairly well in practice.

To determine the location of each article, we “geo-tagged”the document set by extract all possible locations from thearticles. These locations were extracted by matching againsta comprehensive list of possible locations in Pakistan ob-tained from Pakistan Post Office Department. As with time,context matters in articles when a location is referenced. Alocation in an article was marked if it occurred in the titleor if it was the first location mentioned in article. Articlescontaining multiple locations were replicated, once for eachlocation. This allows us to treat each article as pertaining toa single location. We also retained the frequency of occur-rence for each location. Figure 2 captures summary statisticsfor the geo-tagging procedure on our dataset.

3There were around 50 articles in data that contained no date; thesewere discarded.

4

http://www.dawn.com

http://tribune.com.pk

http://www.thenews.com.pk

http://www.nation.com.pk

http://www.paktribune.com

http://www.paktribune.com

http://www.pakistantoday.com.pk

Figure 2: Articles vs. location count

4.4 Feature ExtractionThe final step of our news extraction engine is to extract a

set of key features from each document that can be used formeasuring the dengue spread and severity index. Here, weextract two types of features from each document.

The first type of features correspond to keywords presentin each document. Across all dengue documents, we de-termine a small set of top-ranked keywords based on theTF-IDF metric as a set of features corresponding to dengue.Given any document, we extract a feature vector compris-ing of the document-specific TF-IDF scores of each of thetop-ranked dengue-related keywords.

The second type of features correspond to health relatedmetrics that may be reported in the article. Specifically, inour regression model described in Section 5, we use twosuch features: number of cases, and number of deaths (ifreported) in the article. Extracting these feature values canbe particularly challenging given the lack of consistency inhow numbers are represented in articles. We extracted thesefeatures by first capturing sentences that contained the nounphrases most likely to be associated with numerical values,such as “cases”, “patients”, and “deaths”. These were thencategorized based on the keywords and values present. Fi-nally, we use specific pattern matching rules to obtain num-bers from different string representations (e.g., “256”, “1,234”,“five”), making sure that numbers extracted corresponded tothe location in context.

At the end of this step, every dengue-related documentis associated with the following parameters: location, time,feature-vector of dengue related terms, number of cases/deaths.This cleaned data is then used in a two-step analysis to char-acterize dengue severity and spread. In the first step a termfrequency based approach was used to characterize the dengueseverity. In the second step the data is fed to the trained re-gression model in order to quantify the severity of the rise.

5. DENGUE SEVERITY MEASUREIn this section, we elaborate on our methodology for an-

alyzing documents from the sanitized collection to deter-mine a dengue severity measure on a (location,time) basis.

Broadly, we are interested in understanding how and whethersimple text processing and semantic analysis of news articlesdrawn from within localized geographical boundaries can beused to obtain finer-grained signaling of historical trends inoutbreak severity. Keeping this in mind, our methodologyinvolves a supervised learning model trained on a small man-ually scored dataset using features extracted as described inSection 4, and estimating severity scores for the remainderof the dataset, parameterized on location and time.

The model comprises three phases. In the first phase, oncewe have cleaned up the documents and run a classifier to ob-tain the set of relevant articles, we extract features from eachdocument and train a regression model. In the second phase,we run the model on the remaining set of unscored docu-ments. In the third phase, we aggregate scores on locationand time to come up with a single weighted score for each(location, time) tuple.

The following notation will be useful to characterize ourmethodology. We denote the entire set of articles by D. Aword class W = {w1, . . . , wb} is a collection of words.An article d ∈ D has the following parameters: location`, timestamp t, and the following features: number of casesreported c, number of deaths reported m, and a collection ofK word-classes {W1, . . . ,WK}.

5.1 Training the modelFor a document, our feature set comprises the following

numerical estimates: number of cases (we make no distinc-tion between suspected cases and confirmed cases in thisanalysis), number of deaths, and term-frequency counts fromthe word classes4 below:

1. “death”, “died”, “toll”

2. “case”, “patient”, “infected”

3. “prevent”, “aware”, “fumigate”, “campaign”, “spray”.

This provides a five dimensional vector for each docu-ment comprising of two number-related features (cases anddeaths) and three term frequency related features based onwords in three specific word classes. The word classes werespecifically chosen to highlight and capture three differentdisease severity levels.

We opt for a polynomial regression model as opposed to asimple linear regression model with the intention of captur-ing information conveyed by the colocation of words suchas “death”, “died” with the number of deaths as extractedfrom the article (similarly, “cases”, “patients” and numberof cases). To train our model, we adopt a scoring conven-tion of 1 (not severe) to 5 (severe). In some cases, wheredocuments were falsely classified as relevant even thoughthey are unrelated to dengue, we assign a score of 0. Fordocuments that have incomplete or missing data on numberof deaths and cases, either because the numbers were not4where possible, we use the root stem instead of the literal word.

5

present or were not extracted correctly, we adopt the impu-tation rule of assigning 0. Out of approximately 3400 docu-ments, we picked 700 documents for manual scoring. Fromwithin these 700, we set aside 100 documents for testingand trained the model using the remaining 600 documentswith three-fold cross-validation. We tested several differentchoices of feature sets, and observed the best performancefor two models one which uses all five features mentionedabove denotedM(c,m,W1,W2,W3), and one that leaves out thenumber of deaths as a feature denoted M(c,W1,W2,W3). Ta-ble 3 shows normalized RMSE scores calculated for the twomodels when scores are rounded and left as is.

M(c,m,W1,W2,W3) M(c,W1,W2,W3)

No rounding 0.3063 0.2345With rounding 0.3221 0.2433

Table 1: Normalized RMSE scores for regression model

5.2 Spatiotemporal estimation for severity scoresUsing the model from above, we make inferences on the

scores to the remaining 2800 documents. For each documentd(`, t; c,m,B) denoted d, we use the model to infer a scoreparameterized on `, t and denoted s(d; `, t).

Following estimation of severity scores for all documentsin our dataset, we are ready to construct our spatiotemporalmodel. Our model is parameterized on location and time.We quantize the time period from August 2010 to April 2012into 15-day windows, and group documents by the extractedlocation and time if it falls in this window. The key issuehere is how we go about aggregating the scores within thiswindow. We adopt a weighted-averaging strategy where weweight scores by the TF-IDF scores computed for each ofthe feature word classes, further illustrated below.

LetK be the number of word classes (in our case,K = 3).For document di(`, t), let wi,j(`, t) denote the TF-IDF scorefor the j-th word in the document feature vector, shortenedto wi,j when the (`, t) pair is explicitly mentioned. The ag-gregate severity score is then given by

s(`,t) =1

K

n∑i=1

s(di, l, t) ·

K∑j=1

wi,j∑ni=1 wi,j

For instance, if we have 2 documents in a given (`, t) pair,and K = 3 word classes (say “fever”, “acute”, “death”),then the aggregate score is given by

s(`,t) =s(d1)

3

(w1,f

w1,f + w2,f+

w1,a

w1,a + w2,a+

w1,d

w1,d + w2,d

)+

s(d2)

3

(w2,f

w1,f + w2,f+

w2,a

w1,a + w2,a+

w2,d

w1,d + w2,d

)5.3 Estimating spread

Once we calibrate the dengue severity score on a per lo-cality basis, we use a simple clustering approach to deter-

mine the dengue spread around a locality. Given a location,we recursively determine all neighborhood locations (closesttowns and cities) with a similar severity score and determinethe geographic area of locations that are part of the samecluster. The size of the geographic region gives an estimateof dengue spread around each locality.

We are currently exploring more advanced mechanismsfor calibrating disease spread using the concept of "spreadtrajectories". Using the location- and time-tagging method-ology, we can overlay a chronological order of epidemicoutbreak events on a map where we can use a combinationof graphical models and maximum likelihood estimation tochart out the most plausible spread trajectory.

6. EVALUATIONWe present key observations and results from our evalu-

ation of the proposed dengue severity analysis system. Wechose to conduct our evaluation around the peak dengue epi-demic outbreak in 2011 in Pakistan to characterize how cov-erage of the epidemic varied in newspapers as the epidemicpeaked and receded.

6.1 Data sourcesWe build our document set by crawling articles from six

premier English-based online news sources in Pakistan: Dawn,PakTribune, The News, Nation, Tribune and Pakistan To-day. The total time taken to crawl and compile these articleswas about 15 hours with the average crawl time of about 6seconds per article not accounting for occasional timeouts.Cleaning the document collection of stopwords, and articleswith either zero identifiable locations or mentioning loca-tions outside Pakistan took about 5 minutes. Table 2 givesa source-wise breakdown of our document set. In all, we

Publisher Article countDawn 1552Nation 592Pakistan Today 1014Paktribune 104thenews 434Tribune 589

Table 2: Source-wise breakdown of articles

retrieved more than 4,200 articles for the period of August2010 to April 2012 at an average of about 210 articles permonth. Note that these are only articles mentioning the word“dengue.” This indicates a fairly high level of coverage andattention devoted to the epidemic as it spread through thecountry. Figure 3 charts the number of articles publishedover time. It is worth highlighting that the peak number ofarticles was in the months of August and September 2011tailgating almost exactly with the peak outbreak of the epi-demic.

6.2 Feature extraction

6

Figure 3: Article count over evaluation period.

Figure 4: Most frequent words in the document set.

We focus on using simple lexical analysis tools to identifypotential features hidden in the document set. In contrastwith content published on Twitter, there is more structureand order in news articles, and this renders it easily amenableto text processing. We exploit this and consider simple uni-grams as potential features in our system and sort them in de-scending order of TF-IDF scores. Figure 4 captures this, andspeaks to the high correlation and co-occurrence of wordsfrom the public health terminology.

6.3 Classifier accuracyWe used an SVM classifier with a Gaussian radial basis

function as kernel with σ = 2 and soft margin set at 1000.We evaluated many different classifiers tuning the number offeatures and choice of kernel functions. We started with asmany as 19 features from our feature set, and finally settledon the five features that gave the best results. Out of ap-proximately 3800 articles featuring the word “dengue” andcontaining at least one location in Pakistan, the classifier la-beled 3405 articles as relevant with precision 0.8974 and re-call 0.9722 on a random sample of 50 articles. While ourdocument set displays an inherent bias in that each documentcontains the word dengue, it is still noteworthy that our clas-sifier using only unigrams as features is able to achieve such

Figure 5: Correlation between document score and num-ber of deaths reported using M(c,W1,W2,W3).

high precision and recall values.

6.4 Regression model performanceWe evaluated the polynomial regression model used to

score documents for dengue severity in two ways. Out ofapproximately 3400 documents, we picked 700 documentsfor manual scoring. From within these 700, we set aside 100documents for testing and trained the model using the re-maining 600 documents with three-fold cross-validation. Wetested several different choices of feature sets, and observedthe best performance for two models: one which uses allfive features mentioned above, denoted M(c,m,W1,W2,W3),and one that leaves out the number of deaths as a feature,denoted M(c,W1,W2,W3). This is to test our hypothesis thatthe number of deaths as a feature would show high correla-tion with the number of terms related to death (and so wouldbe redundant). Secondly, since we were not able to accessactual case data (owing to privacy reasons, as highlighted inSection 1), we can evaluateM(c,W1,W2,W3) against the num-ber of deaths reported by newspapers as a proxy to assess theperformance of our model.

Table 3 shows error statistics scores calculated for thetwo models. On a random sample of 20 documents ob-

M(c,m,W1,W2,W3) M(c,W1,W2,W3)

RMSEraw 0.3063 0.2345RMSEround 0.3221 0.2433t-testraw 0.0812 0.0966

Table 3: Error statistics for regression model

tained from the unscored set, RMSE forM(c,W1,W2,W3) withrounding was 0.21 with 65% agreement. And Figure 5 sum-marizes the relationship document score and number of deathswhen using M(c,W1,W2,W3) with rounding. Both of these il-lustrate thatM(c,W1,W2,W3) is actually the better-performingmodel and confirm our original intuition about the correla-tion between the two features. Finally, we capture generallexical trends of articles that are scored at the two ends ofthe spectrum. Figure 6 portrays a word cloud detailing themost frequent words appearing in articles scored 1, and anal-ogously, Figure 7 shows a word cloud for articles scored 5by our model. Two things are immediately apparent in

7

Figure 6: Word cloud for articles scored 1 in our model.

Figure 7: Word cloud for articles scored 5.

Figure 8: Snapshot: 08/11/2011 - 08/25/2011

Figure 9: Snapshot: 09/11/2011 - 09/18/2011

these two figures. First, articles labeled as severe tend tohave higher density of words that can be termed relevantto dengue than those labeled not severe. Second, articleslabeled as severe appear to have a sharply-tailed frequencydistribution of words when compared with those at a lowerscore. This yields a promising avenue for future investiga-tions in terms of feature selection.

6.5 Spatiotemporal modelPer the ideas illustrated in Section 5, we aggregated scores

by location and time to construct a parameterized model. Weevaluate this model by studying spatial trends at both the na-tional level as well as at a city-specific level. Our spatiotem-poral model is visualized in Figures 8-10. Of particularnote is the fact that our model was able to capture preciselythe peak of the dengue outbreak during the months of Au-gust through September 2011.5 Since our model can extractlocations and generate scores for each location, we are ableto do a deep-dive and analyze regions at the town level, asin Figure 11 which captures the dengue outbreak severitytrend from August 11, 2011 to November 23, 2011. Thismatched accurately with the trend in number of reportedcases as shown in the graph in Figure 12.5http://www.thenews.com.pk/NewsDetail.aspx?ID=22690&title=No-end-to-Dengue-fever-in-Punjab

8

http://www.thenews.com.pk/NewsDetail.aspx?ID=22690&title=No-end-to-Dengue-fever-in-Punjab



Figure 10: Snapshot: 09/25/2011 - 10/02/2011

Figure 11: Tracking dengue severity in Multan from08/2011 to 11/2011

Figure 12: Reported cases in Multan from 08/2011 to11/2011

These maps also serve as validation of our approach tolexical and semantic analysis of the document set, partic-ularly with regard to the tools we used to extract location,timestamps and statistics from the news articles.

7. CONCLUSIONSWe first highlight some takeaway points from our evalua-

tions in Section 6:

• A convincing argument can be made for a system thatmines articles from media sources and infers the sever-ity of a large-scale disease outbreak.

• It is possible to construct a richly detailed spatiotempo-ral proxy model that captures both nationwide as wellas city-level trends in public health emergencies.

• Even simple lexical and semantic analysis tools suchas frequently-observed unigrams are good indicatorsof historical trends about epidemics.

• However, there is some scope in considering second-order features such as word-cloud distributions and den-sity of relevant words in building upon and improvingour model.

This work was in part motivated by the recent 2011 dengueoutbreak in Pakistan that resulted in several thousand deaths.The lack of a nationwide integrated system that can capturethe severity and spread of dengue in the country promptedus to explore this problem. The system we have describedhere is a first take at designing a nationwide disease outbreaksurveillance system that can be considerably enhanced by apublic health analytics component that leverages both real-time and historical patterns to come up with better detectionmethodologies.

Our system design and evaluation shows how Internet me-dia sources can be used as a powerful information source toaccurately calibrate the activity of dengue. The focus wasto make the surveillance model as accurate and timely aspossible. We tried different error analysis methods to ver-ify the accuracy of each step of the system. We were ableto accurately quantify the severity of dengue activity for thestudy period of 2 years. There was a high correlation be-tween number of deaths and cases with the severity scoreour system computed based on news articles. This justifiesour model for historical trend analysis. The model was alsoable to characterize dengue activity at a very fine granularityof town and city level. The spread across geographical loca-tion at specific points in time was also characterized by themodel.

As future work, one of our primary focus will be on ex-tending this system design to incorporate real-time mediasources from the Internet and perform analysis on a fine-grained time granularity. We plan to work towards integrat-ing such a real-time system in conjunction to a traditional

9

system to complement the performance of traditional sys-tems and to help better track disease spread across locationsand aid in early detection of outbreaks.

8. REFERENCES[1] Hulth A, Rydevik G, and Linde A, Web queries as a

source for syndromic surveillance, PloS one 4 (2009),no. 2, e4378.

[2] Althouse BM, Ng YY, and Cummings DAT,Prediction of dengue incidence using search querysurveillance., 5 (2011), no. 8.

[3] Chew C and Eysenbach G, Pandemics in the age oftwitter: Content analysis of tweets during the 2009h1n1 outbreak, ACM Trans. Program. Lang. Syst. 5(2010), no. 11, e14118.

[4] Freifeld CC, Mandl KD, Reis BY, and Brownstein JS,Healthmap: global infectious disease monitoringthrough automated classification and visualization ofinternet media reports, J Am Med Inform Assoc 15(2008), no. 2, 150–157.

[5] Emily H. Chan, Vikram Sahai, Corrie Conrad, andJohn S. Brownstein, Using web search query data tomonitor dengue epidemics: A new model for neglectedtropical disease surveillance, PLoS Negl Trop Dis 5(2011), no. 5, e1206.

[6] Das D, Metzger K, Heffernan R, Balter S, Weiss D,and et al., Monitoring over-the-counter medicationsales for early detection of disease outbreaks U newyork city, MMWR Morb Mortal Wkly Rep 54 (2005),no. suppl, 41U46.

[7] Mykhalovskiy E and Weir L, The global public healthintelligence network and early warning outbreakdetection: a canadian contribution to global publichealth, Canadian Journal of Public Health 97 (2006),no. 1, 42–44.

[8] Eysenbach G, Infodemiology and infoveillance:framework for an emerging set of public healthinformatics methods to analyze search,communication and publication behavior on theinternet, J Med Internet Res 11 (2009), no. 1, e11.

[9] Eysenbach G and Köhler C, Health-related searcheson the internet, JAMA 291 (2004), no. 24, 2946.

[10] Janaina Gomide, Adriano Veloso, Wagner Meira,Virgilio Almeida, Fabricio Benevenuto, FernandaFerraz, and Mauro Teixeira, Dengue surveillancebased on a computational model of spatio-temporallocality of twitter., Proceedings of the ACMWebSci’11, June 14-17 2011, Koblenz, Germany.,2011.

[11] Tolentino H, "scanning the emerging infectiousdiseases horizon-visualizing promed emails usingepispider", International Society for DiseaseSurveillance Annual Conference, 2006.

[12] Ginsberg J, Mohebbi MH, Patel RS, Brammer L,Smolinski MS, and Brilliant L, Detecting influenza

epidemics using search engine query data, Nature 457(2009), no. 5, 1012–1014.

[13] Bork KH, Klein BM, Mølbak K, Trautner S, PedersenUB, and Heegaard E, Surveillance of ambulancedispatch data as a tool for early warning, EuroSurveill. 11 (2006), no. 12, 229–233.

[14] V Lampos and N Cristianini, Nowcasting events fromthe social web with statistical learning., ACMTransactions on Intelligent Systems and Technology(2011).

[15] Besculides M, Heffernan R, Mostashari F, and WeissD, Evaluation of school absenteeism data for earlyoutbreak detection, new york city, BMC Public Health5 (2005), 105.

[16] Lawrence C. Madoff, Promed-mail: An early warningsystem for emerging diseases, Clinical InfectiousDiseases 39 (2004), no. 2, 227–232.

[17] Lawrence C. Madoff, David N. Fisman, and TahaKass-Hout, A new approach to monitoring dengueactivity, 5 (2011), no. 5.

[18] Beatty ME, Stone A, Fitzsimons DW, Hanna JN, LamSK, Vong S, Guzman MG, Mendez-Galvan JF,Halstead SB, Letson GW, Kuritsky J, Mahoney R, andMargolis HS, Best practices in dengue surveillance: areport from the asia-pacific and americas dengueprevention boards., Asia-Pacific and Americas DenguePrevention Boards Surveillance Working Group.

[19] Collier N, Doan S, Kawazoe A, Goodwin RM,Conway M, and et al., Biocaster: Detecting publichealth rumors with a web-based text mining system,Bioinformatics 24 (2008), 2940U2941.

[20] Khan SA, Patel CO, and Kukafka R, Godsn: Globalnews driven disease outbreak and surveillance, AMIAAnnual Symposium Proceedings, 2006.

[21] Yih WK, Teates KS, Abrams A, Kleinman K,Kulldorff M, and et al., Telephone triage service datafor detection of influenza-like illness, PloS one 4(2009), e5260Ue5260.

10

Characterizing dengue spread and severity using internet media sources

Documents