111/03/26 1 Infectious Disease Informatics: Overview and The BioPortal Experience Hsinchun Chen, Ph.D. Artificial Intelligence Lab, U. of Arizona NSF BioPortal Research Center Acknowledgement: NSF, CIA, DHS, CDC, NCI
Dec 13, 2015
112/04/18 1
Infectious Disease Informatics: Overview and The BioPortal Experience
Hsinchun Chen, Ph.D.Artificial Intelligence Lab, U. of ArizonaNSF BioPortal Research Center
Acknowledgement: NSF, CIA, DHS, CDC, NCI
Hsinchun Chen et al., 2005 Hsinchun Chen, et al., 2010
Medical Informatics: The computational, algorithmic, database and information-centric approach to the study of medical and health care problems.Infectious Disease Informatics: Medical informatics for infectious disease, public health, and biodefence.
112/04/18 3
IDI and Syndromic Surveillance Systems Data sources and collection strategies
Formal-informal (sequence to epi), standards, data entry and transmission, security
Data analysis and outbreak detection Syndromic classification, outbreak detection
methods (temporal, spatial, spatial-temporal), multiple data streams
Data visualization, information dissemination, and alerting GIS, temporal, sequence, text, interactive
System assessment and evaluation Algorithms, data collection, information
dissemination, interface, usability
112/04/18 4
Syndromic Surveillance Data Sources in Different Stages of Developing a Disease Reaching Situational Awareness
Reproduced from Mandl et. al. (2004)
112/04/18 5
Syndromic Surveillance Systems Generation 1, paper-based: paper, fax, TEL,
TEL directory, etc. Generation 2, email-based: email,
Word/Access, pager, cell phone, etc. Generation 3, database-driven: database,
standards, messaging, tabulation, GIS, graphs, text, etc.
Generation 4, search engine-based: real-time, interactive, web services, visualized, GIS, graphs, texts, sequences, contact networks, etc.
112/04/18 6
Syndromic Surveillance System Survey
Projects User population Stakeholders
RODS -Pennsylvania, Utah, Ohio, New Jersey, Michigan etc
-418 facilities connected to RODS RODS laboratory,
U of Pittsburgh
STEM N/A IBM
ESSENCE II 300 world wide DOD medical facilities DoD
EARS -Various city, county, and state public health officials in the United States and abroad of US
CDC
BioSense Various city, county, and state public health officials in the United States and abroad of US
CDC
RSVP Rapid Syndrome Validation Project; Kansas, NM Sandia NL, NM
BioPortal NY, CA, Kansas, AZ, Taiwan U of Arizona
112/04/18 7
Sample Systems and Data Sources Utilized
Projects Data sources/Techniques
RODS - Chief complaints (CC); OTC medication sales
- Free-text Bayesian disease classification
STEM - Simulated disease data
- Disease modeling and visualization, SIR
ESSENCE II - Military ambulatory visits; CC; Absenteeism data
EARS - 911 calls; CC; Absenteeism; OTC drug sales- Human-developed CC classification rules
BioSense - City/state generated geocoded clinical data- Graphing/mapping displays
RSVP - Clinical and demographic data - PDA entry and access
BioPortal - Geo-coded clinical data; Gemonic sequences; Multilingual CC- Real-time access and visualization; Web based hotspot analysis; Sequence visualization; Multilingual ontology-based CC classification
•Newsweek Magazine March3, 2003
•ABC News April 15, 2003
•The New York Times November 2, 2002
COPLINK News
Project Seeks to Track Terror Web Posts, 11/11/2007
Researchers say tool could trace online posts to terrorists, 11/11/2007
Mathematicians Work to Help Track Terrorist Activity, 9/14/2007
Team from the University of Arizona identifies and tracks terrorists on the Web, 9/10/2007
Dark Web News
112/04/18 12
BioPortal: Overview, West Nile Virus (real-time information collection, sharing, access,
visualization, and analysis, Epi data across species)
112/04/18 13
BioPortal Project Goals
Demonstrate and assess the technical feasibility and scalability of an infectious disease information sharing (across species and jurisdictions), alerting, and analysis framework.
Develop and assess advanced data mining and visualization techniques for infectious disease data analysis and predictive modeling.
Identify important technical and policy-related challenges in developing a national infectious disease information infrastructure.
112/04/18 14
Data Ingest ControlModule
Cleansing / Normalization
New
Info
-Sh
arin
g I
nfr
ast
ruct
ure
NYSDOH CADHS
XML/HL7Network
PHINMSNetwork
Adaptor Adaptor Adaptor
Portal Data Store(MS SQL 2000)
SS
L/R
SA
SS
L/R
SA
Information Sharing Infrastructure Design
112/04/18 15
Data Access Infrastructure Design
Web Server (Tomcat 4.21 / Struts 1.2)
Data Store(MS SQL 2000)
Dat
a S
tore
WN
V-B
OT
Po
rtal
Data Search and Query
Spatial-TemporalVisual-ization
HAN orPersonalAlertManagement
Analysis /Prediction
User Access Control API (Java)
DatasetPrivilegesManagement
Browser (IE/Mozilla/…)
Public healthprofessionals,
researchers, policy makers, law enforcement
agencies & other users
AccessPrivilegeDef.
SSL connection
112/04/18 16
Spatial-Temporal Visualization Integrates four visualization techniques
GIS View Periodic Pattern View Timeline View Central Time Slider
Visualizes the events in multiple dimensions to identify hidden patterns Spatial Temporal Hotspot analysis Phylogenetic tree Contact network analysis
112/04/18 18
Outbreak Detection & Hotspot Analysis Hotspot is a condition indicating
some form of clustering in a spatial and temporal distribution (Rogerson & Sun 2001; Theophilides et. al. 2003; Patil & Tailie 2004; Zeng et. al. 2004; Chang et. al. 2005)
For WNV, localized clusters of dead birds typically identify high-risk disease areas (Gotham et. al. 2001); automatic detection of dead bird clusters can help predict disease outbreaks and allocate prevention/control resources effectively
112/04/18 20
Risk-Adjusted Support Vector Clustering (RSVC)
Estimate baseline density
Minimum sphere
Feature space
High baseline density makes two points far apart in feature space
Split into several clusters
112/04/18 21
Study II: NY WNV (birds, mosquitoes, and humans)
On May 26, 2002, the first dead bird with WNV was found in NY Based on NY’s test dataset
March 5 May 26 July 2
baseline new cases
140 records 224 records
112/04/18 23
User Login
Choose WNV disease data
User main pageAvailable dataset list
Select CA dead bird, chicken and NY dead bird data
Select CA dead bird, chicken and NY dead bird data
Advanced Search criteria
Positive cases
Positive cases
Positive cases
Specify bird species
Dataset name
Spatial / Temporal
Time range
County / State
Results listed in tableSelect background maps
Select NY / CA population, river and
lakes
Start STV
112/04/18 24
Zoom in NY
Timeline
PeriodicPattern
GIS
Control panel View all 3 year
data
Overall pattern
NY dead bird temporal distribution pattern
1 year window in 3 year span
Concentrated in May / Jun
NY dead bird temporal distribution pattern
Move time slider, year 2
Similar time pattern
NY dead bird temporal distribution pattern
Move time slider, year 3
Similar time pattern
NY dead bird temporal distribution pattern
Zoom in
CloseClose
Year 2001 data
Spatial distribution pattern
Spatial distribution pattern
2 weeks window
Spatial distribution pattern
112/04/18 25
Move time slider
Spatial distribution pattern
Spatial distribution pattern
Spatial distribution pattern
Spatial distribution pattern
Spatial distribution pattern
Spatial distribution pattern
Spatial distribution pattern
Spatial distribution pattern
Spatial distribution pattern
Spatial distribution pattern
Spatial distribution pattern
Spatial distribution pattern
Spatial distribution pattern
Season end
Dead bird casesmigrate from long island
Into upstate NY
Enable population
map
Overlay population map
Dead bird casesdistribute along
populated areas nearHudson river
112/04/18 26
BioPortal HotSpot Analysis: RSVC, SaTScan, and CrimeStat Integrated (first visual, real-
time hotspot analysis system for disease surveillance)
West Nile virus in California
112/04/18 27
Regular STV
Hotspot Analysis-Enabled STV
Select baseline and case periods
Select algorithms
Select baseline and case periods
Select target geographic area
Hotspots found!
Select hotspot to
highlight case points
112/04/18 29
FMD Global Surveillance: Lessons Learned
Must understand risks, and nature of changing risks, in order to develop strategies for prevention and mitigation on a global scale
Must understand the global situation in order to prepare locally
United Kingdom FMD outbreak, 2001; $12B, 50-60% of 4M farm animals (cows, pigs, sheep) slaughtered
112/04/18 30
International FMD BioPortal
Real time web-based situational awareness of FMD outbreaks worldwide through the establishment of an international information technology system.
FMDv characterization at the genomic level integrated with associated epidemiological information and modeling tools to forecast national, regional and/or international spread and the prospect of importation into the US and the rest of North America.
Web-based crisis management of resources—facilities, personnel, diagnostics, and therapeutics.
112/04/18 31
Preliminary Global FMD Dataset Provider: UC Davis FMD Lab Information sources: reference labs and OIE Coverage: 28 countries globally Dataset size: 30,000+ records of which 6789 records are
complete Host species: Cattle, Caprine, Ovine, Bovine, Swine, NK,
Elephant, Buffalo, Sheep, Camelidae, Goat
Ovine37%
Bovine37%
Caprine4%
Cattle5%
Sheep3% Camelidae
0%
Goats0%
Swine11%
Elephant0%
Buffaloes3%
Regionwise Distribution of FMD Data
South America66%
Central and SouthAsia15%
Africa1%
Middle East Asia4%
Europe14%
112/04/18 33
FMD Migration Visualization using BioPortal (cases in South Asia)
FMD Cases travel back and forth
between countries
112/04/18 35
International FMD News Provider: UC Davis FMD Lab Information sources: Google, Yahoo, and
open Internet sources Time span: Oct 4, 2004 – present (real-
time messaging under development) Data size: 460 events (6/21/05) Coverage: 51 countries
(Africa:11, Asia:16, Europe:12, Americas:12)
Africa11% Aisa
1%
Asia15%
Australia14%
Europe27%
America27%
UNDEFINED5%
112/04/18 38
FMD Genetic Visualization Goal: Extend STV to incorporate 3rd
dimension, phylogenetic distance Include a phylogenetic tree. Identify phylogenetic groups and color-code the
isolate points on the map. Leverage available NCBI tools such as BLAST.
Proof of concept: SAT 2 & 3 analysis Data: 54 partial DNA sequence records in South
Africa received from UC Davis FMD Lab (Bastos,A.D. et al. 2000, 2003)
Date range: 1978-1998 Countries covered: South Africa,
Zimbabwe, Zambia, Namibia, Botswana
112/04/18 40
Phylogenetic Treeof Sample FMD Data
Identify 6 groupswithin 2 major families (MEGA3; based on sequence similarity)
Group4
Group2
Group3
Group1
Group5
Group6
112/04/18 41
Genetic, Spatial, and Temporal Visualization of FMD Data
Isolates’ locations color
coded
Phylogenetic tree color coded
Isolates’ appearances in
time
112/04/18 42
FMD Time Sequence Analysis
2nd family cases exist before 1993 and a comeback lately
First family cases appeared throughout
the period
Second family cases existed before 1993 and reappeared later
after 1997
112/04/18 43
FMD Periodic Pattern Analysis2nd family concentrated in
Feb. while 1st family spread evenly
112/04/18 44
Locations of Family 1 records
Selected only groups 1, 2, and 3 and found a spatial
cluster
112/04/18 45
Locations of Family 2 records
Selected only groups 4, 5, and 6
Sparse isolate locations
112/04/18 46
BioPortal: Influenza, SARS (chief complaint syndromic surveillance,
contact network analysis and visualization)
112/04/18 47
Existing CC Classification Methods
Classification Method Systems Authors
Keyword Match + Synonym List + Mapping Rules
DOHMH (NY City),
EARS
Mikosz et. al. (2004)
Weighted Keyword Match (Vector Cosine Method) + Mapping Rules
ESSENCE Sniegoski (2004)
Naïve Bayesian RODS Olszewski (2003), Ivanov et. al (2002)
Bayesian Network N/A Chapman et. al. (2004)
112/04/18 48
Syndromic Categories in Different Systems
System # Sdms
Details
CDC (2002)
11 Botulism, Hemorrhagic, Lymphadenitis, Cutaneous Lesion, Gastrointestinal, Respiratory, Neurological, Rash, Specific Infection, Fever, Severe Illness or Death
EARS 41 Lower Resp., Upper Resp., Neuro, Febrile, Poison, Hemorrhage, Botulinic, Rash, Fever, etc. (41 categories)
RODS 8 Gastrointestinal, Constitutional, Respiratory, Rash, Hemorrhagic, Botulinic, Neurological, Other
ESSENCE 8 Death, Gastr, Neuro, Rash, Respi, Sepsi, Unspe, Other
112/04/18 49
Overall System Design
Chief Complaints
CCStandardization
Symptom Grouping Syndrome
Classification
symptomsSymptomGroups Syndromes
Stage 1 Stage 2 Stage 3
EMT-P JESS
UMLSConcepts
SynonymList
EARSSyndrome
Rules
SymptomGrouping
Table
UMLS Ontology
EARSSymptom
Table
EMT-P
Weighted Semantic
Similarity Score
Weighted Semantic
Similarity Score
112/04/18 50
Comparing BioPortal to RODS
Trained BioPortal
Syndrome TP+FN PPV Sensitivity Specificity F F2
GI 124 91.41% 94.35%*** 98.74% 0.93*** 0.93***
HEMO 30 82.86% 96.67%*** 99.38% 0.89** 0.92***
RASH 15 66.67% 66.67%** 99.49% 0.67* 0.67**
RESP 110 92.08% 84.55%**** 99.10% 0.88*** 0.87***
UPPER_RESP 43 80.43% 86.05% 99.06% 0.83 0.84
RODS
Syndrome TP+FN PPV Sensitivity Specificity F F2
GI 124 89.89% 64.52% 98.97% 0.75 0.71
HEMO 30 90.91% 66.67% 99.79%* 0.77 0.73
RASH 15 58.33% 46.67% 99.49% 0.52 0.50
RESP 110 87.84% 59.09% 98.99% 0.71 0.66
UPPER_RESP 43 N/A N/A N/A N/A N/A
* p-value < 0.1 ** p-value < 0.05 *** p-value < 0.01Statistical test is based on 2,500 bootstrapings.
112/04/18 51
Comparing BioPortal to EARS
Trained BioPortal
Syndrome TP+FN PPV Sensitivity Specificity F F2
GI 124 91.41% 94.35%*** 98.74% 0.93*** 0.93***
HEMO 30 82.86% 96.67%*** 99.38% 0.89*** 0.92***
RASH 15 66.67% 66.67%** 99.49% 0.67 0.67*
RESP 110 92.08% 84.55%**** 99.10% 0.88*** 0.87***
UPPER_RESP 43 80.4%*** 86.05%*** 99.06%*** 0.83*** 0.84***
EARS
Syndrome TP+FN PPV Sensitivity Specificity F F2
GI 124 93.75%* 72.58% 99.32%*** 0.82 0.78
HEMO 30 100.00%*** 33.33% 100.00%*** 0.50 0.43
RASH 15 70.00% 46.67% 99.70% 0.56 0.53
RESP 110 90.36% 68.18% 99.10% 0.78 0.74
UPPER_RESP 43 58.70% 62.79% 98.01% 0.61 0.61
* p-value < 0.1 ** p-value < 0.05 *** p-value < 0.01Statistical test is based on 2,500 bootstrapings.
112/04/18 52
Chinese CC Preprocessing: System Design
Separate Chinese and
English Expressions
Chinese Phrase
Segmentation
ChinesePhrase
Translation
ChineseExpressions
SegmentedChinese Phrases
TranslatedChinesePhrases
Stage 0.1 Stage 0.2 Stage 0.3
Chinese to English
Dictionary
ChineseMedical Phrases
CommonChinesePhrases
Raw ChineseCCs
MutualInfo.
Chinese Chief Complaints
English Expressions
112/04/18 55
Taiwan SARS Contact Network Visualization
Social network visualization with patients and geographical locations
Scroll bar on time dimension to see the evolution of a network
112/04/18 56
Taiwan SARS Network Evolution – Hospital Outbreak
The index patient of Heping Hospital began to have symptoms.
112/04/18 57
BioPortal:Towards building integrated, real-time situation
awareness for syndromic surveillance and biodefense
112/04/18 58
Syndromic Surveillance vs. BioSurveillance (Species Jumps) Data sources and collection strategies
Levels of data (sequence to epi to social media), data granularity, automated/manual, integration, information sharing (incentives and standards)
Data analysis and outbreak detection Biological/genetic modeling/analytics
(hypothesis testing), exploration, hypothesis generation, sequence to time/space/event
Data visualization, information dissemination, and alerting Target users (DTRA), dissemination strategies,
surveillance, prediction System assessment and evaluation
Target users (DTRA), acceptance, usability, situation awareness, decision making
112/04/18 59
BioPortal Information
Hsinchun Chen [email protected]
AI Lab project information http://ai.arizona.edu