Integrative Multi-Scale Biomedical Informatics Joel Saltz MD, PhD Director Center for Comprehensive Informatics
Integrative Multi-Scale Biomedical Informatics
Joel Saltz MD, PhDDirector Center for Comprehensive
Informatics
2 Leverage exascale data and computer resources to squeeze the most out of image, sensor or simulation data
Run lots of different algorithms to derive same features
Run lots of algorithms to derive complementary features
Data models and data management infrastructure to manage data products, feature sets and results from classification and machine learning algorithms
Squeezing Information from Spatial Datasets
Cen
ter
for
Com
preh
ensi
ve In
form
atic
s
Outline• Integrative biomedical informatics analysis –
feature sets obtained from Pathology and Radiology studies
• Techniques, tools and methodologies for derivation, management and analysis of feature sets
Cen
ter
for
Com
preh
ensi
ve In
form
atic
s
INTEGRATIVE BIOMEDICAL INFORMATICS ANALYSIS
Reproducible anatomic/functional characterization at gross level (Radiology) and fine level (Pathology)Integration of anatomic/functional characterization with multiple types of “omic” informationCreate categories of jointly classified data to describe pathophysiology, predict prognosis, response to treatment
In Silico Center for Brain Tumor Research
Specific Aims:
1. Influence of necrosis/hypoxia on gene expression andgenetic classification.
2. Molecular correlates of highresolution nuclear morphometry.
3. Gene expression profiles that predict glioma progression.
4. Molecular correlates of MRIenhancement patterns.
Integration of heterogeneous multiscale information
•Coordinated initiatives Pathology, Radiology, “omics”
•Exploit synergies between all initiatives to improve ability to forecast survival & response.
RadiologyImaging
Patient Outco
me
Pathologic Features
“Omic”Data
Lee Cooper Carlos Moreno
Example: Pathology and Gene Expression Joint Predictors of Recurrence/Survival
FEATURE CHARACTERIZATION IN PATHOLOGY AND RADIOLOGY
Role – In silico Brain Tumor ResearchAlgorithmsScaling Requirements
In Silico Center for Brain Tumor ResearchKey Data Sets
REMBRANDT: Gene expression and genomics data set of all glioma subtypes
The Cancer Genome Atlas (TCGA): Rich “omics” set of GBM, digitized Pathology and Radiology
Pathology and Radiology Images from Henry Ford Hospital, Emory, Thomas Jefferson U, MD Anderson and others
TCGA Research Network
Digital Pathology
Neuroimaging
Progression to GBM
Anaplastic Astrocytoma(WHO grade III)
Glioblastoma(WHO grade IV)
TCGA Neuropathology Attributes 120 TCGA specimens; 3 Reviewers
Presence and Degree of:
Microvascular hyperplasia Complex/glomeruloid Endothelial hyperplasia
Necrosis Pseudopalisading pattern Zonal necrosis
Inflammation Macrophages/histiocytes Lymphocytes Neutrophils
Differentiation: Small cell component Gemistocytes Oligodendroglial Multi-nucleated/giant cells Epithelial metaplasia Mesenchymal metaplasia
Other Features Perineuronal/perivascular
satellitosis Entrapped gray or white matter Micro-mineralization
Distinguishing Characteristic in Gliomas
Use image analysis algorithms to segment and classify microanatomic features (Nuclei, Astrocytoma, Necrosis ...) in whole slide images
Represent the segmentation and classification in a well defined structured format that can be used to correlate the pathology with other data modalities
Oligodendroglioma Astrocytoma
Nuclear QualitiesRound shaped withsmooth regular texture
Elongated with rough, irregular texture
Feature Extraction
TCGA Whole Slide Images
Jun Kong
Astrocytoma vs OligodendroglimaOverlap in genetics, gene expression, histology
Astrocytoma vs Oligodendroglima• Assess nuclear size (area and
perimeter), shape (eccentricity, circularity major axis, minor axis, Fourier shape descriptor and extent ratio), intensity (average, maximum, minimum, standard error) and texture (entropy, energy, skewness and kurtosis).
Whole slide scans from 14 TCGA GBMS (69 slides)7 purely astrocytic in morphology; 7 with 2+ oligo component399,233 nuclei analyzed for astro/oligo featuresCases were categorized based on ratio of oligo/astro cells
Machine-based Classification of TCGA GBMs (J Kong)
TCGA Gene Expression Query: c-Met overexpression
Clustergram of selected features used in consensus clustering
Feat
ure
Indi
ces
10
20
30
40
50
60
70
80
90
100
110
Nuclear Features Used to Classify GBMs
2 1 3 4
50 100 150
20
40
60
80
100
120
140
160
0 0.2 0.4 0.6 0.8 1
1
2
3
4
Silhouette Value
Clu
ster
Consensus clustering of morphological signatures
Study includes 200 million nuclei taken from 480 slides corresponding to 167 distinct patients.
Nuclear Features Used to Classify GBMs
Survival of morphological clusters
0 500 1000 1500 2000 2500 30000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Days
Sur
viva
l
Cluster 1Cluster 2Cluster 3Cluster 4
Survival of patients by molecular tumor subtype
0 500 1000 1500 2000 2500 30000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Days
Sur
viva
l
ProneuralNeuralClassicalMesenchymal
Articulate Physical Interpretations of Results
Images
Multiscale Systems BiologyEmploy multi-resolution methods to
characterize necrosis, angiogenesis and correlate these with “omics”
No enhancementNormal VesselsStable lesion
?Rim-enhancementVascular ChangesRapid progression
Correlation of Necrosis, Angiogenesis and “omics”
• GBMs display variable and regionally heterogeneous degrees of necrosis (asterisk) and angiogenesis
• These factors may impact gene expression profiles
Genes Correlated with Necrosis include Transcription Factors Identified as Regulators of the Mesenchymal Transition
GeneSymbol
SAM q-value(Corrected p-value)
C/EBPB < 0.000001C/EBPD < 0.000001FOSL2 < 0.000001STAT3 0.0047RUNX1 0.0082
Carro MS, et al. Nature 263: 318-25, 2010
• Frozen sections from 88 GBM samples marked to identify regions of necrosis and angiogenesis
• Extent of both necrosis and angiogenesis calculated as a percentage of total tissue area
Feature Sets in Radiology(Adam Flanders, TJU; Dan Rubin, Stanford, Lori Dodd, NCI)
• Require standardized validated feature sets to describe de novo disease.
• Fundamental obstacle to new imaging criteria as treatment biomarkers is lack of standard terminology:– To define a comprehensive set of imaging
features of cancer– For reporting imaging results– To provide a more quantitative, reproducible
basis for assessing baseline disease and treatment response
Defining Rich Set of Qualitative and Quantitative Image Biomarkers
• Community-driven ontology development project; collaboration with ASNR
• Imaging features (5 categories)– Location of lesion– Morphology of lesion margin (definition, thickness,
enhancement, diffusion)– Morphology of lesion substance (enhancement, PS
characteristics, focality/multicentricity, necrosis, cysts, midline invasion, cortical involvement, T1/FLAIR ratio)
– Alterations in vicinity of lesion (edema, edema crossing midline, hemorrhage, pial invasion, ependymal invasion, satellites, deep WM invasion, calvarial remodeling)
– Resection features (extent of nCE tissue, CE tissue, resected components)
Cen
ter
for
Com
preh
ensi
ve In
form
atic
s
Emory TJU/CBIT/NCI UVA/Northwestern Henry Ford
David A Gutman1 Adam Flanders3 Max Wintermark8 Lisa Scarpace4
Lee Cooper1 Eric Huang2 Manal Jilwan8 Tom Mikkelsen4
Scott N Hwang1 Robert J Clifford2 Prashant Raghavan8
Chad A Holder1 Dina Hammoud3 Pat Mongkolwat9
Doris Gao1 John Freymann7
Carlos Moreno1 Justin Kirby7
Arun Krishnan1
Jun Kong1
Carl Jaffe6
Seena Dehkharghani1
Joel Saltz1
Dan Brat1
Imaging Predictors of survival and molecular profiles in the TCGA Glioblastoma Data set
The TCGA glioma working group1Emory University Hospital, Atlanta, GA 2National Cancer Institute, Bethesda, MD. 3Thomas Jefferson University Hospital, Philadelphia, PA. 4Henry Ford University Hospital, Detroit, Michigan. 5National Institute of Health, Bethesda, MD. 6Boston University School of
Medicine, Boston, MA. 7SAIC-Frederick, Inc., Frederick, MD. 8University of Virginia, Charlottesville, VA. 9 Northwestern University Chicago, IL
Cen
ter
for
Com
preh
ensi
ve In
form
atic
s
Assumed Dependence Between Features
F6
F16
F7
F5
F11
F22
F14
F21
F19
F1
F24
F9F3
F10 F13
F18 F20F1: Tumor LocationF3: Eloq. BrainF5: Prop. EnhanceF6: Prop. nCETF7: Prop. NecrosisF9: DistributionF10: T1/FLAIRF11: En. Marg. Thick.F13: Def. Non. Marg.F14: Prop. EdemaF16: HemorrhageF18: Pial InvasionF19: EpendymalF20: Cort. Involve.F21: Deep WM Inv.F22: nCET Cross. Mid.F24: Satellites
NOTE: each feature omitted from this graph is independent ofevery other feature.
Slide thanks to Eric Huang, NIH
Cen
ter
for
Com
preh
ensi
ve In
form
atic
s
Estimation Problem Size ReductionCan ignore seven of the features size of contingency table reduced from 2.64 × 1012 cells to 1.34 × 1010 cells.
Collapsibility reduces size of contingency table even further: • Any binary feature Fj connected to only one feature on
graph (i.e. given the feature Fj is connected to, Fj is independent of all other features) can also be ignored
• Eliminates need to deal with Hemorrhage (F16), Pial Invasion (F18), Cortical Involvement (F20), and Satellites (F24).
• Reduces size of contingency table to 1.68 × 109 cells.• Additional analogous considerations can be used to reduce
size of contingency table by more than two additional orders of magnitude
Slide thanks to Eric Huang, NIH
Cen
ter
for
Com
preh
ensi
ve In
form
atic
s
Correlative Imaging Results
• Minimal enhancing tumor (≤5%) strongly associated with Proneural classification (p=0.0006).
• >5% proportion of necrosis and the presence of microvascular hyperplasia in pathology slides (p=0.008).
• Greater maximum tumor dimension (T2 signal) associated with present/abundant microvascular hyperplasia (p=0.001).
< 5% Enhancement
Cen
ter
for
Com
preh
ensi
ve In
form
atic
s
Correlative Imaging Results
• TP53 mutant tumors had a smaller mean tumor sizes (p=0.002) on T2-weighted or FLAIR images.
• EGFR mutant tumors were significantly larger than TP53 mutant tumors (p=0.0005).
• High level EGFR amplification was associated with >5% enhancement and >5% proportion of necrosis (p < 0.01).
> 5% Necrosis
• Leverage exascale data and computer resources to squeeze the most out of image, sensor or simulation data
• Run lots of different algorithms to derive same features
• Run lots of algorithms to derive complementary features
• Data models and data management infrastructure to manage data products, feature sets and results from classification and machine learning algorithms
Squeezing Information from Spatial Datasets
Pipeline for Whole Slide Feature Characterization
• 1010 pixels for each whole slide image• 10 whole slide images per patient• 108 image features per whole slide image• 10,000 brain tumor patients• 1015 pixels• 1013 features• Hundreds of algorithms• Annotations and markups from dozens of
humans
PAIS Database
Implemented with IBM DB2 for large scale pathology image metadata (~million markups per slide)
Represented by a complex data model capturing multi-faceted information including markups, annotations, algorithm provenance, specimen, etc.
Support for complex relationships and spatial query: multi-level granularities, relationships between markups and annotations, spatial and nested relationships
Data Models to Represent Feature Sets and Experimental Metadata
PAIS |pās| : Pathology Analytical Imaging Standards• Provide semantically enabled data model to support
pathology analytical imaging• Data objects, comprehensive data types, and flexible
relationships• Object-oriented design, easily extensible• Reuse existing standards
– Reuse relevant classes already defined in AIM– Follow DICOM WG 26 metadata specifications on WSI reference– Specimen information in DICOM Supplement 122 and caTissue– Use caDSR for CDE and NCI Thesaurus for ontology concepts
Cen
ter
for
Com
preh
ensi
ve In
form
atic
s
Pathology Imaging GIS
Segmentation
Feature extraction
Image analysis
class Domain Mo...
Annotation
GeometricShape
CalculationObservation
Specimen
ImageReference
Provenance
User
PAIS
EquipmentGroup
AnatomicEntity
Subject
Field
Project
MicroscopyImageReference
DICOMImageReference
TMAImageReference
Markup
Inference
Region
WholeSlideImageReferencePatient
Surface
Collection
AnnotationReference
10..1
1
0..1
0..*
0..*
1
0..*1
0..11 0..*
1
0..1
10..1
10..1
10..*
10..*
0..*
0..*
1 0..11
0..1
1
0..*
0..1
0..*
1
0..*
1
0..1
1
0..*
10..1
10..1
1
0..*
10..*
1 0..*
1
0..*
Modeling and management of markup and annotation for querying and sharing through parallel RDBMS + spatial DBMS
PAIS model PAIS data management
On the fly data processing for algorithm validation/algorithm sensitivity studies, or discovery of preliminary results
HDFS data staging MapReduce based queries
Generation and Analysis of Imaging Features
• In-transit data processing using filter/stream systems
• Semantic Workflows• Hierarchical pipeline design with coarse
and fine grained components• Adaptivity and Quality of Service
Same basic story in multiple domains
Classification using DataCutter Filter Stream Workflow
Slides’ Preparation
• 64990 x 59412 pixels in full resolution• Original Size: 10.8 Gb; Compressed Sized: ≈
833Mb
8x
40x
Computerized Classification System for Grading Neuroblastoma
• Background Identification• Image Decomposition (Multi-
resolution levels)• Image Segmentation
(EMLDA)• Feature Construction (2nd
order statistics, Tonal Features)
• Feature Extraction (LDA) + Classification (Bayesian)
• Multi-resolution Layer Controller (Confidence Region)
No
YesImage Tile Initialization
I = L Background? Label
Create Image I(L)
Segmentation
Feature Construction
Feature Extraction
Classification
Segmentation
Feature Construction
Feature Extraction
Classifier Training
Down-sampling
Training Tiles
Within ConfidenceRegion ?
I = I -1
I > 1?
Yes
Yes
No
No
TRAINING
TESTING
Segmentation
A typical segmentation result of an image from undifferentiated class with components segmented by this method is shown. (a) Original image; (b) Partitioned image shown in color; (c)Nuclei; (d)Cytoplasm; (e)Neuropil; (f)Background component.
Cen
ter
for
Com
preh
ensi
ve In
form
atic
s
Semantic Workflows (Wings)Collaborative Work with Yolanda Gil, Mary Hall
• A systematic strategy for composing application components into workflows
• Search for the most appropriate implementation of both components and workflows
• Component optimization– Select among implementation variants of the same
computation– Derive integer values of optimization parameters– Only search promising code variants and a restricted
parameter space• Workflow optimization
– Knowledge-rich representation of workflow properties
Cen
ter
for
Com
preh
ensi
ve In
form
atic
s
Adaptivity
Framework
• Description Module (Wings): Describe application workflow using semantics of workflow components
• Execution module (Pegasus, DataCutter, Condor): Maps to resources, generates and places fine grained filter/stream pipelines
• Tradeoff Module: Schedules execution based on application level QOS
Impact
Cen
ter
for
Com
preh
ensi
ve In
form
atic
s
Image Mining for Comparative Analysis of Expression Patterns in Tissue Microarray
(PI’s: Foran and Saltz)
Build reference library ofexpression signatures, integrate state-of-the-art multi-spectral imaging capability and build a deployable clinical decision support system for analyzing imaged specimens. Technologies and computational tools developed during the course of the project to be tested on a Grid-enabled, virtual laboratory established among strategic sites located at CINJ, Emory, RU, UPenn, OSU, and ASU.
Funded by NIH through grant#5R01LM009239-02
David J. Foran, Ph.D.
Center for Comprehensive Informatics Integrative Biomedical Informatics Projects
In Silico Study of Brain Tumors Minority Health Genomics and Translational
Research Bio-Repository Database (MH-GRID) ACTSI Cardiovascular, Diabetes, Brain Tumor Registry Early Hospital Readmission CFAR (Center for AIDS Research) HIV/Cancer Project Radiation Therapy and Quantitative Imaging Integrative Analysis of Text and Discrete Data
Related to Smoking Cessation and Asthma Semantic Query and Analysis of Integrative Datasets
in Renal Transplant Clinical Studies (CTOT-C)
Atlanta Clinical and Translational Science InstituteFederated Data Warehouse System
Develop integrative, federated ACTSI information warehouse Integrated clinical/imaging/”omic”/biomarker/tissue information
should always be available A virtually centralized, big Atlanta wide information warehouse that
has all relevant data Patients seen and information gathered at any ACTSI site, specimens sent
to any affiliated core, imaging carried out at any affiliated site E.g. Gene expression, SNP, virtual slide images, hematology studies
and CMV serologies for kidney transplant candidates accrued into Study X or Study Y between Feb 2011 and Jan 2012 who were on the kidney transplant waiting list as of November 1, 2010.
Development efforts Security, Web Portal, Common Data Elements & Vocabularies,
Identifiers, High-performance Computing middleware, Testing framework.
ACTSI-wide Federated Data Warehouse
Thanks to:• In silico center team: Dan Brat (Science PI), Tahsin Kurc, Ashish Sharma, Tony Pan, David
Gutman, Jun Kong, Sharath Cholleti, Carlos Moreno, Chad Holder, Erwin Van Meir, Daniel Rubin, Tom Mikkelsen, Adam Flanders, Joel Saltz (Director)
• caGrid Knowledge Center: Joel Saltz, Mike Caliguiri, Steve Langella co-Directors; Tahsin Kurc, Himanshu Rathod Emory leads
• caBIG In vivo imaging team: Eliot Siegel, Paul Mulhern, Adam Flanders, David Channon, Daniel Rubin, Fred Prior, Larry Tarbox and many others
• In vivo imaging Emory team: Tony Pan, Ashish Sharma, Joel Saltz• Emory ATC Supplement team: Tim Fox, Ashish Sharma, Tony Pan, Edi Schreibmann, Paul
Pantalone• Digital Pathology R01: Foran and Saltz; Jun Kong, Sharath Cholleti, Fusheng Wang, Tony
Pan, Tahsin Kurc, Ashish Sharma, David Gutman (Emory), Wenjin Chen, Vicky Chu, Jun Hu, Lin Yang, David J. Foran (Rutgers)
• NIH/in silico TCGA Imaging Group: Scott Hwang, Bob Clifford, Erich Huang, Dima Hammoud, Manal Jilwan, Prashant Raghavan, Max Wintermark, David Gutman, Carlos Moreno, Lee Cooper, John Freymann, Justin Kirby, Arun Krishnan, Seena Dehkharghani, Carl Jaffe
• ACTSI Biomedical Informatics Program: Marc Overcash, Tim Morris, Tahsin Kurc, Alexander Quarshie, Circe Tsui, Adam Davis, Sharon Mason, Andrew Post, Alfredo Tirado-Ramos
• NSF Scientific Workflow Collaboration: Vijay Kumar, Yolanda Gil, Mary Hall, Ewa Deelman, Tahsin Kurc, P. Sadayappan, Gaurang Mehta, Karan Vahi
Thanks!