FLOCK: A Density-Based Clustering Method for Automated Cell Population Identification in High- Dimensional Flow Cytometry Data and the Cell Ontology Richard.

FLOCK: A Density-Based Clustering Method for Automated Cell Population Identification in High-

Dimensional Flow Cytometry Data and the Cell Ontology

Richard H. Scheuermann, Ph.D.

Department of Pathology and Division of Biomedical Informatics

U.T. Southwestern Medical Center, Dallas, TX

http://www.niaid.nih.gov/default.htm

TRADITIONAL FLOW ANALYSIS

Flow Cytometry (FCM)

• a.k.a. Fluorescence Activated Cell Sorting (FACSTM)

• Method:– Stain cell population with fluorescent reagents that bind to specific

molecules, e.g. fluorescein-conjugated anti-CD40 antibodies

– Measure fluorescence properties of each cell using flow cytometer

• Direct and indirect measurement of individual cell characteristics, e.g. cell size, membrane protein expression, secreted protein expression, cell cycle state, DNA ploidy, signal transduction activation

Uses of Flow Cytometry (FCM)

• Differences in cell populations between specimens

• Study of normal cell activation, differentiation and function

• Study of abnormal cell activation, differentiation and function

• Isolate cells from mixture based on their molecular characteristics

• Diagnostics - leukemia, lymphoma, myeloproliferative disorders

• Novel biomarkers

10 10 10 10 100 1 2 3 4

A-07-3711 LAKHANPAL04.008CD15 FITC ->

Red - MyeloblastsGreen - GranulocytesL. Blue - Monocytes

normal leukemia

FCM can measure many parameters simultaneously, e.g., BD LSR-II can produce data for up to 19 parameters for every cell in

a given sample

FCM instrumentation & reagents

Traditional Flow Cytometry Analysis

•Subjective

•Time-consuming

•Doesn’t handle overlapping distributions well

•Sensitive to slight difference in fluorescence intensity distributions between samples

•Requires at least one 2D plot that clearly segregates populations in question

Goal - group together cells with similar characteristics

Traditional approach - manual gating 2D at a time

Improved Approach

• Identifying cell populations automatically, objectively, and quickly in multi-dimensional flow cytometry data (eliminate manual gating)

• Quantitatively compare the identified populations across different samples and across different experiments

Characteristics of FCM Data

Data sets are:• Large (and various) size

– From hundreds to millions of events• Multidimensional

– 19 parameter instrument already available• Noise and Outlier

– Dead cells and dirt

Populations are different in:• shapes

– Elongated, ellipsoid, spherical, banana shapes…• densities

– Some cell populations are relatively sparse even on 2D space• compositions

– Events that pile up on axis can change data distribution• positions

– Some are very close while others are far away• sizes

– From several events to hundreds of thousands events

FLOCK APPROACH

Grid-based Clustering Approach

• Divide n-dimensional space with hyper-grids

• Identify dense hyper-regions

• Merge neighboring dense hyper-regions to define k populations

• Determine centroids of each population

• Cluster data using k centroids to seed

2D example

Divide with hyper-grids

Find dense hyper-regions

Merge neighboring dense hyper-regions

Clustering based on region centers

FLOCK v2.0 STEPS

1. File Conversion - Convert binary .fcs file into a data matrix

2. Data Cleansing - Remove boundary events (noise) in FSC and SSC dimensions

3. Data Shrinking - Collapse data toward distribution modes

4. Normalization - Z-score normalization for values in each dimension ((x i - µ)/SD)

5. Dimension Selection - Select most informative dimensions based on measures of dispersion and

distortion

6. FLOCK LoDi. Partition each dimension to generate a hyper-gridii. Identify dense hyper-regions in hyper-gridiii. Merge neighboring dense hyper-regions to define hyper-region groups (n)iv. Determine centroids for each hyper-region groupv. Use n centroids to seed single round of distance-based clustering

7. FLOCK HiD - Refine population definition based on histogram partitioning

8. Group Merging - Merge close hyper-region groups based on [distance metric]

9. Centroid Calculation - Compute centroid for each hyper-region group

10. Clustering - Cluster events to nearest centroid

11. Population statistics - Summarize population proportions, intensity levels, etc.

12. Visualization

Data

• Source: University of Rochester (Sanz)

• Normal human PBMC sample stained with:– FITC‑IgD– PE‑CD1c– PE‑Alexa610‑CD24– PE‑Cy5‑IgG– PerCP‑Cy5.5‑CD3– PE‑Cy7‑B220– PacificBlue‑CD38– PacificOrange‑Aqua dead cell staining– APC‑CD27– APC‑Cy7‑CD19

• 10 color; 12 parameter

• Gated on CD19+, CD3- (~67,000 events)

N1-3

UM1-2

UM3-4PB GSM

GNSM

DNM

CD

27

IgD

B2

20

CD24

CD

38

IgG

A

17 B Cell Populations in Blood

B2

20

CD24

CD

38

IgG

N1(B220+, CD38+)

N2(B220+, CD38-)

N3(B220low, CD38+)

Naïve B cells (CD27low, IgD+, IgG-)

Population characteristics

Populationa Colorb CD27c IgDc IgGc CD38c CD24c B220c Proportiond Putative cell typea

N1 Gray - + - + int + 48.94% naïve (CD38+)[Bm2?]N2 Magenta - + - - + + 4.69% naïve (CD38-)N3 Purple - + - + + low 4.41% naïve (CD38+B220low)

UM1 Darkred + + - + + + 1.55% unswitched memory (CD38+)UM2 Salmon + + - - + + 0.94% unswitched memory (CD38-)[Bm1?]UM3 Darkblue + int - + + low 6.16% IgDlow unswitched memory (CD38+)UM4 Green + int - - + low 11.50% IgDlow unswitched memory (CD38-)

GSM1 Grayishgreen + + + + + + 0.36% switching memory (IgD+IgG+CD38+)GSM2 Yellow + - + + + low 4.05% switched memory (CD38+)[early Bm5?]GSM3 Blue + - + - + low 4.40% switched memory (CD38-)[late Bm5?]

GNSM1 Cyan + - - + + low 4.84% IgD-IgG- memoryGNSM2 Darkgreen + - - - + low 3.84% IgD-IgG- memoryGNSM3 Teal + - - + + + 1.30% IgD-IgG- memoryGNSM4 Orange + - - - - low 0.51% IgD-IgG- memory

DNSM1 Pink - - + - - + 0.85% double negative memory (IgG+)DNSM2 Darkgray - - - - - + 0.91% double negative memory (IgG-)

PB Red high - - high - low 0.75% plasmablasts

Summary Statistics

B cell component of the Cell Ontology

http://www.obofoundry.org/

Tube Marker Summary

Tube 26 Tube 27 Tube 28 Tube 29 Tube 30 Tube 31 Tube 33

Major PBMC subsets and FcE

T cell subsets

NK & T cells

Naïve TH

Memory TH

T cell subsets

NK cells

FL1 CD14 CD4 CD4 CD4 CD4 CD4 CD56

FL2 CD23 CCR3 CD25 CD25 CD25 CXCR3 CXCR3

FL3 CD3 CD8 CD3 CD3 CD3 CD8 CD3

FL4 CD19 CCR4 CD161 CD45RA CD45RO CCR5 CCR5

Tube 26 - CD19 vs CD3

T

B

CD19

CD3

Ontology Schematic

Normal 2324

0 102 103 104 105

<FITC-A>: IgD

0

103

104

105

<A

PC

-A>

: C

D27

1.6717.2 7.79

65.79.29

0 102 103 104 105

<FITC-A>: IgD

0

103

104

105

<A

PC

-A>

: C

D27

0 102 103 104 105

<FITC-A>: IgD0 102 103 104 1050 102 103 104 105

<FITC-A>: IgD

0

103

104

105

<A

PC

-A>

: C

D27

0

103

104

105

0

103

104

105

<A

PC

-A>

: C

D27

1.6717.2 7.79

65.79.29

1.671.6717.217.2 7.797.79

65.765.79.299.29

Percentage (%)

Population & ID Color Code CD27 IgD CD21 CD38 CD24 B220 CXCR3 2324

1 PB red CD27 high IgD- CD21low CD38+ CD24- B220low CXCR3low 3.11

2 CD27+ cyan CD27+ IgD- CD21+ CD38- CD24+ B220+ CXCR3+ 5.95

6 Memory magenta CD27+ IgD- CD21+ CD38- CD24+ B220low CXCR3- 4.37

9 blue CD27+ IgD- CD21low CD38- CD24- B220low CXCR3- 1.14

4 CD27- gray CD27low IgD- CD21- CD38- CD24- B220low CXCR3- 0.91

8 memory pink CD27low IgD- CD21- CD38low CD24- B220+ CXCR3- 2.28

13 darkblue CD27low IgD- CD21- CD38- CD24- B220+ CXCR3+ 1.98

5 green CD27- IgD- CD21+ CD38- CD24low B220low CXCR3- 0.47

12 darkgreen CD27- IgDlow CD21+ CD38- CD24+ B220low CXCR3- 1.01

3 unswitched yellow CD27+ IgDlow CD21+ CD38- CD24+ B220low CXCR3- 9.12

14 memory purple CD27+ IgDlow CD21- CD38low CD24+ B220+ CXCR3- 0.29

7 naive darkGray CD27+ IgD+ CD21+ CD38low CD24low B220+ CXCR3- 20.47

10 grayish green CD27low IgD+ CD21+ CD38+ CD24+ B220+ CXCR3- 3.79

11 darkred CD27- IgD+ CD21+ CD38- CD24low B220+ CXCR3- 45.09

Marker Expression

B cells from Immgen

UT SouthwesternYu (Max) QianJamie LeeMegan KongJennifer CaiJie HuangNishanth MarthandanDiane XiangYoung Bun KimPaula GuidryEva Sadat

Ignacio Sanz (Rochester)Chungwen Wei (Rochester)Tim Mosmann (Rochester)Adam Seegmiller (UTSW)Nitin Karandikar (UTSW)Christine Martens (Emory)Chris Ding (UTA)

Alex Diehl (Jackson Labs)Martin Zand (Rochester)

Supported by NIH N01AI40076 and N01AI40041

Northrop GrummanJohn CampbellLiz ThompsonJeff WiserMike Attasi

Immune Tolerance NetworkDave ParrishKeith BoyceTom CasaleJeff Bluestone

Acknowledgments

FLOCK: A Density-Based Clustering Method for Automated Cell Population Identification in High- Dimensional Flow Cytometry Data and the Cell Ontology Richard.

Documents

cell populations

tx slide

time slide

flock approach slide

cell size

stain cell population

cell cycle state

cell ontology richard