Analysis of adaptive immune receptors and repertoires and immuneML [email protected] Milena Pavlović Department of Informatics Department of Immunology
Analysis of adaptive immune receptors and repertoires and immuneML
Milena PavlovićDepartment of InformaticsDepartment of Immunology
Immune receptors are natural diagnostics and therapeutics
• Immune cells use immune receptors on their surface to recognize antigens (e.g. parts of a virus or bacteria)
• Immune receptor consists of two sequences (chains) of amino acids (~110 amino acids long)
• CDR3: the most variable part of an immune receptor; 5-20 amino acids long
• Immune repertoires consist of all immune receptors in an individual: 109 unique receptors per individual with low overlap between individuals
• Immune receptors in a person are specific to disease the person has or had before
epitope
antigen receptor
FR1 FR2 FR3
CDR1 CDR2
N1-D-N2 FR4 CH1...
Constantregion
Variable region
CDR3Light chain
Heavy chain
Immune receptors are natural diagnostics and therapeutics
epitope
antigen receptor
learn the mechanism of generation of receptors and recognition of antigens
use it to diagnose a diseaseor create artificial receptors
⇒
Problem formulation or what to consider when building a diagnostic
Predict if a patient has a disease (multi-instance learning problem - only some receptors out of all receptors in the patient are disease-specific)
Predict if a receptor binds an antigen (describe antigen’s receptor sequence space)
Predict if a receptor sequence binds to antigens (use structural data from multiple antigens - are there shared rules?)
Predict antigens the receptor will bind to
Disease status and antigen binding prediction
Challenges when building a diagnostic• Receptor sequence space is low-dimensional
(~15) and yet receptors recognize a lot of different antigens - a lot of dependencies in the sequence
• Low signal-to-noise ratio for any given disease - very low percentage of receptors is specific to the disease (and this percentage is disease-dependent)
• A lot of confounding factors (genetics - antigen presentation and sequence diversity, age)
Challenges when building a diagnostic• Receptor sequence space is low-dimensional
(~15) and yet receptors recognize a lot of different antigens - a lot of dependencies in the sequence
• Low signal-to-noise ratio for any given disease - very low percentage of receptors is specific to the disease (and this percentage is disease-dependent)
• A lot of confounding factors (genetics - antigen presentation and sequence diversity, age)
⇒How to represent the (set of) sequence(s)
& model the problem?
Challenges when building a diagnostic• Receptor sequence space is low-dimensional
(~15) and yet receptors recognize a lot of different antigens - a lot of dependencies in the sequence
• Low signal-to-noise ratio for any given disease - very low percentage of receptors is specific to the disease (and this percentage is disease-dependent)
• A lot of confounding factors (genetics - antigen presentation and sequence diversity, age)
⇒How to represent the (set of) sequence(s)
& model the problem?
Challenges when building a diagnostic• Receptor sequence space is low-dimensional
(~15) and yet receptors recognize a lot of different antigens - a lot of dependencies in the sequence
• Low signal-to-noise ratio for any given disease - very low percentage of receptors is specific to the disease (and this percentage is disease-dependent)
• A lot of confounding factors (genetics - antigen presentation and sequence diversity, age)
⇒How to represent the (set of) sequence(s)
& model the problem?
⇒ How to account for confounders?
Challenges when building a diagnostic• Receptor sequence space is low-dimensional
(~15) and yet receptors recognize a lot of different antigens - a lot of dependencies in the sequence
• Low signal-to-noise ratio for any given disease - very low percentage of receptors is specific to the disease (and this percentage is disease-dependent)
• A lot of confounding factors (genetics - antigen presentation and sequence diversity, age)
⇒How to represent the (set of) sequence(s)
& model the problem?
⇒ How to account for confounders?
• Approaching the challenges: build a platform which will allow the experimentation with different models and data representations for large immune datasets
Immune repertoire, paired and single chain receptor data
Simulation & import and preprocessing
Comprehensive machine learning, data representation and feature recovery
nested cross-validation for hyperparameter optimizationdeep learningclustering and similarity metrics
multi-label classification
feature recovery
Scalable and modular platform on cloud infrastructurepublic instance available at immuneml.org
simulation of antigen-specific sequences and diseased repertoiresimport from Adaptive Biotechnologies, MiXCR, VDJdb, AIRR-compliant formats
preprocess immune data
work with immune repertoires for disease status prediction and confounder analysis
paired chain and single chain receptor data for antigen binding prediction
immuneML
GGP
encodings and embeddings and
and
immuneML facilitates machine learning applications in the immune receptor domain
• ImmuneML provides:
• Multiple predefined workflows (hyperparameter optimization, fit / use trained ML method, data splitting and simulation, exploratory analysis on data)
• ML methods for classification (disease status and antigen binding prediction)
• Different data representations
• Integration with popular tools for immune data analysis through the Galaxy framework (iReceptor, MiXCR, immuneSim, immunoProbs)
data_reports optimal_model_reports
performance_reports
model_reports
data_split_reports
data_split_reports report report reportreport report report
M1.1 M2.1 M3.1
M1.2 M2.2 M3.2
M1.3 M2.3 M3.3
M1.4 M2.4 M3.4
M1.5 M2.5 M3.5
M1.1 M2.1 M3.1
M1.2 M2.2 M3.2
M1.3 M2.3 M3.3
M1.4 M2.4 M3.4
M1.5 M2.5 M3.5
Scenario 1: nested cross-validation with defined report points
test
test
test
test
test
val
val
val
val
val
M1.1 M2.1 M3.1
M1.2 M2.2 M3.2
M1.3 M2.3 M3.3
M1.4 M2.4 M3.4
M1.5 M2.5 M3.5
nr a it
hyperparameters of the model
mod
elsm
odel
sel
ectio
n cr
oss-
valid
atio
n fo
lds
inner cross-validation loop: model selection
outer cross-validation loop: model assessment
performance 5
performance 4
performance 3
performance 2
performance 1
task performance
Scenario 2: repeated holdout with random splits (Monte Carlo cross-validation)
M1.1 M2.1 M3.1
M1.2 M2.2 M3.2
M1.3 M2.3 M3.3
M1.4 M2.4 M3.4
M1.1 M2.1 M3.1
M1.2 M2.2 M3.2
M1.3 M2.3 M3.3
M1.4 M2.4 M3.4
test
test
test
test
val
val
val
val
M1.1 M2.1 M3.1
M1.2 M2.2 M3.2
M1.3 M2.3 M3.3
M1.4 M2.4 M3.4
train optimal models
hyperparameters of the model
mod
elsm
odel
sel
ectio
n tra
inin
g /
valid
atio
n sp
lits
inner holdout loop: model selection
outer repeated holdout loop: model assessment
performance 5
performance 4
performance 3
performance 2
performance 1
task performancetest
report
report
report
report
report
report report report
report report reportreport report report
report report report
report report reportreport report report
report report report
report report reportreport report report
report report report
report report reportreport report report
report report report
report
report
report
report
report
report report
report
report
report
report
report
report
report
report
report
report
report
optimal models
Hyperparameter optimization workflow
immuneML
immuneML facilitates machine learning applications in the immune receptor domain
• ImmuneML is available as:
• A web tool through integration with Galaxy
• A command line tool with domain-specific language for analysis definition
• Python library
definitions: datasets: d1: metadata: "./metadata.csv" format: MiXCR encodings: e1: type: KmerFrequency params: { k: 3 } e2: type: Word2Vec params: {vector_size: 16, context: sequence} ml_methods: log_reg1: type: LogisticRegression params: { C: 0.001 } reports: r1: { type: SequenceLengthDistribution } preprocessing_sequences: seq1: - filter_chain_B: type: DatasetChainFilter params: {keep_chain: A}instructions: HPOptimization: settings: - preprocessing: seq1 encoding: e1 ml_method: log_reg1 - preprocessing: [] encoding: e2 ml_method: log_reg1 assessment: split_strategy: random split_count: 1 training_percentage: 70 selection: split_strategy: k-fold split_count: 5 reports: data_splits: [r1] labels: [CD] dataset: d1 strategy: GridSearch metrics: [accuracy, f1_micro]
image credit: https://usegalaxy.org/
immuneML
immuneML applications
• To show immuneML’s capabilities, we demonstrate:
• Replication of one of the largest studies for disease state prediction
• Antigen binding prediction from paired chain data
• Benchmarking results with simulated datasets and overview of models and representations
immuneML
Emerson et al 2017
Use-case: predicting T1D status with immuneML
• Type 1 Diabetes (T1D) status prediction from 1600 samples (456 T1D patients, 762 first degree relatives, 76 second degree relatives and 224 controls)
• Age and genetic background are the most significant confounding factors
• Goal:
• Build a classifier to predict the disease state
• Recover distinctive features on which the classification is based
Brusko lab, University of Florida
Acknowledgements
Lonneke SchefferAndrei Slabodkin