Analysis of adaptive immune receptors and repertoires and ...€¦ · Immune receptors are natural diagnostics and therapeutics • Immune cells use immune receptors on their surface

Analysis of adaptive immune receptors and repertoires and immuneML

[email protected]

Milena PavlovićDepartment of InformaticsDepartment of Immunology

Immune receptors are natural diagnostics and therapeutics

• Immune cells use immune receptors on their surface to recognize antigens (e.g. parts of a virus or bacteria)

• Immune receptor consists of two sequences (chains) of amino acids (~110 amino acids long)

• CDR3: the most variable part of an immune receptor; 5-20 amino acids long

• Immune repertoires consist of all immune receptors in an individual: 109 unique receptors per individual with low overlap between individuals

• Immune receptors in a person are specific to disease the person has or had before

epitope

antigen receptor

FR1 FR2 FR3

CDR1 CDR2

N1-D-N2 FR4 CH1...

Constantregion

Variable region

CDR3Light chain

Heavy chain

Immune receptors are natural diagnostics and therapeutics

epitope

antigen receptor

learn the mechanism of generation of receptors and recognition of antigens

use it to diagnose a diseaseor create artificial receptors

⇒

Problem formulation or what to consider when building a diagnostic

Predict if a patient has a disease (multi-instance learning problem - only some receptors out of all receptors in the patient are disease-specific)

Predict if a receptor binds an antigen (describe antigen’s receptor sequence space)

Predict if a receptor sequence binds to antigens (use structural data from multiple antigens - are there shared rules?)

Predict antigens the receptor will bind to

Disease status and antigen binding prediction

Challenges when building a diagnostic• Receptor sequence space is low-dimensional

(~15) and yet receptors recognize a lot of different antigens - a lot of dependencies in the sequence

• Low signal-to-noise ratio for any given disease - very low percentage of receptors is specific to the disease (and this percentage is disease-dependent)

• A lot of confounding factors (genetics - antigen presentation and sequence diversity, age)





⇒How to represent the (set of) sequence(s)

& model the problem?













⇒ How to account for confounders?







⇒ How to account for confounders?

• Approaching the challenges: build a platform which will allow the experimentation with different models and data representations for large immune datasets

Immune repertoire, paired and single chain receptor data

Simulation & import and preprocessing

Comprehensive machine learning, data representation and feature recovery

nested cross-validation for hyperparameter optimizationdeep learningclustering and similarity metrics

multi-label classification

feature recovery

Scalable and modular platform on cloud infrastructurepublic instance available at immuneml.org

simulation of antigen-specific sequences and diseased repertoiresimport from Adaptive Biotechnologies, MiXCR, VDJdb, AIRR-compliant formats

preprocess immune data

work with immune repertoires for disease status prediction and confounder analysis

paired chain and single chain receptor data for antigen binding prediction

immuneML

GGP

encodings and embeddings and

and

immuneML facilitates machine learning applications in the immune receptor domain

• ImmuneML provides:

• Multiple predefined workflows (hyperparameter optimization, fit / use trained ML method, data splitting and simulation, exploratory analysis on data)

• ML methods for classification (disease status and antigen binding prediction)

• Different data representations

• Integration with popular tools for immune data analysis through the Galaxy framework (iReceptor, MiXCR, immuneSim, immunoProbs)

data_reports optimal_model_reports

performance_reports

model_reports

data_split_reports

data_split_reports report report reportreport report report

M1.1 M2.1 M3.1

M1.2 M2.2 M3.2

M1.3 M2.3 M3.3

M1.4 M2.4 M3.4

M1.5 M2.5 M3.5

M1.1 M2.1 M3.1

M1.2 M2.2 M3.2

M1.3 M2.3 M3.3

M1.4 M2.4 M3.4

M1.5 M2.5 M3.5

Scenario 1: nested cross-validation with defined report points

test

test

test

test

test

val

val

val

val

val

M1.1 M2.1 M3.1

M1.2 M2.2 M3.2

M1.3 M2.3 M3.3

M1.4 M2.4 M3.4

M1.5 M2.5 M3.5

nr a it

hyperparameters of the model

mod

elsm

odel

sel

ectio

n cr

oss-

valid

atio

n fo

lds

inner cross-validation loop: model selection

outer cross-validation loop: model assessment

performance 5

performance 4

performance 3

performance 2

performance 1

task performance

Scenario 2: repeated holdout with random splits (Monte Carlo cross-validation)

M1.1 M2.1 M3.1

M1.2 M2.2 M3.2

M1.3 M2.3 M3.3

M1.4 M2.4 M3.4

M1.1 M2.1 M3.1

M1.2 M2.2 M3.2

M1.3 M2.3 M3.3

M1.4 M2.4 M3.4

test

test

test

test

val

val

val

val

M1.1 M2.1 M3.1

M1.2 M2.2 M3.2

M1.3 M2.3 M3.3

M1.4 M2.4 M3.4

train optimal models

hyperparameters of the model

mod

elsm

odel

sel

ectio

n tra

inin

g /

valid

atio

n sp

lits

inner holdout loop: model selection

outer repeated holdout loop: model assessment

performance 5

performance 4

performance 3

performance 2

performance 1

task performancetest

report

report

report

report

report

report report report

report report reportreport report report








report

report

report

report

report

report report

report

report

report

report

report

report

report

report

report

report

report

optimal models

Hyperparameter optimization workflow

immuneML

immuneML facilitates machine learning applications in the immune receptor domain

• ImmuneML is available as:

• A web tool through integration with Galaxy

• A command line tool with domain-specific language for analysis definition

• Python library

definitions: datasets: d1: metadata: "./metadata.csv" format: MiXCR encodings: e1: type: KmerFrequency params: { k: 3 } e2: type: Word2Vec params: {vector_size: 16, context: sequence} ml_methods: log_reg1: type: LogisticRegression params: { C: 0.001 } reports: r1: { type: SequenceLengthDistribution } preprocessing_sequences: seq1: - filter_chain_B: type: DatasetChainFilter params: {keep_chain: A}instructions: HPOptimization: settings: - preprocessing: seq1 encoding: e1 ml_method: log_reg1 - preprocessing: [] encoding: e2 ml_method: log_reg1 assessment: split_strategy: random split_count: 1 training_percentage: 70 selection: split_strategy: k-fold split_count: 5 reports: data_splits: [r1] labels: [CD] dataset: d1 strategy: GridSearch metrics: [accuracy, f1_micro]

image credit: https://usegalaxy.org/

immuneML

immuneML applications

• To show immuneML’s capabilities, we demonstrate:

• Replication of one of the largest studies for disease state prediction

• Antigen binding prediction from paired chain data

• Benchmarking results with simulated datasets and overview of models and representations

immuneML

Emerson et al 2017

Use-case: predicting T1D status with immuneML

• Type 1 Diabetes (T1D) status prediction from 1600 samples (456 T1D patients, 762 first degree relatives, 76 second degree relatives and 224 controls)

• Age and genetic background are the most significant confounding factors

• Goal:

• Build a classifier to predict the disease state

• Recover distinctive features on which the classification is based

Brusko lab, University of Florida

Acknowledgements

Lonneke SchefferAndrei Slabodkin

Analysis of adaptive immune receptors and repertoires and ...€¦ · Immune receptors are natural diagnostics and therapeutics • Immune cells use immune receptors on their surface

Documents