Page 1
Inferring Ethnicity from Mitochondrial DNA Sequence
Chih Lee1, Ion Mandoiu1 and Craig E. Nelson2
[email protected] @engr.uconn.edu
[email protected] of Computer Science and Engineering
2Department of Molecular and Cell BiologyUniversity of Connecticut
Page 2
Outline
Introduction Methods Results and Discussions Conclusions
Page 3
Introduction Methods Results and Discussions Conclusions
Outline
Page 4
Ethnicity in Forensics
Ethnicity information assists forensic investigators.
Investigator-assigned ethnicity: based on genetic and non-genetic markers.
Genetic information enhances inference accuracy when access to most informative markers (e.g. skin/hair) is limited.
Autosomal markers: Excellent accuracy assigning samples to clades
[Phi07, Shr97] May not survive degradation
Page 5
Mitochondrial DNA
Circular 16,569 bps Maternally inherited High copy number
Recoverable from degraded samples
Coding region SNPs define
haplogroups [Beh07] Hypervariable Region
Page 6
Hypervariable Region
High mutation rate compared to the coding region
Haplogroup inference [Beh07] 23 groups 96.7% accuracy rate with 1NN
Geographic origin inference [Ege04] SE Africa, Germany and Icelandic 66.8% accuracy rate with PCA-QDA
16024 16569 1 576
HVR 1 HVR 2
Page 7
Ethnicity Inference from HVR
The problem: Given a set of HVR sequences tagged
with ethnicities Predict the ethnicities of new HVR
sequences A classification problem
Our contribution: Assess the performance of 4
classification algorithms: SVM, LDA, QDA and 1NN.
Page 8
Outline
Introduction Methods Results and Discussions Conclusions
Page 9
Encoding HVR
Align to rCRS (revised Cambridge reference sequence) SNP profile
a SNP a binary variable
Missing data (not typed regions) Assume rCRS Use mutation probability Common region
16067T CT
315.1C insertion
523D deletion
Page 10
Support Vector Machines
Binary classification algorithm Map instances to high-D space (the
feature space) Optimal separating hyperplane with
max margins Kernel function k(x1,x2): similarity x1
and x2 between in the feature space Radial basis kernel: exp(-γ||x1-x2||2) Software: LIBSVM [Cha01]
Page 11
Linear/Quadratic Discriminant Analysis
Find argmaxg P(G=g|X=x) Assumptions:
X|G=g ~Np(μg, Σg) P(G=g)’s are equal for all g
P(G=g|X=x) prop. to P(X=x|G=g) μg and Σg are estimated by the
training data LDA: common dispersion matrix Σg =
Σ for all g
Page 12
1-Nearest Neighbor
Assign a new sample to the dominating ethnicity among the nearest samples in the training data
Distance measure: the Hamming distance
Used by Behar et al. (2007) for haplogroup assignment
Page 13
Principal Component Analysis
A dimension reduction technique Used in conjunction with SVM, LDA
and QDA Denoted as: PCA-SVM, PCA-LDA and
PCA-QDA
Page 14
Outline
Introduction Methods Results and Discussions Conclusions
Page 15
The FBI mtDNA Population Database
Two tables: forensic: typed by FBI published: collected from literature
Retain only Caucasian, African, Asian and Hispanic samples
# samples
All Caucasian African Asian Hispanic
forensic dataset
4,426 1,674 (37.8%)
1,305 (29.5%)
761 (17.2%)
686 (15.5%)
published dataset
3,976 2,807 (70.6%)
254 (6.4%)
915 (23%)
Page 16
Data Coverage and Subsets
Variable sequence lengths
trimmed forensic dataset (4,426) 16024-16365
trimmed published dataset (1,904) 16024-16365
full-length forensic dataset (2,540) 16024-16569, 1-576
16024 16569 1 576
HVR 1 HVR 2
forensic
published
Page 17
5-fold Cross-Validation (trimmed forensic)
Macro-Accuracy: Average of ethnicity-wise accuracy rates
Micro-Accuracy: Weighted by # Samples More accurate than Egeland et al. (2004) Matches human experts depending on skull and
large bones [Dib83, isc83]
Page 18
Seq. Region Effect on Accuracy
Different primers result in different coverage. PCA-LDA outperforms 1NN on long sequences. PCA-SVM is consistently the best.
100%90%80%
16024 16569 1 576
HVR 1 HVR 2
full-length forensic dataset
Page 19
80%
Seq. Region Effect on Accuracy
HVR 2 contains less information. PCA-SVM is consistently the best.
100%90%
16024 16569 1 576
HVR 1 HVR 2
full-length forensic dataset
Page 20
Twenty 10% Windows
Accuracy varies with region. PCA-SVM remains the best. 1NN is as good as PCA-SVM for short regions.
16024 16569 1 576
HVR 1 HVR 210%10%10%
Page 21
Independent Validation (1/2)
Training data: trimmed forensic dataset Test data: trimmed published dataset PCA-SVM No Hispanic samples in the test data but
samples can be mis-classified as Hispanic Asian: ~17% lower than CV
Page 22
Independent Validation (2/2)
Composition of the Asian samples in the training data: China (356 profiles), Japan (163), Korea (182), Pakistan
(8), and Thailand (52) Strong bias towards East Asia
145 Mis-classified Asian samples in the test data: 10 samples of unknown country of origin 90 samples from Kazakhstan and Kyrgyzstan
Both countries have significant Russian population. Evidence of admixture with Caucasians.
# Samples Asian Caucasian African Hispanic
Kazakhstan 107 56 (52.3%)
47 (44.0%)
3 (2.8%)
1 (0.9%)
Kyrgyzstan 95 56 (58.9%)
34 (35.8%)
1 (1.1%)
4 (4.2%)
Page 23
Handling Missing Data
Mimic real-world scenario Training: forensic dataset Test: published dataset rCRS and Probability are biased toward
Caucasian. Common Region is the best overall.
Page 24
Posterior Probability Calibration
PCA-SVM on published dataset with “Common Region”
Accuracy rates are slightly higher than the estimated posterior probabilities.
Page 25
Conclusions
SVM is the most accurate algorithm among those investigated, outperforming Discriminant analysis employed by Egeland et
al. (2004) 1NN similar to that used by Behar et al. (2007)
Overall accuracy of 80%-90% in CV and independent testing Matches the accuracy of human experts
depending on measurements of skull and large bones [Dib83,isc83]
Approaches the accuracy by using ~60 autosomal loci [Bam04]
Page 26
Questions?
Thank you for your attention.