Top Banner
Shannon's equivocation for forensic Y-STR marker selection Sabine Siegert a , Lutz Roewer b , Michael Nothnagel a, * a Department of Statistical Genetics and Bioinformatics, Cologne Center for Genomics, University of Cologne, Weyertal 115b, 50931 Cologne, Germany b Dept. Forensic Genetics, Institute of Legal Medicine and Forensic Sciences, Charité Universitätsmedizin Berlin, 13353 Berlin, Germany A R T I C L E I N F O Article history: Received 7 November 2014 Received in revised form 8 January 2015 Accepted 7 February 2015 Keywords: Short tandem repeats Shannon's equivocation Forensic marker analysis Allelic association Linkage disequilibrium A B S T R A C T Short tandem repeat (STR) markers are widely and continuously used in forensic applications. However, past research has demonstrated substantial allelic association between STR markers on both autosomes and the X chromosome, leading to partially redundant information that these markers can provide. Here, we quantify the allelic association between Y-chromosomal STR markers that are part of established forensic panels, separately for three different continental groups. We further propose a sequential marker selection procedure that is based on Shannons equivocation and that accounts for allelic association between STR markers, leading to a maximal gain in independent information. In application to three real- world data sets, we demonstrate the procedures superior performance when compared to single-locus diversity selection strategies, resulting in the optimal marker set for a given data set in the majority of marker subsets. Noting the inferior performance of the established Y-STR marker panels in a retrospective investigation, we suggest that future forensic marker selection should be guided, besides by other technical selection criteria, by an equivocation-based approach to obtain maximally discriminatory marker sets at minimal cost. ã 2015 Elsevier Ireland Ltd. All rights reserved. 1. Introduction Short tandem repeat (STR) markers are widely and continu- ously used in forensic, genealogical and population genetic analyses. STR haplotypes within the non-recombining region of the Y chromosome (NRY) offer the particular advantages of specically targeting the male contribution to DNA stains, e.g., in rape cases, and of providing genetic information that is undiluted by recombination. Thus, Y-STR haplotypes are commonly used in the quantitative assessment of matches in forensic and kinship casework. The number of markers being used for such purposes has been repeatedly increased in the past, in order to provide more power for distinguishing between male subjects or for establishing unilateral inheritance. Several Y-STR marker panels are or have been routinely used in forensic casework. The minimal haplotype(MHT) comprises seven single-copy Y-STRs (DYS19, DYS389I, DYS389II, DYS390, DYS391, DYS392, DYS393) and one multi-copy Y-STR (DYS385ab). It has formed the basis for many forensic applications in the past [1] and is, thus, usually included in subsequently introduced broader panels. Although DYS385ab is sometimes counted as two markers, we will refer to it as one marker and locus in the following. Currently available commercial Y-STR kits include the 11-locus PowerPlex 1 Y12 system (PPY12; Promega Corporation, Madison, WI; 2003) which added DYS437, DYS438, DYS439 to the MHT, the 16-locus Yler 1 kit (Yler; Life Technologies, Foster City, CA; 2004) which further added DYS448, DYS456, DYS458, DYS635, GATAH4 to the PPY12 set, the PowerPlex 1 Y23 system (PPY23; Promega Corporation, Madison, WI; 2012) which in turn added single-copy markers DYS481, DYS533, DYS549, DYS570, DYS576 and DYS643 to the Yler set and the 25-locus Yler 1 Plus kit (YlerPlus; Life Technologies, Foster City, CA; 2014; see http://www.lifetechnol- ogies.com/de/de/home/industrial/human-identication/y-str- products.html) which added the markers DYS449, DYS460, DYS518, DYS627 and multi-copy-marker DYF387S1ab to the PPY23 set and skips DYS549 and DYS643. Markers DYS449, DYS518, DYS570, DYS576, DYS627 and DYF387S1ab have been categorized as rapidly mutating[2,3] in the past. Discriminatory power of markers and haplotypes in forensic applications is assessed by a number of measures, including single- marker gene diversity (GD), haplotype diversity (HD), match probability (MP) and discrimination capacity (DC). A substantial amount of research for the selection of additional markers with the ability to increase haplotype discrimination has been made [4], * Corresponding author. Tel.: +49 221 478 96847. E-mail addresses: [email protected] (S. Siegert), [email protected] (L. Roewer), [email protected] (M. Nothnagel). http://dx.doi.org/10.1016/j.fsigen.2015.02.001 1872-4973/ ã 2015 Elsevier Ireland Ltd. All rights reserved. Forensic Science International: Genetics 16 (2015) 216225 Contents lists available at ScienceDirect Forensic Science International: Genetics journal homepage: www.else vie r.com/locate /fsig
10

Shannon's equivocation for forensic Y-STR marker selection

Apr 11, 2023

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Shannon's equivocation for forensic Y-STR marker selection

Forensic Science International: Genetics 16 (2015) 216–225

Shannon's equivocation for forensic Y-STR marker selection

Sabine Siegert a, Lutz Roewer b, Michael Nothnagel a,*aDepartment of Statistical Genetics and Bioinformatics, Cologne Center for Genomics, University of Cologne, Weyertal 115b, 50931 Cologne, GermanybDept. Forensic Genetics, Institute of Legal Medicine and Forensic Sciences, Charité – Universitätsmedizin Berlin, 13353 Berlin, Germany

A R T I C L E I N F O

Article history:Received 7 November 2014Received in revised form 8 January 2015Accepted 7 February 2015

Keywords:Short tandem repeatsShannon's equivocationForensic marker analysisAllelic associationLinkage disequilibrium

A B S T R A C T

Short tandem repeat (STR) markers are widely and continuously used in forensic applications. However,past research has demonstrated substantial allelic association between STR markers on both autosomesand the X chromosome, leading to partially redundant information that these markers can provide. Here,we quantify the allelic association between Y-chromosomal STR markers that are part of establishedforensic panels, separately for three different continental groups. We further propose a sequential markerselection procedure that is based on Shannon’s equivocation and that accounts for allelic associationbetween STR markers, leading to a maximal gain in independent information. In application to three real-world data sets, we demonstrate the procedure’s superior performance when compared to single-locusdiversity selection strategies, resulting in the optimal marker set for a given data set in the majority ofmarker subsets. Noting the inferior performance of the established Y-STR marker panels in aretrospective investigation, we suggest that future forensic marker selection should be guided, besides byother technical selection criteria, by an equivocation-based approach to obtain maximally discriminatorymarker sets at minimal cost.

ã 2015 Elsevier Ireland Ltd. All rights reserved.

Contents lists available at ScienceDirect

Forensic Science International: Genetics

journal homepage: www.else vie r .com/locate / fs ig

1. Introduction

Short tandem repeat (STR) markers are widely and continu-ously used in forensic, genealogical and population geneticanalyses. STR haplotypes within the non-recombining region ofthe Y chromosome (NRY) offer the particular advantages ofspecifically targeting the male contribution to DNA stains, e.g., inrape cases, and of providing genetic information that is undilutedby recombination. Thus, Y-STR haplotypes are commonly used inthe quantitative assessment of matches in forensic and kinshipcasework. The number of markers being used for such purposeshas been repeatedly increased in the past, in order to provide morepower for distinguishing between male subjects or for establishingunilateral inheritance.

Several Y-STR marker panels are or have been routinely used inforensic casework. The ‘minimal haplotype’ (MHT) comprisesseven single-copy Y-STRs (DYS19, DYS389I, DYS389II, DYS390,DYS391, DYS392, DYS393) and one multi-copy Y-STR (DYS385ab). Ithas formed the basis for many forensic applications in the past [1]

* Corresponding author. Tel.: +49 221 478 96847.E-mail addresses: [email protected] (S. Siegert),

[email protected] (L. Roewer), [email protected](M. Nothnagel).

http://dx.doi.org/10.1016/j.fsigen.2015.02.0011872-4973/ã 2015 Elsevier Ireland Ltd. All rights reserved.

and is, thus, usually included in subsequently introduced broaderpanels. Although DYS385ab is sometimes counted as two markers,we will refer to it as one marker and locus in the following.Currently available commercial Y-STR kits include the 11-locusPowerPlex1 Y12 system (PPY12; Promega Corporation, Madison,WI; 2003) which added DYS437, DYS438, DYS439 to the MHT, the16-locus Yfiler1 kit (Yfiler; Life Technologies, Foster City, CA; 2004)which further added DYS448, DYS456, DYS458, DYS635, GATAH4 tothe PPY12 set, the PowerPlex1 Y23 system (PPY23; PromegaCorporation, Madison, WI; 2012) which in turn added single-copymarkers DYS481, DYS533, DYS549, DYS570, DYS576 and DYS643 tothe Yfiler set and the 25-locus Yfiler1 Plus kit (YfilerPlus; LifeTechnologies, Foster City, CA; 2014; see http://www.lifetechnol-ogies.com/de/de/home/industrial/human-identification/y-str-products.html) which added the markers DYS449, DYS460,DYS518, DYS627 and multi-copy-marker DYF387S1ab to thePPY23 set and skips DYS549 and DYS643. Markers DYS449,DYS518, DYS570, DYS576, DYS627 and DYF387S1ab have beencategorized as ‘rapidly mutating’ [2,3] in the past.

Discriminatory power of markers and haplotypes in forensicapplications is assessed by a number of measures, including single-marker gene diversity (GD), haplotype diversity (HD), matchprobability (MP) and discrimination capacity (DC). A substantialamount of research for the selection of additional markers with theability to increase haplotype discrimination has been made [4],

Page 2: Shannon's equivocation for forensic Y-STR marker selection

S. Siegert et al. / Forensic Science International: Genetics 16 (2015) 216–225 217

and indeed, previous studies have already confirmed the gain ininformation content and discrimination power provided byPPY23 compared to other smaller-sized forensic marker panels [5].

While inclusion of more markers can lead to higher discrimi-natory power, it can also increase genotyping costs. Some form ofmarker selection is therefore necessary in order to obtain bestdiscrimination at minimal cost. To the best of our knowledge, theselection of those markers that are included in the variouscommercially available Y-STR panels was primarily made in twoconsecutive steps. First, a pre-selection of markers was based onsingle-marker properties like allele frequencies and GD. In asecond step, an evaluation procedure compared the markercombinations with respect to their haplotype discriminationcapacity (i.e., HD, MP and DC) [6–10]. However, markers mayshow some allelic association (also denoted as gametic associationor linkage disequilibrium). Previous studies have demonstratedsubstantial allelic association between STR markers on bothautosomes [11–14] and the X chromosome [15–17], extending forsome markers even over 10 cM in some populations. In such cases,markers will provide partially redundant information for discrim-ination and the above described selection method may lead tosuboptimal discriminatory panels for a given number of markers,because it does not account for this redundancy. Recently, non-independence of Y-STR markers’ alleles has also been shown [18].

Here, we propose a novel marker selection approach that isbased on Shannon’s equivocation, or conditional entropy [19]. Itmeasures the gain of independent information provided by theinclusion of additional markers and, thus, accounts for inter-marker association. More specific, we describe an iterative markerselection method based on the maximization of additionalinformation from the newly added marker atop of the informationalready available from the initial haplotype formed by thepreviously selected markers. We compare our approach todifferent potential marker selection strategies, quantify allelicassociation patterns for commonly used forensic Y-STR markers indifferent continental groups, show the sub-optimality of currentlyused Y-chromosomal forensic kits with respect to their discrimi-natory power for a fixed number of markers and demonstrate inapplication to real data that our approach yields optimal or near-optimal results among all considered strategies. We suggest theuse of our equivocation approach in the selection procedure forfuture forensic marker panels.

2. Material and methods

2.1. Shannon’s entropy and allelic association

Shannon’s entropy [19] is a measure that is commonly used ininformation theory for assessing the average unpredictability in arandom variable. It is mathematically defined as the expectedvalue of the information content of a discrete random variable,

H ¼ �Xsps � ln psð Þ; (1)

where ps denotes the relative frequency of the s-th state of thevariable, respectively. This concept can easily be applied todescribe the information content at a locus or in a particulargenomic region by substituting states with the respective alleles atthat locus or the respective haplotypes in that region. Populationgenetic processes such as mutation and recombination will lead toan increased number of different haplotypes and, correspondingly,usually to an increase of entropy of the haplotype system, whereasdrift will cause a depletion in the number and usually a decrease ofentropy. The entropy concept also allows describing the allelicassociation structure between two multi-allelic loci or several bi-allelic loci by use of the normalized entropy difference (NED) [20].

For two multi-allelic loci, the NED is defined as

NEDij ¼ 2 � Hi þ Hj� �� Hij

Hi þ Hj; (2)

where Hi and Hj denote the entropy values for loci i and j,respectively, and Hij the entropy of the joint two-locus system [20].Note that the sum Hi + Hj equals the expected entropy understochastic independence of the two loci and, thus, equals themaximum possible entropy of the two-locus system. Accordingly,NED equals zero for two loci with independent alleles and unity forloci with perfectly correlated alleles. The NED can be interpreted asa multi-allelic extension of the commonly used measure for allelicassociation r2 for two bi-allelic markers, and its ability toappropriate describe simultaneous multi-locus allelic associationand to detect regions of elevated association has been demon-strated in the past [21]. Significant allelic association between twomarkers can be tested for by comparing the observed haplotypefrequencies against those frequencies that are expected under noallelic association. More specific, let the two markers feature n1and n2 alleles, respectively. We follow standard statistical theory[22] and use a likelihood-ratio test (LRT) with the number ofdegrees equalling the difference between (n1� n2� 1) and (n1�1+ n2� 1).

2.2. Marker selection by Shannon’s equivocation

We propose the use of Shannon’s equivocation, or conditionalentropy [19], for the iterative selection of a maximal informativemarker set for a fixed number of markers. In this way, we are ableto select that marker which offers the largest gain in independentinformation atop of that of an initial set of m markers.Mathematically, equivocation of a newly added marker Xmþ1 isdefined as the average entropy of Xmþ1 for each possible value ofthe initial m-marker haplotype X1;:::;m�, weighted by the probabilityof occurrence of that haplotype:

HðXmþ1jX1;:::;mÞ ¼ �X

h2 ½1;:::;mf �gph � H Xmþ1jX½1;:::;m�

� ¼ hÞ; (3)

where ph denotes the relative frequency of the h-th haplotype fromthe set of all possible haplotypes 1; :::; mf �g at the initial m loci,respectively. The chain rule of equivocation implies that theentropy of a new marker conditioned on an arbitrary but fixed m-marker haplotype can be simply calculated as the differencebetween the new m + 1-marker haplotype and the initial m-markerhaplotype, e.g., by

H Xmþ1jX 1;...;m½ �� � ¼ H X 1;...;mþ1½ �

� �� H X 1;...;m½ �� �

; (4)

respectively [19]. In order to obtain a highly informative set offorensic markers, we suggest the following selection procedure:

1. From a given marker panel, choose the Y-STR marker showingthe largest single-locus entropy.

2. From the remaining set of markers, choose that marker thatyields the largest equivocation value atop of the already chosenset of markers.

3. Repeat step 2 until a specified number of markers, a forensicparameter threshold or some other optimality criterion isattained.

2.3. Forensic parameters

Several diversity measures are commonly used in forensicapplication and may also serve as optimality criteria for a marker

Page 3: Shannon's equivocation for forensic Y-STR marker selection

218 S. Siegert et al. / Forensic Science International: Genetics 16 (2015) 216–225

selection process. Single-locus gene diversity (GD) is computed asGD ¼ n=ðn � 1Þ 1 �P

ap2a

� �, following Nei [23,24], where n denotes

the total number of samples and pa the relative frequency of the a-th allele at the locus, respectively. The haplotype diversity (HD) iscalculated analogous to GD. The match probability (MP) iscomputed as the sum of squared haplotype frequencies, whereasthe discrimination capacity (DC) is calculated as the ratio betweenthe number of different haplotypes and the total number ofhaplotypes.

2.4. Alternative marker selection procedures

For comparison, we also considered alternative selectionprocedures for a discriminatory marker set by use of differentsingle-locus optimality criteria. In particular, we based theselection in turn on the number of different alleles at a locus,the single-locus gene diversity and the single-locus entropy. To thisend, we ranked the loci according to their respective single-locusoptimality criterion values and then selected a specified number oftop-ranking markers. In order to obtain the optimal possible valuefor the forensic parameter of interest with a given data set, weadditionally obtained this optimum by performing an exhaustivesearch over all possible subsets of the complete PPY23 panel (seeTable 1 for a list of constituting markers) for a given number ofmarkers.

2.5. Data set used for benchmarking

We based our analysis on previously published informationfrom a global survey of Y-STR variation that has been described indetail elsewhere [5] and whose data are uploaded to the YHRDdatabase (www.yhrd.org). In brief, a total of 19,630 male DNAsamples from 129 populations in 51 countries worldwide werecompiled in 84 participating laboratories. DNA samples were

Table 1Single-locus diversity measures and average pairwise allelic association values for the

Europeans Africans

Marker Na GDb Hc NEDd Na GDb

DYS19 16 0.64 1.29 0.06 8 0.74DYS389I 9 0.56 0.99 0.06 6 0.59DYS389II.I 11 0.59 1.14 0.03 8 0.68DYS390 11 0.72 1.43 0.07 8 0.54DYS391 8 0.54 0.88 0.05 6 0.42DYS392 13 0.62 1.16 0.13 6 0.17DYS393 8 0.45 0.87 0.04 5 0.65DYS385ab 97 0.87 2.88 0.11 45 0.93DYS438 10 0.69 1.33 0.16 6 0.43DYS439 13 0.70 1.34 0.05 5 0.60DYS437 11 0.64 1.10 0.11 6 0.23DYS448 20 0.63 1.20 0.09 9 0.63DYS456 14 0.74 1.50 0.06 7 0.50DYS458 24 0.78 1.72 0.05 10 0.73DYS635 15 0.67 1.40 0.09 10 0.65GATAH4 9 0.60 1.09 0.04 6 0.52DYS576 16 0.77 1.65 0.03 10 0.80DYS481 21 0.81 2.00 0.09 12 0.83DYS549 10 0.65 1.24 0.04 6 0.60DYS533 11 0.59 1.16 0.07 6 0.49DYS570 23 0.80 1.79 0.05 11 0.79DYS643 14 0.65 1.33 0.10 9 0.76

DYS389II alleles were encoded by the difference, labelled DYS389II.I, between the totaa N: Number of distinct alleles.b GD: Gene diversity.c H: Entropy.d NED: The average normalized entropy difference (NED) was calculated as the mea

respectively.

genotyped for 22 Y-STR markers, using the PPY23 kit andfollowing the manufacturer’s instructions [25] with the occasion-al adaptation to prevailing laboratory practise. We restricted ouranalyses to three subsets of this survey, namely a data setcomprising only individuals of both European ancestry andEuropean residency (68 sampling populations,n = 11,501 individuals) as well as an African (6 populations,n = 445 individuals) and an Asian data set (22 populations,n = 3458 individuals) with corresponding requirements (Supple-mentary Table S11).

Single-locus forensic parameters were calculated for all22 markers of the PPY23 kit, separately for each of the threecontinental data sets. To this end, DYS389II alleles were encoded bythe difference, henceforth labelled DYS389II.I, between the totalrepeat number at DYS389II and the repeat number at DYS389I. Asstated before, multi-copy DYS385ab haplotypes were treated assingle alleles, thereby ignoring the internal order of its twocomponent alleles. Allele and haplotype frequencies were esti-mated using the counting method. Haplotype-based forensicparameters were calculated for the constructed haplotypes for thefull PPY23 marker set (22 loci) and subsets thereof, including thosecorresponding to the MHT (eight loci), PPY12 (11 loci) and Yfiler(16 loci) panels and those that resulted from the selectionprocedures.

All calculations were performed in the R statistical environmentv.2.15.3 [26]. In order to perform the exhaustive search over allpotential subsets of the complete PPY23 panel, we used R packagecombinat v.0.0-8 [27] to generate all possible m-marker hap-lotypes. A general trend in the scatter plots of single-locus featureswas inferred by use of natural cubic splines with equally spacedknots (6.75, 6.85, . . . , 9.15). Therefore, for each grid point xi themean value of single-locus diversity measure based on allhaplotypes with an entropy value within [xi� 0.05; xi+ 0.05)was calculated, respectively.

PPY23 panel.

Asians

Hc NEDd Na GDb Hc NEDd

1.51 0.15 12 0.69 1.39 0.03 1.08 0.12 8 0.66 1.17 0.04 1.32 0.11 9 0.66 1.27 0.02 1.17 0.19 10 0.73 1.48 0.04 0.77 0.10 8 0.41 0.78 0.02

0.41 0.05 14 0.74 1.49 0.06 1.19 0.10 9 0.66 1.22 0.04 3.06 0.21 111 0.97 3.94 0.09 0.80 0.11 9 0.57 1.17 0.07 1.05 0.07 9 0.68 1.31 0.02 0.54 0.09 6 0.49 0.83 0.03 1.25 0.15 15 0.76 1.65 0.06 1.02 0.07 10 0.61 1.27 0.03 1.49 0.10 23 0.83 2.00 0.04 1.47 0.13 16 0.77 1.69 0.04 0.86 0.08 9 0.63 1.20 0.03 1.71 0.10 17 0.80 1.78 0.02 1.94 0.12 24 0.84 2.05 0.04 1.10 0.11 9 0.64 1.24 0.02 0.86 0.10 9 0.63 1.18 0.03 1.75 0.10 19 0.83 1.94 0.03

1.66 0.19 12 0.75 1.59 0.06

l repeat number at DYS389II and the repeat number at DYS389I.

n of NED values of all possible 21 combined two-locus systems for each marker,

Page 4: Shannon's equivocation for forensic Y-STR marker selection

Fig. 1. Inter-marker association structure. Normalized entropy difference (NED)values for all 231 marker pairs, separately for each continental data set. Markers aresorted by their physical location. (A) Europeans, (B) Africans, (C) Asians.

S. Siegert et al. / Forensic Science International: Genetics 16 (2015) 216–225 219

3. Results

3.1. Inter-marker association structure

We initially investigated the association structure among all22 PPY23 markers by calculating the normalized entropy differ-ence (NED) for all 231 pairs of loci, separately for each continentaldata set. If the marker alleles are not completely independent butshow some allelic association, this information redundancy maylead to suboptimal discriminatory marker sets in selectionstrategies that consider each marker separately.

The markers of the PPY23 panel showed substantial differ-ences between the three continental groups with respect to allelenumber (p = 3.5 �10�8, Friedman rank sum test), gene diversity(p = 0.004) and single-locus entropy (p = 8.3 �10�4), but also theiraverage allelic association with the other markers (p = 2.3 � 10�8)(Table 1). For easier comparison, we categorized NED values intothree groups, namely little association (0.0 � NED < 0.1), moderateassociation (0.1 � NED < 0.2) and substantial association (NED� 0.2). In the European data set, the majority of marker pairsshowed at most only little allelic association, indicated by aminimum NED value of 0.005 and median of 0.059 for all pairs,albeit significantly different from zero for all pairs (SupplementaryTable S1). Marker pairs containing either DYS389II.I orDYS576 revealed smallest NED values (<0.1; mean = 0.03, respec-tively), indicating little association between the alleles of thesemarkers from all others. On the other hand, 55 pairs (23.8%) hadNED values equaling or exceeding 0.1 and 7 pairs (3.0%) evenshowed values greater than 0.2, indicating substantial associationbetween the respective marker alleles. Notably, markerDYS438 was part in six out of those seven pairs and, thus,showed substantial association with roughly a quarter of allmarkers, while DYS392 was part in two pairs. The pair DYS438-DYS392 also showed the largest observed NED value (0.39) over allpairs.

Inter-marker association structures differed substantially be-tween the three continental data sets (Fig. 1). In the Asians, none ofthe 231 pairs showed an NED value exceeding 0.2, while 26 pairsexceeded this threshold in the African data set; also some markerpairs showed non-significant allelic association, presumably due tothe lower sample size compared to Europeans (SupplementaryTable S1). Out of the latter pairs, 11 included marker DYS385ab,eight DYS390, seven DYS643, and six DYS448. The median NEDvalues over all pairs, namely 0.031 and 0.099 in Asians and inAfricans, respectively, were also somewhat different from that ofthe Europeans. The different continental groups, thus, did not onlyshow varying degrees in the allelic association between particularmarkers, but completely different sets of markers that showed thestrongest association with other markers. Correspondingly, thedistribution of redundant information across the markers differedbetween the continental groups.

3.2. Single-locus diversity measures and their correlation

Given the continental differences in allelic association, wenext investigated how strongly the various single-locusdiversity measures were correlated and how well they wereable to predict allelic association of the respective markers withother markers. The number of alleles, gene diversity and single-locus entropy of markers showed substantial differences in theirvalues, but partially also in their ranking (Table 1). While, withthe only exception of number of alleles in Europeans, markersDYS385ab and DYS481 ranked first and second with respect toall three diversity measures in all three continental data sets,subsequent ranks differed between the continental groups. Forexample, marker DYS456 appeared to be less informative in

Africans and Asians compared to Europeans, whereasDYS19 and DYS643 may provide more discriminatory informa-tion in Africans and DYS635 and DYS448 in Asians, similar toprevious analyses (Fig. 3 and Supplementary Table S4 in Ref.[5]). More specifically, for the latter four loci values for genediversity and single-locus entropy were generally lower in theEuropean data set compared to the African and the Asian one.Unsurprisingly, all three measures were strongly positively andsignificantly correlated (Table 2), implying that an iterativemarker selection based on single-locus diversity should lead tohighly similar results irrespective of the actual criterion beingused.

Notably, single-locus diversity measures were not predictive ofaverage NED values per marker in the European data set. Inparticular, correlation between any of the measures and the meanNED value per marker was small (Table 2), and thus, the diversity ofa marker did not indicate how much redundant information thismarker shared with the other markers on average. In the Asian andAfrican data set, however, Spearman’s rank correlation estimatesrevealed some dependency between each single-locus diversitymeasure and the mean NED value per marker, although somewhat

Page 5: Shannon's equivocation for forensic Y-STR marker selection

Fig. 2. Correlation between single-locus diversity measures in Europeans. Scatterplots of the single-locus entropy with (A) the number of singletons, (B) haplotypediversity, and (C) discrimination capacity, respectively, for all 646,646 ten-markersubsets in the European data set. Red line: natural cubic spline interpolation. (Forinterpretation of the references to color in this figure legend, the reader is referredto the web version of this article.)

220 S. Siegert et al. / Forensic Science International: Genetics 16 (2015) 216–225

lower compared to correlations amongst the single-locus meas-ures. It is therefore more likely that the equivocation-basedselection method would lead to quite similar results than the otherselection strategies that focused on single-locus diversity meas-ures in those two continental data sets.

3.3. Entropy-based total information assessment

As reported above, markers may provide different amounts ofdiscriminatory information in different continental groups. Wetherefore quantified the information content of a complete markerpanel by use of entropy. However, since entropy increases withincreasing numbers of outcomes, i.e., sample size, absolute entropyvalues by themselves do not have much of a meaningfulinterpretation. Instead, we used the ratio of the observed totalentropy, Htotal, divided by the maximum possible entropy, i.e.,Hmax = ln(n) where n denotes the sample size. Applying thisframework, the PPY23 panel was most informative in Asians(0.9954; Htotal = 8.11, Hmax = 8.15) and Europeans (0.9946; Htotal =9.30, Hmax = 9.35), but less so in Africans (0.9566; Htotal = 5.83,Hmax = 6.10). These results are similar to previous results based ondiscrimination capacity (DC = 0.9743/0.9645/0.8472 for Asians/Europeans/Africans, respectively; [5]).

3.4. Iterative marker selection by equivocation

We next applied our equivocation-based marker selectionapproach separately to each of the three continental data sets.Marker DYS385ab showed the highest entropy in all three data setsand was therefore always chosen first. While this markeraccounted for roughly a third (H = 2.88, 30.9%) of the total observedentropy of the complete PPY23 haplotype system in Europeans(Htotal = 9.30), it already explained about one half in the African(H = 3.06, 52.5% of Htotal = 5.83) as well as the Asian data set(H = 3.94, 48.5% of Htotal = 8.11).

Selection for maximal equivocation, i.e., the largest gain ininformation atop of that provided by the current haplotype system,led to an iterative addition of markers in different orders for thethree continental data sets (Supplementary Tables S2–S4). TheNED measure helps to understand and interpret the selectionprocess. For example, marker DYS438 was among the top ten ofmarkers with highest GD and entropy in Europeans (Table 1), butwas added last to the marker set and thus, identified as the markerwith the smallest increase in independent information (Supple-mentary Table S2). The NED measure indicated that markerDYS438 did not provide much independent information due to theallelic association with markers that had been included in themarker set earlier because of their higher and independentinformation content (e.g., NED between DYS438 and DYS385ab:0.26; between DYS438 and DYS481: 0.17; SupplementaryTable S1). Similarly, marker DYS643 in the African data set andmarkers DYS448 and DYS643 in Asians would have been goodcandidates for an early inclusion given their single-locus diversitymeasures, but were added only in the final stages by theequivocation-based approach due to their allelic association withpreviously included markers.

Remarkably, only a very limited number of markers wererequired to explain most of the total entropy for all three data sets.For example, the haplotype system formed by the first six selectedmarkers, i.e., by about a quarter of all markers, accounted for over90% (H = 8.38) of the total entropy in Europeans (SupplementaryTable S2). With eight and ten markers, i.e., less than half of allmarkers, these figures reached over 95% (H = 8.93) and 97.5%(H = 9.13), respectively. In Africans, the first seven markers, i.e., lessthan a third of all markers, were sufficient to explain over 97.5% ofthe total entropy (Supplementary Table S3). Notably, markersDYS19 and DYS643 were not included in this selection despite theirhigh single-locus GD and entropy values. Finally, only six markerswere sufficient to account for over 97.5% of the total entropy inAsians (Supplementary Table S4).

In order to investigate the discriminatory features relevant forforensic applications, we then compared three specific marker sets

Page 6: Shannon's equivocation for forensic Y-STR marker selection

Fig. 3. Forensic parameters for different marker subsets of size 1–22. (A–C) Proportion of singletons, (D–F) discrimination capacity, (A, D) Europeans, (B, E) Africans, (C, F)Asians. Red: equivocation-based marker selection; orange: selection based on allele number; green: selection based on gene diversity; blue: selection based on single-locusentropy. Magenta circles: forensic kits MHT (8 marker), PPY12 (11), Yfiler (16) and PPY23 (22). Black dotted line: data set optimum obtained from an exhaustive search.

S. Siegert et al. / Forensic Science International: Genetics 16 (2015) 216–225 221

of limited size (namely that accounting for more than 90%, 95% and97.5% of the total entropy in the data set, respectively) obtainedfrom our equivocation-based selection against established forensicmarker sets, namely the minimal haplotype (MHT), PPY12 and

Yfiler panels, with respect to commonly used forensic parameters.While none of the selected marker sets, with their much smallernumber of markers, equaled the discriminatory capacity of the fullPPY23 panel, they showed a better discriminatory performance

Page 7: Shannon's equivocation for forensic Y-STR marker selection

Table 2Correlation of single-locus diversity measures among themselves and with pairwise allelic association values for the PPY23 panel.

Na GDb Average NEDd

EUR AFR ASI EUR AFR ASI EUR AFR ASI

Na 0.22(p = 0.331)

0.43(p = 0.043)

0.47(p = 0.026)

GDb 0.76(p = 3.6 � 10�5)

0.75(p = 5.2 �10�5)

0.91(p = 2.7 � 10�9)

0.18(p = 0.433)

0.44(p = 0.041)

0.45(p = 0.039)

Hc 0.83(p = 1.7 � 10�6)

0.82(p = 2.6 � 10�6)

0.96(p = 9.8 � 10�13)

0.96(p = 3.6 � 10�6)

0.99(p = 3.0 � 10�6)

0.97(p = 3.5 �10�6)

0.16(p = 0.485)

0.52(p = 0.015)

0.38(p = 0.080)

Spearman’s rank correlation estimates between single-locus diversity measures and with the mean normalized entropy difference (average NED) as well as corresponding p-values are displayed for each dataset, respectively. EUR: European dataset; AFR: African dataset; ASI: Asian dataset.

a N: Number of distinct alleles.b GD: Gene diversity.c H: Entropy.d Average NED: The average normalized entropy difference (NED) was calculated as the mean of NED values of all possible 21 combined two-locus systems for each marker,

respectively.

222 S. Siegert et al. / Forensic Science International: Genetics 16 (2015) 216–225

than the established marker panels while using the same or even alower number of markers (Tables 3–5, ). For example, ten markerswere sufficient to outperform the 16-marker Yfiler panel in each ofthe relevant forensic parameters in Europeans (Table 3). In Africansthe seven-marker haplotype, while accounting for more than 97.5%of the total PPY23 haplotype entropy, did not outperform the Yfilerpanel, but nevertheless showed markedly increased discrimina-tion power compared to the 11-marker PPY12 panel (Table 4).Similarly, in Asians the discrimination performance of the six-marker haplotype, explaining more than 97.5% of the overallPPY23 haplotype entropy, was between that of the PPY12 and Yfilerpanels (Table 5).

3.5. Comparative analysis of marker selection approaches

We finally set out to empirically investigate the question, howwell our approach is able to obtain the optimal marker set amongall sets of a given size and how other methods, that are based onsingle-locus characteristics and that do not take the allelicassociation structure with previously selected markers intoaccount, compare to our approach. To this end, we repeated themarker selection, separately for each continental data set, andbased it in turn on the number of alleles, the gene diversity and thesingle-locus entropy of the markers. In order to obtain the optimalforensic parameter value in a given data set for a given marker setsize k (1 � k � 22), we exhaustively searched for the optimum of

Table 3Forensic parameters for three equivocation-selected and four commonly used marker

EUR_90.0a EUR_95.0b EUR_97.5c

# Markers 6 8 10

Explained entropyh (%) 90.1 96.0 98.2

# Singletons (%) 5120 (44.5) 7599 (66.1) 9063 (78.8)

# Distinct haplotypes 6725 8931 10034

HD 0.999563 0.999897 0.999958

MP 5.24 �10�4 1.90 � 10�4 1.28 � 10�4

DC 0.5847 0.7765 0.8724

HD: haplotype diversity; MP: match probability; DC: discrimination capacity.a EUR_90.0, equivocation-selected set of markers that explains �90% of the 22-loci sy

DYS481, DYS456.b EUR_95.0, equivocation-selected set of markers that explains �95% of the 22-loci sy

DYS481, DYS456, DYS439, DYS549.c EUR_97.5, equivocation-selected set of markers that explains �97.5% of the 22-loci sy

DYS481, DYS456, DYS439, DYS549, DYS390, DYS389II.I.d Minimal haplotype; includes markers DYS19, DYS385ab, DYS389I, DYS389II.I, DYS3e PowerPlex1 Y12; includes markers DYS19, DYS385ab, DYS389I, DYS389II.I, DYS390f Yfiler1; includes markers DYS19, DYS385ab, DYS389I, DYS389II.I, DYS390, DYS391

GATAH4.g PowerPlex1 Y23; includes markers DYS19, DYS385ab, DYS389I, DYS389II.I, DYS39

DYS635, GATAH4, DYS481, DYS533, DYS549, DYS570, DYS576 and DYS643.h Percentage of the entropy of 22-loci system (PPY23) in Europeans.

the respective parameter over all possible k-marker subsets out of

the full 22-marker set, i.e., 22k

� �different sets, separately for each

continental data set. We observed a monotone relationshipbetween the k-marker haplotype entropy and each of the forensicparameters in all three continental data sets. More specifically,among the marker sets comprising a fixed number of markers,those with higher haplotype entropy also tended to have morehaplotype singletons, a higher haplotype diversity and a higherdiscrimination capacity. As an example, Fig. 2 depicts thisrelationship for the 646,646 possible ten-marker subsets in theEuropean data set. On average, a gain of one Shannon ininformation content significantly increased the number of single-tons by 3059, HD by 0.001536 and DC by 26.30% (p < 10�16,respectively). Note, that iterative addition of markers in any of theselection methods finally led to the PPY23 22-marker panelhaplotype system and, thus, yielded identical forensic parametervalues.

The equivocation-based marker selection method almostalways outperformed all single-locus based selection methodswith respect to any of the four considered forensic parameters inall three continental data sets, in particular in the midrange of themarker set size. Furthermore, the equivocation-based approach ledto the optimum forensic parameter value for most marker subsetsizes k and was close to the optimum otherwise for all threecontinental data sets. More specific, the equivocation-based

panels in the European data set.

MHTd PPY12e Yfilerf PPY23g

8 11 16 2278.0 86.6 97.5 100.02919 (25.4) 4701 (40.9) 8773 (76.3) 10778 (93.7)4232 6098 9747 111130.996322 0.998846 0.999930 0.9999933.77 � 10�3 1.24 �10�3 1.57 � 10�4 9.53 �10�5

0.3680 0.5302 0.8475 0.9663

stem entropy in Europeans; includes markers DYS385ab, DYS570, DYS576, DYS458,

stem entropy in Europeans; includes markers DYS385ab, DYS570, DYS576, DYS458,

stem entropy in Europeans; includes markers DYS385ab, DYS570, DYS576, DYS458,

90, DYS391, DYS392, DYS393., DYS391, DYS392, DYS393, DYS437, DYS438, DYS439., DYS392, DYS393, DYS437, DYS438, DYS439, DYS448, DYS456, DYS458, DYS635,

0, DYS391, DYS392, DYS393, DYS437, DYS438, DYS439, DYS448, DYS456, DYS458,

Page 8: Shannon's equivocation for forensic Y-STR marker selection

Table 4Forensic parameters for three equivocation-selected and four commonly used marker panels in the African data set.

AFR_90.0a AFR_95.0b AFR_97.5c MHTd PPY12e Yfilerf PPY23g

# Y-STR markers 4 5 7 8 11 16 22Explained entropyh (%) 93.28 95.57 97.62 90.64 93.31 98.00 100.0# Singletons (%) 209 (46.97) 258 (57.98) 287 (64.49) 196 (44.04) 237 (53.26) 305 (68.54) 337 (75.73)# Distinct haplotypes 289 321 343 269 300 353 377HD 0.995880 0.996761 0.997996 0.993785 0.995060 0.997824 0.998704MP 6.36 � 10�3 5.48 � 10�3 4.25 �10�3 8.45 �10�3 7.18 � 10�3 4.19 � 10�3 3.54 �10�3

DC 0.6494 0.7213 0.7708 0.6045 0.6742 0.7933 0.8472

HD: haplotype diversity; MP: match probability; DC: discrimination capacity.a AFR_90.0, equivocation-selected set of markers that explains �90% of the 22-loci system entropy in Africans; includes markers DYS385ab, DYS481, DYS576, DYS458.b AFR_95.0, equivocation-selected set of markers that explains �95% of the 22-loci system entropy in Africans; includes markers DYS385ab, DYS481, DYS576, DYS458,

DYS570.c AFR_97.5, equivocation-selected set of markers that explains �97.5% of the 22-loci system entropy in Africans; includes markers DYS385ab, DYS481, DYS576, DYS458,

DYS570, DYS635, DYS549.d Minimal haplotype; includes markers DYS19, DYS385ab, DYS389I, DYS389II.I, DYS390, DYS391, DYS392, DYS393.e PowerPlex1 Y12; includes markers DYS19, DYS385ab, DYS389I, DYS389II.I, DYS390, DYS391, DYS392, DYS393, DYS437, DYS438, DYS439.f Yfiler1; includes markers DYS19, DYS385ab, DYS389I, DYS389II.I, DYS390, DYS391, DYS392, DYS393, DYS437, DYS438, DYS439, DYS448, DYS456, DYS458, DYS635,

GATAH4.g PowerPlex1 Y23; includes markers DYS19, DYS385ab, DYS389I, DYS389II.I, DYS390, DYS391, DYS392, DYS393, DYS437, DYS438, DYS439, DYS448, DYS456, DYS458,

DYS635, GATAH4, DYS481, DYS533, DYS549, DYS570, DYS576 and DYS643.h Percentage of the entropy of 22-loci system (PPY23) in Africans.

S. Siegert et al. / Forensic Science International: Genetics 16 (2015) 216–225 223

marker set reached the data set optimum for the DC for most kvalues in Europeans (Fig. 3; Supplementary Table S5), with onlyfive exceptions (k = 2, 3, 4, 14, 15) where the equivocation-selectedmarker set differed from the optimal one by only a single Y-STRmarker and still ranked among the top five, respectively(Supplementary Table S8). Furthermore, the equivocation-basedapproach yielded the data set optimum for 14 out of 21 possible kvalues for the number of singletons and for 12 for both HD and MP,respectively (Fig. S1; Supplementary Table S8). In Africans, theequivocation-based marker selection yielded the data set specificoptimum for all but three values for k (5–7 for both HD and MP; 4, 7,and 8 for DC) and generally yielded marker sets with higherdiscrimination power than those obtained from single-locusdiversity based selection (Fig. 3; Fig. S1; SupplementaryTable S6). Details for the marker sets yielding the respectiveoptimum values can be found in Supplementary Table S9. In Asians,we again observed a superior performance of the equivocation-based selection over single-locus selection strategies (Fig. 3;Fig. S1; Supplementary Table S7). 10 out of 21 possible k valuesyielded the optimum number of singletons, 17 the optimum HDand MP values and 14 the optimum DC value. Notably, for sixmarker set sizes (k = 3, 4, 15–18) there exists a marker set withhigher haplotype entropy than that of the equivocation-selectedset, respectively; these marker sets provided even higher, albeit

Table 5Forensic parameters for three equivocation-selected and four commonly used marker

ASI_90.0a ASI_95.0b ASI_97.5c

# Y-STR markers 4 5 6

Explained entropyh (%) 93.50 97.23 98.40

# Singletons (%) 1742 (50.38) 2524 (72.99) 2807 (81.17)

# Distinct haplotypes 2354 2901 3086

HD 0.999624 0.999851 0.999914

MP 6.65 �10�4 4.38 � 10�4 3.75 �10�4

DC 0.6807 0.8389 0.8924

HD: haplotype diversity; MP: match probability; DC: discrimination capacity.a ASI_90.0, equivocation-selected set of markers that explains �90% of the 22-loci syb ASI_95.0, equivocation-selected set of markers that explains �95% of the 22-loci systec ASI_97.5, equivocation-selected set of markers that explains �97.5% of the 22-loci

DYS458, DYS439.d Minimal haplotype; includes markers DYS19, DYS385ab, DYS389I, DYS389II.I, DYS3e PowerPlex1 Y12; includes markers DYS19, DYS385ab, DYS389I, DYS389II.I, DYS390f Yfiler1; includes markers DYS19, DYS385ab, DYS389I, DYS389II.I, DYS390, DYS391

GATAH4.g PowerPlex1 Y23; includes markers DYS19, DYS385ab, DYS389I, DYS389II.I, DYS39

DYS635, GATAH4, DYS481, DYS533, DYS549, DYS570, DYS576 and DYS643.h Percentage of the entropy of 22-loci system (PPY23) in Asians.

only marginally, discrimination power for the most part (Supple-mentary Table S10).

4. Discussion

In this study, we propose a novel approach for selecting a set offorensic markers based on Shannon's equivocation. Basic motiva-tion for this approach is the observation that forensic STR markerscan show substantial allelic association and, thus, provide partiallyredundant information. With the development of increasinglylarger marker sets and, correspondingly, smaller inter-markerdistances, it is conceivable that the degree of redundancy willincrease. Furthermore, since allelic association is strongly influ-enced by population history, patterns of redundancy will differbetween populations of different ancestry. In this study, we indeedobserved varying patterns of allelic association between Y-STRmarkers in different continental groups and, thus, differentdegrees of redundancy between those markers. Marker selectionstrategies that focus exclusively on single-marker characteristicsignore inter-marker redundancy and, thus, may lead to marker setsof suboptimal discriminatory capacity.

Our proposed equivocation-based selection approach implicitlyaccounts for multi-allelic association between markers byevaluating the gain in independent information. Starting with

panels in the Asian data set.

MHTd PPY12e Yfilerf PPY23g

8 11 16 2293.37 96.48 99.23 100.01990 (57.55) 2428 (70.21) 3058 (88.43) 3293 (95.23)2449 2812 3234 33690.999485 0.999792 0.999949 0.9999838.04 �10�4 4.97 � 10�4 3.40 � 10�4 3.07 � 10�4

0.7082 0.8132 0.9352 0.9743

stem entropy in Asians; includes markers DYS385ab, DYS481, DYS570, DYS576.m entropy in Asians; includes markers DYS385ab, DYS481, DYS570, DYS576, DYS458.system entropy in Asians; includes markers DYS385ab, DYS481, DYS570, DYS576,

90, DYS391, DYS392, DYS393., DYS391, DYS392, DYS393, DYS437, DYS438, DYS439., DYS392, DYS393, DYS437, DYS438, DYS439, DYS448, DYS456, DYS458, DYS635,

0, DYS391, DYS392, DYS393, DYS437, DYS438, DYS439, DYS448, DYS456, DYS458,

Page 9: Shannon's equivocation for forensic Y-STR marker selection

224 S. Siegert et al. / Forensic Science International: Genetics 16 (2015) 216–225

the most informative marker, i.e., the marker with the highestsingle-locus entropy, the approach iteratively adds that Y-STRwhich leads to the largest gain of independent information to theamount of information already provided by the current marker set.Thus, this approach will often include only a subset of correlatedmarkers even if each alone shows high discrimination capacity.Through application to data sets from three different continentalgroups, we could show that our approach yielded forensic markersets that provided optimal or near optimal discrimination powerwith limited set size. In particular, we could demonstrate asuperior performance of our approach with respect to commonlyused forensic parameters, compared to those based on single-locuscharacteristics, such as the number of alleles, gene diversity orsingle-locus entropy, and this approach yields the data set specificoptimum most of the time. This implies that inclusion of markersguided by these single-locus features alone may led to suboptimalresults. The underlying reason suboptimal performance of thelatter is their failure to consider the dependency structure betweenthe markers. Inclusion of several markers that individually providehigh discrimination capacity into a forensic set may lead tosuboptimal sets if these markers are correlated. Thus, theequivocation-based selection approach will often include only asubset of correlated markers even if each alone shows highdiscrimination capacity.

4.1. Population specificity

Population history strongly influences the patterns of allelicassociation as well as the ranks of Y-STR markers with respect togene diversity, allele number and single-locus entropy, as alreadynoted by Purps et al. [5]. Thus, the redundancy in discriminatoryinformation will vary from population to population and theequivocation-based selection procedure will yield differing markersets for a given size. For example, while DYS385ab andDYS481 were most discriminating and informative across all threecontinental groups, markers DYS456, DYS643 and DYS635 differedin their variability and in Europeans, Africans and Asians. In anycase, several markers, namely DYS385ab, DYS570, DYS576,DYS458 and DYS481, were chosen as the top five in all threecontinental groups. Marker selection based on equivocation shouldtherefore be carried out separately in different populations orgroups of populations of similar ancestry in order to account forthis population dependency. Y-chromosomal haplogroups, definedby the presence of particular alleles at designated single-nucleotide polymorphisms, indicate a shared ancestry and couldhelp in delineating groups of populations with similar inter-marker association patterns. Marker sets for meta-populations, i.e.,sets that provide high discriminatory power for a number ofpopulations, could be derived by the fusion of population-specificmarker lists.

4.2. Future forensic marker selection

In this study, we observed that the currently used forensicmarker panels MHT, PPY12 and Yfiler often provide strikinglylower discriminatory power than equally sized equivocation-selected sets in all continental data sets. This is not surprising,since discriminatory capacity has not been the only criterion foran optimal choice of a forensic marker set in the past.Optimization with respect to other criteria, such as technicalreproducibility, fragment length, number of intermediate allelesand the availability of allele frequency information for manypopulations, also played and still play an important role. Kayseret al. published a survey of 166 forensically useful polymorphichuman Y-chromosomal microsatellites [28], a number that wasextended to 186 in a more recent study [2]. Forensic kits have

since then been constructed using these comprehensive librariesof Y-STRs following a conservative strategy of adding one oranother STR to existing panels. So far, only the so-called RM panel[3] follows a rigorously different concept combining only the top-13 most rapidly mutating markers in a panel. This concept,however, has drawbacks since many multi-local markers are noteasily typed and interpreted. For forensic use, modified RM panelshave thus been proposed which combine only fast-mutatingsingle-copy markers [29]. Efforts to design forensic kits forapplications in South Africa which replace less informativeEurope-specific markers with STRs that have a higher diversity inSub-Saharan Africa (see e.g., Ref. [30]) are based on smallempirical frequency datasets and have not been successfullytranslated into practice.

Nevertheless, the suboptimal performance of seasoned forensicY-STR kits is in parts also explicable by the substantial allelicassociation between some of the markers. Future forensic markerkit design should therefore operate on a large set of possiblemarkers and, employ an equivocation-based selection procedureto obtain a maximally informative marker set for forensicapplications, thereby augmenting other technical optimizationcriteria. However, we acknowledge that these and other reasonssuch as multiplex construction may pose as obstacles for theimplementation of our approach.

A generalization of our approach from Y-chromosomal to X-chromosomal and autosomal STR markers is theoreticallystraightforward. Demonstrated allelic association between thosemarkers provides strong motivation to apply this framework formarker selection. However, allelic association and equivocationanalysis rest on reliable haplotype information. With increasingnumbers of both markers and alleles, it becomes increasinglydifficult to precisely estimate the frequencies of these haplotypes,to the point of being at all impossible. Future progress insequencing technology may hopefully provide the data basis foran application of our equivocation-based marker selectionprocedure to autosomal and X-chromosomal forensic panels.

Appendix A. Supplementary data

Supplementary data associated with this article can be found, inthe online version, at http://dx.doi.org/10.1016/j.fsigen.2015.02.001.

References

[1] M. Kayser, A. Caglia, D. Corach, N. Fretwell, C. Gehrig, G. Graziosi, et al.,Evaluation of Y-chromosomal STRs: a multicenter study, Int. J. Legal Med. 110(1997) 41–49 125–133.

[2] K.N. Ballantyne, M. Goedbloed, R. Fang, O. Schaap, O. Lao, A. Wollstein, et al.,Mutability of Y-chromosomal microsatellites: rates, characteristics, molecularbases, and forensic implications, Am. J. Hum. Genet. 87 (2010) 341–353.

[3] K.N. Ballantyne, A. Ralf, R. Aboukhalid, N.M. Achakzai, M.J. Anjos, Q. Ayub, et al.,Toward male individualization with rapidly mutating y-chromosomal shorttandem repeats, Hum. Mutat. 35 (2014) 1021–1032.

[4] J.M. Thompson, M.M., Ewing, D.R., Rabbach, P.M., Fulmer, C.J., Sprecher, D.R.Storts, The PowerPlex1 Y23 System: A New Y-STR Multiplex for Casework andDatabase Applications. http://www.promega.de/resources/profiles-in-dna/2012/the-powerplex-y23-system-a-new-y-str-multiplex-for-casework-and-database-applications/, 2012.

[5] J. Purps, S. Siegert, S. Willuweit, M. Nagy, C. Alves, R. Salazar, et al., A globalanalysis of Y-chromosomal haplotype diversity for 23 STR loci, Forensic Sci. Int.Genet. 12C (2014) 12–23.

[6] H. Rodig, L. Roewer, A. Gross, T. Richter, P. de Knijff, M. Kayser, et al., Evaluationof haplotype discrimination capacity of 35 Y-chromosomal short tandemrepeat loci, Forensic Sci. Int. 174 (2008) 182–188.

[7] S.K. Lim, Y. Xue, E.J. Parkin, C. Tyler-Smith, Variation of 52 new Y-STR loci in theY chromosome consortium worldwide panel of 76 diverse individuals, Int. J.Legal Med. 121 (2007) 124–127.

[8] M. Vermeulen, A. Wollstein, K. van der Gaag, O. Lao, Y. Xue, Q. Wang, et al.,Improving global and regional resolution of male lineage differentiation bysimple single-copy Y-chromosomal short tandem repeat polymorphisms,Forensic Sci. Int. Genet. 3 (2009) 205–213.

Page 10: Shannon's equivocation for forensic Y-STR marker selection

S. Siegert et al. / Forensic Science International: Genetics 16 (2015) 216–225 225

[9] M. Geppert, J. Edelmann, R. Lessig, The Y-chromosomal STRs DYS481, DYS570,DYS576 and DYS643, Legal Med. (Tokyo) 11 (Suppl. 1) (2009) S109–S110.

[10] M.E. D'Amato, L. Ehrenreich, K. Cloete, M. Benjeddou, S. Davison,Characterization of the highly discriminatory loci DYS449, DYS481, DYS518,DYS612, DYS626, DYS644 and DYS710, Forensic Sci. Int. Genet. 4 (2010)104–110.

[11] G.A. Huttley, M.W. Smith, M. Carrington, S.J. O’Brien, A scan for linkagedisequilibrium across the human genome, Genetics 152 (1999) 1711–1722.

[12] A. Johansson, V. Vavruch-Nilsson, A. Edin-Liljegren, P. Sjolander, U. Gyllensten,Linkage disequilibrium between microsatellite markers in the Swedish Samirelative to a worldwide selection of populations, Hum. Genet. 116 (2005)105–113.

[13] K.L. Mohlke, E.M. Lange, T.T. Valle, S. Ghosh, V.L. Magnuson, K. Silander, et al.,Linkage disequilibrium between microsatellite markers extends beyond 1 cMon chromosome 20 in Finns, Genome Res. 11 (2001) 1221–1226.

[14] A. Tenesa, A.F. Wright, S.A. Knott, A.D. Carothers, C. Hayward, A. Angius, et al.,Extent of linkage disequilibrium in a Sardinian sub-isolate: sampling andmethodological considerations, Hum. Mol. Genet. 13 (2004) 25–33.

[15] T.M. Diegoli, M.D. Coble, Development and characterization of two mini-Xchromosomal short tandem repeat multiplexes, Forensic Sci. Int. Genet. 5(2011) 415–421.

[16] D. Kling, B. Dell’Amico, P.J. Haddeland, A.O. Tillmar, Population genetic analysisof 12 X-STRs in a Somali population sample, Forensic Sci. Int. Genet. 11 (2014)e7–8.

[17] A.O. Tillmar, P. Mostad, T. Egeland, B. Lindblom, G. Holmlund, K. Montelius,Analysis of linkage and linkage disequilibrium for eight X-STR markers,Forensic Sci. Int. Genet. 3 (2008) 37–41.

[18] A. Caliebe, J. Jochens, S. Willuweit, L. Roewer, M. Krawczak, No shortcutsolution to the problem of Y-STR match probability calculation, Forensic Sci.Int. Genet. 15 (2015) 69–75.

[19] C.E. Shannon, A mathematical theory of communication, Bell Syst. Tech. J. 27(1948) 623–656.

[20] M. Nothnagel, K. Rohde, The effect of single-nucleotide polymorphism markerselection on patterns of haplotype blocks and haplotype frequency estimates,Am. J. Hum. Genet. 77 (2005) 988–998.

[21] M. Nothnagel, R. Furst, K. Rohde, Entropy as a measure for linkagedisequilibrium over multilocus haplotype blocks, Hum. Hered. 54 (2002)186–198.

[22] S.S. Wilks, Mathematical Statistics, Wiley, New York, 1962.[23] M. Nei, F. Tajima, DNA polymorphism detectable by restriction endonucleases,

Genetics 97 (1981) 145–163.[24] M. Nei, Molecular Evolutionary Genetics, Columbia University Press, New

York, 1987.[25] Promega Corporation, Technical Manual: PowerPlex1 Y23 System – I

Nstructions For Use Of Products DC2305 AND DC2320. http://www.promega.com, 2012.

[26] R Core Team, R: A language and environment for statistical computing. RFoundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0,http://www.R-project.org/, 2013.

[27] S. Chasalow, Combinat: combinatorics utilities. R package version 0. 0-8.http://CRAN.R-project.org/package=combinat, 2012.

[28] M. Kayser, R. Kittler, A. Erler, M. Hedman, A.C. Lee, A. Mohyuddin, et al., Acomprehensive survey of human Y-chromosomal microsatellites, Am. J. Hum.Genet. 74 (2004) 1183–1197.

[29] U. Rogalla, M. Wo�zniak, J. Swobodzi�nski, M. Derenko, B.A. Malyarchuk, I.Dambuyeva, et al., A novel multiplex assay amplifying 13 Y-STRs characterizedby rapid and moderate mutation rate, Forensic Sci. Int. Genet. 15 (2015) 49–55.

[30] M.E. D'Amato, V.B. Bajic, S. Davison, Design and validation of a highlydiscriminatory 10-locus Y-chromosome STR multiplex system, Forensic Sci.Int. Genet. 5 (2011) 122–125.