Top Banner
The new Y Chromosome Haplotype Reference Database § Sascha Willuweit *, Lutz Roewer Department of Forensic Genetics, Institute of Legal Medicine and Forensic Sciences, Charite ´– Universita ¨tsmedizin, Berlin, Germany 1. Introduction Y chromosome STR profiling is the most advanced method to analyze the male proportion of female/male cell mixtures. In addition a combination of Y-STRs and Y-SNPs provides valuable investigative leads on the geographical ancestry of the male trace donor. The Y Chromosome Haplotype Reference Database (YHRD) assists the interpretation of profiles consisting of such markers [1]. The analytical concept requires high-resolution and sensitive chemistry as well as large reference sample collections where all major ancestral founder lineages of world populations are represented in appropriate numbers. Interpretation and calcula- tion of match probabilities then follows a genealogical approach, where haplotype frequencies are reported in groups of spatially distributed populations (metapopulations) sharing a common ancestry and thus a similar pool of deep-rooting lineages. To describe the hierarchy of metapopulations we use a knowledge- based terminology, which incorporates linguistic and geographical resources (e.g. Eurasian-European-Western European (instead of ‘‘Caucasian’’)). The further development of a terminology that could replace a frequently used but diffuse or misleading historical vocabulary as e.g. ‘‘Hispanics’’ or ‘‘Caucasian’’ (in US) or ‘‘Mestizo’’ (in Latin America) is a major challenge but without alternative for Y chromosomal profiles which often (inherently) carry detailed information about geographical or linguistic ancestry. The amount of studies on Y-STR haplotypes distributions has increased enormously in the past years with no signs of slowing down. Studies cover most regions and even remote niches of the global landscape (Fig. 1). In the last two years a new generation of highly discriminative Y-STR kits with more than 20 Y-STRs (PowerPlex 1 Y23 System, Promega, Madison, USA and Yfiler 1 Plus, Life Technologies, Foster City, USA) arrived which have sparked an avalanche of new studies. For example, within only 10 months after the release of the Powerplex Y23 kit nearly 20k 23- locus haplotypes have been collected from 129 populations in 51 countries [2]. Nearly all these single-center and multi-center studies set-up by forensic institutions are submitted to the YHRD prior to publication in peer-reviewed forensic or human genetics journals. The sheer amount of empirical material of the highest forensic quality allows testing of new mathematical models to calculate the weight of evidence of a Y-STR haplotype match by means of likelihood principles. The proposed approaches with variable estimators (surveying, coalescence, discrete Laplace) mainly concern the frequency estimation of rare haplotypes, whereas frequencies of common haplotypes can be readily estimated using counts and Clopper–Pearson confidence intervals [3–7,23]. These estimation methods, which generate results within reasonable computing time, are now implemented in the YHRD. To keep pace with the rapid developments in all three areas, i.e. data generation, chemistry and mathematics, the YHRD was completely Forensic Science International: Genetics xxx (2014) xxx–xxx A R T I C L E I N F O Article history: Keywords: Database Y chromosome Haplotype Y-STR Frequency estimation Metapopulation A B S T R A C T After opening the first version of an internet-accessible worldwide reference database of Y chromosome profiles 14 years ago and six years after the last major relaunch the new YHRD 4.0 repository and website has been rolled-out. By November 2014 about 136k 9-locus haplotypes, among these 84k 17-locus haplotypes, 25k 23-locus haplotypes and 15k Y SNP profiles from 917 sampling locations in 128 countries have been submitted by more than 250 institutes and laboratories. In geographic terms, about 39% of the YHRD samples are from Europe, 32% from Asia, 16% from South America, 6% from North America, 4% from Africa and 2% from Oceania/Australia. Worldwide collaboration is the driving force for the rapid growth of the database and this, in turn, allows the evaluation and implementation of enhanced interpretation tools (variable frequency estimators, LR-based mixture and kinship analysis, Y-SNP-based ancestry assessment). ß 2014 Elsevier Ireland Ltd. All rights reserved. § Y Chromosome Haplotype Reference Database: https://yhrd.org. * Corresponding author at: Charite ´– Universita ¨ tsmedizin Berlin, Institute of Legal Medicine and Forensic Sciences, Department of Forensic Genetics, Augustenburger Platz 1, 13353 Berlin, Germany. Tel.: +49 30 450525032; fax: +49 30 450525912. E-mail address: [email protected] (S. Willuweit). G Model FSIGEN-1287; No. of Pages 6 Please cite this article in press as: S. Willuweit, L. Roewer, The new Y Chromosome Haplotype Reference Database, Forensic Sci. Int. Genet. (2014), http://dx.doi.org/10.1016/j.fsigen.2014.11.024 Contents lists available at ScienceDirect Forensic Science International: Genetics jou r nal h o mep ag e: w ww .elsevier .co m /loc ate/fs ig http://dx.doi.org/10.1016/j.fsigen.2014.11.024 1872-4973/ß 2014 Elsevier Ireland Ltd. All rights reserved.
6

The new Y Chromosome Haplotype Reference Database

Apr 21, 2023

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The new Y Chromosome Haplotype Reference Database

Forensic Science International: Genetics xxx (2014) xxx–xxx

G Model

FSIGEN-1287; No. of Pages 6

The new Y Chromosome Haplotype Reference Database§

Sascha Willuweit *, Lutz Roewer

Department of Forensic Genetics, Institute of Legal Medicine and Forensic Sciences, Charite – Universitatsmedizin, Berlin, Germany

A R T I C L E I N F O

Article history:

Keywords:

Database

Y chromosome

Haplotype

Y-STR

Frequency estimation

Metapopulation

A B S T R A C T

After opening the first version of an internet-accessible worldwide reference database of Y chromosome

profiles 14 years ago and six years after the last major relaunch the new YHRD 4.0 repository and website

has been rolled-out. By November 2014 about 136k 9-locus haplotypes, among these 84k 17-locus

haplotypes, 25k 23-locus haplotypes and 15k Y SNP profiles from 917 sampling locations in

128 countries have been submitted by more than 250 institutes and laboratories. In geographic terms,

about 39% of the YHRD samples are from Europe, 32% from Asia, 16% from South America, 6% from North

America, 4% from Africa and 2% from Oceania/Australia. Worldwide collaboration is the driving force for

the rapid growth of the database and this, in turn, allows the evaluation and implementation of enhanced

interpretation tools (variable frequency estimators, LR-based mixture and kinship analysis, Y-SNP-based

ancestry assessment).

� 2014 Elsevier Ireland Ltd. All rights reserved.

Contents lists available at ScienceDirect

Forensic Science International: Genetics

jou r nal h o mep ag e: w ww .e lsev ier . co m / loc ate / fs ig

1. Introduction

Y chromosome STR profiling is the most advanced method toanalyze the male proportion of female/male cell mixtures. Inaddition a combination of Y-STRs and Y-SNPs provides valuableinvestigative leads on the geographical ancestry of the male tracedonor. The Y Chromosome Haplotype Reference Database (YHRD)assists the interpretation of profiles consisting of such markers[1]. The analytical concept requires high-resolution and sensitivechemistry as well as large reference sample collections where allmajor ancestral founder lineages of world populations arerepresented in appropriate numbers. Interpretation and calcula-tion of match probabilities then follows a genealogical approach,where haplotype frequencies are reported in groups of spatiallydistributed populations (metapopulations) sharing a commonancestry and thus a similar pool of deep-rooting lineages. Todescribe the hierarchy of metapopulations we use a knowledge-based terminology, which incorporates linguistic and geographicalresources (e.g. Eurasian-European-Western European (instead of‘‘Caucasian’’)). The further development of a terminology thatcould replace a frequently used but diffuse or misleading historicalvocabulary as e.g. ‘‘Hispanics’’ or ‘‘Caucasian’’ (in US) or ‘‘Mestizo’’

§ Y Chromosome Haplotype Reference Database: https://yhrd.org.

* Corresponding author at: Charite – Universitatsmedizin Berlin, Institute of Legal

Medicine and Forensic Sciences, Department of Forensic Genetics, Augustenburger

Platz 1, 13353 Berlin, Germany. Tel.: +49 30 450525032; fax: +49 30 450525912.

E-mail address: [email protected] (S. Willuweit).

Please cite this article in press as: S. Willuweit, L. Roewer, The new

Genet. (2014), http://dx.doi.org/10.1016/j.fsigen.2014.11.024

http://dx.doi.org/10.1016/j.fsigen.2014.11.024

1872-4973/� 2014 Elsevier Ireland Ltd. All rights reserved.

(in Latin America) is a major challenge but without alternative for Ychromosomal profiles which often (inherently) carry detailedinformation about geographical or linguistic ancestry.

The amount of studies on Y-STR haplotypes distributions hasincreased enormously in the past years with no signs of slowingdown. Studies cover most regions and even remote niches of theglobal landscape (Fig. 1). In the last two years a new generation ofhighly discriminative Y-STR kits with more than 20 Y-STRs(PowerPlex1 Y23 System, Promega, Madison, USA and Yfiler1

Plus, Life Technologies, Foster City, USA) arrived which havesparked an avalanche of new studies. For example, within only10 months after the release of the Powerplex Y23 kit nearly 20k 23-locus haplotypes have been collected from 129 populations in51 countries [2]. Nearly all these single-center and multi-centerstudies set-up by forensic institutions are submitted to the YHRDprior to publication in peer-reviewed forensic or human geneticsjournals. The sheer amount of empirical material of the highestforensic quality allows testing of new mathematical models tocalculate the weight of evidence of a Y-STR haplotype match bymeans of likelihood principles. The proposed approaches withvariable estimators (surveying, coalescence, discrete Laplace)mainly concern the frequency estimation of rare haplotypes,whereas frequencies of common haplotypes can be readilyestimated using counts and Clopper–Pearson confidence intervals[3–7,23]. These estimation methods, which generate results withinreasonable computing time, are now implemented in the YHRD. Tokeep pace with the rapid developments in all three areas, i.e. datageneration, chemistry and mathematics, the YHRD was completely

Y Chromosome Haplotype Reference Database, Forensic Sci. Int.

Page 2: The new Y Chromosome Haplotype Reference Database

Fig. 1. Worldwide map of national databases at the YHRD.

S. Willuweit, L. Roewer / Forensic Science International: Genetics xxx (2014) xxx–xxx2

G Model

FSIGEN-1287; No. of Pages 6

remodeled to become more operational in everyday forensicpractice. This paper describes the basic principles and the mostimportant changes to previous versions of the database.

2. Structure and basic principles

2.1. Technical set-up

YHRD has been developed following the behavior drivendevelopment (BDD) principle [8] for the frontend and using thetest driven development (TDD) principle [9] for the databasebackend. This ensures that YHRD will work as reliable andtrustworthy as possible by defining test cases and test scenariosthat are tested automatically with each new release.

The web interface itself is written in Ruby [10] using thepopular Ruby-On-Rails framework [11]. All haplotype data ismanaged by a proprietary In-Memory-Database written in C99 andX86_64-Assembler for fastest possible binary pattern matching[12]. Statistical tools are implemented in R [13] based on theexcellent clustering library ‘‘snow’’ by Tierney et al. [14] to allowfor parallel processing.

Of course, since the objective of the database is to assesshaplotype frequencies and not to identify personal matches, we donot store any IDs of the original submissions.

2.2. Input files

As new kits are including increasing number of Y-STR loci, itbecame inevitable to provide a fully electronic submission facility.To fulfill this need, we implemented a flexible Excel-, XML-, CSV-file upload mechanism. Moreover, users are able to use theirGeneMapper (LifeTechnologies Inc., Foster City, USA) export files tosearch the YHRD. Both should avoid errors introduced by manualdata transfer. Of course, manual input is possible as well.

2.3. Datasets

The YHRD is a database, which consists of 5 separate datasetswith different resolutions (see Table 1). Datasets are collections ofhaplotypes all defined for a certain set of Y-STR markers (seeTables 1 and 2) and might therefore be part of other, biggerdatasets. The sets available at YHRD are currently fixed and maychange only on new versions. Each result at YHRD (search, tools,

Please cite this article in press as: S. Willuweit, L. Roewer, The new

Genet. (2014), http://dx.doi.org/10.1016/j.fsigen.2014.11.024

etc) is based on one particular dataset, which means that all valuesare relative to the size and composition of that dataset.

2.4. Metapopulations

The term ‘‘metapopulation’’ was first defined by Levins [15]saying that ‘‘a natural population occupying any considerable areawill be made up of a number of . . . local populations’’. Hanski andGilpin [16] refer to the term metapopulation to describe spatiallydistributed population groups, which are interconnected by gene-flow and migration. We adapted this terminology to an assemblageof Y-STR profiles spread over a territory, which descend from one orseveral common ancestral lineages. Population genetic analyses ondifferent Y chromosomal marker sets show that metapopulationsare stabilized over time by cultural (e.g. a common language ormarital patterns) or geographical factors. Several studies examinedthe range expansion of Y chromosomal lineages and (sub-)population effects in more depth than we do here [17–21]. Themost comprehensive study so far on YHRD datasets [2] shows thatwith increasing number of Y-STRs included in a marker set, thegenetic distances between meta-populations decreased monotoni-cally but are not erased. The distances calculated for 7 up to 23 STRsremain significant, both on a global as well as on a continental level.The YHRD currently includes 33 metapopulations (see Fig. 2), whichare subject to changes if new samples arrive or variance analysis (e.g.by AMOVA, see Section 4.1) confirms divergence of populationgroups (for an example see Fig. S1).

Supplementary Fig. S1 related to this article can be found, in theonline version, at doi:10.1016/j.fsigen.2014.11.024.

2.5. National databases

National databases were established to meet the demands for alocal (‘‘national’’ in political terms) Y-STR reference database asrequired by legislation. They consist of all population samplescollected within the political boundary of the concerned country.The YHRD currently includes 128 national databases with anaverage size of 1063 haplotypes (s = 1112) extending from about10 Haplotypes (Solomon Islands with 8 minimal haplotypes andDemocratic Republic of the Congo with 13 minimal haplotypes), toabout 7500 (Poland with 7478 minimal haplotypes, Brazil with7506 minimal haplotypes, Germany with 7675 minimal haplo-types and United States with 7793 minimal haplotypes), to almost18,000 (China with 17,792 minimal haplotypes).

Y Chromosome Haplotype Reference Database, Forensic Sci. Int.

Page 3: The new Y Chromosome Haplotype Reference Database

Table 2Distribution of identical haplotypes in the five YHRD datasets as of release 48 (November 2014). Each row represents the amount of haplotypes being present in the

appropriate dataset exactly n times. E.g. n = 1 will give the count (and relative percentage) of all singletons included in each corresponding dataset.

n Minimal Promega PowerPlex Y12 Applied Biosystems Yfiler Promega PowerPlex Y23 Applied Biosystems Yfiler Plus

1 25,215 (�19%) 33,510 (�35%) 55,946 (�66%) 22,685 (�90%) 1171 (�90%)

2 5869 (�4%) 5860 (�6%) 5761 (�7%) 914 (�4%) 22 (�2%)

3 2537 (�2%) 2223 (�2%) 1534 (�2%) 100 (<1%) 2 (<1%)

4 1347 (�1%) 1125 (�1%) 660 (�1%) 38 (<1%) 2 (<1%)

5 891 (�1%) 612 (�1%) 312 (<1%) 8 (<1%) 4 (<1%)

6 615 (<1%) 412 (<1%) 197 (<1%) 3 (<1%) 1 (<1%)

7 415 (<1%) 272 (<1%) 122 (<1%) 2 (<1%) 1 (<1%)

8 319 (<1%) 217 (<1%) 81 (<1%) 3 (<1%) 0

9 269 (<1%) 171 (<1%) 51 (<1%) 0 0

10 226 (<1%) 125 (<1%) 46 (<1%) 1 (<1%) 0

>10 1714 (�1%) 934 (�1%) 229 (<1%) 4 (<1%) 3 (<1%)

Table 1Compositions of the five YHRD datasets as of release 48 (November 2014). Each row represents a separate dataset with a name and the defining markers followed by the count

of haplotypes and the number of populations, national databases and metapopulations structuring the dataset.

Dataset YSTR markers Number of

haplotypes

Number of

population

samples

Number of

national

databases

Number of

metapopulations

Minimal DYS19, DYS389I, DYS389II, DYS390, DYS391, DYS392, DYS393, DYS385 136,184 917 128 33

Promega PowerPlex Y12 Minimal + DYS437, DYS438, DYS439 96,735 659 115 31

Applied Biosystems Yfiler Promega PowerPlex Y12 + DYS448, DYS456, DYS458, DYS635, YGATAH4 84,256 572 107 31

Promega PowerPlex Y23 Applied Biosystems Yfiler + DYS481, DYS533, DYS549, DYS570,

DYS576, DYS643

25,118 175 54 28

Applied Biosystems

Yfiler Plus

Applied Biosystems Yfiler + DYS576, DYS627, DYS460, DYS518, DYS570,

DYS449, DYS481, DYF387S1, DYS533

1297 13 8 7

S. Willuweit, L. Roewer / Forensic Science International: Genetics xxx (2014) xxx–xxx 3

G Model

FSIGEN-1287; No. of Pages 6

2.6. Y-SNPs and haplogroup definition

The extension of Y-STR data with Y-SNP data creates a valuablepossibility of clustering haplotypes and to refine existingstructuring methods based on non-genetic factors. We includeY-SNP information as sets of Y-SNP marker states (ancestral/derived) and transcode them to haplogroups according to thephylogenetic ‘‘minimal Y tree’’ [22]. The YHRD currently includesmore than 15,000 chromosomes typed both for STR and SNPmarkers, which can therefore assigned to haplogroups.

3. Search the database

The search result view is made up of panes that are collecting allavailable values for a certain feature (e.g. the ‘‘Worldwide’’ featurepane collecting all information on the whole dataset). There aretwo additional feature panes that are giving values based on theappropriate sub-database of the chosen dataset (‘‘NationalDatabase’’ and ‘‘Metapopulation’’). Each pane consists of twocategories of values given:

Observed: Actual observed frequency and corresponding confi-dence interval (CI).1

Expected: Estimated haplotype frequencies and if applicable thecorresponding confidence interval (CI)1. There are currentlythree methods implemented (see Section 3.1 for details):Augmented count (n + 1/N + 1) [3], discrete Laplace [7] andkappa method [4]. Not all approaches necessarily are present ateach pane (e.g. kappa method being absent when there arematches observed).

1 Confidence intervals are calculated using Clopper–Pearson [23].

Please cite this article in press as: S. Willuweit, L. Roewer, The new

Genet. (2014), http://dx.doi.org/10.1016/j.fsigen.2014.11.024

3.1. Frequency estimation methods

Together with the actual observed haplotype frequency, anaugmented count n + 1/N + 1 (frequency when adding the haplo-type in question to the dataset) is given.

3.1.1. Discrete Laplace

The discrete Laplace distribution approximates properties of theFisher–Wright model of evolution and can be used to estimate Y-STRhaplotype frequencies by modeling the frequency as a relation of thecomposition of the haplotype in question compared to each of thepre-calculated subpopulation centers [7]. This estimation method isonly available for adequately sized metapopulations and datasets.When estimating the worldwide haplotype frequency, the averageover all metapopulation estimations is given. A disproportionatelyhigh contribution of one metapopulation to the overall frequencyestimator is stated explicitly.

3.1.2. Kappa method

Derived from Ewens’ work [24] Brenner’s kappa (k) methodestimates the haplotype frequency by correcting the simplecounting method using k, with k being the proportion of singletons(haplotypes that occur only once) in the dataset [4]. Simulationstudies [6] showed that the kappa method provides usefulfrequency information in cases of rare haplotypes. It is thereforetaken as a frequency estimator in cases when there is no haplotypematch observed.

3.2. Ancestry information

When adding the ‘‘Ancestry Information’’ feature to a resultsheet for any targeted haplotype the incremental minimalhaplotype dataset (9 loci) which is present in all sampled

Y Chromosome Haplotype Reference Database, Forensic Sci. Int.

Page 4: The new Y Chromosome Haplotype Reference Database

Fig. 2. Structure and terminology of metapopulations defined for the YHRD. Note that the second layer is not labeled to improve readability.

S. Willuweit, L. Roewer / Forensic Science International: Genetics xxx (2014) xxx–xxx4

G Model

FSIGEN-1287; No. of Pages 6

individuals is searched. Three different views on the distribution ofmatching minimal haplotypes provide data, which helps the userto assess the phylogenetic ancestry and the biogeographical originof the chromosome:

� Metapopulation frequency distribution: A graphical representationof the relative frequency of the minimal haplotype for eachmetapopulation is given. A bolder font indicates a significantlyhigher relative frequency of the minimal haplotype in question.� Y-SNPs: The composition of associated Y-SNPs to the matched

minimal haplotypes is given as a pie chart. All haplotypeswithout Y-SNP information will be mapped to the phylogeneticroot ‘‘Y’’.� National database frequencies: For each match of the minimal

haplotype the relative as well as the absolute frequency in theappropriate matching national database is given as a bar plot.� Heat Map: A spatial projection of relative minimal haplotype

frequency is given as an interactive world map, with each dotrepresenting a individual sampling site. Non-matches arecolored in blue whereas red dots are those with at least oneminimal haplotype match. An underlying heat map gradient inred helps spotting sampling sites with high relative frequencies.

4. Associated tools

4.1. AMOVA/MDS

The AMOVA (pairwise FST/RST calculation) [25] has been carriedover from the last version of the database. Beginning with thesubmission of a data file (Excel, CSV or the like) composed of one ormany population sample(s), the researcher selects any populationsample or whole national databases of the appropriate YHRDdataset for comparison. Note that we decided to limit the numberof haplotypes involved in a calculation to 10,000 following theprinciple of fair use.

All selected single population samples and/or national data-bases are used to calculate a matrix of pairwise FST/RST togetherwith their corresponding p-values (based on 10,000 iterations ofrandom rearrangement). Based on that matrix the first PrincipalComponents of a two-dimensional Principal Component Analysis(PCA) is evaluated and taken as the starting configuration for a

Please cite this article in press as: S. Willuweit, L. Roewer, The new

Genet. (2014), http://dx.doi.org/10.1016/j.fsigen.2014.11.024

Sammon Mapping [26], a nonlinear metric multidimensionalscaling method, which is then plotted as an exchangeable PDF file.

The selected algorithm will cluster population samples/national databases based on limits given by the researcher (e.g.based on an FST/RST threshold).

Fig. S1 gives an example of an AMOVA analysis of 52 populationsamples from China representing 17,278 individuals in theminimal haplotype dataset.

4.2. Mixture analysis

We implemented the LR-based Y-STR mixture approach of Wolfet al. [27] using a recursion in the number of unknown contributorsand the observed haplotype frequencies given by the YHRD. Themixture tool will calculate the likelihood ratio (LR) of thedonorship versus the non-donorship of the suspect to the tracetaking into account the given known contributors (if any) and thenumber of additional unknown contributors (if any). The tool willaccept manual input and electronic files (see Section 2.2).

4.3. Kinship analysis

YHRD provides a relatively simple Kinship Index (KI) calcula-tion for two different scenarios: father-son and two-brothers. Bothare using the well-known formulas [28,29] in case of mutation andnon-mutation with the extension that the frequencies of unob-served suspect and/or known contributor haplotypes are estimat-ed by their augmented count. The tool will accept manual inputand electronic files (see Section 2.2).

5. Contribution and updates

The YHRD will be updated about four times a year. Submissionsusing standardized file formats are received, evaluated andaccepted by the YHRD, after passing the internal qualityassessment. A unique accession number is issued for eachsubmitted population sample.

Required sample sizes and procedures of quality assessment arepublished in the updated ISFG guidelines for the publication ofpopulation data [30]. Accepted studies with accession number arepublished both in the YHRD and in a peer-reviewed forensicjournal. At each result page the YHRD release (date and version)

Y Chromosome Haplotype Reference Database, Forensic Sci. Int.

Page 5: The new Y Chromosome Haplotype Reference Database

S. Willuweit, L. Roewer / Forensic Science International: Genetics xxx (2014) xxx–xxx 5

G Model

FSIGEN-1287; No. of Pages 6

and accession date is noted. This information needs to be includedin any reports.

6. Availability

Published datasets can be retrieved upon request from theYHRD, provided that an outline of the research project is submittedfor publication on the website.

7. Safety, security and validation

High-resolution Y-STR profiling with the current generation ofY-STR kits may approach the identification level in some cases. Toprotect the privacy of submitted data, we decided to encrypt eachcommunication with the YHRD using Transport Layer Security(TLS). Additionally the whole system is built for anonymous usagemeaning that we do not use any external analytics or loggingfacility and users do not have to register or login to use the YHRD atall.

To fulfill the requirements of a court-going forensic expertsystem we defined almost 230 individual test cases covering everyalgorithm at the YHRD. Those tests will ensure the software isworking within expected specifications. For example a random Y-STR database is created (with known frequencies) and each resultof a random set of searches will be checked against theirexpectations. If any of those tests fail, the release/update willnot be published to the public. After successful review of thefailure, the release/update will be published to the public and alldataset specific release information will be updated automatically.

To keep track of possible incompatibilities and/or bugs, we runa bug tracking system (https://redmine.yhrd.org) along with theYHRD. This ensures that a failed or bogus behavior of the YHRD willbe immediately turned into an electronic test case and testedtherefore with each release/update as described above.

As each report should be documented, we place the date of theaction and the date and release of the YHRD at each final resultpage.

8. Summary

The Y Chromosome Haplotype Reference Database (YHRD) is acomputing platform, which allows acquisition, distribution,evaluation and interpretation of forensic DNA datasets. Dedicatedsolely to Y chromosome polymorphisms (Y-STRs, Y-SNPs) YHRDexperiences and features can serve to help other online resourcesin the rapidly evolving field of forensic genetics. New technologies(e.g. massive-parallel sequencing results) and new mathematicaltools can be integrated and linked to the data sets. Since datasetsgrow and tools evolve, such platforms need to be updatedregularly. With the current version of the YHRD we have facilitatedthe process of registration, accelerate the allocation of data andreduced the response time. To achieve this we have reduced theoutput of a typical haplotype search. The default results pagepresents the frequency estimates (constant and variable estima-tors) for the selected dataset, without loading tables of matchingand neighboring haplotypes, maps and numbers in subpopula-tions. However, additional features can or must be added, namelyfrequency estimates in national or metapopulation databases orcontextual ancestry information. All details on the datasets (e.g.contributors) or the markers (e.g. mutation rates) can be retrievedvia the link ‘‘Resources’’. Likewise, tools for data or matchinterpretation can be approached by a link where calculationsfor typical Y-STR applications (e.g. mixture interpretation orkinship analysis) can be performed. The reduction of output andthe concentration on the most important features (frequencyestimators) make the new YHRD much more operational in daily

Please cite this article in press as: S. Willuweit, L. Roewer, The new

Genet. (2014), http://dx.doi.org/10.1016/j.fsigen.2014.11.024

forensic practice than previous versions. It remains nevertheless afully annotated database where all documents on the dataresources and the validation processes for submitters areaccessible (via link "Resources > Database details"). This is oneimportant prerequisite for court use of such repositories. In view ofthe wide application of this database we seriously take ourresponsibility to provide education. The website has screenedtutorials, a manual, a FAQ section and lists of all relevantpublications. The biannual Y-User Workshop, established in1996 with 9 conferences in 5 countries so far, becomes the mostimportant forum for scientific discussion on the forensic applica-tion of Y chromosome markers which eventually led to theestablishment of the new YHRD [31].

Acknowledgement

We sincerely thank all contributors who supplied the datacontained in this database.

References

[1] S. Willuweit, L. Roewer, International Forensic Y Chromosome User Group, YChromosome haplotype reference database (YHRD): update, Forensic Sci. Int.Genet. 1 (2) (2007) 83–87.

[2] J. Purps, S. Siegert, S. Willuweit, M. Nagy, C. Alves, R. Salazar, S.M. Angustia, L.H.Santos, K. Anslinger, B. Bayer, Q. Ayub, W. Wei, Y. Xue, C. Tyler-Smith, M.B.Bafalluy, B. Martınez-Jarreta, B. Egyed, B. Balitzki, S. Tschumi, D. Ballard, D.S.Court, X. Barrantes, G. Baßler, T. Wiest, B. Berger, H. Niederstatter, W. Parson, C.Davis, B. Budowle, H. Burri, U. Borer, C. Koller, E.F. Carvalho, P.M. Domingues, W.T.Chamoun, M.D. Coble, C.R. Hill, D. Corach, M. Caputo, M.E. D’Amato, S. Davison, R.Decorte, M.H. Larmuseau, C. Ottoni, O. Rickards, D. Lu, C. Jiang, T. Dobosz, A.Jonkisz, W.E. Frank, I. Furac, C. Gehrig, V. Castella, B. Grskovic, C. Haas, J. Wobst, G.Hadzic, K. Drobnic, K. Honda, Y. Hou, D. Zhou, Y. Li, S. Hu, S. Chen, U.D. Immel, R.Lessig, Z. Jakovski, T. Ilievska, A.E. Klann, C.C. Garcıa, P. de Knijff, T. Kraaijenbrink,A. Kondili, P. Miniati, M. Vouropoulou, L. Kovacevic, D. Marjanovic, I. Lindner, I.Mansour, M. Al-Azem, A.E. Andari, M. Marino, S. Furfuro, L. Locarno, P. Martın,G.M. Luque, A. Alonso, L.S. Miranda, H. Moreira, N. Mizuno, Y. Iwashima, R.S. Neto,T.L. Nogueira, R. Silva, M. Nastainczyk-Wulf, J. Edelmann, M. Kohl, S. Nie, X. Wang,B. Cheng, C. Nunez, M.M. Pancorbo, J.K. Olofsson, N. Morling, V. Onofri, A.Tagliabracci, H. Pamjav, A. Volgyi, G. Barany, R. Pawlowski, A. Maciejewska, S.Pelotti, W. Pepinski, M. Abreu-Glowacka, C. Phillips, J. Cardenas, D. Rey-Gonzalez,A. Salas, F. Brisighelli, C. Capelli, U. Toscanini, A. Piccinini, M. Piglionica, S.L.Baldassarra, R. Ploski, M. Konarzewska, E. Jastrzebska, C. Robino, A. Sajantila, J.U.Palo, E. Guevara, J. Salvador, M.C. Ungria, J.J. Rodriguez, U. Schmidt, N. Schlauderer,P. Saukko, P.M. Schneider, M. Sirker, K.J. Shin, Y.N. Oh, I. Skitsa, A. Ampati, T.G.Smith, L.S. Calvit, V. Stenzl, T. Capal, A. Tillmar, H. Nilsson, S. Turrina, D. De Leo, A.Verzeletti, V. Cortellini, J.H. Wetton, G.M. Gwynne, M.A. Jobling, M.R. Whittle, D.R.Sumita, P. Wolanska-Nowak, R.Y. Yong, M. Krawczak, M. Nothnagel, L. Roewer, Aglobal analysis of Y-chromosomal haplotype diversity for 23 STR loci, Forensic Sci.Int. Genet. 12 (2014) 12–23.

[3] P. Gill, C. Brenner, B. Brinkmann, B. Budowle, A. Carracedo, M.A. Jobling, P. deKnijff, M. Kayser, M. Krawczak, W.R. Mayr, N. Morling, B. Olaisen, V. Pascali, M.Prinz, L. Roewer, P.M. Schneider, A. Sajantila, C. Tyler-Smith, DNA commission ofthe International Society of Forensic Genetics: recommendations on forensicanalysis using Y-chromosome STRs, Int. J. Legal Med. 114 (6) (2001) 305–309.

[4] C.H. Brenner, Fundamental problem of forensic mathematics – the evidentialvalue of a rare haplotype, Forensic Sci. Int. Genet. 4 (2010) 281–291.

[5] S. Willuweit, A. Caliebe, M.M. Andersen, L. Roewer, Y-STR frequency surveyingmethod: a critical reappraisal, Forensic Sci. Int. Genet. 5 (2) (2011) 84–90.

[6] M.M. Andersen, A. Caliebe, A. Jochens, S. Willuweit, M. Krawczak, Estimatingtrace-suspect match probabilities for singleton Y-STR haplotypes using coalescenttheory, Forensic Sci. Int. Genet. 7 (2) (2013) 264–271.

[7] M.M. Andersen, P.S. Eriksen, N. Morling, The discrete Laplace exponential familyand estimation of Y-STR haplotype frequencies, J. Theor. Biol. 329 (2013) 39–51.

[8] D. Chelimsky, et al., The RSpec Book: Behaviour Driven Development with RSpec,Cucumber, and Friends, 1st ed., Pragmatic Bookshelf, Frisco, TX, USA, 2012.

[9] J.W. Grenning, Test Driven Development for Embedded C, 1st ed., PragmaticBookshelf, Frisco, TX, USA, 2011.

[10] ISO/IEC 30170, Information Technology—Programming Languages—Ruby, 2012.[11] http://rubyonrails.org/documentation/.[12] D. Gusfield, Algorithms on strings, in: Trees and Sequences: Computer Science

and Computational Biology, 1st ed., Cambridge University Press, Cambridge,United Kingdom, 1997, pp. 70–73.

[13] R Core Team, R: A Language and Environment for Statistical Computing, RFoundation for Statistical Computing, Vienna, Austria, 2014, http://www.R-pro-ject.org/.

[14] L. Tierney, A.J. Rossini, N. Li, H. Sevcikova, Snow: Simple Network of Workstations.R Package Version 0.3-13, 2013, http://CRAN.R-project.org/package=snow.

[15] R. Levins, Some demographic and genetic consequences of environmental het-erogeneity for biological control, Bull. Entomol. Soc. Am. 15 (1969) 237–240.

Y Chromosome Haplotype Reference Database, Forensic Sci. Int.

Page 6: The new Y Chromosome Haplotype Reference Database

S. Willuweit, L. Roewer / Forensic Science International: Genetics xxx (2014) xxx–xxx6

G Model

FSIGEN-1287; No. of Pages 6

[16] I. Hanski, M. Gilpin, Metapopulation Biology: Ecology, Genetics, and Evolution,Academic Press, San Diego, 1997.

[17] L. Roewer, P.J. Croucher, S. Willuweit, T.T. Lu, M. Kayser, R. Lessig, P. de Knijff, M.A.Jobling, C. Tyler-Smith, M. Krawczak, Signature of recent historical events in theEuropean Y-chromosomal STR haplotype distribution, Hum. Genet. 116 (4)(2005) 279–291.

[18] A. Diaz-Lacava, M. Walier, S. Willuweit, T.F. Wienker, R. Fimmers, M.P. Baur, L.Roewer, Geostatistical inference of main Y-STR-haplotype groups in Europe,Forensic Sci. Int. Genet. 5 (2) (2011) 91–94.

[19] M. Kayser, O. Lao, K. Anslinger, C. Augustin, G. Bargel, J. Edelmann, S. Elias, M.Heinrich, J. Henke, L. Henke, C. Hohoff, A. Illing, A. Jonkisz, P. Kuzniar, A. Lebioda, R.Lessig, S. Lewicki, A. Maciejewska, D.M. Monies, R. Pawłowski, M. Poetsch, D.Schmid, U. Schmidt, P.M. Schneider, B. Stradmann-Bellinghausen, R. Szibor, R.Wegener, M. Wozniak, M. Zoledziewska, L. Roewer, T. Dobosz, R. Ploski, Signifi-cant genetic differentiation between Poland and Germany follows present-daypolitical borders, as revealed by Y-chromosome analysis, Hum. Genet. 117 (5)(2005) 428–443.

[20] L. Roewer, S. Willuweit, C. Kruger, M. Nagy, S. Rychkov, I. Morozowa, O. Naumova,Y. Schneider, O. Zhukova, M. Stoneking, I. Nasidze, Analysis of Y Chromosome STRhaplotypes in the European part of Russia reveals high diversities but non-significant genetic distances between populations, Int. J. Legal Med. 122 (3)(2008) 219–223.

[21] L. Roewer, M. Nothnagel, L. Gusmao, V. Gomes, M. Gonzalez, D. Corach, A. Sala, E.Alechine, T. Palha, N. Santos, A. Ribeiro-Dos-Santos, M. Geppert, S. Willuweit, M.Nagy, S. Zweynert, M. Baeta, C. Nunez, B. Martınez-Jarreta, F. Gonzalez-Andrade,E. Fagundes de Carvalho, D.A. da Silva, J.J. Builes, D. Turbon, A.M. Lopez Parra, E.Arroyo-Pardo, U. Toscanini, L. Borjas, C. Barletta, E. Ewart, S. Santos, M. Krawczak,

Please cite this article in press as: S. Willuweit, L. Roewer, The new

Genet. (2014), http://dx.doi.org/10.1016/j.fsigen.2014.11.024

Continent-wide decoupling of Y-chromosomal genetic variation fromlanguage and geography in native South Americans, PLoS Genet. 9 (4) (2013)e1003460.

[22] M. van Oven, A. Van Geystelen, M. Kayser, R. Decorte, M.H. Larmuseau, Seeing thewood for the trees: a minimal reference phylogeny for the human Y Chromosome,Hum. Mutat. 35 (2) (2014) 187–191.

[23] C.J. Clopper, E.S. Pearson, The use of confidence or fiducial limits illustrated in thecase of the binomial, Biometrika 26 (1934) 404–413.

[24] W.J. Ewens, The sampling theory of selectively neutral alleles, Theor. Popul. Biol. 3(1972) 87–112.

[25] L. Excoffier, P. Smouse, J. Quattro, Analysis of molecular variance inferred frommetric distances among DNA haplotypes: application to human mitochondrialDNA restriction data, Genetics 131 (1992) 479–491.

[26] J.W. Sammon, A nonlinear mapping for data structure analysis, IEEE Trans.Comput. 18 (1969) 401–409.

[27] A. Wolf, A. Caliebe, O. Junge, M. Krawczak, Forensic interpretation of Y-chromo-somal DNA mixtures, Forensic Sci. Int. 152 (2–3) (2005) 209–213.

[28] B. Rolf, W. Keil, B. Brinkmann, L. Roewer, R. Fimmers, Paternity testing using Y-STRhaplotypes: assigning a probability for paternity in cases of mutations, Int. J. LegalMed. 115 (1) (2001) 12–15.

[29] J.S. Buckleton, C.M. Triggs, S.J. Walsh, Forensic DNA evidence interpretation, 1sted., CRC Press, Boca Raton, 2005, pp. 388–389.

[30] A. Carracedo, J.M. Butler, L. Gusmao, A. Linacre, W. Parson, L. Roewer, P.M.Schneider, Update of the guidelines for the publication of genetic populationdata, Forensic Sci. Int. Genet. 10 (2014) A1–A2.

[31] W. Parson, L. Roewer, Foreword. Genetic disciplines surrounding haploid DNAmarkers, Forensic Sci. Int. Genet. 7 (6) (2013) 567.

Y Chromosome Haplotype Reference Database, Forensic Sci. Int.