Practical course using the software ||||| Introduction to ...adegenet.r-forge.r-project.org/files/practical-day1.1.2.pdf · 5.2 Principal Component Analysis (PCA) . . . . . . . .

Practical course using the software

—————Introduction to genetic data analysis in

Thibaut Jombart

—————

Abstract

This practical course is meant as a short introduction to genetic dataanalysis using [10]. After describing the main types of data, we illus-trate how to perform some basic population genetics analyses, and then gothrough constructing trees from genetic distances and performing standardmultivariate analyses. The practical uses mostly the packages adegenet [6],ape [9] and ade4 [1, 3, 2], but others like genetics [11] and hierfstat [5] arealso required.

1

Contents

1 Let’s start 31.1 Loading the packages . . . . . . . . . . . . . . . . . . . . . . . 31.2 How to get information? . . . . . . . . . . . . . . . . . . . . . 3

2 Handling the data 52.1 Data formats . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Importing / exporting data . . . . . . . . . . . . . . . . . . . 72.3 Manipulating data . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Basic population genetics 9

4 Making trees 114.1 Hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . 114.2 ape and phylogenies . . . . . . . . . . . . . . . . . . . . . . . 13

5 Multivariate Analysis 155.1 ade4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.2 Principal Component Analysis (PCA) . . . . . . . . . . . . . 165.3 Principal Coordinates Analysis (PCoA) . . . . . . . . . . . . 19

2

1 Let’s start

1.1 Loading the packages

Before going further, we shall make sure that all we need is installed on thecomputer. Launch , and make sure that the version being used is at least2.12.0 by typing:

> R.version.string

[1] "R version 2.12.0 (2010-10-15)"

The next thing to do is check that relevant packages are installed. To loadan installed package, use the library instruction; for instance:

> library(adegenet)

loads adegenet if it is installed (and issues an error otherwise). To get theversion of a package, use:

> packageDescription("adegenet", fields = "Version")

[1] "1.2-9"

adegenet version should read 1.2-8 or greater.In case a package would not be installed, you can install it using in-

stall.packages. To install all the required dependencies, specify dep=TRUE.For instance, the following instruction should install adegenet with all its de-pendencies (it can take up to a few minutes, so don’t run it unless adegenetis not installed):

> install.packages("adegenet", dep = TRUE)

Using the previous instructions, load (and install if required) the pack-ages adegenet, ade4, ape, genetics, and hierfstat.

1.2 How to get information?

There are several ways of getting information about R in general, or aboutadegenet in particular. The function help.search is used to look for helpon a given topic. For instance:

> help.search("Hardy-Weinberg")

replies that there is a function HWE.test.genind in the adegenet package,other similar functions in genetics and pegas. To get help for a given func-tion, use ?foo where ‘foo’ is the function of interest. For instance (quotescan be removed):

> `?`(spca)

3

will open the manpage of the spatial principal component analysis [7]. Atthe end of a manpage, an ‘example’ section often shows how to use a func-tion. This can be copied and pasted to the console, or directly executedfrom the console using example. For further questions concerning R, thefunction RSiteSearch is a powerful tool to make an online research usingkeywords in R’s archives (mailing lists and manpages).

adegenet has a few extra documentation sources. Information can befound from the website (http://adegenet.r-forge.r-project.org/), inthe ‘documents’ section, including two tutorials, a manual which includesall manpages of the package, and a dedicated mailing list with searchablearchives. To open the website from , use:

> adegenetWeb()

The same can be done for tutorials, using adegenetTutorial (see manpageto choose the tutorial to open).

You will also find a listing of the main functions of the package typing:

> `?`(adegenet)

Note that you can also browse help pages as html pages, using:

> help.start()

To go to the adegenet page, click ‘packages’, ‘adegenet’, and ‘adegenet-package’.

Lastly, several mailing lists are available to find different kinds of infor-mation on R; to name a few:

R-help (https://stat.ethz.ch/mailman/listinfo/r-help): general ques-tions about R

R-sig-genetics (https://stat.ethz.ch/mailman/listinfo/r-sig-genetics):genetics in R

adegenet forum (https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum): adegenet and multivariate analysis ofgenetic markers

4

http://adegenet.r-forge.r-project.org/

https://stat.ethz.ch/mailman/listinfo/r-help

https://stat.ethz.ch/mailman/listinfo/r-sig-genetics

https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum

https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum

2 Handling the data

2.1 Data formats

Two principal types of genetic data can be handled in R. The first one is(preferably aligned) DNA sequences, and the second one is genetic markers.

DNA sequences can be used to calibrate models of evolution and computegenetic distances, which can in turn be used for phylogenetic reconstructionor in multivariate analyses. In , DNA sequences are best handled asDNAbin objects, in the ape package. See ?DNAbin for more details about thisclass. After loading ape, we load the dataset woodmouse:

> library(ape)> data(woodmouse)> woodmouse

15 DNA sequences in binary format stored in a matrix.

All sequences of same length: 965

Labels: No305 No304 No306 No0906S No0908S No0909S ...

Base composition:a c g t

0.307 0.261 0.126 0.306

Use str, unclass and attributes to explore the content of the DNAbin

object. Convert the dataset to a matrix of characters using as.character.What is the size of this new object (use object.size)? Compare it to theoriginal DNAbin object. Why is this class optimal for handling DNA se-quences?

Genetic markers are features of DNA sequences that vary between indi-viduals. There is not a single type of genetic marker: some are codominant(all alleles are visible), some are dominant (some alleles hide the presence ofothers); some have many alleles (like microsatellites), and others are most of-ten binary (like SNPs). In general, codominant (more informative) markersare the rule, and dominant markers become more and more rare.

Both types of markers can be handled in using the class genind

implemented in adegenet. This class uses the S4 system of class definitionand methods, with which users are generally not familiar. Load thedataset nancycats:

> library(adegenet)> data(nancycats)> nancycats

######################## Genind object ########################

- genotypes of individuals -

5

S4 class: genind@call: genind(tab = truenames(nancycats)$tab, pop = truenames(nancycats)$pop)

@tab: 237 x 108 matrix of genotypes

@ind.names: vector of 237 individual [email protected]: vector of 9 locus [email protected]: number of alleles per [email protected]: locus factor for the 108 columns of @[email protected]: list of 9 components yielding allele names for each locus@ploidy: 2@type: codom

Optionnal contents:@pop: factor giving the population of each [email protected]: factor giving the population of each individual

@other: a list containing: xy

Usually, the slots of S4 objects can be accessed by ’@’, which replaces the$ operator used in S3 classes (e.g. in lists). For genind objects, both canbe used. Documentation of the class can be accessed by typing ?’genind-

class’. Using names and @ (or $), browse the content of nancycats; whatinformation is contained in this object?

A more simple dataset can be used to understand how information iscoded. Here, we create a simple dataset with two diploid individuals (inrows), and two markers (columns).

> dat <- data.frame(locus1 = c("A/T", "T/T"), locus2 = c("G/G",+ "C/G"))> dat

locus1 locus21 A/T G/G2 T/T C/G

> x <- df2genind(dat, sep = "/")> x

######################## Genind object ########################


S4 class: genind@call: df2genind(X = dat, sep = "/")



Optionnal contents:@pop: - empty -

6

@pop.names: - empty -

@other: - empty -

Look at the @tab component of the genind object. To extract this tablewith original labels, use truenames. How is information stored? Does theEuclidean distance between the rows of this matrix make some biologicalsense? This is a canonical way of storing information for any dominantmarkers, irrespective of the degree of ploidy.

It sometimes happens that analyses are performed at a population level,rather than at an individual level. In such cases, genpop objects are used.These objects contain allele numbers per populations rather than relativefrequencies, but are otherwise similar to genind objects. genpop can beobtained from genind objects using genind2genpop. Use this function tocompute allele counts for the cat colonies of Nancy (data nancycats), andcompare the the structure of the obtained object to the genind object.

2.2 Importing / exporting data

DNA sequences with usual formats are read using read.dna from the apepackage.

Genetic markers come in a wider variety of formats. import2genind

can be used to import some of the most frequent, and less twisted formats.However, it is also possible to convert data from a data.frame to a genindusing df2genind. Using the example above, create a data.frame with twotetraploid (i.e. 4 alleles) individuals and two loci. Note that alleles can begiven any name, and do not need to be letters. Then, used df2genind toconvert it to a genind object. Use truenames and genind2df to check thatthe data conversion has worked.

It also happens that genetic markers are derived from DNA sequences.In such a case, only sites that vary between individuals are retained; theseare called Single Nucleotide Polymorphism (SNPs). They can be extractedfrom DNAbin objects using DNAbin2genind. Use this function to extract thepolymorphic sites from the woodmouse dataset. The locus names correspondto the position of the polymorphic site. Use this information to plot thedistribution of polymorphism along the DNA sequence; you should arrive atsomething along the lines of (using density with a width of 30):

7

0 200 400 600 800 1000

0.00

00.

001

0.00

20.

003

0.00

4

density.default(x = x, width = 30)

Position on DNA sequence

Den

sity

of p

olym

orph

ism

(S

NP

s)

||||||||| ||| || || ||||||| ||||| || |||| || ||| || ||||| | | | | || ||||

What can we say about the distribution of SNPs on the DNA sequence?

2.3 Manipulating data

It is often useful to be able to manipulate genetic marker data in differentways. There is no particular manipulation for DNAbin objects, which areessentially matrices or lists. genind objects benefit from different facilitiesfor manipulating data. First, the [ operator can be used to subset individualsand alleles; this is done like for a matrix, and is based on the @tab slot. Forinstance, to retain only the first 10 individuals of the nancycats dataset,use:

> nancycats[1:10, ]

######################## Genind object ########################


S4 class: genind@call: .local(x = x, i = i, j = j, drop = drop)



8

Optionnal contents:@pop: factor giving the population of each [email protected]: factor giving the population of each individual

@other: a list containing: xy

Data can be split by population using seppop, and by locus using seploc.Indication of the population is obtained or set using pop. repool can beused to gather the genotypes stored in different genind objects into a singlegenind. Reading the documentation, interprete the following command line:

> temp <- seppop(nancycats)> x <- repool(lapply(temp, function(e) e[sample(1:nrow(e$tab),+ 5, replace = FALSE)]))

What does the produced object contain? How can this be useful whenanalysing genetic data?

3 Basic population genetics

This section aims at introducing some basic population genetics tools. Thefirst one consists in testing random mating within groups by testing Hardy-Weinberg equilibirum (HWE). Using HWE.test.genind, test HWE in thecat colonies of the dataset nancycats. Since this corresponds to a lot of tests(1 per group and locus), as a first approximation, ask for returning only thep-values. Compute the number of significant departures from HWE (p.value< 0.001) per locus, and then per population (use apply), and representgraphically the results. You should arrive at (in less than 3 command lines):

9

P01

P02

P03

P04

P05

P06

P07

P08

P09

P10

P11

P12

P13

P14

P15

P16

P17

By population

Pro

port

ion

of d

epar

ture

s fr

om H

WE

0.0

0.1

0.2

0.3

0.4

fca8

fca2

3

fca4

3

fca4

5

fca7

7

fca7

8

fca9

0

fca9

6

fca3

7By marker

Pro

port

ion

of d

epar

ture

s fr

om H

WE

0.00

0.02

0.04

0.06

0.08

0.10

What are your conclusions? What can we say about the hypothesis ofrandom mating within these cat colonies?

To investigate the data further, we can measure and test the significanceof the genetic differentiation of the cat colonies. Use fstat to compute

10

the Fst of these populations. This is a measure of population differentia-tion, which can be roughly interpreted as the proportion of genetic varianceexplained by differences between groups. Values less than 0.05 are oftenconsidered as negligible.

Then, test the significance of the overall group differentiation usinggstat.randtest, and plot the results; are there differences between groups?

4 Making trees

Genetic data are often represented using trees. In this section, we won’ttackle phylogenetic reconstruction based on maximum likelihood or parsi-mony, which should be covered in a practical of its own. For this, see thewell-documented package phangorn. In the following, we illustrate brieflyhow to contruct tree representations of genetic distances.

4.1 Hierarchical clustering

Several algorithm of hierarchical clustering are implemented in the functionhclust. This function requires a distance matrix produced by the functiondist, or similar function (e.g. dist.dna, dist.genpop).

We will try to describe the genetic variability in the cat colonies of Nancyusing clustering methods. First, we need to compute a distance matrix. Re-place missing data in nancycats using na.replace, and then compute theEuclidean distances between individuals using dist on the table of relativeallele frequencies ($tab of the object in which NAs have been replaced). Usehclust to compute a clustering with complete linkage, and one with singlelinkage. How do they compare? How would you interprete these graphs?Can you assess the number of actual genetic clusters in these data?

As a comparison, repeat these analyses with the dataset microbov, whichcontains genotypes of various cattle breeds in France and Africa.

11

017

018

019

191

180

125

119

123

035

061

205

218

220

048

143

223

137

118

121

186

193

188

156

235

207

208 13

213

916

210

919

5 067 21

005

307

006

311

300

600

811

520

918

502

202

5 012

028

026

032

068

141

072

145

003

093

127

203

204

126

198

200

124

177

183

165

158

190

014

016 01

101

302

308

308

614

805

405

516

106

522

421

121

521

708

900

700

509

102

001

503

010

119

214

214

404

214

615

919

404

903

303

4 029 13

303

110

8 027

232

233 1

99 236

128

225

120

122

187

074

140

189

136

147

004

229

230

171

078

079

076

085

009

155

231

237

087

116

212

213

216

206

043

075

099

100

094

098

105

160

164

066

051

069

057

221

222

024

021

226 11

414

916

604

613

809

509

611

215

204

410

711

009

713

413

503

703

8 179

181 084

154

157

058

104

153

150

151

060

178

172

175

173

167

169

176

050

117

010

202

196

197 12

913

013

103

604

1 228

234

102

219

001

002

047

040

056

062

081

184

227

064

201

059

092

045

090

174

071

080

103

111

052

088

082

039

073

168

170

214

077

182

106

163

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Microbov − complete linkage

hclust (*, "complete")D

Hei

ght

093

090

191

112

110

171

148

067

185

168

057

017

011

003

161

024

180

127

183

224

223

215

209

194

181

179

159

103

111

086

070

192

037

038

035

031

018

054

055 2

0210

117

421

421

020

619

518

616

613

613

413

512

410

809

709

609

508

908

408

306

005

305

004

904

804

504

613

803

402

901

901

011

414

9 033

152

154

153

150

151

066

109

027

193

081

042

146 17

722

110

215

712

621

721

120

820

019

818

818

717

217

514

713

313

208

711

611

506

109

207

108

007

907

605

104

404

303

604

101

503

001

603

201

202

802

602

202

502

122

601

416

016

410

420

515

616

313

916

2 235

085

170

167

173

169

176

013

023

107

069

222

212

207

213

216

204

203

196

197

165

129

128

125

119

123

088

075

142

144

068

141

072

145

074

065

059

058

052

047

008

117

006

063

113

094

106

099

100

098

105

007

005

091

143

020

232

233

236

199

064

201

228

078

077

182 0

4005

606

214

018

9 082

039

073

155

004

158

137

190

122

120

118

121

009

178

229

230

219

225

220

218

001

002

184

227 2

34 131

130

231

2371.4

1.6

1.8

2.0

2.2

2.4

2.6

Microbov − single linkage

hclust (*, "single")D

Hei

ght

Species in the microbov dataset are well differentiated. How good ishierarchical clustering at displaying genetic structures?

12

4.2 ape and phylogenies

ape is the main package for phylogenetic analyses. It implements variousmethods for reconstructing trees based on genetic distances. It also providesnice graphical functions for plotting trees.

Use the genetic distances obtained for the nancycat data to computea Neighbour-Joining tree (nj). Then, plot the tree, specifying that youwant to plot an unrooted tree, and use tiplabels to represent colonies bycoloured dots. To define one color for each colony, use:

> myCol <- rainbow(17)[as.integer(pop(nancycats))]

Since some colours are very light, you are best using a grey background forthe plot. The final result should ressemble:

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●●

●●

●

●

●

●

● ●

●

●

●

●

●

●●●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●●

●

●

●

● ●

● ●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●●

●

●●

●●

●

●

For a comparison, repeat these analyses with the cattle breed data (microbov).To define one color for each breed of the dataset, use:

> myCol <- rainbow(15)[as.integer(pop(microbov))]

You should obtain a tree like this one:

13

●

●

●●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●●

●●

●● ●

●

●

●●

●

●

●

●

●●

●

●

●●

●● ●●

●

●

●

●

●●

●●●

●

●

●●

●

●

●

●

● ●

●

●

●

●

● ●

●

●

●

●●●

●●

● ●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●●

●

●

●●

●

●● ●●●●

●

●

●

●

●

●●●●

●● ●

●

●●●●●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●●

●●

●●●

●

●

●

●

●

●●●

●●

●●● ●

●●

●●

●

●

●●

●

●

●

●

●

●

● ●

●

●

●

●

●

● ●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

● ●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●●

● ●

●

●

●

●●

●●●

●

●●●●●

●

●

●

●

●

●

●●●

●●

●●

● ●

●●

●●

●

●

●

●

●●●

●

●●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

● ●●

●

●

●

●●

●

●

●●

●●

●

●●● ●●

●

●

●

●●● ●

●● ●

●

●

●

●

●

●

●

●●

●

●

●●

●

●●

●

●

●

●

●●

●

●●

●

●

●

● ●

●

●

●●

●

●

●

●● ●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●

●

●

●

● ●●●

●

●

●

●

●

●

●

●

●

●

● ●●●

●●

●

●●

●

●● ●

●

●

●

●

●

●

●●

●●

●

●

●●

●●●

●●

●●●

●

●

●

●

●●

●

●●●

●●

●

● ●

●

●●

●

●

●

●●

●

●

●

●●

●

●

●

●

●●

●

●

● ●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

● ●

●

●●

●

There are two species, Bos taurus and Bos indicus, in these cattle breeds;the species information is stored in microbov$other$spe. Use the same ap-proach to represent the species information on the tips of the tree. Is thisclassification reflected by a clear-cut structure on the tree?

The quality of a tree is unfortunately rarely questioned in terms of howwell the real distances are represented by the tree. The literature is evenfull of examples of analyses based on distances computed from trees, whilethe original distances that had generated the tree could have been used. Asimple way of assessing how good a tree is at representing a set of distancesis simply comparing the original distances to the reconstructed distances(sum of branch lengths between tips). The latter can be computed usingcophenetic. Use this function to plot both distances (microbov data), afterconverting the distances into vectors using as.vector; you should obtainthe following figure:

14

The red line indicates the regression of the original distance onto the recon-structed (function lm). Perform this regression, and use summary to computevarious statistics for this model, including the R2. What percentage of thevariance of the original distances is represented by the tree reconstruction?Is this satisfying? Shall we always rely on a tree representation of a distancematrix?

5 Multivariate Analysis

5.1 ade4

There are a few of packages implementing multivariate analyses in . Oneof the most complete is ade4, with which adegenet is interfaced. It hasbeen developped around the “duality diagram” representation [4, 2], whichemphasizes that most multivariate analyses can be unified under a singlegeneral algorithm. Alternatively, some statisticians refer to this algorithmas “generalized PCA”.

In ade4, multivariate analyses all use the same underlying algorithm(as.dudi) and produce an object with the class dudi (for“duality diagram”).Methods such as Principal Component Analysis (PCA) or Principal Coor-dinates Analysis (PCoA) are implemented by functions where the methods’abbreviation is preceeded by dudi; for instance, PCA is implemented bydudi.pca, while PCoA is implemented by dudi.pco.

Apart from a few specific methods like spatial PCA (spca, [7]) or Dis-criminant Analysis of Principal Components (dapc, [8]), multivariate meth-ods are not implemented in adegenet. Procedures from ade4 are used aftertransforming genind or genpop objects in an appropriate way, which oftenconsists in replacing missing data, and computing and standardizing allelefrequencies before analysis.

15

5.2 Principal Component Analysis (PCA)

PCA can be used on standardized allele frequencies, at the individual orpopulational level. The function scaleGen is used to standardize the dataand replace possible missing data. Use it on the microbov dataset to obtaina table of centred, but not scaled allele frequencies. Then, perform the PCAof these data using dudi.pca. Be careful to set the argument scale toFALSE to avoid erasing the scaling choice made in scaleGen.

Eigenvalues

0.0

0.2

0.4

0.6

0.8

1.0

1.2

The function dudi.pca displays a barplot of eigenvalues and asks for anumber of retained principal components. In general, eigenvalues representthe amount of genetic diversity — as measured by the multivariate methodbeing used — represented by each principal component (PC). Here, eacheigenvalue is the variance of the corresponding PC.

Once you have selected the number of retained axes, your PCA shouldressemble:

Duality diagrammclass: pca dudi$call: dudi.pca(df = X, center = FALSE, scale = FALSE, scannf = FALSE,

nf = 3)

$nf: 3 axis-components saved$rank: 341eigen values: 1.27 0.5317 0.423 0.2853 0.2565 ...vector length mode content

1 $cw 373 numeric column weights2 $lw 704 numeric row weights

16

3 $eig 341 numeric eigen values

data.frame nrow ncol content1 $tab 704 373 modified array2 $li 704 3 row coordinates3 $l1 704 3 row normed scores4 $co 373 3 column coordinates5 $c1 373 3 column normed scoresother elements: cent norm

The printing of dudi object is not always explicit. The essential itemsfor the analyses we will use are:

? $eig: eigenvalues of the analysis

? $c1: principal axes, containing the loadings of the variables (alleles)

? $li: principal components, containing the scores of the individuals

The function scatter provides a simultaneous representation of individ-uals and variables. However, it does not provide very informative graphicswhen there are too many items to be plotted. To represent individuals only,use s.label, and use the argument clab to change the size of the labels;for instance:

> s.label(pcaX$li, clab = 0.75)> add.scatter.eig(pcaX$eig, 3, 1, 2)

d = 1

AFBIBOR9503

AFBIBOR9504

AFBIBOR9505

AFBIBOR9506

AFBIBOR9507

AFBIBOR9508

AFBIBOR9509

AFBIBOR9510 AFBIBOR9511

AFBIBOR9512

AFBIBOR9513 AFBIBOR9514 AFBIBOR9515


AFBIBOR9518

AFBIBOR9519

AFBIBOR9520

AFBIBOR9521

AFBIBOR9522

AFBIBOR9523



AFBIBOR9529 AFBIBOR9530 AFBIBOR9531 AFBIBOR9532

AFBIBOR9533

AFBIBOR9534


AFBIBOR9537


AFBIBOR9541


AFBIBOR9544

AFBIBOR9545

AFBIBOR9546

AFBIBOR9547

AFBIBOR9548

AFBIBOR9549

AFBIBOR9550


AFBIZEB9453 AFBIZEB9454


AFBIZEB9457

AFBIZEB9458

AFBIZEB9459

AFBIZEB9460

AFBIZEB9461

AFBIZEB9462

AFBIZEB9463

AFBIZEB9464

AFBIZEB9465

AFBIZEB9466 AFBIZEB9467 AFBIZEB9468

AFBIZEB9469

AFBIZEB9470


AFBIZEB9473

AFBIZEB9474

AFBIZEB9475

AFBIZEB9476

AFBIZEB9477

AFBIZEB9478

AFBIZEB9479

AFBIZEB9480

AFBIZEB9481 AFBIZEB9482 AFBIZEB9483

AFBIZEB9484

AFBIZEB9485


AFBIZEB9488

AFBIZEB9489

AFBIZEB9490

AFBIZEB9491

AFBIZEB9492


AFBIZEB9495

AFBIZEB9496

AFBIZEB9497

AFBIZEB9498

AFBIZEB9499

AFBIZEB9500


AFBTLAG9402

AFBTLAG9403 AFBTLAG9404

AFBTLAG9405

AFBTLAG9406


AFBTLAG9409

AFBTLAG9410

AFBTLAG9411

AFBTLAG9412


AFBTLAG9415

AFBTLAG9416

AFBTLAG9417

AFBTLAG9418 AFBTLAG9419 AFBTLAG9420

AFBTLAG9421

AFBTLAG9422


AFBTLAG9425

AFBTLAG9426 AFBTLAG9427 AFBTLAG9428 AFBTLAG9429


AFBTLAG9432 AFBTLAG9433 AFBTLAG9434

AFBTLAG9435


AFBTLAG9438

AFBTLAG9439

AFBTLAG9440

AFBTLAG9441

AFBTLAG9442

AFBTLAG9443

AFBTLAG9444

AFBTLAG9445

AFBTLAG9446

AFBTLAG9447

AFBTLAG9448

AFBTLAG9449

AFBTLAG9450

AFBTLAG9451

AFBTLAG9452

AFBTND202 AFBTND205

AFBTND206 AFBTND207

AFBTND208

AFBTND209

AFBTND211

AFBTND212

AFBTND213 AFBTND214

AFBTND215

AFBTND216

AFBTND217

AFBTND221 AFBTND222

AFBTND223

AFBTND233

AFBTND241

AFBTND242

AFBTND244 AFBTND248 AFBTND253

AFBTND254

AFBTND255 AFBTND257

AFBTND258

AFBTND259 AFBTND284

AFBTND285

AFBTND292

AFBTSOM9352 AFBTSOM9353

AFBTSOM9354


AFBTSOM9357

AFBTSOM9358

AFBTSOM9359

AFBTSOM9360


AFBTSOM9363 AFBTSOM9364 AFBTSOM9365

AFBTSOM9366

AFBTSOM9367


AFBTSOM9370

AFBTSOM9371

AFBTSOM9372

AFBTSOM9373

AFBTSOM9374

AFBTSOM9375


AFBTSOM9378

AFBTSOM9379


AFBTSOM9382

AFBTSOM9383

AFBTSOM9384

AFBTSOM9385

AFBTSOM9386

AFBTSOM9387



AFBTSOM9392

AFBTSOM9393

AFBTSOM9394

AFBTSOM9395

AFBTSOM9396



AFBTSOM9401

FRBTAUB9061 FRBTAUB9062 FRBTAUB9063

FRBTAUB9064

FRBTAUB9065

FRBTAUB9066

FRBTAUB9067 FRBTAUB9068


FRBTAUB9071

FRBTAUB9072


FRBTAUB9076

FRBTAUB9077


FRBTAUB9081

FRBTAUB9208

FRBTAUB9210


FRBTAUB9213

FRBTAUB9216

FRBTAUB9218


FRBTAUB9221

FRBTAUB9222



FRBTAUB9229 FRBTAUB9230 FRBTAUB9231 FRBTAUB9232

FRBTAUB9234

FRBTAUB9235


FRBTAUB9286


FRBTAUB9289

FRBTAUB9290

FRBTBAZ15654

FRBTBAZ15655

FRBTBAZ25576

FRBTBAZ25578 FRBTBAZ25950 FRBTBAZ25954

FRBTBAZ25956



FRBTBAZ26352 FRBTBAZ26354


FRBTBAZ26388

FRBTBAZ26396

FRBTBAZ26400

FRBTBAZ26401

FRBTBAZ26403

FRBTBAZ26439


FRBTBAZ26469

FRBTBAZ29244

FRBTBAZ29246

FRBTBAZ29247

FRBTBAZ29253

FRBTBAZ29254

FRBTBAZ29259

FRBTBAZ29261




FRBTBAZ30421

FRBTBAZ30429

FRBTBAZ30436

FRBTBAZ30440


FRBTBAZ30539

FRBTBAZ30540

FRBTBDA29851

FRBTBDA29852

FRBTBDA29853 FRBTBDA29854 FRBTBDA29855

FRBTBDA29856 FRBTBDA29857


FRBTBDA29860

FRBTBDA29861



FRBTBDA29866


FRBTBDA29870

FRBTBDA29872

FRBTBDA29873


FRBTBDA29877

FRBTBDA29878



FRBTBDA35244

FRBTBDA35245


FRBTBDA35256

FRBTBDA35258

FRBTBDA35259

FRBTBDA35260

FRBTBDA35262


FRBTBDA35274

FRBTBDA35278

FRBTBDA35280

FRBTBDA35281

FRBTBDA35283

FRBTBDA35284 FRBTBDA35286 FRBTBDA35379 FRBTBDA35446


FRBTBDA35877

FRBTBDA35899

FRBTBDA35916

FRBTBDA35931


FRBTBDA36120

FRBTBDA36124

FRBTBPN1870

FRBTBPN1872 FRBTBPN1873

FRBTBPN1875

FRBTBPN1876

FRBTBPN1877

FRBTBPN1894


FRBTBPN1897

FRBTBPN1898

FRBTBPN1899


FRBTBPN1904

FRBTBPN1906

FRBTBPN1907


FRBTBPN1913 FRBTBPN1914 FRBTBPN1915

FRBTBPN1927

FRBTBPN1928

FRBTBPN1930

FRBTBPN1932

FRBTBPN1934

FRBTBPN1935


FRBTBPN25811

FRBTCHA15946

FRBTCHA15957 FRBTCHA15985

FRBTCHA15994

FRBTCHA25009

FRBTCHA25015


FRBTCHA25069

FRBTCHA25295



FRBTCHA25543 FRBTCHA25654 FRBTCHA25995 FRBTCHA26011


FRBTCHA26054

FRBTCHA26074


FRBTCHA26199

FRBTCHA26202

FRBTCHA26205

FRBTCHA26246

FRBTCHA26274

FRBTCHA26285

FRBTCHA26783

FRBTCHA26784

FRBTCHA26785 FRBTCHA26786 FRBTCHA26789 FRBTCHA26790 FRBTCHA26792 FRBTCHA26793

FRBTCHA26797

FRBTCHA26798

FRBTCHA26800

FRBTCHA30335

FRBTCHA30341

FRBTCHA30344 FRBTCHA30353 FRBTCHA30356

FRBTCHA30373


FRBTCHA30878

FRBTCHA30879


FRBTCHA30893

FRBTCHA30896 FRBTGAS14180

FRBTGAS14183

FRBTGAS14184

FRBTGAS14185

FRBTGAS14186

FRBTGAS14187

FRBTGAS14188 FRBTGAS9049



FRBTGAS9054


FRBTGAS9057

FRBTGAS9058

FRBTGAS9059

FRBTGAS9060

FRBTGAS9170

FRBTGAS9171

FRBTGAS9172

FRBTGAS9173

FRBTGAS9174

FRBTGAS9175

FRBTGAS9176

FRBTGAS9177

FRBTGAS9178



FRBTGAS9183

FRBTGAS9184


FRBTGAS9188

FRBTGAS9189

FRBTGAS9190

FRBTGAS9193

FRBTGAS9195

FRBTGAS9197

FRBTGAS9198

FRBTGAS9199 FRBTGAS9200 FRBTGAS9201 FRBTGAS9202 FRBTGAS9203

FRBTGAS9204

FRBTGAS9205 FRBTLIM3001

FRBTLIM30816 FRBTLIM30817

FRBTLIM30818


FRBTLIM30821

FRBTLIM30822

FRBTLIM30823


FRBTLIM30826 FRBTLIM30827 FRBTLIM30828 FRBTLIM30829 FRBTLIM30830

FRBTLIM30831


FRBTLIM30834

FRBTLIM30835

FRBTLIM30836

FRBTLIM30837 FRBTLIM30838 FRBTLIM30839

FRBTLIM30840



FRBTLIM30846

FRBTLIM30847


FRBTLIM30850

FRBTLIM30851

FRBTLIM30852

FRBTLIM30853

FRBTLIM30854

FRBTLIM30855

FRBTLIM30856

FRBTLIM30857


FRBTLIM5133

FRBTLIM5135

FRBTLIM5136

FRBTLIM5137

FRBTMA25273 FRBTMA25278

FRBTMA25282

FRBTMA25298

FRBTMA25382

FRBTMA25387

FRBTMA25409

FRBTMA25412

FRBTMA25418

FRBTMA25423

FRBTMA25428

FRBTMA25433

FRBTMA25436 FRBTMA25439 FRBTMA25488


FRBTMA25588



FRBTMA25820

FRBTMA25896

FRBTMA25902

FRBTMA25917 FRBTMA25922 FRBTMA25978 FRBTMA25982


FRBTMA26168

FRBTMA26232

FRBTMA26280


FRBTMA29571


FRBTMA29809


FRBTMA29945

FRBTMA29948


FRBTMBE1496 FRBTMBE1497 FRBTMBE1502

FRBTMBE1503

FRBTMBE1505 FRBTMBE1506

FRBTMBE1507



FRBTMBE1514

FRBTMBE1516


FRBTMBE1520

FRBTMBE1523


FRBTMBE1532

FRBTMBE1534

FRBTMBE1535

FRBTMBE1536

FRBTMBE1538


FRBTMBE1544


FRBTMBE1549

FRBTSAL9087 FRBTSAL9088

FRBTSAL9089

FRBTSAL9090

FRBTSAL9091

FRBTSAL9093 FRBTSAL9094 FRBTSAL9095

FRBTSAL9096


FRBTSAL9100

FRBTSAL9101

FRBTSAL9102

FRBTSAL9103

FRBTSAL9241

FRBTSAL9242

FRBTSAL9243

FRBTSAL9245



FRBTSAL9250

FRBTSAL9251

FRBTSAL9252

FRBTSAL9253

FRBTSAL9255

FRBTSAL9256

FRBTSAL9257

FRBTSAL9258

FRBTSAL9259


FRBTSAL9262

FRBTSAL9265

FRBTSAL9266

FRBTSAL9267

FRBTSAL9268



FRBTSAL9275 FRBTSAL9276 FRBTSAL9277

FRBTSAL9280

FRBTSAL9283


Eigenvalues

This is better, but far from perfect yet. In this case, it would be better torepresent results by breed. This can be achieved using s.class. In additionto the coordinates (the same as in s.label), this function needs a factor

17

grouping the observations. This can be obtained easily using pop on theobject microbov. To be even fancier, you can define one color per groupusing rainbow (for other palettes, see ?rainbow):

> myCol <- rainbow(length(levels(pop(microbov))))

and then pass it as the color argument in s.class; the result should ressem-ble this:

d = 1

●

●

●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●

●● ●

●

●

●●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●●

● ●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

● ●●

●

●

●●

●

●●

●

●

●

●

●

●●

●

●

●

●●

●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●●●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●●● ●

●

●

●

●

●

●

●

● ●

●

●

●

●●

●●

●

● ●

●

●●

●

●

●

●

●●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●●

●

●

●●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

Borgou

Zebu

Lagunaire

NDama Somba

Aubrac

Bazadais

BlondeAquitaine BretPieNoire Charolais

Gascon Limousin MaineAnjou Montbeliard

Salers

Eigenvalues

Represent the results for the axes 1-2, and then 2-3. How can you in-terprete these principal components? What is the major factor of geneticdifferentiation in these cattle breeds? What is the second one? What is thethird one?

In PCA, the eigenvalues indicate the variance of the corresponding prin-cipal component. Verify that this is indeed the case, for the first and sec-ond principal components. Note that this is also, up to a constant, themean squared Euclidean distance between individuals. This is because (forx ∈ Rn):

var(x) =

∑ni=1

∑nj=1(xi − xj)

2

2n(n− 1)

This again can be checked; assuming this analyis is called pcaX:

> pcaX$eig[1]

[1] 1.269978

18

> pc1 <- pcaX$li[, 1]> n <- length(pc1)> 0.5 * mean(dist(pc1)^2) * ((n - 1)/n)

[1] 1.269978

We sometimes want to express eigenvalues as a percentage of the totalvariation in the data. What is the percentage of variance explained by thefirst three eigenvalues?

Allele contributions can sometimes be informative. The basic graphicsfor representing allele loadings is s.arrow. Use it to represent the resultsof the PCA; is this informative? An alternative is offered by loadingplot,which represents one axis at a time. Use it to represent the squared loadingsof the analysis; how does it compare to previous representation? The func-tion returns invisibly some information; adapting the arguments and savingthe returned value, identify the 2% alleles contributing most to showing thediversity within African breeds. You should find:

[1] "ETH152.197" "INRA32.178" "HEL13.182" "MM12.119" "CSRM60.093"[6] "TGLA227.079" "TGLA126.119" "TGLA122.150"

5.3 Principal Coordinates Analysis (PCoA)

We have just seen that the variance is, up to a constant, equal to the meansquared distance between all pairs of observations. Therefore, maximizingthe variance along an axis should be exactly the same as maximizing thesum of all squared Euclidean distances. This latter maximization is theobjective of Principal Coordinates Analysis (PCoA), also known as Met-ric Multidimensional Scaling (MDS). This method is implemented in ade4by dudi.pco. After scaling the relative allele frequencies of the microbov

dataset, perform this analysis.Represent the principal components ($li) of this analysis and compare

them to the PCs of the previous PCA. What is the correlation betweenthem? Compare the scatter plots of both analyses; are there any differ-ences? In practice, in which cases should we prefer PCA over PCoA, or thecontrary?

As an exercise, compute genetic similarities based on the proportion ofshared alleles (propShared) on the microbov data. Transform the similar-ities into a distance (no dedicated function for this), and perform a PCoAof the resulting distance. Is this distance Euclidean? If not, it needs tobe. Transform it into an Euclidean distance (using cailliez) and redo thePCoA. How do the results compare to the previous analyses?

19

References

[1] D. Chessel, A-B. Dufour, and J. Thioulouse. The ade4 package-I- one-table methods. R News, 4:5–10, 2004.

[2] S. Dray and A.-B. Dufour. The ade4 package: implementing the dualitydiagram for ecologists. Journal of Statistical Software, 22(4):1–20, 2007.

[3] S. Dray, A.-B. Dufour, and D. Chessel. The ade4 package - II: Two-table and K-table methods. R News, 7:47–54, 2007.

[4] Y. Escoufier. The duality diagramm : a means of better practicalapplications. In P. Legendre and L. Legendre, editors, Development innumerical ecology, pages 139–156. NATO advanced Institute , Serie G.Springer Verlag, Berlin, 1987.

[5] J. Goudet. HIERFSTAT, a package for R to compute and test hierar-chical F-statistics. Molecular Ecology Notes, 5:184–186, 2005.

[6] T. Jombart. adegenet: a R package for the multivariate analysis ofgenetic markers. Bioinformatics, 24:1403–1405, 2008.

[7] T. Jombart, S. Devillard, A.-B. Dufour, and D. Pontier. Revealingcryptic spatial patterns in genetic variability by a new multivariatemethod. Heredity, 101:92–103, 2008.

[8] Thibaut Jombart, Sebastien Devillard, and Francois Balloux. Discrim-inant analysis of principal components: a new method for the analysisof genetically structured populations. BMC Genetics, 11(1):94, 2010.

[9] E. Paradis, J. Claude, and K. Strimmer. APE: analyses of phylogeneticsand evolution in R language. Bioinformatics, 20:289–290, 2004.

[10] R Development Core Team. R: A Language and Environment for Sta-tistical Computing. R Foundation for Statistical Computing, Vienna,Austria, 2009. ISBN 3-900051-07-0.

[11] G. R. Warnes. The genetics package. R News, 3(1):9–13, 2003.

20

Practical course using the software ||||| Introduction to ...adegenet.r-forge.r-project.org/files/practical-day1.1.2.pdf · 5.2 Principal Component Analysis (PCA) . . . . . . . .

Documents