Top Banner
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1
52

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

Dec 26, 2015

Download

Documents

Gregory Green
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Yang LiLin Liu

Jan 29, 2014

1

Page 2: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

• Unix part slides courtesy: John Brunelle

• You can check out more details in:– https://software.rc.fas.harvard.edu/

training/intro_unix/latest/#(1)

2

Page 3: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Sign up on Odyssey

• Very simple, just go to http://rc.fas.harvard.edu/, then click on Account and Access Request Forms (right top of the website on Quick Links section), then click on RC Account form, and then fill it in as below – we will take care of the rest!

3

Page 4: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology 4

Page 5: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Basic Unix Command

• Log in:• ssh

[email protected]

5

Page 6: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Basic Unix Command

• Upload or download files:Upload:

scp username@host dir/targetfilenameDownload:

scp dir/yourfilename username@host

6

Page 7: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

CaSe SeNsItIvE

• In shell commands, abc will be different from ABC

7

Page 8: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Terminology and notation

• Folders are usually referred to as directories

• Locations in the filesystem, like /n/home00/cfest350, are called paths

• The directory and file names that make up a path are always separated by a forward-slashes

• The top of the hierarchy is /, ie the root directory

8

Page 9: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Terminology and notation

9

Page 10: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Navigating the system: ls

10

Page 11: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Download and unzip files

• wget https://software.rc.fas.harvard.edu/training/examples.tar.gz

• tar xvf examples.tar.gz

11

Page 12: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

What if you get confused

• man ls– Use the arrow keys, page up/down keys,

or the SPACE to navigate– To search for a phrase of text, for

example the word time, type /time and hit ENTER• Hit n to go to the next occurrence• Hit N to go to the previous occurrence• Hit q to quit

12

Page 13: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Kill process

• top• kill• killall• Ctrl-c• Exercise: Run the command

~/examples/bin/ticktock, and kill it once you've had enough

13

Page 14: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Copy files• mkdir workshop• cd workshop• cp ~/examples/aaa .• cp ~/examples/bbb ~/examples/ccc .• cp aaa zzz• rsync: replacement for cp, but can be used

to copy files to/from remote computers– e.g. rsync -avz --progress mywork

username@hostname:~/mywork

14

Page 15: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Moving and removing

15

Page 16: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

File permissions

• The -rw-r--r-- displays the file mode bits– The first character is the type (- for files,

d for directories, and other letters (b, c, l, s etc.) for special files

– Following that are three groups of three characters, for read, write, and execute permissions for user, group, and others

16

Page 17: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology 17

Page 18: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

• r = 4, w = 2, x = 1, rwx = 7• chmod 755

18

Page 19: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Hidden files

19

Page 20: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

File manipulation

• cat ~/examples/gpl-3.0.txt• less ~/examples/gpl-3.0.txt• File editors: vim/emacs/nano

20

Page 21: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

More shell commands

21

Page 22: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Piping your commands

• cat ~/examples/answers.out | awk '{print $3}'

22

Page 23: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Exercises

• List the last 5 files in /bin by combining the ls and tail commands with a pipe

• Count the number of lines that contain the word free in ~/examples/gpl-3.0.txt

23

Page 24: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

The shell environment

• echo $PATH• Change $PATH:• PATH=$PATH\:/dir/path ; export PATH

24

Page 25: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Submit a job

• bsub < yourscript.bsub• yourscript.bsub:

#!/bin/sh#BSUB -u linliu@harvard#BSUB -J hellwo_world#BSUB -o hellow_world.out#BSUB -e hellow_world.err#BSUB -q short_serialpython hellow_world.py > hellow_world.out

25

Page 26: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Load modules

• module load dir/software– http://oldrcwebsite.rc.fas.harvard.edu/

faq/modulelist

26

Page 27: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Final tips

• Google is extremely helpful if you want to write some shell scripts

27

Page 28: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Getting started

Where the scripts/commands are executed

Where plots/help displayed, and packages installed.

Where the CODE is scripted

Show the variables/functions in memory

Page 29: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Workspace Management

• Before jumping into R, it is important to ask ourselvesWhere am I?

>getwd()

–I want to be there…• setwd(“C://”)

–With who am I?• dir() # lists all the files in the working directory

–With who I can count on?• ls() #lists all the variables on the current session

Page 30: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Workplace Management (2)

Saving>save(x,file=“name.RData”) #Saves specific

objects>save.image(“name.Rdata”) #Saves the whole

workspace

Loading>load(“name.Rdata”)

‘?function’ and ‘??function’>? To get the documentation of the function>?? Find related functions to the query

Page 31: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

R Objects• Almost all things in R are OBJECTS!

– Functions, datasets, results, etc… (graphs NO)

• OBJECTS are classified by two criteria– MODE: How objects are stored in R

• Character, numeric, logical, factor, list, function…• To obtain the mode of an object

> mode(object)

– CLASS: How objects are treated by functions• Vector, matrix, array, data.frame,…• To obtain the class of an object

> class(object)

Page 32: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

R classescharacter

> assembly = “hg19”> assembly> class(assembly)

numeric> expression = 3.456> expression> class(expression)

integer> nbases = “3000000000L”> nbases> class(nbases)

logical> completed = FALSE> completed> class(completed)

Page 33: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

R classes - Vectorvector

>x=c(10,5,3,6); x[3:4]; x[1]

Computations on vector are performed on each entry of the vector

>y=c(log(x),x,x^2)

Not necessarily to have vectors of the same length in operations!

>w=sqrt(x)+2>z=c(pi,exp(1),sqrt(2))>x+z

–Logical vectors>aux=x<7

Page 34: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

R Classes - Listlist

A vector of values of possibly different classes and different length.

Creating it.>x1 = 1:5>x2 = c(T,T,F,T,F)>y=list(question.number = x1, question.answer = x2)

Accesing it.>y;class(y)>y$question.answer[3]; y[[2]][3];

y[[“question.answer”]][3]>y$question.number[which(question.answer == T)]

Page 35: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

R classes - Matrixmatrix

>x=1:8>dim(x)=c(2,4)>y=matrix(1:8,2,4,byrow=F)

Operations are applied on each element

>x*x; max(x)>x=matrix(1:28,ncol=4);

y=7:10 so then x*y is…?>y=matrix(1:8,ncol=2)>y%*%t(y)

Page 36: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

R classes - Matrix

matrixExtracting info

>y[1,] or y[,1]Extending matrices

>cbind(y,seq(101,104))>rbind(y,c(102,109))

Apply is a useful function!>apply(y,2,mean)>apply(y,1,log)

Page 37: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

R classes – Data Frame

data.frameCreating it.

> policy.number = c(“A00187”, “A00300”,”A00467”,”A01226”)> issue.age = c(74,30,68,74)> sex=c(“F”, “M”, “M”, “F”)> smoke=c(“S”,”N”,”N”,”N”)> face.amount = c(420, 1560, 960, 1190)> ins.df = data.frame(policy.number, issue.age, sex, smoke,

face.amount)

Accesing it.> ins.df[1,]; ins.df[,1] # access first row, access first colum> ins.df$policy.number # access policy number column> rownames(ins.df); colnames(ins.df);> index.smokers = which(ins.df$smoke == “S”) # row index of

smokers> ins.df[index.smokers] # access all smokers in the df> ins.df$policy.number[index.smokers] # policy number for

smokers

Page 38: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

R classes – Data Frame

data.frameManipulating it.

> ins.df = rbind(ins.df, c(“A01495”, 62, “M”, “N”, 1330))> sort.age = sort(ins.df$issue.age, index=T)> ins.df = ins.df[sort.age$ix,]> ins.df$visits = c(0,4,2,1,1)> drops = c(“sex”,”visits”)> ins.df[,!(names(ins.df) %in% drops]> ins.df[,”visits”] = c(0,4,2,1,1)> carins.df = data.frame(policy.number =

c("A01495","A00232","A00187"), car.accident = c("Y","N","N"))> ins.merged.df = merge(ins.df, carins.df, by = "policy.number")> Etc…

Page 39: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

R Classes - Factorfactor

Qualitative variables that can be included in models.

>smoke = c(“yes”,”no”,”yes”,”no”)>smoke.factor = as.factor(smoke)>smoke.factor>class(smoke)>class(smoke.factor)

Page 40: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Loops and Conditional Statements

ifExample

>a=9>if(a<0){ print (“Negative number”) } else{ print (“Non-negative number”) }

Page 41: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

• for>z=rep(1,10)>for (i in 2:10)

{ z[i]=z[i]+exp(1)*z[i-1] }

• while>n=0>tmp=0>while(tmp<100)

{ tmp=tmp+rbinom(1,10,0.5) n=n+1 }

Loops and Conditional Statements

Page 42: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Functions!• My own functions

> function.name=function(arg1,arg2,…,argN) { Body of the function }

> fun.plot=function(y,z){y=log(y)*z-z^3+z^2plot(z,y)}

> z=seq(-11,10)> y=seq(11,32)> fun.plot(y,z)

Page 43: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Functions! (2)• The ‘…’ argument

– Can be used to pass arguments from one function to another• Without the need to specify arguments in

the header

fun.plot=function(y,z,...) { y=log(y)*z-z^3+z^2 plot(z,y,...) }fun.plot(y,z,type="l",col="red")fun.plot(y,z,type="l”,col=“red”,lwd=4)

Page 44: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Handling data I/O

Reading from files to a data frame>read.csv(“filename.csv“) # reads csv files into

a data.frame>read.table(“filename.txt“) # reads txt files in a

table format to a data.frame

Writing from a data frame to a file>write(x,filename) # writes the object x to

filename>write.table(x,filename) # writes the object x to

filename in a table format

Note: have in mind additional options such as, header = TRUE, row.names = TRUE, col.names = TRU, quotes = TRUE, etc.

Page 45: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Plotting!

>x.data=rnorm(1000)>y.data=x.data^3-10*x.data^2>z.data=-0.5*y.data-90

>plot(x.data,y.data,main="Title of the graph",xlab="x label",ylab="y label")

>points(x.data,z.data,col="red")>legend(-2,2,legend=c("Black points","Red

points"),col=c("black","red"),pch=1,text.col=c("black","red"))

Page 46: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Plotting! (2)

You can export graphs in many formats– To check the formats that are available in your

R installation>capabilities()

png>png("Lab2_plot.png",width=520,height=440)>plot(x.data,y.data,main="Title of the graph",xlab="x

label",ylab="y label")>points(x.data,z.data,col="red")>legend(-2,2,legend=c("Black points","Red

points"),col=c("black","red"),pch=1,text.col=c("black","red"))

>dev.off()eps

> postscript("Lab2_plot.eps",width=500,height=440)

Page 47: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Simulation

Sampling>sample(x,repla

ce=TRUE) – put it back into the bag!

Distributions

Page 48: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Libraries!!

Collection of R functionsthat together perform a specialized analysis.

Install packages from CRAN> install.packages(“PackageName”)

Loading libraries> library(LibraryName)

Getting the documentation of a library> library(help=LibraryName)

Listing all the available packages> library()

Page 49: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

• www.bioconductor.org

– A suite of R packages for Bioinformatics.

– To use only Core packages• >source(“http://bioconductor.org/biocLite.R”)• >biocLite()

– To use Core and Other packages• >source(“http://bioconductor.org/biocLite.R”)• >biocLite(c(“pkg1”, “pkg2”,…,“pkgN”))

Libraries!!

Page 50: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Exercise 1 – The empire strikes back: GOOG versus BAIDU

Plot historical Stock Prices times series using prices from yahoo finance.

(a) Download and install tseries package.

(b) Include tseries package as a library in your code.

(c) Use get.hist.quote to download GOOG and BAIDU historical data.

(d) Plot both time series in the same panel and add a legend to the plot.

Page 51: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Exercise 2 – Challenging Challenger

On January 28, 1986, the space Shuttle Challenger exploded in the early stages of its flight. Feynman, along a committee determined that the explosion was due to low temperatures and the failure of O-rings sealed on the booster rockets. The ambient temperature was 36 degrees on the morning of the launch.The scientists had data (temperature, number of failures) from previous flights.

Page 52: STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Exercise 2 – Challenging Challenger

(a)Plot the number of failures versus the temperature for flights with one or more O-ring failures. Is there any evidence that temperature affects O-ring performance?

(b)Plot the number of failures versus temperature for all the flights. Is there any evidence that temperature affects O-ring performance?

(c) What’s your conclusion? What do you think the scientists plot before taking the decision to fly that day? Just historical curiosity, Whom played a central role in discovering the causes of the failure and how he announced it?