STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Yang Li Lin Liu Jan 29, 2014 1.

Post on 26-Dec-2015

220 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Yang LiLin Liu

Jan 29, 2014

1

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

• Unix part slides courtesy: John Brunelle

• You can check out more details in:– https://software.rc.fas.harvard.edu/

training/intro_unix/latest/#(1)

2

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Sign up on Odyssey

• Very simple, just go to http://rc.fas.harvard.edu/, then click on Account and Access Request Forms (right top of the website on Quick Links section), then click on RC Account form, and then fill it in as below – we will take care of the rest!

3

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology 4

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Basic Unix Command

• Log in:• ssh

username@odyssey.fas.harvard.edu

5

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Basic Unix Command

• Upload or download files:Upload:

scp username@host dir/targetfilenameDownload:

scp dir/yourfilename username@host

6

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

CaSe SeNsItIvE

• In shell commands, abc will be different from ABC

7

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Terminology and notation

• Folders are usually referred to as directories

• Locations in the filesystem, like /n/home00/cfest350, are called paths

• The directory and file names that make up a path are always separated by a forward-slashes

• The top of the hierarchy is /, ie the root directory

8

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Terminology and notation

9

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Navigating the system: ls

10

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Download and unzip files

• wget https://software.rc.fas.harvard.edu/training/examples.tar.gz

• tar xvf examples.tar.gz

11

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

What if you get confused

• man ls– Use the arrow keys, page up/down keys,

or the SPACE to navigate– To search for a phrase of text, for

example the word time, type /time and hit ENTER• Hit n to go to the next occurrence• Hit N to go to the previous occurrence• Hit q to quit

12

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Kill process

• top• kill• killall• Ctrl-c• Exercise: Run the command

~/examples/bin/ticktock, and kill it once you've had enough

13

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Copy files• mkdir workshop• cd workshop• cp ~/examples/aaa .• cp ~/examples/bbb ~/examples/ccc .• cp aaa zzz• rsync: replacement for cp, but can be used

to copy files to/from remote computers– e.g. rsync -avz --progress mywork

username@hostname:~/mywork

14

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Moving and removing

15

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

File permissions

• The -rw-r--r-- displays the file mode bits– The first character is the type (- for files,

d for directories, and other letters (b, c, l, s etc.) for special files

– Following that are three groups of three characters, for read, write, and execute permissions for user, group, and others

16

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology 17

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

• r = 4, w = 2, x = 1, rwx = 7• chmod 755

18

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Hidden files

19

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

File manipulation

• cat ~/examples/gpl-3.0.txt• less ~/examples/gpl-3.0.txt• File editors: vim/emacs/nano

20

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

More shell commands

21

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Piping your commands

• cat ~/examples/answers.out | awk '{print $3}'

22

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Exercises

• List the last 5 files in /bin by combining the ls and tail commands with a pipe

• Count the number of lines that contain the word free in ~/examples/gpl-3.0.txt

23

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

The shell environment

• echo $PATH• Change $PATH:• PATH=$PATH\:/dir/path ; export PATH

24

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Submit a job

• bsub < yourscript.bsub• yourscript.bsub:

#!/bin/sh#BSUB -u linliu@harvard#BSUB -J hellwo_world#BSUB -o hellow_world.out#BSUB -e hellow_world.err#BSUB -q short_serialpython hellow_world.py > hellow_world.out

25

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Load modules

• module load dir/software– http://oldrcwebsite.rc.fas.harvard.edu/

faq/modulelist

26

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Final tips

• Google is extremely helpful if you want to write some shell scripts

27

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Getting started

Where the scripts/commands are executed

Where plots/help displayed, and packages installed.

Where the CODE is scripted

Show the variables/functions in memory

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Workspace Management

• Before jumping into R, it is important to ask ourselvesWhere am I?

>getwd()

–I want to be there…• setwd(“C://”)

–With who am I?• dir() # lists all the files in the working directory

–With who I can count on?• ls() #lists all the variables on the current session

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Workplace Management (2)

Saving>save(x,file=“name.RData”) #Saves specific

objects>save.image(“name.Rdata”) #Saves the whole

workspace

Loading>load(“name.Rdata”)

‘?function’ and ‘??function’>? To get the documentation of the function>?? Find related functions to the query

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

R Objects• Almost all things in R are OBJECTS!

– Functions, datasets, results, etc… (graphs NO)

• OBJECTS are classified by two criteria– MODE: How objects are stored in R

• Character, numeric, logical, factor, list, function…• To obtain the mode of an object

> mode(object)

– CLASS: How objects are treated by functions• Vector, matrix, array, data.frame,…• To obtain the class of an object

> class(object)

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

R classescharacter

> assembly = “hg19”> assembly> class(assembly)

numeric> expression = 3.456> expression> class(expression)

integer> nbases = “3000000000L”> nbases> class(nbases)

logical> completed = FALSE> completed> class(completed)

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

R classes - Vectorvector

>x=c(10,5,3,6); x[3:4]; x[1]

Computations on vector are performed on each entry of the vector

>y=c(log(x),x,x^2)

Not necessarily to have vectors of the same length in operations!

>w=sqrt(x)+2>z=c(pi,exp(1),sqrt(2))>x+z

–Logical vectors>aux=x<7

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

R Classes - Listlist

A vector of values of possibly different classes and different length.

Creating it.>x1 = 1:5>x2 = c(T,T,F,T,F)>y=list(question.number = x1, question.answer = x2)

Accesing it.>y;class(y)>y$question.answer[3]; y[[2]][3];

y[[“question.answer”]][3]>y$question.number[which(question.answer == T)]

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

R classes - Matrixmatrix

>x=1:8>dim(x)=c(2,4)>y=matrix(1:8,2,4,byrow=F)

Operations are applied on each element

>x*x; max(x)>x=matrix(1:28,ncol=4);

y=7:10 so then x*y is…?>y=matrix(1:8,ncol=2)>y%*%t(y)

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

R classes - Matrix

matrixExtracting info

>y[1,] or y[,1]Extending matrices

>cbind(y,seq(101,104))>rbind(y,c(102,109))

Apply is a useful function!>apply(y,2,mean)>apply(y,1,log)

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

R classes – Data Frame

data.frameCreating it.

> policy.number = c(“A00187”, “A00300”,”A00467”,”A01226”)> issue.age = c(74,30,68,74)> sex=c(“F”, “M”, “M”, “F”)> smoke=c(“S”,”N”,”N”,”N”)> face.amount = c(420, 1560, 960, 1190)> ins.df = data.frame(policy.number, issue.age, sex, smoke,

face.amount)

Accesing it.> ins.df[1,]; ins.df[,1] # access first row, access first colum> ins.df$policy.number # access policy number column> rownames(ins.df); colnames(ins.df);> index.smokers = which(ins.df$smoke == “S”) # row index of

smokers> ins.df[index.smokers] # access all smokers in the df> ins.df$policy.number[index.smokers] # policy number for

smokers

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

R classes – Data Frame

data.frameManipulating it.

> ins.df = rbind(ins.df, c(“A01495”, 62, “M”, “N”, 1330))> sort.age = sort(ins.df$issue.age, index=T)> ins.df = ins.df[sort.age$ix,]> ins.df$visits = c(0,4,2,1,1)> drops = c(“sex”,”visits”)> ins.df[,!(names(ins.df) %in% drops]> ins.df[,”visits”] = c(0,4,2,1,1)> carins.df = data.frame(policy.number =

c("A01495","A00232","A00187"), car.accident = c("Y","N","N"))> ins.merged.df = merge(ins.df, carins.df, by = "policy.number")> Etc…

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

R Classes - Factorfactor

Qualitative variables that can be included in models.

>smoke = c(“yes”,”no”,”yes”,”no”)>smoke.factor = as.factor(smoke)>smoke.factor>class(smoke)>class(smoke.factor)

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Loops and Conditional Statements

ifExample

>a=9>if(a<0){ print (“Negative number”) } else{ print (“Non-negative number”) }

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

• for>z=rep(1,10)>for (i in 2:10)

{ z[i]=z[i]+exp(1)*z[i-1] }

• while>n=0>tmp=0>while(tmp<100)

{ tmp=tmp+rbinom(1,10,0.5) n=n+1 }

Loops and Conditional Statements

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Functions!• My own functions

> function.name=function(arg1,arg2,…,argN) { Body of the function }

> fun.plot=function(y,z){y=log(y)*z-z^3+z^2plot(z,y)}

> z=seq(-11,10)> y=seq(11,32)> fun.plot(y,z)

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Functions! (2)• The ‘…’ argument

– Can be used to pass arguments from one function to another• Without the need to specify arguments in

the header

fun.plot=function(y,z,...) { y=log(y)*z-z^3+z^2 plot(z,y,...) }fun.plot(y,z,type="l",col="red")fun.plot(y,z,type="l”,col=“red”,lwd=4)

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Handling data I/O

Reading from files to a data frame>read.csv(“filename.csv“) # reads csv files into

a data.frame>read.table(“filename.txt“) # reads txt files in a

table format to a data.frame

Writing from a data frame to a file>write(x,filename) # writes the object x to

filename>write.table(x,filename) # writes the object x to

filename in a table format

Note: have in mind additional options such as, header = TRUE, row.names = TRUE, col.names = TRU, quotes = TRUE, etc.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Plotting!

>x.data=rnorm(1000)>y.data=x.data^3-10*x.data^2>z.data=-0.5*y.data-90

>plot(x.data,y.data,main="Title of the graph",xlab="x label",ylab="y label")

>points(x.data,z.data,col="red")>legend(-2,2,legend=c("Black points","Red

points"),col=c("black","red"),pch=1,text.col=c("black","red"))

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Plotting! (2)

You can export graphs in many formats– To check the formats that are available in your

R installation>capabilities()

png>png("Lab2_plot.png",width=520,height=440)>plot(x.data,y.data,main="Title of the graph",xlab="x

label",ylab="y label")>points(x.data,z.data,col="red")>legend(-2,2,legend=c("Black points","Red

points"),col=c("black","red"),pch=1,text.col=c("black","red"))

>dev.off()eps

> postscript("Lab2_plot.eps",width=500,height=440)

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Simulation

Sampling>sample(x,repla

ce=TRUE) – put it back into the bag!

Distributions

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Libraries!!

Collection of R functionsthat together perform a specialized analysis.

Install packages from CRAN> install.packages(“PackageName”)

Loading libraries> library(LibraryName)

Getting the documentation of a library> library(help=LibraryName)

Listing all the available packages> library()

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

• www.bioconductor.org

– A suite of R packages for Bioinformatics.

– To use only Core packages• >source(“http://bioconductor.org/biocLite.R”)• >biocLite()

– To use Core and Other packages• >source(“http://bioconductor.org/biocLite.R”)• >biocLite(c(“pkg1”, “pkg2”,…,“pkgN”))

Libraries!!

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Exercise 1 – The empire strikes back: GOOG versus BAIDU

Plot historical Stock Prices times series using prices from yahoo finance.

(a) Download and install tseries package.

(b) Include tseries package as a library in your code.

(c) Use get.hist.quote to download GOOG and BAIDU historical data.

(d) Plot both time series in the same panel and add a legend to the plot.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Exercise 2 – Challenging Challenger

On January 28, 1986, the space Shuttle Challenger exploded in the early stages of its flight. Feynman, along a committee determined that the explosion was due to low temperatures and the failure of O-rings sealed on the booster rockets. The ambient temperature was 36 degrees on the morning of the launch.The scientists had data (temperature, number of failures) from previous flights.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Exercise 2 – Challenging Challenger

(a)Plot the number of failures versus the temperature for flights with one or more O-ring failures. Is there any evidence that temperature affects O-ring performance?

(b)Plot the number of failures versus temperature for all the flights. Is there any evidence that temperature affects O-ring performance?

(c) What’s your conclusion? What do you think the scientists plot before taking the decision to fly that day? Just historical curiosity, Whom played a central role in discovering the causes of the failure and how he announced it?

top related