Top Banner
Computers and Programming for Biologists
36

Computers and Programming for Biologists. What is Bioinformatics? The use of information technology to collect, analyze, and interpret biological data.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Computers and Programming for Biologists. What is Bioinformatics? The use of information technology to collect, analyze, and interpret biological data.

Computers and Programmingfor Biologists

Page 2: Computers and Programming for Biologists. What is Bioinformatics? The use of information technology to collect, analyze, and interpret biological data.

What is Bioinformatics?• The use of information technology to collect,

analyze, and interpret biological data.• An ad hoc collection of computing tools that are

used by molecular biologists to manage research data.– Computational algorithms– Database schema– Statistical methods– Data visualization tools

Page 3: Computers and Programming for Biologists. What is Bioinformatics? The use of information technology to collect, analyze, and interpret biological data.

The Human Genome Project

Page 4: Computers and Programming for Biologists. What is Bioinformatics? The use of information technology to collect, analyze, and interpret biological data.

A Genome Revolution in Biology and Medicine

We are in the midst of a "Golden Era" of biology

The Human Genome Project has produced a huge storehouse of data that will be used to change every aspect of biological research and medicine

The revolution is about treating biology as an information science, not about specific biochemical technologies.

Page 5: Computers and Programming for Biologists. What is Bioinformatics? The use of information technology to collect, analyze, and interpret biological data.

The job of the biologist is changing

– The biologist will spend more time using computers

& on experimental design and data analysis (and less time doing tedious lab biochemistry)

– Biology will become a more quantitative science (think how the periodic table affected chemistry)

As more biological information becomes available and laboratory equipment becomes more automated ...

Page 6: Computers and Programming for Biologists. What is Bioinformatics? The use of information technology to collect, analyze, and interpret biological data.

What are the Tools?

• Alignment

• Similarity = string matching– Pattern search– Hash tables and substitution matrices

• Clustering

• Genome assembly and annotation

Page 7: Computers and Programming for Biologists. What is Bioinformatics? The use of information technology to collect, analyze, and interpret biological data.

Align by hand

GATGCCATAGAGCTGTAGTCGTACCCT <—

—> CTAGAGAGC-GTAGTCAGAGTGTCTTTGAGTTCC

Somebody should make a computer program for this kind of thing…

Page 8: Computers and Programming for Biologists. What is Bioinformatics? The use of information technology to collect, analyze, and interpret biological data.

Global vs. Local Alignments

Page 9: Computers and Programming for Biologists. What is Bioinformatics? The use of information technology to collect, analyze, and interpret biological data.

BLAST Algorithm

Page 10: Computers and Programming for Biologists. What is Bioinformatics? The use of information technology to collect, analyze, and interpret biological data.

>ZFISH9:GNL-TI fi72b02.y1 Length = 724

Score = 307 bits (786), Expect = 8e-82 Identities = 145/200 (72%), Positives = 166/200 (82%), Gaps = 1/200 (0%) Frame = +3

Query: 45 VLLKEYRVILPVSVDEYQVGQLYSVAEASKNXXXXXXXXXXXXXXPYEK-DGEKGQYTHK 103 +L+KE+R++LPVSV+EYQVGQLYSVAEASKN PYEK DGEKGQYTHKSbjct: 123 MLIKEFRIVLPVSVEEYQVGQLYSVAEASKNETGGGDGVEVLKNEPYEKEDGEKGQYTHK 302

Query: 104 IYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITNEYMKEDFLIKIETWHKPDLG 163 IY LQSKVP+FVR+LAP AL IHEKAWNAYPYCRTV+TNEYMK++FLI IETWHKPDLGSbjct: 303 IYRLQSKVPSFVRLLAPSSALIIHEKAWNAYPYCRTVLTNEYMKDNFLIMIETWHKPDLG 482

Query: 164 TQENVHKLEPEAWKHVEAVYIDIADRSQVLSKDYKAEEDPAKFKSIKTGRGPLGPNWKQE 223 QENVH L+ E WK VE ++IDIADRSQV +KDYK +EDPA FKS KTGRGPLGP+WK+ESbjct: 483 EQENVHNLDSERWKQVEVIHIDIADRSQVDTKDYKPDEDPATFKSQKTGRGPLGPDWKKE 662

Query: 224 LVNQKDCPYMCAYKLVTVKF 243 L ++DCP+MCAYK VTV FSbjct: 663 LPQKRDCPHMCAYKXVTVNF 722

Page 11: Computers and Programming for Biologists. What is Bioinformatics? The use of information technology to collect, analyze, and interpret biological data.
Page 12: Computers and Programming for Biologists. What is Bioinformatics? The use of information technology to collect, analyze, and interpret biological data.

Clustering (Phylogenetics)

Page 13: Computers and Programming for Biologists. What is Bioinformatics? The use of information technology to collect, analyze, and interpret biological data.

Genome Assembly

Page 14: Computers and Programming for Biologists. What is Bioinformatics? The use of information technology to collect, analyze, and interpret biological data.

Raw Genome Data:

Page 15: Computers and Programming for Biologists. What is Bioinformatics? The use of information technology to collect, analyze, and interpret biological data.

UCSC

Page 16: Computers and Programming for Biologists. What is Bioinformatics? The use of information technology to collect, analyze, and interpret biological data.

The Challenge of New Data Types

• Gene expression microarrays– thousands of genes, imprecise measurements– huge images, private file formats

• Proteomics– high-throughput Mass Spec– protein chips: protein-protein interactions

• Genotyping– thousands of alleles, thousands of individuals

Page 17: Computers and Programming for Biologists. What is Bioinformatics? The use of information technology to collect, analyze, and interpret biological data.

cDNA spotted microarrays

Page 18: Computers and Programming for Biologists. What is Bioinformatics? The use of information technology to collect, analyze, and interpret biological data.
Page 19: Computers and Programming for Biologists. What is Bioinformatics? The use of information technology to collect, analyze, and interpret biological data.

High-Throughput Genotyping

Page 20: Computers and Programming for Biologists. What is Bioinformatics? The use of information technology to collect, analyze, and interpret biological data.

Bioinformatics:Beyond Using Websites

• You can do a lot of sophisticated bioinformatics using public websites

• But at some point you may be faced with a LOT of data - thousands of searches, annotations, etc.

• The only solution is to have your own bioinformatics computer, database, and custom programs.

• Needs more processor power and more hard drive space than a typical desktop personal computer

Page 21: Computers and Programming for Biologists. What is Bioinformatics? The use of information technology to collect, analyze, and interpret biological data.
Page 22: Computers and Programming for Biologists. What is Bioinformatics? The use of information technology to collect, analyze, and interpret biological data.
Page 23: Computers and Programming for Biologists. What is Bioinformatics? The use of information technology to collect, analyze, and interpret biological data.

Bioinformatics Requires Powerful Computers

• One definition of bioinformatics is "the use of computers to analyze biological problems.”

• As biological data sets have grown larger and biological problems have become more complex, the requirements for computing power have also grown.

• Computers that can provide this power generally use the Unix operating system - so you must learn Unix be a computational biologist

Page 24: Computers and Programming for Biologists. What is Bioinformatics? The use of information technology to collect, analyze, and interpret biological data.

Stable and Efficient

• Unix is very stable - computers running Unix almost never crash

• Unix is very efficient • it gets maximum number crunching power out of

your processor (and multiple processors)

• it can smoothly manage extremely huge amounts of data

• it can give a new life to otherwise obsolete Macs and PCs

• Most new bioinformatics software is created for Unix - its easy for the programmers

Page 25: Computers and Programming for Biologists. What is Bioinformatics? The use of information technology to collect, analyze, and interpret biological data.

Open Source Bioinformatics• Almost all of the bioinformatics software that

you need to do complex analyses is free for UNIX computers

• The Open Source software ethic is very strong among biologists– Bioinformatics.org– Bioperl.org– Open-bio.org

• New algorithms generally appear first as free software (a publication requirement)

Page 26: Computers and Programming for Biologists. What is Bioinformatics? The use of information technology to collect, analyze, and interpret biological data.

Free Software• Linux operating system, mySYQL database• Perl - programming language• Blast and Fasta - similarity search• Clustal - multiple alignment• Phylip - phylogenetics• Phred/Phrap/Consed - sequence assembly

and SNP detection• EMBOSS - a complete sequence analysis

package created by the EMBL (like GCG)

Page 27: Computers and Programming for Biologists. What is Bioinformatics? The use of information technology to collect, analyze, and interpret biological data.

Computer Hardware is not Free

• However, you can build a powerful Linux cluster for $20-50K

(depending on how much power you need)

• The real cost is for a person to manage the machines, install the software, and train scientists to use it.

• Small schools can join together or affiliate with a larger neighbor.

Page 28: Computers and Programming for Biologists. What is Bioinformatics? The use of information technology to collect, analyze, and interpret biological data.

Do Biologists have to become Programmers?

• No, but it can give you a big advantage.

• More and more of biology is becoming computer aided design of experiments, automated equipment, and computational analysis of the results.

• “I just want to say one word to you ... Databases”

Page 29: Computers and Programming for Biologists. What is Bioinformatics? The use of information technology to collect, analyze, and interpret biological data.

Why teach bioinformatics in undergraduate education?

Demand for trained graduates from the biomedical industry

Bioinformatics is essential to understand current developments in all fields of biology

We need to educate an entire new generation of scientists, health care workers, etc.

Use bioinformatics to enhance the teaching of other subjects: genetics, evolution, biochemistry

Page 30: Computers and Programming for Biologists. What is Bioinformatics? The use of information technology to collect, analyze, and interpret biological data.

Genomics in Medical Education

“The explosion of information about the new genetics will create a huge problem in health education. Most physicians in practice have had not a single hour of education in genetics and are going to be severely challenged to pick up this new technology and run with it."

Francis Collins

Page 31: Computers and Programming for Biologists. What is Bioinformatics? The use of information technology to collect, analyze, and interpret biological data.

Becoming a Unix Power User

• Learn more Unix commands

• Use the shell to execute simple programs

• Write scripts - automate repetitive tasks

• Download and install the latest bioinformatics software

• Drive your system manager crazy… or get your own Unix machine

(Linux on an Intel machine or Mac OS-X)

Page 32: Computers and Programming for Biologists. What is Bioinformatics? The use of information technology to collect, analyze, and interpret biological data.

BioPerl

• Why re-invent the wheel?

• Lots of common bioinformatics tasks have already been programmed as “modules” in Perl.– Grab sequences from GenBank, extract e-

values and annotation from Blast results, etc.

• Download from www.bioperl.org

Page 33: Computers and Programming for Biologists. What is Bioinformatics? The use of information technology to collect, analyze, and interpret biological data.

Resources

• Notes for Lincoln Stein’s course on

“Genome Informatics”http://stein.cshl.org/genome_informatics/index.html

• BioPerl.org http://bio.perl.org/

• PERL for biologists (Kurt Stüber)

http://caliban.mpiz-koeln.mpg.de/~stueber/perl/

• “Why Biologists Want to Program Computers”by James Tisdall: http://www.oreilly.com/news/perlbio_1001.html

Page 34: Computers and Programming for Biologists. What is Bioinformatics? The use of information technology to collect, analyze, and interpret biological data.

Resources for Bio-Computing

Page 35: Computers and Programming for Biologists. What is Bioinformatics? The use of information technology to collect, analyze, and interpret biological data.

Stuart M. Brown, [email protected]

www.med.nyu/rcr

Bioinformatics: A Biologist's Guide to Biocomputing and the Internet

Essentials of Medical Genomics

Page 36: Computers and Programming for Biologists. What is Bioinformatics? The use of information technology to collect, analyze, and interpret biological data.