Dec 18, 2015
“ I think you should be more explicit here in step two”
Figure omitted because of copyright reasonA printed version can be found at
Leung YF, Lam DSC, Pang CP. The miracle of microarray data analysis. Genome Biol. 2001 Aug 29; 2: 4021.1-4021.2.
~ Normal science consists largely of mopping-up operations. Experimentalists carry out modified versions of experiments that have been carried out may times before ~
Thomas S. Kuhn
Different kinds of microarray software Image analysis software Data mining software
– Statistics software • R packages for microarray analysis
SNPs analysis software Database/ LIMS software Public Expression Database Primer design Software for further data mining: annotation,
promoter analysis & pathway reconstruction
Softwares won’t discuss today
Hardware control softwares– Arrayer controlling – ArrayMaker– Scanner controlling/ Image acquisition
A statistics on current microarray softwares
28 Feb 2002 Jan 2001
Image analysis 17 17
Data mining 39
R packages 14
SNP analysis 1
Database/ LIMS 14 4
Public Database 16 8
Accessory 8 -
Further data mining 9 -
Total 116 29
* Extracted from http://ihome.cuhk.edu.hk/~b400559/arraysoft.html
Image analysis software
Spot recognition Segmentation
– Foreground calculation– Background calculation
Spot quality measures
Major Image analysis softwares AIDA array ArrayPro ArrayVision Dapple F-scan GenePix Pro 3.0.5 ImaGene 4.0 Iconoclust Iplab
Lucidea Automated Spotfinder
Phoretix Array3 P-scan QuantArray 3.0 ScanAlyze 2 Spot TIGR Spotfinder UCSF Spot
Examples of common used image analysis software ScanAlyze 2 (Mike Eisen, LBNL) GenePix Pro 3.0.5 (Axon Instruments) QuantArray 3.0 (Packard Instrument) ImaGene 4.0 (Biodiscovery)
Spot recognition
ArrayPro from Media Cybernetics Automate and fast grid, subgrid and spot finding
algorithms
Segmentation
Purpose – classification between foreground and background– Fixed circle– Adaptive circle– Adaptive shape– Histogram method
Spot quality measure E.g. QuantArray 3.0
– Diameter– Spot Area– Footprint– Circularity– Spot Signal/Noise– Spot Uniformity– Background Uniformity– Replicate Uniformity
Problem: lacking rigorous spot quality definition and experimental verification
Future Image analysis software
Rigorous quality mearsures definition Extra dye for better segmentation Automated analysis
Data mining software
Main purposes1. Filtering and normalization2. Statistical inference of differentially
expressed genes3. Identification of biologically meaningful
patterns, i.e. expression profile; expression fingerprint/ signature
4. Visualization5. Other analysis like pathway reconstruction
etcs.
Different categories
Turnkey system Comprehensive software Specific analysis software Extension/ accessory of other software
Major data mining software AIDA Array AMADA ANOVA program for microarray data ArrayMiner arraySCOUT ArrayStat BRB ArrayTools CHIPSpace Cleaver CIT CLUSFAVOR Cluster Cyber T DNA-arrays analysis tools dchip Expression Profiler Expressionist Freeview & FreeOView Gene Cluster
GeneLinker Gold GeneMaths GeneSight GeneSpring Genesis Genetraffic J-Express MAExplorer Partek R cluster Rosetta Resolver SAM SpotFire Decision Site SNOMAD TIGR ArrayViewer TIGR Multiple Experiment Viewer TreeView Xcluster Xpression NTI
Turnkey system Definition: A computer system that has been customized
for a particular application. The term derives from the idea that the end user can just turn a key and the system is ready to go.
For microarray, this includes everything from OS, server software, database, client software, statistics software and even hardware
Examples– Genetraffic (Iobion)
• Using Open Source softwares - LINUX, the R statistical language, PostgreSQL, and Apache Web server
– Rosetta resolver (Rosetta Biosoftware)• Sun Fire server and drive array, Oracle 8i, Rosetta server and client side
software
Turnkey system Advantages
– performance– Security– Support multiple users– Incorporate the experiment and data standards in design
Disadvantages– Expensive– Not suitable for small labs– Require dedicated supporting staff– Close system
Comprehensive software
Definition: Software incorporate many different analyses for different stage in a single package.
Examples– Cluster (Mike Eisen, LBNL)– GeneMaths (Applied Maths)– GeneSight (Biodiscovery)– GeneSpring (Silicon Genetics)
Comprehensive software
Cluster– Filter data
– Adjust data- normalization, log transform etc
– Clustering
– Self-Organizing Maps (SOMs)
– Principle Component Analysis (PCA)
GeneSpring– & Promoter analysis
– Gene annotation with public database information
– Scripting tools
– Access Open DataBase Connectivity (ODBC) databases
Comprehensive software
GeneMaths– & Bootstrap analysis
for clustering
– Fast clustering algoritms
– Access Open DataBase Connectivity (ODBC) database
GeneSight– & confidence analysis
for replicated data
– statistical analysis for significant genes
– Graphical data set builder
Comprehensive software
Advantages– Standardized operation – Generate various analysis easily– Shorter learning curve for biologist– Script language for automated process control– Some brilliant ideas or analysis within
particular software– “False” Sense of security?
Comprehensive software Disadvantages
– Inflexible to latest analysis development– Generate various analysis too easily– Implicit data analysis/ statistics background and
definitions– Proprietary script language– Data compatibility with other softwares– Necessity to design and maintain your own database– Commercial softwares can be expensive!– Adding particular analysis because of marketing
purpose, extra spending on unnecessary functions– Sometimes only available in a few computing platforms
Specific analysis software
Definition: Software performing a few/ one specific analysis
Examples– GeneCluster (Whitehead Institute Centre
for genome research)– INCLUSive - INtegrated CLustering, Upstream
Sequence retrieval and motif Sampler (Katholieke Universiteit Leuven)
– SAM – Significance Analysis of Microarrays (Stanford University)
Specific analysis software
INCLUSive - INtegrated CLustering, Upstream Sequence retrieval and motif Sampler
SAM – finding statistical significant differentially expressed gene
Specific analysis software
Advantages– Better statistical background reference, usually
with literature support
Disadvantages– Non-standardized environment – java, web,
excel… etc– Data compatibility problem– Data preprocessing problem
Extension/ accessory of other software Definition: extension of other software’s
capability Examples:
– Freeview: Visualization and Optimization of Gene Clustering Dendrograms for Cluster
– ArrayMiner: extension of GeneSpring
Statistics softwares
Advantages– Highly flexible– High level, multivariate analyses are either
standard or easily programmable Disadvantages
– Usually command line driven, impossible to learn intuitively (a disadvantage??)
– Require a much better understanding of the statistical data analysis to follow the steps (a disadvantage??)
R-packages
A language and environment for statistical computing and graphics.
Highly compatible to S/ S-plus Open source under GNU General Public
Licence Runs on many UNIX/ Linux/ windows
family and MacOS platform There are growing number of microarray
analysis softwares (packages) written in R
R-packages Dedicated for
microarray analysis– affy– Bioconductor– SMA extension– Cyber T– GeneSOM– Permax– OOMAL (S-Plus)– SMA– YASMA
General packages– cclust
– cluster
– mclust
– multiv
– mva
– …etc!
R-packages
SMA – perform intensity and spatial dependent
normalization – Replicated array data analysis by an empirical
bayes approach
R-packages
Bioconductor– open source software project to provide infrastructure
in terms of design and software to assist biologists and statisticians for analysing genomic data, with primary emphasis on inference using DNA microarrays
– Most software produced by the Bioconductor project will be in the form of R libraries
• Variation 1: provide basic infrastructure support that will help other developers produce high quality software
• Variation 2: provide innovative methodology for analyzing genomic data
– Provide some form of graphical user interface for selected libraries
– A mechanism for linking together different groups with common goals
Future Data mining software
Standardized, open-source (free) platform?– EMBOSS - European Molecular Biology Open
Software Suite.
More supervised analysis package and pathway prediction package?
Plugin modules – J-express– GeneSpring
Mutation analysis software
Chip based SNP or chromosomal aberration analysis (arrayCGH)
Various forms of protocols, e.g. primer extension, ligase chain reaction, MALDI-TOF-MS, hybridization..etc
Result is in the form of base calling or allelic imbalance
Example – genorama
Definition: large collection of data organized especially for rapid search and retrieval
Two categories– Within laboratory/ institute database; LIMS– Public expression database
Standardized definition of data – Minimum Information About a Microarray Experiment (MIAME)
• Experimental design• Array design• Samples• Hybridizations• Measurements• Normalization controls
Database
Database/ LIMS software
The database within your lab/ institute The quality of in house data management
will affect the quality of final public data repository
Database structure may be relatively simple
Major Database/ LIMS software AMAD ARGUS ArrayDB ArrayInformatics Clonetracker GeNet Genetraffic GeneX MAD
Maxd NOMAD Partisan Array LIMS Phoretix Array2
Database Rosetta Resolver SMD
Public Expression Database
Necessities– Provide raw data to validate published array
result and develop new analysis tools– Further understanding of your data– Compare among different groups, meta-data
mining– Source for specialty array design
Different categories– Generic– Species specific– Disease specific
The importance of data standardization
Major public gene expression databases 3D-GeneExpression
Database ArrayExpress BodyMap ChipDB ExpressDB Gene Expression Omnibus
(GEO) Gene Expression Database
(GXD) Gene Resource Locator
GeneX Human Gene Expression
Index (HuGE Index) RIKEN cDNA Expression
Array Database (READ) RNA Abundance Database
(RAD) Saccharomyces Genome
Database (SGD) Standford Microarray
Database (SMD) TissueInfo yeast Microarray Global
Viewer (yMGV)
Primer/ probe design
Array designer GAP (Genome- wide Automated Primer
finder servers) OligoArray Primer3 ProbeWiz Server
Other useful software for further data mining Data annotation
– DRAGON– Gene Ontology– PubGene– Resourcerer
Promoter analysis– AlignACE– INCLUSive– MEME– Sequence Logo
Pathway reconstruction– GenMAPP– PathFinder
Data annotation– Link GI to a particular name– Literature mining to infer network
Network reconstruction– Cluster + promoter analysis– statistical inference from experimental data
Some suggestions for biologists who are serious in microarray study Communicate or even collaborate with
Statisticians, Mathematicians and bioinformaticians
Learn a high level statistical language, e.g. R Learn programming, e.g. C Learn database, e.g. SQL Learn Linux Revise your statistics, probability and may be even
calculus Lucky…?!