Top Banner
Biostatistics
36

Biostatistics. Statistics refers to the analysis and interpretation of data with a view toward objective evaluation of the reliability of the conclusions.

Dec 26, 2015

Download

Documents

Isaac Holt
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Biostatistics. Statistics refers to the analysis and interpretation of data with a view toward objective evaluation of the reliability of the conclusions.

Biostatistics

Page 2: Biostatistics. Statistics refers to the analysis and interpretation of data with a view toward objective evaluation of the reliability of the conclusions.

Biostatistics

• Statistics refers to the analysis and interpretation of data with a view toward objective evaluation of the reliability of the conclusions based on the data.

• Statistics applied to biological problems is called biostatistics / biometry.

Page 3: Biostatistics. Statistics refers to the analysis and interpretation of data with a view toward objective evaluation of the reliability of the conclusions.

Applications of Statistics in Bioinformatics

• Descriptive Summaries• Clinical diagnosis• Equipment calibration• Experimental data analysis• Gene expression prediction• Gene hunting• Gene prediction• Genetic linkage analysis• Laboratory automation• Nucleotide alignment• Population studies• Protein function prediction• Protein structure prediction• Quantifying uncertainty• Quality control• Sequence similarity.

Page 4: Biostatistics. Statistics refers to the analysis and interpretation of data with a view toward objective evaluation of the reliability of the conclusions.

Basic Concepts - I

• Data– Any kind of numbers

– Statistical analyses need numbers

• Statistics– Concerned with collection, organization, and analysis

of data

– Drawing inferences about a population when only a sample of the population is studied

• Summary– Data are numbers, numbers contain information,

statistics investigate and evaluate the nature and meaning of this information

Page 5: Biostatistics. Statistics refers to the analysis and interpretation of data with a view toward objective evaluation of the reliability of the conclusions.

Basic Concepts - II

• Sources of Data– Routinely kept records

– Surveys

– Experiments / Research Studies

• Biostatistics– Statistics applied to biological sciences and medicine

– Statistics including not only analytic techniques but also study design issues

Page 6: Biostatistics. Statistics refers to the analysis and interpretation of data with a view toward objective evaluation of the reliability of the conclusions.

Variables

• Variable: A characteristic that differs from one biological entity to another.

• Continuous Variable: A variable for which there is a possible value between any other two possible values. Eg: Height.

• Discrete variable: A variable that can take only certain values. Eg: No.of leaves

Page 7: Biostatistics. Statistics refers to the analysis and interpretation of data with a view toward objective evaluation of the reliability of the conclusions.

Accuracy & Precision

• Accuracy is the nearness of a measurement to the actual value of the variable being measured.

• Precision refers to the closeness to each other of repeated measurements of the same quantity.

Page 8: Biostatistics. Statistics refers to the analysis and interpretation of data with a view toward objective evaluation of the reliability of the conclusions.

Frequency table & Frequency Distributions.

• Frequency table: involves a listing of all the observed values of the variable being studied and how many times each value is observed. Helps summarize large amounts of data.

• Frequency distribution: The distribution of the total number of observations among the various categories is called a frequency distribution.

• Represented graphically as a bar graph, histogram, Frequency polygons etc.

Page 9: Biostatistics. Statistics refers to the analysis and interpretation of data with a view toward objective evaluation of the reliability of the conclusions.

Population and Samples

• Population: The entire collection of measurements about which one wishes to draw conclusions is the population / universe.

• Sample: The subset of all the measurements in the population is called the sample.

• Random sampling: The selection of any member of the population in no way influences the selection of any other member, i.e each member of the population has an equal and independent chance of being selected.

Page 10: Biostatistics. Statistics refers to the analysis and interpretation of data with a view toward objective evaluation of the reliability of the conclusions.

Randomness

• Data are inherently noisy and randomness is inherent in any sampling process.

• Every measurement system introduces noise-random variability-into the desired signal.

• The noise can be minimized by controlling the external environment or more often by reducing the bandwidth of the system using statistical techniques.

• By reducing the bandwidth of acceptable (good) data, it can be more readily differentiated from bad data and made more apparent and available.

Eg: Analysis of intra-array spot fluorescence intensity can be used to control for contamination and other sources of

variability.

Page 11: Biostatistics. Statistics refers to the analysis and interpretation of data with a view toward objective evaluation of the reliability of the conclusions.

Simple Random Sample

• Reason– sample a ‘small’ number of subjects from a population

to make inference about the population

– Essence of statistical inference

• Definition– A sample of size n drawn from a population of size N

in such a way that every possible sample of size n has the same chance of being selected

• Sampling with and without replacement– In biostatistics, most sampling done without

replacement

Page 12: Biostatistics. Statistics refers to the analysis and interpretation of data with a view toward objective evaluation of the reliability of the conclusions.

Interface Noise

• Much of bioinformatics work involves interfacing mechanical, biological and electronic systems and each interface introduces noise and variability in the overall process. Eg: Translating analog fluorescence intensity to a digital

signal introduces noise, decreases overall system dynamic range and adds non-linearities and variability to the gene expression data.

Similarly the mechanical and optical-to-digital interfaces in a nucleotide sequencing machine contribute noise, errors and random variability to sequence data.

Page 13: Biostatistics. Statistics refers to the analysis and interpretation of data with a view toward objective evaluation of the reliability of the conclusions.

Descriptive Statistics Measures of Location

• Descriptive measure computed from sample data - statistic

• Descriptive measure computed from population data - parameter

• Most common measures of location– Mean

– Median

– Mode

– Geometric Mean

Page 14: Biostatistics. Statistics refers to the analysis and interpretation of data with a view toward objective evaluation of the reliability of the conclusions.

Descriptive Statistics - Arithmetic mean

• Probably most common of the measures of central tendency

– a.k.a. ‘average’

• Definition

– Normal distribution, although we tend to use it regardless of distribution

• Weakness

– Influenced by extreme values

• Translations

– Additive

– Multiplicative

n

xx i

Page 15: Biostatistics. Statistics refers to the analysis and interpretation of data with a view toward objective evaluation of the reliability of the conclusions.

Descriptive Statistics Median

• Frequently used if there are extreme values in a distribution or if the distribution is non-normal

• Definition– That value that divides the ‘ordered array’ into two equal

parts• If an odd number of observations, the median will be the (n+1)/2

observation– ex.: median of 11 observations is the 6th observation

• If an even number of observations, the median will be the midpoint between the middle two observations

– ex.: median of 12 observations is the midpoint between 6th and 7th

• Comparison of mean and median indicates skewness of distribution

Page 16: Biostatistics. Statistics refers to the analysis and interpretation of data with a view toward objective evaluation of the reliability of the conclusions.

Descriptive Statistics Mode

• Not used very frequently in practice• Definition

– Value that occurs most frequently in data set

• If all values different, no mode• May be more than one mode

– Bimodal or multimodal

Page 17: Biostatistics. Statistics refers to the analysis and interpretation of data with a view toward objective evaluation of the reliability of the conclusions.

Descriptive Statistics Geometric mean

• Used to describe data with an extreme skewness to the right– Ex., laboratory data: lipid measurements

• Definition– Antilog of the mean of the log xi

Page 18: Biostatistics. Statistics refers to the analysis and interpretation of data with a view toward objective evaluation of the reliability of the conclusions.

Descriptive Statistics Measures of Dispersion

• Dispersion of a set of observations is the variety exhibited by the observations– If all values are the same, no dispersion

– More the values are spread, the greater the dispersion

• Many distributions are well-described by measure of location and dispersion

• Common measures– Range

– Quantiles

– Variance

– Standard deviation

– Coefficient of variation

Page 19: Biostatistics. Statistics refers to the analysis and interpretation of data with a view toward objective evaluation of the reliability of the conclusions.

Descriptive StatisticsRange

• Range is the difference between the smallest and largest values in the data set– Heavily influenced by two most extreme values and

ignores the rest of the distribution

Page 20: Biostatistics. Statistics refers to the analysis and interpretation of data with a view toward objective evaluation of the reliability of the conclusions.

Descriptive Statistics Variance

• Variance measures distribution of values around their mean

• Definition of sample variance

• Degrees of freedom– n-1 used because if we know n-1 deviations, the nth

deviation is known

– Deviations have to sum to zero

)1/()( 22 nxxs i

Page 21: Biostatistics. Statistics refers to the analysis and interpretation of data with a view toward objective evaluation of the reliability of the conclusions.

Descriptive Statistics Standard Deviation

• Definition of sample standard deviation

• Standard deviation in same units as mean– Variance in units2

• Translations– Additive

– Multiplicative

2ss

Page 22: Biostatistics. Statistics refers to the analysis and interpretation of data with a view toward objective evaluation of the reliability of the conclusions.

Descriptive Statistics Coefficient of Variation

• Relative variation rather than absolute variation such as standard deviation

• Definition of C.V.

• Useful in comparing variation between two distributions– Used particularly in comparing laboratory measures to

identify those determinations with more variation

– Also used in QC analyses for comparing observers

)100(..x

sVC

Page 23: Biostatistics. Statistics refers to the analysis and interpretation of data with a view toward objective evaluation of the reliability of the conclusions.

Sampling and distributions

• Population mean and variance are estimated by sampling population data and drawing inferences from the sample data based in part on assumptions of how the data are distributed in the population.

• Distributions used in statistical analysis:

Discrete random variables: Binomial, Poisson and Hypergeometric distributions.

Continuous random variables: Normal distribution, Z distribution.

Eg: The analysis of discrete random variables, such as the position of a nucleotide on a given sequence may use techniques based on a binomial distribution and not techniques that assume a normal distribution.

Page 24: Biostatistics. Statistics refers to the analysis and interpretation of data with a view toward objective evaluation of the reliability of the conclusions.

Hypothesis Testing

• Hypothesis testing deals with the null hypothesis and the alternate hypothesis.

• The null hypothesis is usually assumed to hold unless there is enough evidence to reject it. Eg: In Microarray work, a typical hypothesis is that

two microarrays that have been subjected to the same spotting and hybridization process will produce identical gene expression fluorescence results.The degree to which this hypothesis is true can be estimated by examining the gene expression scatter plots created from data gleaned from each microarray and correlating the values mathematically.

Page 25: Biostatistics. Statistics refers to the analysis and interpretation of data with a view toward objective evaluation of the reliability of the conclusions.

Z score

• A statistic commonly used in alignment searches.• It is a measure of the distance from the mean,

measured in standard deviation units.• If each sequence to be aligned is randomized and

an optimal alignment is made, the result is a series of scores (S) for the alignment of two sequences with a mean(µ) and standard deviation (δ).

• The Z score Z = (S - µ ) / δ

Page 26: Biostatistics. Statistics refers to the analysis and interpretation of data with a view toward objective evaluation of the reliability of the conclusions.

Z - score

• The advantage of a Z score over a simple percentage score is that it corrects for compositional biases in the sequence and accounts for varying length of sequences.

• Z scores assume a normal distribution, whereas alignment data don’t follow a normal distribution.

• As a result a higher z score is taken as a threshold of significance.

Page 27: Biostatistics. Statistics refers to the analysis and interpretation of data with a view toward objective evaluation of the reliability of the conclusions.

Graphical Methods Bar Graphs and Histogram

• Histogram graph of frequencies - special form of bar graph– Can be used to visually compare frequencies

– Easier to assess magnitude of differences rather than trying to judge numbers

• Frequency polygon - similar to histogram

Page 28: Biostatistics. Statistics refers to the analysis and interpretation of data with a view toward objective evaluation of the reliability of the conclusions.

Summary

• In practice, descriptive statistics play a major role– Always the first 1-2 tables/figures in a paper

– Statistician needs to know about each variable before deciding how to analyze to answer research questions

• In any analysis, 90% of the effort goes into setting up the data– Descriptive statistics are part of that 90%

Page 29: Biostatistics. Statistics refers to the analysis and interpretation of data with a view toward objective evaluation of the reliability of the conclusions.

Distributions in Bioinformatics

• Binomial distributions are used for spotting stretches of DNA with unusual nucleotide sequences and pair-wise sequence comparisons.

• Normal distributions are used for modeling continuous random variables with applications such as the statistical significance of pairwise sequence comparison.

• Multinomial distributions are used for spotting stretches of DNA with unusual content, distinguishing tests for introns by composition and quantifying relative codon frequency.

Page 30: Biostatistics. Statistics refers to the analysis and interpretation of data with a view toward objective evaluation of the reliability of the conclusions.

Software

• Statistical software– SAS– SPSS– Stata– BMDP– MINITAB– Excel??

• Graphical software– From list above– Sigmaplot– Harvard Graphics– Axum– PowerPoint??– Excel??

Page 31: Biostatistics. Statistics refers to the analysis and interpretation of data with a view toward objective evaluation of the reliability of the conclusions.

Case Study - Microarray

• Microarrays offer an efficient method of gathering data that can be used to determine the expression patterns of tens of thousands of genes in only a few hours.

• Microarrays allow researchers to examine the mRNA from different tissues in normal and disease states to determine which genes and environmental conditions lead to disease

Page 32: Biostatistics. Statistics refers to the analysis and interpretation of data with a view toward objective evaluation of the reliability of the conclusions.

Microarray analysis

• Analysis of the flourescence data includes a check for micro-array to microarray variability using a scatter plot.

• Gene expression levels are measured by adequately quantifying the flourescence associated with each spot.

• The most common methods of achieving this is to rely on simple descriptive statistics such as mean, mode and median.

Page 33: Biostatistics. Statistics refers to the analysis and interpretation of data with a view toward objective evaluation of the reliability of the conclusions.

Microarray analysis

• The total pixel intensity is the sum of all pixels corresponding to fluorescence in an area.

• The volume measure is the sum of signal intensity above background noise for each pixel.

• Role of statistical analysis in reading the intensity value associated with each spot is to control for variability. The inter and intra microarray comparisons are used to identify contamination and other sources of variability.

Page 34: Biostatistics. Statistics refers to the analysis and interpretation of data with a view toward objective evaluation of the reliability of the conclusions.

Microarray analysis

• The mean is the average pixel density over a spot, corresponding to the average fluorescence intensity. The advantage of measuring the mean intensity level is that it decreases the error due to variance in DNA deposition during microarray work.

• The mode is the most likely intensity value, represented by the highest peak in the fluorescence plot.

• The median is the mid-point in the intensity plot.

Page 35: Biostatistics. Statistics refers to the analysis and interpretation of data with a view toward objective evaluation of the reliability of the conclusions.

Microarray analysis

• A quick check for data validity is to create a scatter plot of flourescence data from two identically treated microarrays.

• (Refer fig 6-4, Pg:226 – Bioinformatics Computing Bk)

• The ideal condition is when gene expressions measured by the microarrays are identical as indicated by data on the 45-degree ID line as in (A).

Page 36: Biostatistics. Statistics refers to the analysis and interpretation of data with a view toward objective evaluation of the reliability of the conclusions.

Microarray analysis

• If the amplitude of gene expression on one microarray is greater than the other, data fall off the ID line as in (B) and (C).

• The scatter plot also provides a measure of gene expression amplitude, in that the greater the distance from the origin, the greater the expression amplitude.

• For example the gene plotted at position (C)Has a greater expression amplitude than the gene at

position (A).