Introduction to Biostatistics and Bioinformatics
Exploring Data and Descriptive Statistics
Learning Objectives
Python matplotlib library to visualize data:• Scatter plot• Histogram• Kernel density estimate• Box plots
Descriptive statistics:• Mean and median• Standard deviation and inter quartile range• Central limit theorem
An Example Data Set
0.022-0.0830.048-0.010-0.1250.195-0.071-0.1470.0330.0800.0730.0160.1480.1350.006-0.0890.165-0.088-0.1370.094
Scatter Plot
0.022-0.0830.048-0.010-0.1250.195-0.071-0.1470.0330.0800.0730.0160.1480.1350.006-0.0890.165-0.088-0.1370.094
Order or Measurement
Mea
sure
men
t
Histogram
Order or Measurement
Me
as
ure
me
nt
Measurement Measurement Measurement
Bin size = 0.1 Bin size = 0.05 Bin size = 0.025
Nu
mb
er
of
Me
as
ure
me
nts
Nu
mb
er
of
Me
as
ure
me
nts
Nu
mb
er
of
Me
as
ure
me
nts
Cumulative Distributions
Order or Measurement
Me
as
ure
me
nt
Measurement
Cu
mu
lati
ve
Fre
qu
en
cy
Kernel Density Estimate
Order or Measurement
Me
as
ure
me
nt
Measurement
Nu
mb
er
of
Me
as
ure
me
nts
Original Distribution
Order or Measurement
Me
as
ure
me
nt
Measurement
Nu
mb
er
of
Me
as
ure
me
nts
Original Distribution Kernel Density Estimate
Fre
qu
en
cy
Measurement
Bin size = 0.05
Nu
mb
er
of
Me
as
ure
me
nts
Histogram
Measurement
More Data
Order or Measurement
Me
as
ure
me
nt
Measurement
Nu
mb
er
of
Me
as
ure
me
nts
Original Distribution Kernel Density Estimate
Fre
qu
en
cy
Measurement
Bin size = 0.05
Nu
mb
er
of
Me
as
ure
me
nts
Histogram
Measurement
Exercise 1
Download ibb2015_7_exercise1.py
(a) Draw 20 points from a normal distribution with mean=0 and standard deviation=0.1.
import numpy as np
y=0.1*np.random.normal(size=20)print y
[-0.09946073 -0.19612617 0.03442682 0.02622746 -0.28418124 -0.04245968 0.05922837 0.01199874 0.13454915 -0.07482707 -0.11688758 0.01714036 0.03280043 0.01356022 0.09128649 -0.18923468 0.14536047 -0.07764629 -0.0349553 0.04300367]
Exercise 1
(b) Make scatter plot of the 20 points.
import matplotlib.pyplot as plt
x=range(1,points+1)fig, (ax1) = plt.subplots(1,figsize=(6,6))ax1.scatter(x,y,color='red',lw=0,s=40)ax1.set_xlim([0,points+1])ax1.set_ylim([-1,1])fig.savefig('ibb2015_7_exercise1_scatter_points'+str(poi
nts)+'.png',dpi=300,bbox_inches='tight')plt.close(fig)
Exercise 1
(c) Plot histograms.
for bin in [20,40,80]:fig, (ax1) = plt.subplots(1,figsize=(6,6))
ax1.hist(y,bins=bin,histtype='step',color='black', range=[-1,1], lw=2, normed=True)ax1.set_xlim([-1,1])fig.savefig('ibb2015_7_exercise1_bin'+str(bin)+'_points'+str(points)+'.png',dpi=300,bbox_inches='tight')plt.close(fig)
Exercise 1
(d) Plot cumulative distribution.
y_cumulative=np.linspace(0,1,points)x_cumulative=np.sort(y)fig, (ax1) = plt.subplots(1,figsize=(6,6))ax1.plot(x_cumulative,y_cumulative,color='black', lw=2)ax1.set_xlim([-1,1])ax1.set_ylim([0,1])fig.savefig('ibb2015_7_exercise1_cumulative_points'+
str(points)+'.png',dpi=300,bbox_inches='tight')plt.close(fig)
Exercise 1
(e) Plot kernel density estimate.
import scipy.stats as stats
kde_points=1000kde_x = np.linspace(-1,1,kde_points)fig, (ax1) = plt.subplots(1,figsize=(6,6))kde_y=stats.gaussian_kde(y)ax1.plot(kde_x,kde_y(kde_x),color='black', lw=2)ax1.set_xlim([-1,1])fig.savefig('ibb2015_7_exercise1_kde_points'+str(points)
+'.png',dpi=300,bbox_inches='tight')plt.close(fig)
Comparing Measurements
Comparing Measurements – Cumulative distributions
Systematic Shifts
Exercise 2
Download ibb2015_7_exercise2.py
(a) Generate 5 data sets with 20 data points each from normal distributions with means = 0, 0, 0.1, 0.5 and 0.3 and standard deviation=0.1.
y=[]for j in range(5):
y.append(0.1*np.random.normal(size=20))y[2]+=0.1y[3]+=0.5y[4]+=0.3print y
Exercise 2
(b) Make scatter plots for the 5 data sets.
sixcolors=['#D4C6DF','#8968AC','#3D6570','#91732B','#963725','#4D0132']
fig, (ax1) = plt.subplots(1,figsize=(6,6))for j in range(5):
ax1.scatter(np.linspace(j+1-0.2,j+1+0.2,20), y[j],color=sixcolors[6-(j+1)], lw=0, alpha=1)
ax1.set_xlim([0,6])ax1.set_ylim([-1,1])
fig.savefig('ibb2015_7_exercise2_scatter_sample'+str(20),dpi=300,bbox_inches='tight')
plt.close(fig)
Correlation Between Two Variables
Correlation Between Two Variables
Correlation Between Two Variables
Correlation Between Two Variables
Correlation Between Two Variables
Data Visualization
http://blogs.nature.com/methagora/2013/07/data-visualization-points-of-view.html
Process of Statistical Analysis
Population
Random Sample
Sample Statistics
Describe
MakeInferences
DistributionsComplex Normal Skewed Long tails
n=3
n=10
n=100
Mean
n
ni
iix
1
xxx n,...,,21
Mean
Sample
Mean - Sample Size
Normal Distribution
100
0.2
0.0
Mean
806040200 Sample Size
-0.2
Mean – Sample SizeComplex Normal Skewed Long tails
Sample Size
100
1
-1
0.2
-0.2
Mode, Maximum and Minimum
xxx n,...,,21
Sample
Maximum),...,,max(
21 xxx n
Minimum
),...,,min(21 xxx n
Modethe most common value
Median, Quartiles and Percentiles
xxx n,...,,21
Sample
Quartiles
xQ i
1 for 25% of the sample
xQ i
2for 50% of the sample
(median)xQ i
3 for 75% of the sample
xP im for m% of the sample
Percentiles
Median and Mean – Sample SizeComplex Normal Skewed Long tails
Sample Size
100
1
-1
0.2
-0.2
Median - Gray
Variance
n
ni
iix
1
xxx n,...,,21
Variance
Sample
Mean
n
i
ni
ix
1
2
2)(
Variance – Sample SizeComplex Normal Skewed Long tails
Sample Size
100
0.6
0
0.1
0
Inter Quartile Range (IQR)
xxx n,...,,21
Sample
Quartiles
xQ i
1 for 25% of the sample
xQ i
2for 50% of the sample
(median)xQ i
3 for 75% of the sample
Inter Quartile Range
QQIQR13
Inter Quartile Range and Standard Deviation
Complex Normal Skewed Long tails
Sample Size
100
1.0
0
0.4
0
IRQ/1.349 - Gray
Central Limit Theorem
The sum of a large number of values drawn from many distributions converge normal if:
• The values are drawn independently;• The values are from the one distribution; and • The distribution has to have a finite mean and
variance.
Uncertainty in Determining the MeanComplex Normal Skewed Long tails
n=3
n=10
Mean
n=100
n=3
n=10
n=100
n=3
n=10
n=100
n=10
n=100
n=1000
Standard Error of the Mean
n
ni
iix
1
xxx n,...,,21
Variance
Sample
Mean
n
i
ni
ix
1
2
2)(
nmes
..
Standard Error of the Mean
Exercise 3
Download ibb2015_7_exercise3.py
(a) Generate skewed data sets.
sample_size=10x_test=np.random.uniform(-1.0,1.0,size=30*sample_size)y_test=np.random.uniform(0.0,1.0,size=30*sample_size)y_test2=skew(x_test,-0.1,0.2,10)y_test2/=max(y_test2)x_test2=x_test[y_test<y_test2]x_sample=x_test2[:sample_size]
1. Generate a pair of random numbers within the range.2. Assign them to x and y3. Keep x if the point (x,y) is within the distribution.4. Repeat 1-3 until the desired sample size is obtained.5. The values x obtained in this was will be distributed according to
the original distribution.
Exercise 3(b) Calculate the mean of samples drawn from the skewed data set and the
standard error of the mean, and plot the distribution of averages.
for repeat in range(1000):…average.append(np.mean(x_sample))
sem=np.std(average)fig, (ax1) = plt.subplots(1,figsize=(6,6))ax1.set_title('Sample size = '+str(sample_size)+', SEM = '
+str(sem))ax1.hist(average,bins=100,histtype='step',color='red',range=
[-0.5,0.5],normed=True,lw=2)ax1.set_xlim([-0.5,0.5])
Box Plot
M. Krzywinski & N. Altman, Visualizing samples with box plots, Nature Methods 11 (2014) 119
n=5
Box PlotsComplex Normal Skewed Long tails
n=10
n=100
n=5
n=10
n=100
n=5
n=10
n=100
n=5
n=10
n=100
Box Plots with All the Data PointsComplex Normal Skewed Long tails
n=5
n=10
n=100
n=5
n=10
n=100
n=5
n=10
n=100
n=5
n=10
n=100
Box Plots, Scatter Plots and Bar GraphsNormal Distribution
Error bars: standard deviation error bars: standard deviation
error bars: standard error error bars: standard error
Box Plots, Scatter Plots and Bar GraphsSkewed Distribution
Error bars: standard deviation error bars: standard deviation
error bars: standard errorerror bars: standard error
Exercise 4
Download ibb2015_7_exercise4.py and plot box plots for a skewed data set.
fig, (ax1) = plt.subplots(1,figsize=(6,6))ax1.scatter(np.linspace(1-0.1, 1+0.1,sample_size),
x_sample, facecolors='none', edgecolor=thiscolor, lw=1)
bp=ax1.boxplot(x_samples, notch=False, sym='')plt.setp(bp['boxes'], color=thiscolor, lw=2)plt.setp(bp['whiskers'], color=thiscolor, lw=2)plt.setp(bp['medians'], color='black', lw=2)plt.setp(bp['caps'], color=thiscolor, lw=2)plt.setp(bp['fliers'], color=thiscolor, marker='o', lw=0)
fig.savefig(…)
Descriptive Statistics - Summary
• Example distribution: • Normal distribution• Skewed distribution• Distribution with long tails• Complex distribution with several peaks
• Mean, median, quartiles, percentiles
• Variance, Standard deviation, Inter Quartile Range (IQR), error bars
• Box plots, bar graphs, and scatter plots
Descriptive Statistics – Recommended Reading
http://blogs.nature.com/methagora/2013/08/giving_statistics_the_attention_it_deserves.html
Homework
Plot the ratio of the standard error of the mean and the standard deviation as a function of sample size (use sample sizes of 3, 10, 30, 100, 300, 1000) for the skewed distribution in Exercise 3. Modify ibb2015_7_exercise3.py to generate this plot and email both the script and the plot.
Next Lecture: Sequence Alignment Concepts