Top Banner
Statistical analysis for exploring risk factors Kenichi Satoh (RIRBM) [email protected] http://home.hiroshima-u.ac.jp/ksatoh/ Statistical methods are useful to understand or summarize results of clinical trials etc… We will introduce you the exploring process of statistical data analysis.
40

Statistical analysis for exploring risk factorshome.hiroshima-u.ac.jp/.../Statisticalanalysis20170530.pdfStatistical analysis for exploring risk factors Kenichi Satoh (RIRBM) [email protected]

May 06, 2018

Download

Documents

lamtuyen
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Statistical analysis for exploring risk factorshome.hiroshima-u.ac.jp/.../Statisticalanalysis20170530.pdfStatistical analysis for exploring risk factors Kenichi Satoh (RIRBM) ksatoh@hiroshima-u.ac.jp

Statistical analysisfor exploring risk factors

Kenichi Satoh (RIRBM)

[email protected]

http://home.hiroshima-u.ac.jp/ksatoh/

Statistical methods are useful to understand or summarize results of clinical trials etc…We will introduce you the exploring process of statistical data analysis.

Page 2: Statistical analysis for exploring risk factorshome.hiroshima-u.ac.jp/.../Statisticalanalysis20170530.pdfStatistical analysis for exploring risk factors Kenichi Satoh (RIRBM) ksatoh@hiroshima-u.ac.jp

1. Baby weight data

• Univariate analysis

• Bi-variate analysis

• Multivariate analysis- Linear Regression analysis- Pass analysis 2. Decathlon data

• Multidimensional Scaling method

• Hierarchical Custer analysis

Page 3: Statistical analysis for exploring risk factorshome.hiroshima-u.ac.jp/.../Statisticalanalysis20170530.pdfStatistical analysis for exploring risk factors Kenichi Satoh (RIRBM) ksatoh@hiroshima-u.ac.jp

Baby weight dataBabyWeightMamWeight MamAge PregnancyDays Smoking

3087 48 28 304 03229 52 24 286 13204 61 33 273 13346 58 30 295 03579 56 21 290 02325 46 26 262 03159 55 30 318 13589 63 37 298 02969 52 25 299 12819 40 22 313 03191 59 34 285 13346 57 28 306 02444 45 30 291 13662 64 21 274 03241 53 29 283 0

Reference: 佐和隆光著回帰分析, 朝倉出店p57-表4.1

Which variables are related to baby weight?

Page 4: Statistical analysis for exploring risk factorshome.hiroshima-u.ac.jp/.../Statisticalanalysis20170530.pdfStatistical analysis for exploring risk factors Kenichi Satoh (RIRBM) ksatoh@hiroshima-u.ac.jp

Free Statistical Software “R”

https://www.r-project.org/index.htmlhttp://www.okadajp.org/RWiki/

Search Results by “Statistics R” in Amazon

Page 5: Statistical analysis for exploring risk factorshome.hiroshima-u.ac.jp/.../Statisticalanalysis20170530.pdfStatistical analysis for exploring risk factors Kenichi Satoh (RIRBM) ksatoh@hiroshima-u.ac.jp

Sorting, Stem-and leaf plot, Frequency, Histogram, Rug plot, Probability density function, Normal distribution, Boxplot, Quantile, percentile

Univariate analysis

Page 6: Statistical analysis for exploring risk factorshome.hiroshima-u.ac.jp/.../Statisticalanalysis20170530.pdfStatistical analysis for exploring risk factors Kenichi Satoh (RIRBM) ksatoh@hiroshima-u.ac.jp

Sorting and stem-and-leaf plot

“2|34” means there are two samples 23XX and 24XX, which are exactly given by 2325 and 2444.

Page 7: Statistical analysis for exploring risk factorshome.hiroshima-u.ac.jp/.../Statisticalanalysis20170530.pdfStatistical analysis for exploring risk factors Kenichi Satoh (RIRBM) ksatoh@hiroshima-u.ac.jp

Frequency, Histogram and Rug plot

Baby weight can be categorized by several intervals and table of frequency count is created.

Page 8: Statistical analysis for exploring risk factorshome.hiroshima-u.ac.jp/.../Statisticalanalysis20170530.pdfStatistical analysis for exploring risk factors Kenichi Satoh (RIRBM) ksatoh@hiroshima-u.ac.jp

Calculation of mean and s.d.

Variance is sometimes defined by using 1/(n-1) instead of 1/n because of statistical theory.

Page 9: Statistical analysis for exploring risk factorshome.hiroshima-u.ac.jp/.../Statisticalanalysis20170530.pdfStatistical analysis for exploring risk factors Kenichi Satoh (RIRBM) ksatoh@hiroshima-u.ac.jp

Normal distribution with mean and s.d.

95% of the distribution lies within two standard deviationsof the mean., i.e., (μ-2σ, μ+2σ).Only one data “2325” is outside of (2379, 3913).

https://en.wikipedia.org/wiki/Normal_distribution

Page 10: Statistical analysis for exploring risk factorshome.hiroshima-u.ac.jp/.../Statisticalanalysis20170530.pdfStatistical analysis for exploring risk factors Kenichi Satoh (RIRBM) ksatoh@hiroshima-u.ac.jp

Standardization

The shape of distribution is invariant.

Page 11: Statistical analysis for exploring risk factorshome.hiroshima-u.ac.jp/.../Statisticalanalysis20170530.pdfStatistical analysis for exploring risk factors Kenichi Satoh (RIRBM) ksatoh@hiroshima-u.ac.jp

Quantiles and Box plot

The 1st Quantile is the 25th percentile,…Notch of boxplot shows the 95% confidence interval of median, of which length decreases as the sample size increases.

Page 12: Statistical analysis for exploring risk factorshome.hiroshima-u.ac.jp/.../Statisticalanalysis20170530.pdfStatistical analysis for exploring risk factors Kenichi Satoh (RIRBM) ksatoh@hiroshima-u.ac.jp

Boxplot, t-test, Correlation Coefficient, Simple Regression, Regression line

Bivariate analysis

Page 13: Statistical analysis for exploring risk factorshome.hiroshima-u.ac.jp/.../Statisticalanalysis20170530.pdfStatistical analysis for exploring risk factors Kenichi Satoh (RIRBM) ksatoh@hiroshima-u.ac.jp

Boxplot by groups

“Smoking” is a dummy or indicator variable for mother smoking during pregnancy (1: smoking; 0: not smoking).It seems that there is a difference between median values among two groups.

Page 14: Statistical analysis for exploring risk factorshome.hiroshima-u.ac.jp/.../Statisticalanalysis20170530.pdfStatistical analysis for exploring risk factors Kenichi Satoh (RIRBM) ksatoh@hiroshima-u.ac.jp

Comparing means between two groups

Null hypothesis of t-test is that there is no difference between mean values among two groups.When p-value is less than 0.05, the null hypothesis is rejected and we will consider that there is a statistically significant difference in means.

Page 15: Statistical analysis for exploring risk factorshome.hiroshima-u.ac.jp/.../Statisticalanalysis20170530.pdfStatistical analysis for exploring risk factors Kenichi Satoh (RIRBM) ksatoh@hiroshima-u.ac.jp

Simple regression

The Linear curve y=737+45x was fitted for the scatter plot, which implies that Baby weight increases by 450g when Mother weight increases by 10kg. The increment is statistically significant with p=0.0003.

Page 16: Statistical analysis for exploring risk factorshome.hiroshima-u.ac.jp/.../Statisticalanalysis20170530.pdfStatistical analysis for exploring risk factors Kenichi Satoh (RIRBM) ksatoh@hiroshima-u.ac.jp

Estimation of regression coefficients

Statistical method or models can be described by using linear algebra.

Page 17: Statistical analysis for exploring risk factorshome.hiroshima-u.ac.jp/.../Statisticalanalysis20170530.pdfStatistical analysis for exploring risk factors Kenichi Satoh (RIRBM) ksatoh@hiroshima-u.ac.jp

Correlation coefficient

Correlation coefficient is interpreted as a slope of simple regression line for standardized two covariates.

Page 18: Statistical analysis for exploring risk factorshome.hiroshima-u.ac.jp/.../Statisticalanalysis20170530.pdfStatistical analysis for exploring risk factors Kenichi Satoh (RIRBM) ksatoh@hiroshima-u.ac.jp

Correlation coefficient between two variables is a measurements expressing a goodness of fitting linear curve line.

The values is calculated within [-1,1] and its absolute value takes 1 only if all the data lies on a linear curve.

When correlation coefficient is about zero, there might be no relation between variables in general.

Page 19: Statistical analysis for exploring risk factorshome.hiroshima-u.ac.jp/.../Statisticalanalysis20170530.pdfStatistical analysis for exploring risk factors Kenichi Satoh (RIRBM) ksatoh@hiroshima-u.ac.jp

Correlation matrix, Multiple linear regression, Pass analysis

Multivariate analysis

Page 20: Statistical analysis for exploring risk factorshome.hiroshima-u.ac.jp/.../Statisticalanalysis20170530.pdfStatistical analysis for exploring risk factors Kenichi Satoh (RIRBM) ksatoh@hiroshima-u.ac.jp

Scatter plot matrices can be summarized as a correlation coefficient matrix. Those figure and table are useful to understand all relations on multi-variables at once.

Scatter plot matrices Correlation matrix

Page 21: Statistical analysis for exploring risk factorshome.hiroshima-u.ac.jp/.../Statisticalanalysis20170530.pdfStatistical analysis for exploring risk factors Kenichi Satoh (RIRBM) ksatoh@hiroshima-u.ac.jp

Multiple linear regression

Fitted regression model:

BabyWeight=-1675+55*MamWeight-23*MamAge+9*PregnancyDays-141*Smoking

Page 22: Statistical analysis for exploring risk factorshome.hiroshima-u.ac.jp/.../Statisticalanalysis20170530.pdfStatistical analysis for exploring risk factors Kenichi Satoh (RIRBM) ksatoh@hiroshima-u.ac.jp

Goodness of fittingThe goodness of fitting is obtained as multiple R-squared: R2, of which is a squared correlation coefficient between response and the fitted value.

Page 23: Statistical analysis for exploring risk factorshome.hiroshima-u.ac.jp/.../Statisticalanalysis20170530.pdfStatistical analysis for exploring risk factors Kenichi Satoh (RIRBM) ksatoh@hiroshima-u.ac.jp

Multiple linear regression⊆ Pass analysis⊆ Structural Equation Model

Multiple linear regression is a special case of Pass analysis,which is also special model among SEM.

Page 24: Statistical analysis for exploring risk factorshome.hiroshima-u.ac.jp/.../Statisticalanalysis20170530.pdfStatistical analysis for exploring risk factors Kenichi Satoh (RIRBM) ksatoh@hiroshima-u.ac.jp

There are two pass ways from Smoking to BabyWeight. Smoking affects directly Babyweightand decreases by 197g, and indirectly decreases 1.26*48.66=60g through MamWeight.

Free Software “Graphviz” by AT&Thttp://www.graphviz.org/

Page 25: Statistical analysis for exploring risk factorshome.hiroshima-u.ac.jp/.../Statisticalanalysis20170530.pdfStatistical analysis for exploring risk factors Kenichi Satoh (RIRBM) ksatoh@hiroshima-u.ac.jp

Decathlon Data

Reference:

鈴木義一郎著例解多変量解析実教出版

Top 50 decathlon records in 1995 JAPAN.

There are 10 variables: Race100, LongJump, ShotPutHighJump, Race400, Hurdle110Discus, PoleVault, Javelin, Race1500.

Which variables are related each other?

id Race100 LongJump ShotPut HighJump Race400 Hurdle110 Discus PoleVault Javelin Race1500

1 11.28 6.91 13.38 1.93 50.25 14.7 44.66 4.9 56.1 285.312 11.02 6.8 12.81 1.93 48.42 15.72 38.84 4.6 65.65 285.543 10.9 7.08 12.38 1.87 48.49 14.58 35.84 4.6 56.9 280.624 11.34 7.02 12.08 2.05 50.55 14.73 37.14 4.5 57.76 279.045 11.24 7.03 11.57 1.96 48.89 14.89 31.34 4.7 61.96 280.776 10.81 6.67 12.81 1.93 48.97 15.04 33.96 3.9 57.26 279.097 11.34 6.77 12.4 1.98 50.03 14.82 33.14 4.3 51.98 275.088 11.29 6.69 11.67 1.9 51.43 14.66 34.36 4.4 52.82 267.419 10.83 6.83 11.67 1.8 48.35 15.11 33.68 4.3 49.54 275.69

10 11.23 6.67 12.39 1.84 50.33 15.81 37.68 4.5 54.54 280.6611 10.85 7.2 11.37 1.75 50.4 14.88 34.46 4.3 44.66 275.2412 11 6.9 11.5 1.85 49.8 15.5 36.22 4.5 52.52 28613 11.38 7.04 11.68 1.91 51.33 15.65 31.66 4.6 50.62 286.5914 11.49 6.8 10.7 1.95 49.4 15.21 32.72 3.8 54.26 268.5715 11.11 6.99 12.76 1.7 49.54 16.32 40.24 4.3 50.4 302.2316 11.62 6.88 12.41 1.88 52.93 16.52 37.36 4.7 52.94 291.2517 10.9 7.22 10.48 1.85 49.4 14.8 30.16 4 51.56 291.118 11.35 6.82 11.8 1.94 51.42 15.04 33.72 3.8 52.26 285.8519 11.1 6.86 11.17 1.8 49.71 15.28 30 4.4 48.84 285.6520 11.3 6.94 10.51 1.85 51 15.6 33.84 4.3 54.2 27521 11.42 6.66 12.25 1.75 52.7 15.87 40.84 4.2 57.26 299.5822 11.55 6.95 10.82 1.84 51.4 14.94 33.04 4.4 54.1 318.7123 11.2 6.35 10.8 1.7 50.8 14.8 34.64 4 56.86 274.324 11.39 6.98 10.33 1.94 51.52 16.31 29.82 4 52.78 277.7225 11 6.66 10.8 1.85 50.2 16 30.38 4 52.9 28126 11.77 6.63 10.4 1.92 50.33 15.15 33.4 3.8 47.48 279.8327 11.1 6.56 12.09 1.7 50.1 15.3 36.44 3.6 53.98 289.428 10.9 6.62 11.15 1.96 50.7 15.4 32.18 3.8 52.84 313.529 11.15 6.37 10.8 1.72 50.26 15.05 34.2 3.9 50.4 285.9530 11.72 6.53 11.09 1.8 49.99 16.04 31.74 3.8 57.74 274.9131 11.2 6.76 10.13 1.85 50.5 15.8 30.92 4 51.68 282.732 11.17 6.64 9.01 1.84 50.6 14.86 27.46 4.9 42.42 312.7333 11.41 6.38 10.87 1.94 51.48 15.88 33.22 4.1 46.22 289.9634 11.33 6.71 10.93 1.75 49.58 15.89 28.7 3.5 60.5 290.1335 11.2 6.67 10.63 1.87 52.3 16.6 30.8 4.3 53.82 289.536 11.57 6.43 11.35 1.75 50.3 16.32 35.48 4 47.56 280.2737 10.8 6.67 10.81 1.83 50.1 15.5 28.18 3.6 54.64 301.438 11.1 6.98 11.04 1.8 50.4 15.4 31.2 3.2 44.42 271.239 11.21 6.31 11.12 1.78 51.65 15.39 30.44 4.2 51.94 305.8340 11.2 6.65 10.32 1.75 50.6 15.5 30.08 3.8 48.76 276.841 11.32 6.81 9.65 1.7 51.42 15.7 24.5 4.4 43.94 271.8842 10.9 6.54 9.81 1.84 50.4 15.8 29.42 3.8 47.08 286.543 11.68 6.78 11.63 1.87 54.16 16.83 33.54 3.3 47.36 272.4444 11.56 6.56 9.42 1.85 51.54 15.89 28.04 4 48.1 275.5545 11.2 7.16 9.28 1.85 52.5 16.2 26.06 4.5 44.8 293.346 11.2 6.55 10.48 1.8 51.4 15.3 32.88 3.7 48.46 296.747 11.24 6.27 10.97 1.91 51.65 15.98 30.48 3.4 49.52 288.2448 11.06 6.91 9.94 2.15 52.47 15.53 27.5 3.1 44.42 315.2349 11.54 6.18 10.6 1.7 50.45 16.46 34.06 3.8 55.08 285.2250 11.21 6.52 10.16 1.81 50.56 16.16 31.96 3.8 41.38 287.47

Page 26: Statistical analysis for exploring risk factorshome.hiroshima-u.ac.jp/.../Statisticalanalysis20170530.pdfStatistical analysis for exploring risk factors Kenichi Satoh (RIRBM) ksatoh@hiroshima-u.ac.jp

Simple regressionCorrelation coefficient

Page 27: Statistical analysis for exploring risk factorshome.hiroshima-u.ac.jp/.../Statisticalanalysis20170530.pdfStatistical analysis for exploring risk factors Kenichi Satoh (RIRBM) ksatoh@hiroshima-u.ac.jp

Scatter plot matrices Correlation matrix

Page 28: Statistical analysis for exploring risk factorshome.hiroshima-u.ac.jp/.../Statisticalanalysis20170530.pdfStatistical analysis for exploring risk factors Kenichi Satoh (RIRBM) ksatoh@hiroshima-u.ac.jp

PrincipleComponent Analysisbased onCorrelation Matrix

Page 29: Statistical analysis for exploring risk factorshome.hiroshima-u.ac.jp/.../Statisticalanalysis20170530.pdfStatistical analysis for exploring risk factors Kenichi Satoh (RIRBM) ksatoh@hiroshima-u.ac.jp

Biplot of PCAbased onIndividual Data

1st and 2nd place player are good at throwing.3, 5, 9, 11, 17 are good at running.

Page 30: Statistical analysis for exploring risk factorshome.hiroshima-u.ac.jp/.../Statisticalanalysis20170530.pdfStatistical analysis for exploring risk factors Kenichi Satoh (RIRBM) ksatoh@hiroshima-u.ac.jp

Adjacency Matrix based on Correlation Matrix

a= 1 if |r|>0.3, otherwise a=0

Page 31: Statistical analysis for exploring risk factorshome.hiroshima-u.ac.jp/.../Statisticalanalysis20170530.pdfStatistical analysis for exploring risk factors Kenichi Satoh (RIRBM) ksatoh@hiroshima-u.ac.jp

Undirected Graph based onAdjacency Matrix

Page 32: Statistical analysis for exploring risk factorshome.hiroshima-u.ac.jp/.../Statisticalanalysis20170530.pdfStatistical analysis for exploring risk factors Kenichi Satoh (RIRBM) ksatoh@hiroshima-u.ac.jp

Distance

Distance between ShotPut and Discus canbe measured as two points in 50 dimensional space. Those variables are comparable by standardization.

Distance is a function d: M x M → R, that satisfies the following conditions:

i) d(x,y) ≥ 0, and d(x,y) = 0 if and only if x = y. ii) d(x,y) = d(y,x)iii) d(x,z) ≤ d(x,y) + d(y,z)

Page 33: Statistical analysis for exploring risk factorshome.hiroshima-u.ac.jp/.../Statisticalanalysis20170530.pdfStatistical analysis for exploring risk factors Kenichi Satoh (RIRBM) ksatoh@hiroshima-u.ac.jp

Relation betweenCorrelationand Distancefor Standardized Data

Page 34: Statistical analysis for exploring risk factorshome.hiroshima-u.ac.jp/.../Statisticalanalysis20170530.pdfStatistical analysis for exploring risk factors Kenichi Satoh (RIRBM) ksatoh@hiroshima-u.ac.jp

Distance matrix for Standardized Data

Distance between Discus and ShotPut is 4.18, which is the minimum distance among 10 athletics.

Page 35: Statistical analysis for exploring risk factorshome.hiroshima-u.ac.jp/.../Statisticalanalysis20170530.pdfStatistical analysis for exploring risk factors Kenichi Satoh (RIRBM) ksatoh@hiroshima-u.ac.jp

Multi-DimensionalScaling methodwhich makes a mapfrom Distances

Although original variables existsin 10-1=9⊆50 dimensional space,MDS method can put all variables in 2 dimensional plane at once.Needless to say, some information was lost by dimension reduction.

Page 36: Statistical analysis for exploring risk factorshome.hiroshima-u.ac.jp/.../Statisticalanalysis20170530.pdfStatistical analysis for exploring risk factors Kenichi Satoh (RIRBM) ksatoh@hiroshima-u.ac.jp

Custer analysisbased on Distance

The cluster dendrogram shows the similarity among variables according to distance matrix based on original 50 dimensional space.

Page 37: Statistical analysis for exploring risk factorshome.hiroshima-u.ac.jp/.../Statisticalanalysis20170530.pdfStatistical analysis for exploring risk factors Kenichi Satoh (RIRBM) ksatoh@hiroshima-u.ac.jp

Classification based on Cluster Analysis

Page 38: Statistical analysis for exploring risk factorshome.hiroshima-u.ac.jp/.../Statisticalanalysis20170530.pdfStatistical analysis for exploring risk factors Kenichi Satoh (RIRBM) ksatoh@hiroshima-u.ac.jp

Factor Analysisbased onCorrelation Matrix

In Factor analysis, several covariates are related by the common latent or unobserved factor. When the number of factors is two, the method is similar with Principle Component Analysis.

Page 39: Statistical analysis for exploring risk factorshome.hiroshima-u.ac.jp/.../Statisticalanalysis20170530.pdfStatistical analysis for exploring risk factors Kenichi Satoh (RIRBM) ksatoh@hiroshima-u.ac.jp

Classificationbased onFactor Analysis

Latent factors can be named for understanding the covariates group.

Page 40: Statistical analysis for exploring risk factorshome.hiroshima-u.ac.jp/.../Statisticalanalysis20170530.pdfStatistical analysis for exploring risk factors Kenichi Satoh (RIRBM) ksatoh@hiroshima-u.ac.jp

Visualization ofCorrelation matrixand its classification

The order of covariates are arranged according to high correlation group.Eventually, diagonal matrices shows some clusters of covariates groups. The color shows the height of correlation.