Statistical analysis for exploring risk factors Kenichi Satoh (RIRBM) [email protected]http://home.hiroshima-u.ac.jp/ksatoh/ Statistical methods are useful to understand or summarize results of clinical trials etc… We will introduce you the exploring process of statistical data analysis.
40
Embed
Statistical analysis for exploring risk factorshome.hiroshima-u.ac.jp/.../Statisticalanalysis20170530.pdfStatistical analysis for exploring risk factors Kenichi Satoh (RIRBM) [email protected]
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Statistical methods are useful to understand or summarize results of clinical trials etc…We will introduce you the exploring process of statistical data analysis.
1. Baby weight data
• Univariate analysis
• Bi-variate analysis
• Multivariate analysis- Linear Regression analysis- Pass analysis 2. Decathlon data
Sorting, Stem-and leaf plot, Frequency, Histogram, Rug plot, Probability density function, Normal distribution, Boxplot, Quantile, percentile
Univariate analysis
Sorting and stem-and-leaf plot
“2|34” means there are two samples 23XX and 24XX, which are exactly given by 2325 and 2444.
Frequency, Histogram and Rug plot
Baby weight can be categorized by several intervals and table of frequency count is created.
Calculation of mean and s.d.
Variance is sometimes defined by using 1/(n-1) instead of 1/n because of statistical theory.
Normal distribution with mean and s.d.
95% of the distribution lies within two standard deviationsof the mean., i.e., (μ-2σ, μ+2σ).Only one data “2325” is outside of (2379, 3913).
https://en.wikipedia.org/wiki/Normal_distribution
Standardization
The shape of distribution is invariant.
Quantiles and Box plot
The 1st Quantile is the 25th percentile,…Notch of boxplot shows the 95% confidence interval of median, of which length decreases as the sample size increases.
Boxplot, t-test, Correlation Coefficient, Simple Regression, Regression line
Bivariate analysis
Boxplot by groups
“Smoking” is a dummy or indicator variable for mother smoking during pregnancy (1: smoking; 0: not smoking).It seems that there is a difference between median values among two groups.
Comparing means between two groups
Null hypothesis of t-test is that there is no difference between mean values among two groups.When p-value is less than 0.05, the null hypothesis is rejected and we will consider that there is a statistically significant difference in means.
Simple regression
The Linear curve y=737+45x was fitted for the scatter plot, which implies that Baby weight increases by 450g when Mother weight increases by 10kg. The increment is statistically significant with p=0.0003.
Estimation of regression coefficients
Statistical method or models can be described by using linear algebra.
Correlation coefficient
Correlation coefficient is interpreted as a slope of simple regression line for standardized two covariates.
Correlation coefficient between two variables is a measurements expressing a goodness of fitting linear curve line.
The values is calculated within [-1,1] and its absolute value takes 1 only if all the data lies on a linear curve.
When correlation coefficient is about zero, there might be no relation between variables in general.
Correlation matrix, Multiple linear regression, Pass analysis
Multivariate analysis
Scatter plot matrices can be summarized as a correlation coefficient matrix. Those figure and table are useful to understand all relations on multi-variables at once.
Goodness of fittingThe goodness of fitting is obtained as multiple R-squared: R2, of which is a squared correlation coefficient between response and the fitted value.
Multiple linear regression⊆ Pass analysis⊆ Structural Equation Model
Multiple linear regression is a special case of Pass analysis,which is also special model among SEM.
There are two pass ways from Smoking to BabyWeight. Smoking affects directly Babyweightand decreases by 197g, and indirectly decreases 1.26*48.66=60g through MamWeight.
Free Software “Graphviz” by AT&Thttp://www.graphviz.org/
Decathlon Data
Reference:
鈴木義一郎著例解多変量解析実教出版
Top 50 decathlon records in 1995 JAPAN.
There are 10 variables: Race100, LongJump, ShotPutHighJump, Race400, Hurdle110Discus, PoleVault, Javelin, Race1500.
1st and 2nd place player are good at throwing.3, 5, 9, 11, 17 are good at running.
Adjacency Matrix based on Correlation Matrix
a= 1 if |r|>0.3, otherwise a=0
Undirected Graph based onAdjacency Matrix
Distance
Distance between ShotPut and Discus canbe measured as two points in 50 dimensional space. Those variables are comparable by standardization.
Distance is a function d: M x M → R, that satisfies the following conditions:
i) d(x,y) ≥ 0, and d(x,y) = 0 if and only if x = y. ii) d(x,y) = d(y,x)iii) d(x,z) ≤ d(x,y) + d(y,z)
Relation betweenCorrelationand Distancefor Standardized Data
Distance matrix for Standardized Data
Distance between Discus and ShotPut is 4.18, which is the minimum distance among 10 athletics.
Multi-DimensionalScaling methodwhich makes a mapfrom Distances
Although original variables existsin 10-1=9⊆50 dimensional space,MDS method can put all variables in 2 dimensional plane at once.Needless to say, some information was lost by dimension reduction.
Custer analysisbased on Distance
The cluster dendrogram shows the similarity among variables according to distance matrix based on original 50 dimensional space.
Classification based on Cluster Analysis
Factor Analysisbased onCorrelation Matrix
In Factor analysis, several covariates are related by the common latent or unobserved factor. When the number of factors is two, the method is similar with Principle Component Analysis.
Classificationbased onFactor Analysis
Latent factors can be named for understanding the covariates group.
Visualization ofCorrelation matrixand its classification
The order of covariates are arranged according to high correlation group.Eventually, diagonal matrices shows some clusters of covariates groups. The color shows the height of correlation.