11
Multivariate Statistical AnalysisMultivariate Statistical Analysis
Shyh-Kang JengShyh-Kang JengDepartment of Electrical Engineering/Department of Electrical Engineering/Graduate Institute of Communication/Graduate Institute of Communication/
Graduate Institute of Networking and MultiGraduate Institute of Networking and Multimediamedia
22
What Is Multivariate Analysis?What Is Multivariate Analysis?
Statistical methodology to analyze Statistical methodology to analyze data with measurements on many data with measurements on many variablesvariables
Process
controllable factors
uncontrollable factors
input output
33
Why to Learn Multivariate Why to Learn Multivariate Analysis? Analysis?
Explanation of a social or physical Explanation of a social or physical phenomenon must be tested by phenomenon must be tested by gathering and analyzing datagathering and analyzing data
Complexities of most phenomena Complexities of most phenomena require an investigator to collect require an investigator to collect observations on many different observations on many different variablesvariables
44
Application ExamplesApplication Examples
Is one product better than the other?Is one product better than the other?
Which factor is the most important to Which factor is the most important to determine the performance of a determine the performance of a system?system?
How to classify the results into How to classify the results into clusters?clusters?
What are the relationships between What are the relationships between variables?variables?
55
Course OutlineCourse OutlineIntroductionIntroductionMatrix Algebra and Random VectorsMatrix Algebra and Random VectorsSample Geometry and Random Sample Geometry and Random SamplesSamplesMultivariate Normal DistributionMultivariate Normal DistributionInference about a Mean VectorInference about a Mean VectorComparison of Several Multivariate Comparison of Several Multivariate MeansMeansMultivariate Linear Regression ModelsMultivariate Linear Regression Models
66
Course OutlineCourse OutlinePrincipal ComponentsPrincipal Components
Factor Analysis and Inference for Factor Analysis and Inference for Structured Covariance MatricesStructured Covariance Matrices
Canonical Correlation AnalysisCanonical Correlation Analysis
Discrimination and ClassificationDiscrimination and Classification
Clustering, Distance Methods, and Clustering, Distance Methods, and OrdinationOrdination
Multidimensional Scaling*Multidimensional Scaling*
Structural Equation Modeling*Structural Equation Modeling*
77
Text Book and WebsiteText Book and Website
R. A. Johnson and D. W. Wichern, ApR. A. Johnson and D. W. Wichern, Applied Multivariate Statistical Analysis,plied Multivariate Statistical Analysis, 5th ed., Prentice Hall, 2002. ( 5th ed., Prentice Hall, 2002. ( 雙葉雙葉 ))http://cc.ee.ntu.edu.tw/~skjeng/http://cc.ee.ntu.edu.tw/~skjeng/
MultivariateAnalysis2006.htmMultivariateAnalysis2006.htm
88
ReferencesReferencesJ. F. Hair, Jr., B. Black, B. Babin, R. E. AndJ. F. Hair, Jr., B. Black, B. Babin, R. E. Anderson, and R. L. Tatham, Multivariate Daterson, and R. L. Tatham, Multivariate Data Analysis, 6th ed., Prentice Hall, 2006. a Analysis, 6th ed., Prentice Hall, 2006. (( 華泰華泰 ))D. C. Montgomery, Design and Analysis oD. C. Montgomery, Design and Analysis of Experiments, 6th ed., John Wiley, 2005. f Experiments, 6th ed., John Wiley, 2005. (( 歐亞歐亞 ))D. SalsbergD. Salsberg 著著 , , 葉偉文譯葉偉文譯 ,, 統計統計 ,, 改變了改變了世界世界 , , 天下遠見天下遠見 , 2001., 2001.
99
ReferencesReferences
張碧波,張碧波,推理統計學推理統計學,三民,,三民, 1976.1976.張輝煌編譯,張輝煌編譯,實驗設計與變異分析實驗設計與變異分析,建興,,建興,1986.1986.
1010
Time ManagementTime Management
Emergency
Importance
III
III IV
1111
Some Important LawsSome Important Laws
First things firstFirst things first
80 – 20 Law80 – 20 Law
Fast prototyping and evolutionFast prototyping and evolution
1212
Major Uses of Multivariate AnalysisMajor Uses of Multivariate Analysis
Data reduction or structural Data reduction or structural simplificationsimplification
Sorting and groupingSorting and grouping
Investigation of the dependence Investigation of the dependence among variablesamong variables
PredictionPrediction
Hypothesis construction and testingHypothesis construction and testing
1313
Array of DataArray of Data
npnknn
jpjkjj
pk
pk
xxxx
xxxx
xxxx
xxxx
21
21
222221
111211
x
1414
Descriptive StatisticsDescriptive Statistics
Summary numbers to assess the Summary numbers to assess the information contained in datainformation contained in data
Basic descriptive statisticsBasic descriptive statistics– Sample meanSample mean– Sample varianceSample variance– Sample standard deviationSample standard deviation– Sample covarianceSample covariance– Sample correlation coefficientSample correlation coefficient
1515
Sample Mean and Sample Mean and Sample VarianceSample Variance
pk
xxn
ss
xn
x
n
jkjkkkk
n
jjkk
,,2,1
1
1
1
22
1
1616
Sample Covariance and Sample Covariance and Sample Correlation CoefficientSample Correlation Coefficient
kiikkiik
n
jkjk
n
jiji
n
jkjkiji
kkii
ikik
n
jkjkijiik
rrss
pkpi
xxxx
xxxx
ss
sr
xxxxn
s
,
,,2,1;,,2,1
1
1
2
1
2
1
1
1717
Standardized Values Standardized Values (or Standardized Scores)(or Standardized Scores)
Centered at zeroCentered at zero
Unit standard deviationUnit standard deviation
Sample correlation coefficient can be Sample correlation coefficient can be regarded as a sample covariance of regarded as a sample covariance of two standardized variablestwo standardized variables
kk
kjk
s
xx
1818
Properties of Sample Correlation Properties of Sample Correlation CoefficientCoefficient
Value is between -1 and 1Value is between -1 and 1Magnitude measure the strength of the linear Magnitude measure the strength of the linear associationassociationSign indicates the direction of the associationSign indicates the direction of the associationValue remains unchanged if all Value remains unchanged if all xxjiji’s and ’s and xxjkjk’s ’s are changed to are changed to yyjiji = = aa xxjiji + + bb and and yyjkjk = = cc xxjkjk + + dd, resp, respectively, provided that the constants ectively, provided that the constants aa and and cc h have the same sign ave the same sign
1919
Arrays of Basic Arrays of Basic Descriptive StatisticsDescriptive Statistics
1
1
1
,
21
221
112
21
22221
11211
2
1
pp
p
p
pppp
p
p
n
p
rr
rr
rr
sss
sss
sss
x
x
x
R
Sx
2020
ExampleExampleFour receipts from Four receipts from a university a university bookstorebookstore
Variable 1: dollar Variable 1: dollar salessales
Variable 2: number Variable 2: number of booksof books
358
448
552
442
x
2121
Arrays of Basic Descriptive Arrays of Basic Descriptive StatisticsStatistics
136.0
36.01
5.05.1
5.134,
4
50
R
Sx n
2222
Using SASUsing SASCreate New ProjectCreate New Project– Name Name Project Project
Insert DataInsert Data– New: Name New: Name Data1 Data1
Change column name Change column name – Right button, select Properties…Right button, select Properties…
Enter data in the data gridEnter data in the data gridSelect Data1 under ProjectSelect Data1 under ProjectAnalysis Analysis DescriptiveDescriptive– Summary statisticsSummary statistics– CorrelationsCorrelations
2323
Summary StatisticsSummary Statistics
Save Save – Personal Personal Enterprise Guide Sample Enterprise Guide Sample
DataData Data1.sas7bdat Data1.sas7bdat
Columns Columns Variables to assign Variables to assign Analysis variables Analysis variables
Statistics Statistics Mean Mean
2424
Report on MeansReport on Means
2525
CorrelationsCorrelationsDelete Summary statistics node Delete Summary statistics node Save Save – Personal Personal Enterprise Guide Sample Enterprise Guide Sample Dat Dat
aa Data1.sas7bdat Data1.sas7bdatColumns Columns Variables to assign Variables to assign Correl Correlation variablesation variablesCorrelations Correlations Pearson Pearson Covariances Covariances Show Pearson correlations in results Show Pearson correlations in results Divisor for variances (Number of row Divisor for variances (Number of rows) s)
2626
CorrelationsCorrelations
Results Results Show results Show results (uncheck) (uncheck) show statistics for each variable show statistics for each variable (uncheck) show significance (uncheck) show significance probabilities associated with probabilities associated with correlationscorrelations
2727
Report on CorrelationsReport on Correlations
2828
Scatter Plot and Scatter Plot and Marginal Dot DiagramsMarginal Dot Diagrams
2929
Scatter Plot and Marginal Dot Scatter Plot and Marginal Dot Diagrams for Rearranged DataDiagrams for Rearranged Data
3030
Effect of Unusual ObservationsEffect of Unusual Observations
3131
Effect of Unusual ObservationsEffect of Unusual Observations
WarnerTime and Bradstreet &Dun but firms allfor 0.50
Warner Timebut firms allfor 0.39
Bradstreet&Dun but firms allfor 0.56
firms 16 allfor 0.39
12r
3232
Paper Quality MeasurementsPaper Quality Measurements
3333
Lizard Size DataLizard Size Data
*SVL: snoutvent length; HLS: hind limb span
3434
3D Scatter Plots of Lizard Data3D Scatter Plots of Lizard Data
3535
Female Bear Data and Female Bear Data and Growth CurvesGrowth Curves
3636
Utility Data as StarsUtility Data as Stars
3737
Chernoff Faces over TimeChernoff Faces over Time
3838
Euclidean DistanceEuclidean Distance
Each coordinate contributes equally to tEach coordinate contributes equally to the distancehe distance
2222
211
2121
)()()(),(
),,,(),,,,(
pp
pp
yxyxyxQPd
yyyQxxxP
3939
Statistical DistanceStatistical DistanceWeight coordinates subject to a great Weight coordinates subject to a great deal of variability less heavily than deal of variability less heavily than those that are not highly variablethose that are not highly variable
4040
Statistical Distance for Statistical Distance for Uncorrelated DataUncorrelated Data
22
22
11
212*
2
2*1
222*2111
*1
21
),(
/,/
)0,0(),,(
s
x
s
xxxPOd
sxxsxx
OxxP
4141
Ellipse of Constant Statistical Ellipse of Constant Statistical Distance for Uncorrelated DataDistance for Uncorrelated Data
11sc 11sc
22sc
x1
x2
0
22sc
4242
Scattered Plot for Scattered Plot for Correlated MeasurementsCorrelated Measurements
4343
Statistical Distance under Rotated Statistical Distance under Rotated Coordinate SystemCoordinate System
22222112
2111
212
211
22
22
11
21
21
2),(
cossin~sincos~
~
~
~
~),(
)~,~(),0,0(
xaxxaxaPOd
xxx
xxx
s
x
s
xPOd
xxPO
4444
General Statistical DistanceGeneral Statistical Distance
)])((2
))((2))((2
)(
)()([
),(
]222
[),(
),,,(),0,,0,0(),,,,(
11,1
331113221112
2
22222
21111
1,131132112
22222
2111
2121
pppppp
pppp
pppp
ppp
pp
yxyxa
yxyxayxyxa
yxa
yxayxa
QPd
xxaxxaxxa
xaxaxaPOd
yyyQOxxxP
4545
Necessity of Statistical DistanceNecessity of Statistical Distance
4646
Necessary Conditions for Necessary Conditions for Statistical Distance DefinitionsStatistical Distance Definitions
)inequality (Triangle
),(),(),(
if0),(
if0),(
),(),(
QRdRPdQPd
QPQPd
QPQPd
PQdQPd
4747
Reading AssignmentsReading Assignments
Text bookText book– pp. 50-60pp. 50-60– pp. 84-97pp. 84-97
4848
OutliersOutliersObservations with a unique combinatioObservations with a unique combination of characteristics identifiable as distinn of characteristics identifiable as distinctly different from the other observationctly different from the other observationssImpactImpact– Limiting the generalizability of any type of aLimiting the generalizability of any type of a
nalysisnalysis– Must be viewed in light of how representativMust be viewed in light of how representativ
e it is of the population to be retained or dele it is of the population to be retained or deletedeted
4949
Sources of OutliersSources of OutliersProcedure errorProcedure errorExtraordinary eventExtraordinary event– e,g., hurricane for daily rainfall analysise,g., hurricane for daily rainfall analysisExtraordinary observationsExtraordinary observations– Researcher has no explanationResearcher has no explanationUnique in their combinations of values aUnique in their combinations of values across the variablescross the variables– Falls within ordinary range of values of variaFalls within ordinary range of values of varia
blesbles– Retain it unless proved invalidRetain it unless proved invalid
5050
Rules of Thumb for Rules of Thumb for Univariate Outlier DetectionUnivariate Outlier Detection
Small samples (80 or fewer Small samples (80 or fewer observations)observations)– Cases with standard scores of 2.5 or Cases with standard scores of 2.5 or
greatergreater
Larger sample sizeLarger sample size– Threshold increases up to 4Threshold increases up to 4
Standard score not usedStandard score not used– Cases falling outside the range of 2.5 Cases falling outside the range of 2.5
versus 4 standard deviations, depending versus 4 standard deviations, depending on the sample sizeon the sample size
5151
Rules of Thumb for Bivariate and MRules of Thumb for Bivariate and Multivariate Outlier Detectionultivariate Outlier Detection
Bivariate Bivariate – Use scatterplots with confidence intervals aUse scatterplots with confidence intervals a
t a specified alpha levelt a specified alpha levelMultivariateMultivariate– Threshold levels for the Threshold levels for the DD22/df/df measure shoul measure shoul
d be conservative (0.005 or 0.001) resulting id be conservative (0.005 or 0.001) resulting in values of 2.5 (small samples) versus 3 or 4 n values of 2.5 (small samples) versus 3 or 4 in larger samplesin larger samples
– DD22: Mahalanobis measure: Mahalanobis measure– dfdf: degrees of freedom: degrees of freedom
5252
Outlier Description and ProfilingOutlier Description and Profiling
Generate profiles of each outlier observaGenerate profiles of each outlier observationtionIdentify the variable(s) responsible for itIdentify the variable(s) responsible for its being an outliers being an outlierDiscriminant analysis or multiple regressDiscriminant analysis or multiple regression can be applied to identify the differeion can be applied to identify the differences between outliers and other observnces between outliers and other observationsations
5353
Examples of OutliersExamples of Outliers
5454
Example of Bivariate OutliersExample of Bivariate Outliers
5555
Missing DataMissing Data
Valid values on one or more variables arValid values on one or more variables are not available for analysise not available for analysisAffects the generalizability of the resultsAffects the generalizability of the resultsRemedy is applied Remedy is applied – to maintain as close as possible the original to maintain as close as possible the original
distribution of valuesdistribution of values
5656
Missing Data ProcessMissing Data Process
Systematic event that leads to Systematic event that leads to missing valuesmissing values– Event external to the respondent (data Event external to the respondent (data
entry errors or data collection problem)entry errors or data collection problem)– Action on the part of the respondent Action on the part of the respondent
(such as refusal to answer)(such as refusal to answer)
Causes some patterns and Causes some patterns and relationships underlying the missing relationships underlying the missing datadata
5757
Impact of Missing DaraImpact of Missing DaraPractical impactPractical impact– Reduction of the sample size available Reduction of the sample size available
for analysis (adequate for analysis (adequate inadequate) inadequate)
Substantive perspectiveSubstantive perspective– Statistical results based on data with a Statistical results based on data with a
non-random data process could be non-random data process could be biasedbiased
– e.g., individuals did not provide their e.g., individuals did not provide their income tended to be almost exclusively income tended to be almost exclusively in the high income bracketin the high income bracket
5858
Hypothetical Example ofHypothetical Example ofMissing DataMissing Data
5959
Practical ConsiderationsPractical Considerations
Complete data requiredComplete data required– Only 5 cases are usable (too few)Only 5 cases are usable (too few)
A possible remedy: Eliminate A possible remedy: Eliminate VV33– 12 cases have complete data12 cases have complete data– Eliminate cases 3, 13, 15Eliminate cases 3, 13, 15– Total number of missing data is reduced Total number of missing data is reduced
to 7.4% for all valuesto 7.4% for all values
6060
Substantive ImpactSubstantive Impact5 still with missing data, all occur in 5 still with missing data, all occur in VV44These cases compared with those with vThese cases compared with those with valid alid VV44 data data– The 5 cases with missing The 5 cases with missing VV44 data have the fiv data have the fiv
e lowest scores on e lowest scores on VV22– Affects any analysis in which Affects any analysis in which VV22 and and VV44 are b are b
oth includedoth included– e.g, mean for e.g, mean for VV22 = 8.4 if cases with missing = 8.4 if cases with missing VV44
data are excluded, =7.8 if includeddata are excluded, =7.8 if included
6161
Dealing with Missing DataDealing with Missing Data
Determine the type of missing dataDetermine the type of missing data
Determine the extent of missing dataDetermine the extent of missing data
Diagnose the randomness of the Diagnose the randomness of the missing datamissing data
Select the imputation methodSelect the imputation method
6262
Ignorable Missing DataIgnorable Missing Data
Expected and part of the research Expected and part of the research designdesign
The missing data process is The missing data process is operating at randomoperating at random
Specific remedies are not neededSpecific remedies are not needed
6363
Examples of Examples of Ignorable Missing DataIgnorable Missing Data
Taking a sample of the population Taking a sample of the population rather than gathering data from the rather than gathering data from the entire populationentire population
Missing data due to the design of the Missing data due to the design of the data collection instrumentdata collection instrument– e.g., respondents skip sections of e.g., respondents skip sections of
questions that are not applicablequestions that are not applicable
6464
Missing Data Process Missing Data Process Known to the ResearchersKnown to the Researchers
Can be identified due to procedural Can be identified due to procedural factorsfactors– Data entry errorsData entry errors– Disclosure restrictionsDisclosure restrictions– Failure to complete the entire Failure to complete the entire
questionnairequestionnaire– Morbidity of the respondentsMorbidity of the respondents
Little control over the processLittle control over the processSome remedies may be applicableSome remedies may be applicable
6565
Missing Data Process Missing Data Process Unknown to the ResearchersUnknown to the Researchers
Most often are related directly to the Most often are related directly to the respondentrespondentExamplesExamples– Refusal to respond to certain questionsRefusal to respond to certain questions– Respondents have no opinion or Respondents have no opinion or
insufficient knowledge to answer the insufficient knowledge to answer the questionquestion
Should anticipated and minimized in Should anticipated and minimized in the research design and data the research design and data collection stagescollection stagesSome remedies may be applicableSome remedies may be applicable
6666
Assessing the Extent and Patterns Assessing the Extent and Patterns of Missing Dataof Missing Data
TabulateTabulate– Percentage of variables with missing data for Percentage of variables with missing data for
each caseeach case– Number of cases with missing data for each Number of cases with missing data for each
variablevariable
Look for non-random patterns in the dataLook for non-random patterns in the data
Determine the number of cases without Determine the number of cases without missing data on any variables missing data on any variables – sample size available for analysis if remedies sample size available for analysis if remedies
are not applied)are not applied)
6767
Rules of Thumb to Ignore Rules of Thumb to Ignore Missing DataMissing Data
Missing data under 10% for an Missing data under 10% for an individual case can generally be individual case can generally be ignored, except when the missing ignored, except when the missing data occurs in a specific non-random data occurs in a specific non-random fashionfashionThe number of cases with no missing The number of cases with no missing data must be sufficient for the data must be sufficient for the selected analysis technique if selected analysis technique if replacement values will not be replacement values will not be substituted (imputed) to the missing substituted (imputed) to the missing datadata
6868
Rules of Thumb for Deletions Rules of Thumb for Deletions Variables with as little as 15% Variables with as little as 15% missing data are candidates for missing data are candidates for deletiondeletion– Higher levels of missing data Higher levels of missing data
(20%~30%) can often be remedied(20%~30%) can often be remedied
Be sure the overall decrease in Be sure the overall decrease in missing data is large enough to missing data is large enough to justify deleting an individual variable justify deleting an individual variable or caseor case
6969
Rules of Thumb for DeletionsRules of Thumb for Deletions
Cases with missing data for Cases with missing data for dependent variables typically are dependent variables typically are deleteddeleted
When deleting a variable, ensure When deleting a variable, ensure that alternative variables, hopefully that alternative variables, hopefully highly correlated, are available to highly correlated, are available to represent the intent of the original represent the intent of the original variablevariable
7070
Levels of RandomnessLevels of RandomnessMissing at Random (MAR)Missing at Random (MAR)– e.g., missing data of gender are random e.g., missing data of gender are random
for both male and female, but those of for both male and female, but those of household income occur at a higher household income occur at a higher frequency for males than femalesfrequency for males than females
Missing Completely at Random Missing Completely at Random (MCAR)(MCAR)– e.g., missing data for household income e.g., missing data for household income
were randomly missing in equal were randomly missing in equal proportions for both male and female proportions for both male and female
7171
Modeling approach for Modeling approach for MAR ProcessMAR Process
Involves maximum likelihood Involves maximum likelihood estimation techniquesestimation techniques– e.g., EM approache.g., EM approach
7272
Imputation Using Only Valid DataImputation Using Only Valid Data
Complete Case ApproachComplete Case Approach– Include only those observations with Include only those observations with
complete datacomplete data
Using All-Available DataUsing All-Available Data– Use only valid dataUse only valid data– Imputes the distribution characteristics Imputes the distribution characteristics
(e.g., means or standard deviation) or (e.g., means or standard deviation) or relationship (e.g., correlation) from relationship (e.g., correlation) from every valid valueevery valid value
7373
Imputation Using Known Imputation Using Known Replacement ValuesReplacement Values
Hot deck imputationHot deck imputation– Use the value from another observation in tUse the value from another observation in t
he sample that is deemed similarhe sample that is deemed similarCold deck imputationCold deck imputation– Derive the replacement value from an exterDerive the replacement value from an exter
nal source (e.g., prior studies, other samples,nal source (e.g., prior studies, other samples, etc.) etc.)
Case substitutionCase substitution– Choose another nonsampled observationChoose another nonsampled observation
7474
Imputation by Calculating Imputation by Calculating Replacement ValuesReplacement Values
Mean SubstitutionMean Substitution– Use the mean value of that variable Use the mean value of that variable
calculated from all valid responsescalculated from all valid responses
Regression imputationRegression imputation– Predict the missing values of a variable Predict the missing values of a variable
based on its relationship to other based on its relationship to other variables in the data setvariables in the data set
7575
Summary of Imputation Using Summary of Imputation Using Only Valid DataOnly Valid Data
7676
Summary of Imputation Using Summary of Imputation Using Known Replacement ValuesKnown Replacement Values
7777
Summary of Imputation by Summary of Imputation by Calculating Replacement ValuesCalculating Replacement Values
7878
Summary of Model-Based MethodsSummary of Model-Based Methods
7979
Rule of Thumbs for Imputation of Rule of Thumbs for Imputation of Missing DataMissing Data
Under 10%Under 10%– Any imputation methodAny imputation method
10% to 20%10% to 20%– All-available, hot deck case substitution, All-available, hot deck case substitution,
regression for MCARregression for MCAR– Model-based for MARModel-based for MAR
Over 20%Over 20%– Regression for MCARRegression for MCAR– Model-based for MARModel-based for MAR
8080
Summary Statistics of Missing Data Summary Statistics of Missing Data for Original Samplefor Original Sample
8181
Comparison of Four Comparison of Four Imputation Methods Imputation Methods
8282
Comparison of Four Comparison of Four Imputation MethodsImputation Methods