1 Multivariate Statistical Analysis Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking.

11

Multivariate Statistical AnalysisMultivariate Statistical Analysis

Shyh-Kang JengShyh-Kang JengDepartment of Electrical Engineering/Department of Electrical Engineering/Graduate Institute of Communication/Graduate Institute of Communication/

Graduate Institute of Networking and MultiGraduate Institute of Networking and Multimediamedia

22

What Is Multivariate Analysis?What Is Multivariate Analysis?

Statistical methodology to analyze Statistical methodology to analyze data with measurements on many data with measurements on many variablesvariables

Process

controllable factors

uncontrollable factors

input output

33

Why to Learn Multivariate Why to Learn Multivariate Analysis? Analysis?

Explanation of a social or physical Explanation of a social or physical phenomenon must be tested by phenomenon must be tested by gathering and analyzing datagathering and analyzing data

Complexities of most phenomena Complexities of most phenomena require an investigator to collect require an investigator to collect observations on many different observations on many different variablesvariables

44

Application ExamplesApplication Examples

Is one product better than the other?Is one product better than the other?

Which factor is the most important to Which factor is the most important to determine the performance of a determine the performance of a system?system?

How to classify the results into How to classify the results into clusters?clusters?

What are the relationships between What are the relationships between variables?variables?

55

Course OutlineCourse OutlineIntroductionIntroductionMatrix Algebra and Random VectorsMatrix Algebra and Random VectorsSample Geometry and Random Sample Geometry and Random SamplesSamplesMultivariate Normal DistributionMultivariate Normal DistributionInference about a Mean VectorInference about a Mean VectorComparison of Several Multivariate Comparison of Several Multivariate MeansMeansMultivariate Linear Regression ModelsMultivariate Linear Regression Models

66

Course OutlineCourse OutlinePrincipal ComponentsPrincipal Components

Factor Analysis and Inference for Factor Analysis and Inference for Structured Covariance MatricesStructured Covariance Matrices

Canonical Correlation AnalysisCanonical Correlation Analysis

Discrimination and ClassificationDiscrimination and Classification

Clustering, Distance Methods, and Clustering, Distance Methods, and OrdinationOrdination

Multidimensional Scaling*Multidimensional Scaling*

Structural Equation Modeling*Structural Equation Modeling*

77

Text Book and WebsiteText Book and Website

R. A. Johnson and D. W. Wichern, ApR. A. Johnson and D. W. Wichern, Applied Multivariate Statistical Analysis,plied Multivariate Statistical Analysis, 5th ed., Prentice Hall, 2002. ( 5th ed., Prentice Hall, 2002. ( 雙葉雙葉 ))http://cc.ee.ntu.edu.tw/~skjeng/http://cc.ee.ntu.edu.tw/~skjeng/

MultivariateAnalysis2006.htmMultivariateAnalysis2006.htm

88

ReferencesReferencesJ. F. Hair, Jr., B. Black, B. Babin, R. E. AndJ. F. Hair, Jr., B. Black, B. Babin, R. E. Anderson, and R. L. Tatham, Multivariate Daterson, and R. L. Tatham, Multivariate Data Analysis, 6th ed., Prentice Hall, 2006. a Analysis, 6th ed., Prentice Hall, 2006. (( 華泰華泰 ))D. C. Montgomery, Design and Analysis oD. C. Montgomery, Design and Analysis of Experiments, 6th ed., John Wiley, 2005. f Experiments, 6th ed., John Wiley, 2005. (( 歐亞歐亞 ))D. SalsbergD. Salsberg 著著 , , 葉偉文譯葉偉文譯 ,, 統計統計 ,, 改變了改變了世界世界 , , 天下遠見天下遠見 , 2001., 2001.

99

ReferencesReferences

張碧波，張碧波，推理統計學推理統計學，三民，，三民， 1976.1976.張輝煌編譯，張輝煌編譯，實驗設計與變異分析實驗設計與變異分析，建興，，建興，1986.1986.

1010

Time ManagementTime Management

Emergency

Importance

III

III IV

1111

Some Important LawsSome Important Laws

First things firstFirst things first

80 – 20 Law80 – 20 Law

Fast prototyping and evolutionFast prototyping and evolution

1212

Major Uses of Multivariate AnalysisMajor Uses of Multivariate Analysis

Data reduction or structural Data reduction or structural simplificationsimplification

Sorting and groupingSorting and grouping

Investigation of the dependence Investigation of the dependence among variablesamong variables

PredictionPrediction

Hypothesis construction and testingHypothesis construction and testing

1313

Array of DataArray of Data

npnknn

jpjkjj

pk

pk

xxxx

xxxx

xxxx

xxxx

21

21

222221

111211

x

1414

Descriptive StatisticsDescriptive Statistics

Summary numbers to assess the Summary numbers to assess the information contained in datainformation contained in data

Basic descriptive statisticsBasic descriptive statistics– Sample meanSample mean– Sample varianceSample variance– Sample standard deviationSample standard deviation– Sample covarianceSample covariance– Sample correlation coefficientSample correlation coefficient

1515

Sample Mean and Sample Mean and Sample VarianceSample Variance

pk

xxn

ss

xn

x

n

jkjkkkk

n

jjkk

,,2,1

1

1

1

22

1

1616

Sample Covariance and Sample Covariance and Sample Correlation CoefficientSample Correlation Coefficient

kiikkiik

n

jkjk

n

jiji

n

jkjkiji

kkii

ikik

n

jkjkijiik

rrss

pkpi

xxxx

xxxx

ss

sr

xxxxn

s

,

,,2,1;,,2,1

1

1

2

1

2

1

1

1717

Standardized Values Standardized Values (or Standardized Scores)(or Standardized Scores)

Centered at zeroCentered at zero

Unit standard deviationUnit standard deviation

Sample correlation coefficient can be Sample correlation coefficient can be regarded as a sample covariance of regarded as a sample covariance of two standardized variablestwo standardized variables

kk

kjk

s

xx

1818

Properties of Sample Correlation Properties of Sample Correlation CoefficientCoefficient

Value is between -1 and 1Value is between -1 and 1Magnitude measure the strength of the linear Magnitude measure the strength of the linear associationassociationSign indicates the direction of the associationSign indicates the direction of the associationValue remains unchanged if all Value remains unchanged if all xxjiji’s and ’s and xxjkjk’s ’s are changed to are changed to yyjiji = = aa xxjiji + + bb and and yyjkjk = = cc xxjkjk + + dd, resp, respectively, provided that the constants ectively, provided that the constants aa and and cc h have the same sign ave the same sign

1919

Arrays of Basic Arrays of Basic Descriptive StatisticsDescriptive Statistics

1

1

1

,

21

221

112

21

22221

11211

2

1

pp

p

p

pppp

p

p

n

p

rr

rr

rr

sss

sss

sss

x

x

x

R

Sx

2020

ExampleExampleFour receipts from Four receipts from a university a university bookstorebookstore

Variable 1: dollar Variable 1: dollar salessales

Variable 2: number Variable 2: number of booksof books

358

448

552

442

x

2121

Arrays of Basic Descriptive Arrays of Basic Descriptive StatisticsStatistics

136.0

36.01

5.05.1

5.134,

4

50

R

Sx n

2222

Using SASUsing SASCreate New ProjectCreate New Project– Name Name Project Project

Insert DataInsert Data– New: Name New: Name Data1 Data1

Change column name Change column name – Right button, select Properties…Right button, select Properties…

Enter data in the data gridEnter data in the data gridSelect Data1 under ProjectSelect Data1 under ProjectAnalysis Analysis DescriptiveDescriptive– Summary statisticsSummary statistics– CorrelationsCorrelations

2323

Summary StatisticsSummary Statistics

Save Save – Personal Personal Enterprise Guide Sample Enterprise Guide Sample

DataData Data1.sas7bdat Data1.sas7bdat

Columns Columns Variables to assign Variables to assign Analysis variables Analysis variables

Statistics Statistics Mean Mean

2424

Report on MeansReport on Means

2525

CorrelationsCorrelationsDelete Summary statistics node Delete Summary statistics node Save Save – Personal Personal Enterprise Guide Sample Enterprise Guide Sample Dat Dat

aa Data1.sas7bdat Data1.sas7bdatColumns Columns Variables to assign Variables to assign Correl Correlation variablesation variablesCorrelations Correlations Pearson Pearson Covariances Covariances Show Pearson correlations in results Show Pearson correlations in results Divisor for variances (Number of row Divisor for variances (Number of rows) s)

2626

CorrelationsCorrelations

Results Results Show results Show results (uncheck) (uncheck) show statistics for each variable show statistics for each variable (uncheck) show significance (uncheck) show significance probabilities associated with probabilities associated with correlationscorrelations

2727

Report on CorrelationsReport on Correlations

2828

Scatter Plot and Scatter Plot and Marginal Dot DiagramsMarginal Dot Diagrams

2929

Scatter Plot and Marginal Dot Scatter Plot and Marginal Dot Diagrams for Rearranged DataDiagrams for Rearranged Data

3030

Effect of Unusual ObservationsEffect of Unusual Observations

3131

Effect of Unusual ObservationsEffect of Unusual Observations

WarnerTime and Bradstreet &Dun but firms allfor 0.50

Warner Timebut firms allfor 0.39

Bradstreet&Dun but firms allfor 0.56

firms 16 allfor 0.39

12r

3232

Paper Quality MeasurementsPaper Quality Measurements

3333

Lizard Size DataLizard Size Data

*SVL: snoutvent length; HLS: hind limb span

3434

3D Scatter Plots of Lizard Data3D Scatter Plots of Lizard Data

3535

Female Bear Data and Female Bear Data and Growth CurvesGrowth Curves

3636

Utility Data as StarsUtility Data as Stars

3737

Chernoff Faces over TimeChernoff Faces over Time

3838

Euclidean DistanceEuclidean Distance

Each coordinate contributes equally to tEach coordinate contributes equally to the distancehe distance

2222

211

2121

)()()(),(

),,,(),,,,(

pp

pp

yxyxyxQPd

yyyQxxxP

3939

Statistical DistanceStatistical DistanceWeight coordinates subject to a great Weight coordinates subject to a great deal of variability less heavily than deal of variability less heavily than those that are not highly variablethose that are not highly variable

4040

Statistical Distance for Statistical Distance for Uncorrelated DataUncorrelated Data

22

22

11

212*

2

2*1

222*2111

*1

21

),(

/,/

)0,0(),,(

s

x

s

xxxPOd

sxxsxx

OxxP

4141

Ellipse of Constant Statistical Ellipse of Constant Statistical Distance for Uncorrelated DataDistance for Uncorrelated Data

11sc 11sc

22sc

x1

x2

0

22sc

4242

Scattered Plot for Scattered Plot for Correlated MeasurementsCorrelated Measurements

4343

Statistical Distance under Rotated Statistical Distance under Rotated Coordinate SystemCoordinate System

22222112

2111

212

211

22

22

11

21

21

2),(

cossin~sincos~

~

~

~

~),(

)~,~(),0,0(

xaxxaxaPOd

xxx

xxx

s

x

s

xPOd

xxPO

4444

General Statistical DistanceGeneral Statistical Distance

)])((2

))((2))((2

)(

)()([

),(

]222

[),(

),,,(),0,,0,0(),,,,(

11,1

331113221112

2

22222

21111

1,131132112

22222

2111

2121

pppppp

pppp

pppp

ppp

pp

yxyxa

yxyxayxyxa

yxa

yxayxa

QPd

xxaxxaxxa

xaxaxaPOd

yyyQOxxxP

4545

Necessity of Statistical DistanceNecessity of Statistical Distance

4646

Necessary Conditions for Necessary Conditions for Statistical Distance DefinitionsStatistical Distance Definitions

)inequality (Triangle

),(),(),(

if0),(

if0),(

),(),(

QRdRPdQPd

QPQPd

QPQPd

PQdQPd

4747

Reading AssignmentsReading Assignments

Text bookText book– pp. 50-60pp. 50-60– pp. 84-97pp. 84-97

4848

OutliersOutliersObservations with a unique combinatioObservations with a unique combination of characteristics identifiable as distinn of characteristics identifiable as distinctly different from the other observationctly different from the other observationssImpactImpact– Limiting the generalizability of any type of aLimiting the generalizability of any type of a

nalysisnalysis– Must be viewed in light of how representativMust be viewed in light of how representativ

e it is of the population to be retained or dele it is of the population to be retained or deletedeted

4949

Sources of OutliersSources of OutliersProcedure errorProcedure errorExtraordinary eventExtraordinary event– e,g., hurricane for daily rainfall analysise,g., hurricane for daily rainfall analysisExtraordinary observationsExtraordinary observations– Researcher has no explanationResearcher has no explanationUnique in their combinations of values aUnique in their combinations of values across the variablescross the variables– Falls within ordinary range of values of variaFalls within ordinary range of values of varia

blesbles– Retain it unless proved invalidRetain it unless proved invalid

5050

Rules of Thumb for Rules of Thumb for Univariate Outlier DetectionUnivariate Outlier Detection

Small samples (80 or fewer Small samples (80 or fewer observations)observations)– Cases with standard scores of 2.5 or Cases with standard scores of 2.5 or

greatergreater

Larger sample sizeLarger sample size– Threshold increases up to 4Threshold increases up to 4

Standard score not usedStandard score not used– Cases falling outside the range of 2.5 Cases falling outside the range of 2.5

versus 4 standard deviations, depending versus 4 standard deviations, depending on the sample sizeon the sample size

5151

Rules of Thumb for Bivariate and MRules of Thumb for Bivariate and Multivariate Outlier Detectionultivariate Outlier Detection

Bivariate Bivariate – Use scatterplots with confidence intervals aUse scatterplots with confidence intervals a

t a specified alpha levelt a specified alpha levelMultivariateMultivariate– Threshold levels for the Threshold levels for the DD22/df/df measure shoul measure shoul

d be conservative (0.005 or 0.001) resulting id be conservative (0.005 or 0.001) resulting in values of 2.5 (small samples) versus 3 or 4 n values of 2.5 (small samples) versus 3 or 4 in larger samplesin larger samples

– DD22: Mahalanobis measure: Mahalanobis measure– dfdf: degrees of freedom: degrees of freedom

5252

Outlier Description and ProfilingOutlier Description and Profiling

Generate profiles of each outlier observaGenerate profiles of each outlier observationtionIdentify the variable(s) responsible for itIdentify the variable(s) responsible for its being an outliers being an outlierDiscriminant analysis or multiple regressDiscriminant analysis or multiple regression can be applied to identify the differeion can be applied to identify the differences between outliers and other observnces between outliers and other observationsations

5353

Examples of OutliersExamples of Outliers

5454

Example of Bivariate OutliersExample of Bivariate Outliers

5555

Missing DataMissing Data

Valid values on one or more variables arValid values on one or more variables are not available for analysise not available for analysisAffects the generalizability of the resultsAffects the generalizability of the resultsRemedy is applied Remedy is applied – to maintain as close as possible the original to maintain as close as possible the original

distribution of valuesdistribution of values

5656

Missing Data ProcessMissing Data Process

Systematic event that leads to Systematic event that leads to missing valuesmissing values– Event external to the respondent (data Event external to the respondent (data

entry errors or data collection problem)entry errors or data collection problem)– Action on the part of the respondent Action on the part of the respondent

(such as refusal to answer)(such as refusal to answer)

Causes some patterns and Causes some patterns and relationships underlying the missing relationships underlying the missing datadata

5757

Impact of Missing DaraImpact of Missing DaraPractical impactPractical impact– Reduction of the sample size available Reduction of the sample size available

for analysis (adequate for analysis (adequate inadequate) inadequate)

Substantive perspectiveSubstantive perspective– Statistical results based on data with a Statistical results based on data with a

non-random data process could be non-random data process could be biasedbiased

– e.g., individuals did not provide their e.g., individuals did not provide their income tended to be almost exclusively income tended to be almost exclusively in the high income bracketin the high income bracket

5858

Hypothetical Example ofHypothetical Example ofMissing DataMissing Data

5959

Practical ConsiderationsPractical Considerations

Complete data requiredComplete data required– Only 5 cases are usable (too few)Only 5 cases are usable (too few)

A possible remedy: Eliminate A possible remedy: Eliminate VV33– 12 cases have complete data12 cases have complete data– Eliminate cases 3, 13, 15Eliminate cases 3, 13, 15– Total number of missing data is reduced Total number of missing data is reduced

to 7.4% for all valuesto 7.4% for all values

6060

Substantive ImpactSubstantive Impact5 still with missing data, all occur in 5 still with missing data, all occur in VV44These cases compared with those with vThese cases compared with those with valid alid VV44 data data– The 5 cases with missing The 5 cases with missing VV44 data have the fiv data have the fiv

e lowest scores on e lowest scores on VV22– Affects any analysis in which Affects any analysis in which VV22 and and VV44 are b are b

oth includedoth included– e.g, mean for e.g, mean for VV22 = 8.4 if cases with missing = 8.4 if cases with missing VV44

data are excluded, =7.8 if includeddata are excluded, =7.8 if included

6161

Dealing with Missing DataDealing with Missing Data

Determine the type of missing dataDetermine the type of missing data

Determine the extent of missing dataDetermine the extent of missing data

Diagnose the randomness of the Diagnose the randomness of the missing datamissing data

Select the imputation methodSelect the imputation method

6262

Ignorable Missing DataIgnorable Missing Data

Expected and part of the research Expected and part of the research designdesign

The missing data process is The missing data process is operating at randomoperating at random

Specific remedies are not neededSpecific remedies are not needed

6363

Examples of Examples of Ignorable Missing DataIgnorable Missing Data

Taking a sample of the population Taking a sample of the population rather than gathering data from the rather than gathering data from the entire populationentire population

Missing data due to the design of the Missing data due to the design of the data collection instrumentdata collection instrument– e.g., respondents skip sections of e.g., respondents skip sections of

questions that are not applicablequestions that are not applicable

6464

Missing Data Process Missing Data Process Known to the ResearchersKnown to the Researchers

Can be identified due to procedural Can be identified due to procedural factorsfactors– Data entry errorsData entry errors– Disclosure restrictionsDisclosure restrictions– Failure to complete the entire Failure to complete the entire

questionnairequestionnaire– Morbidity of the respondentsMorbidity of the respondents

Little control over the processLittle control over the processSome remedies may be applicableSome remedies may be applicable

6565

Missing Data Process Missing Data Process Unknown to the ResearchersUnknown to the Researchers

Most often are related directly to the Most often are related directly to the respondentrespondentExamplesExamples– Refusal to respond to certain questionsRefusal to respond to certain questions– Respondents have no opinion or Respondents have no opinion or

insufficient knowledge to answer the insufficient knowledge to answer the questionquestion

Should anticipated and minimized in Should anticipated and minimized in the research design and data the research design and data collection stagescollection stagesSome remedies may be applicableSome remedies may be applicable

6666

Assessing the Extent and Patterns Assessing the Extent and Patterns of Missing Dataof Missing Data

TabulateTabulate– Percentage of variables with missing data for Percentage of variables with missing data for

each caseeach case– Number of cases with missing data for each Number of cases with missing data for each

variablevariable

Look for non-random patterns in the dataLook for non-random patterns in the data

Determine the number of cases without Determine the number of cases without missing data on any variables missing data on any variables – sample size available for analysis if remedies sample size available for analysis if remedies

are not applied)are not applied)

6767

Rules of Thumb to Ignore Rules of Thumb to Ignore Missing DataMissing Data

Missing data under 10% for an Missing data under 10% for an individual case can generally be individual case can generally be ignored, except when the missing ignored, except when the missing data occurs in a specific non-random data occurs in a specific non-random fashionfashionThe number of cases with no missing The number of cases with no missing data must be sufficient for the data must be sufficient for the selected analysis technique if selected analysis technique if replacement values will not be replacement values will not be substituted (imputed) to the missing substituted (imputed) to the missing datadata

6868

Rules of Thumb for Deletions Rules of Thumb for Deletions Variables with as little as 15% Variables with as little as 15% missing data are candidates for missing data are candidates for deletiondeletion– Higher levels of missing data Higher levels of missing data

(20%~30%) can often be remedied(20%~30%) can often be remedied

Be sure the overall decrease in Be sure the overall decrease in missing data is large enough to missing data is large enough to justify deleting an individual variable justify deleting an individual variable or caseor case

6969

Rules of Thumb for DeletionsRules of Thumb for Deletions

Cases with missing data for Cases with missing data for dependent variables typically are dependent variables typically are deleteddeleted

When deleting a variable, ensure When deleting a variable, ensure that alternative variables, hopefully that alternative variables, hopefully highly correlated, are available to highly correlated, are available to represent the intent of the original represent the intent of the original variablevariable

7070

Levels of RandomnessLevels of RandomnessMissing at Random (MAR)Missing at Random (MAR)– e.g., missing data of gender are random e.g., missing data of gender are random

for both male and female, but those of for both male and female, but those of household income occur at a higher household income occur at a higher frequency for males than femalesfrequency for males than females

Missing Completely at Random Missing Completely at Random (MCAR)(MCAR)– e.g., missing data for household income e.g., missing data for household income

were randomly missing in equal were randomly missing in equal proportions for both male and female proportions for both male and female

7171

Modeling approach for Modeling approach for MAR ProcessMAR Process

Involves maximum likelihood Involves maximum likelihood estimation techniquesestimation techniques– e.g., EM approache.g., EM approach

7272

Imputation Using Only Valid DataImputation Using Only Valid Data

Complete Case ApproachComplete Case Approach– Include only those observations with Include only those observations with

complete datacomplete data

Using All-Available DataUsing All-Available Data– Use only valid dataUse only valid data– Imputes the distribution characteristics Imputes the distribution characteristics

(e.g., means or standard deviation) or (e.g., means or standard deviation) or relationship (e.g., correlation) from relationship (e.g., correlation) from every valid valueevery valid value

7373

Imputation Using Known Imputation Using Known Replacement ValuesReplacement Values

Hot deck imputationHot deck imputation– Use the value from another observation in tUse the value from another observation in t

he sample that is deemed similarhe sample that is deemed similarCold deck imputationCold deck imputation– Derive the replacement value from an exterDerive the replacement value from an exter

nal source (e.g., prior studies, other samples,nal source (e.g., prior studies, other samples, etc.) etc.)

Case substitutionCase substitution– Choose another nonsampled observationChoose another nonsampled observation

7474

Imputation by Calculating Imputation by Calculating Replacement ValuesReplacement Values

Mean SubstitutionMean Substitution– Use the mean value of that variable Use the mean value of that variable

calculated from all valid responsescalculated from all valid responses

Regression imputationRegression imputation– Predict the missing values of a variable Predict the missing values of a variable

based on its relationship to other based on its relationship to other variables in the data setvariables in the data set

7575

Summary of Imputation Using Summary of Imputation Using Only Valid DataOnly Valid Data

7676

Summary of Imputation Using Summary of Imputation Using Known Replacement ValuesKnown Replacement Values

7777

Summary of Imputation by Summary of Imputation by Calculating Replacement ValuesCalculating Replacement Values

7878

Summary of Model-Based MethodsSummary of Model-Based Methods

7979

Rule of Thumbs for Imputation of Rule of Thumbs for Imputation of Missing DataMissing Data

Under 10%Under 10%– Any imputation methodAny imputation method

10% to 20%10% to 20%– All-available, hot deck case substitution, All-available, hot deck case substitution,

regression for MCARregression for MCAR– Model-based for MARModel-based for MAR

Over 20%Over 20%– Regression for MCARRegression for MCAR– Model-based for MARModel-based for MAR

8080

Summary Statistics of Missing Data Summary Statistics of Missing Data for Original Samplefor Original Sample

8181

Comparison of Four Comparison of Four Imputation Methods Imputation Methods

8282

Comparison of Four Comparison of Four Imputation MethodsImputation Methods

1 Multivariate Statistical Analysis Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking.

Documents

1 Multivariate Statistical Analysis Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking.