24/6/2015 Lesson7:PrincipalComponentsAnalysis(PCA)
https://onlinecourses.science.psu.edu/stat505/book/export/html/49 1/19
Lesson7:PrincipalComponentsAnalysis(PCA)Introduction
Sometimesdataarecollectedonalargenumberofvariablesfromasinglepopulation.AsanexampleconsiderthePlacesRateddatasetbelow
Example:PlacesRated
InthePlacesRatedAlmanac,BoyerandSavageaurated329communitiesaccordingtothefollowingninecriteria:
1. ClimateandTerrain2. Housing3. HealthCare&theEnvironment4. Crime5. Transportation6. Education7. TheArts8. Recreation9. Economics
Notethatwithinthedataset,exceptforhousingandcrime,thehigherthescorethebetter.Forhousingandcrime,thelowerthescorethebetter.Wheresomecommunitiesmightdobetterinthearts,othercommunitiesmightberatedbetterinotherareassuchashavingalowercrimerateandgoodeducationalopportunities.
Objective
Withalargenumberofvariables,thedispersionmatrixmaybetoolargetostudyandinterpretproperly.Therewouldbetoomanypairwisecorrelationsbetweenthevariablestoconsider.Graphicaldisplayofdatamayalsonotbeofparticularhelpincasethedatasetisverylarge.With12variables,forexample,therewillbemorethan200threedimensionalscatterplotstobestudied!
Tointerpretthedatainamoremeaningfulform,itisthereforenecessarytoreducethenumberofvariablestoafew,interpretablelinearcombinationsofthedata.Eachlinearcombinationwillcorrespondtoaprincipalcomponent.
(ThereisanotherveryusefuldatareductiontechniquecalledFactorAnalysis,whichwillbetakenupinasubsequentlesson.)
Learningobjectives&outcomes
Uponcompletionofthislesson,youshouldbeabletodothefollowing:
CarryoutaprincipalcomponentsanalysisusingSASandMinitabAssesshowmanyprincipalcomponentsshouldbeconsideredinananalysis
24/6/2015 Lesson7:PrincipalComponentsAnalysis(PCA)
https://onlinecourses.science.psu.edu/stat505/book/export/html/49 2/19
Interpretprincipalcomponentscores.BeabletodescribeasubjectwithahighorlowscoreDeterminewhenaprincipalcomponentanalysismaybebasedonthevariancecovariancematrix,andwhenthecorrelationmatrixshouldbeusedUnderstandhowprincipalcomponentscoresmaybeusedinfurtheranalyses.
7.1PrincipalComponentAnalysis(PCA)ProcedureSupposethatwehavearandomvectorX.
\(\textbf{X}=\left(\begin{array}{c}X_1\\X_2\\\vdots\\X_p\end{array}\right)\)
withpopulationvariancecovariancematrix
\(\text{var}(\textbf{X})=\Sigma=\left(\begin{array}{cccc}\sigma^2_1&\sigma_{12}&\dots&\sigma_{1p}\\\sigma_{21}&\sigma^2_2&\dots&\sigma_{2p}\\\vdots&\vdots&\ddots&\vdots
\\\sigma_{p1}&\sigma_{p2}&\dots&\sigma^2_p\end{array}\right)\)
Considerthelinearcombinations
\(\begin{array}{lll}Y_1&=&e_{11}X_1+e_{12}X_2+\dots+e_{1p}X_p\\Y_2&=&e_{21}X_1+e_{22}X_2+\dots+e_{2p}X_p\\&&\vdots\\Y_p&=&e_{p1}X_1+e_{p2}X_2+
\dots+e_{pp}X_p\end{array}\)
Eachofthesecanbethoughtofasalinearregression,predictingYifromX1,X2,...,Xp.Thereisnointercept,butei1,ei2,...,eipcanbeviewedasregressioncoefficients.
NotethatYiisafunctionofourrandomdata,andsoisalsorandom.Thereforeithasapopulationvariance
\[\text{var}(Y_i)=\sum_{k=1}^{p}\sum_{l=1}^{p}e_{ik}e_{il}\sigma_{kl}=\mathbf{e}'_i\Sigma\mathbf{e}_i\]
Moreover,YiandYjwillhaveapopulationcovariance
\[\text{cov}(Y_i,Y_j)=\sum_{k=1}^{p}\sum_{l=1}^{p}e_{ik}e_{jl}\sigma_{kl}=\mathbf{e}'_i\Sigma\mathbf{e}_j\]
Herethecoefficientseijarecollectedintothevector
\(\mathbf{e}_i=\left(\begin{array}{c}e_{i1}\\e_{i2}\\\vdots\\e_{ip}\end{array}\right)\)
FirstPrincipalComponent(PCA1):Y1
The first principal component is the linear combination of xvariables that hasmaximumvariance (among alllinearcombinations),soitaccountsforasmuchvariationinthedataaspossible.
24/6/2015 Lesson7:PrincipalComponentsAnalysis(PCA)
https://onlinecourses.science.psu.edu/stat505/book/export/html/49 3/19
Specificallywewilldefinecoefficientse11,e12,...,e1pforthatcomponentinsuchawaythatitsvarianceismaximized,subjecttotheconstraintthatthesumofthesquaredcoefficientsisequaltoone.Thisconstraintisrequiredsothatauniqueanswermaybeobtained.
Moreformally,selecte11,e12,...,e1pthatmaximizes
\[\text{var}(Y_1)=\sum_{k=1}^{p}\sum_{l=1}^{p}e_{1k}e_{1l}\sigma_{kl}=\mathbf{e}'_1\Sigma\mathbf{e}_1\]
subjecttotheconstraintthat
\[\mathbf{e}'_1\mathbf{e}_1=\sum_{j=1}^{p}e^2_{1j}=1\]
SecondPrincipalComponent(PCA2):Y2
Thesecondprincipalcomponentisthelinearcombinationofxvariablesthataccountsforasmuchoftheremainingvariationaspossible,withtheconstraintthatthecorrelationbetweenthefirstandsecondcomponentis0
Selecte21,e22,...,e2pthatmaximizesthevarianceofthisnewcomponent...
\[\text{var}(Y_2)=\sum_{k=1}^{p}\sum_{l=1}^{p}e_{2k}e_{2l}\sigma_{kl}=\mathbf{e}'_2\Sigma\mathbf{e}_2\]
subjecttotheconstraintthatthesumsofsquaredcoefficientsadduptoone,
\[\mathbf{e}'_2\mathbf{e}_2=\sum_{j=1}^{p}e^2_{2j}=1\]
alongwiththeadditionalconstraintthatthesetwocomponentswillbeuncorrelatedwithoneanother.
\[\text{cov}(Y_1,Y_2)=\sum_{k=1}^{p}\sum_{l=1}^{p}e_{1k}e_{2l}\sigma_{kl}=\mathbf{e}'_1\Sigma\mathbf{e}_2=0\]
Allsubsequentprincipalcomponentshavethissamepropertytheyarelinearcombinationsthataccountforasmuchoftheremainingvariationaspossibleandtheyarenotcorrelatedwiththeotherprincipalcomponents
Wewilldothisinthesamewaywitheachadditionalcomponent.Forinstance:
ithPrincipalComponent(PCAi):Yi
Weselectei1,ei2,...,eipthatmaximizes
\[\text{var}(Y_i)=\sum_{k=1}^{p}\sum_{l=1}^{p}e_{ik}e_{il}\sigma_{kl}=\mathbf{e}'_i\Sigma\mathbf{e}_i\]
subjecttotheconstraintthatthesumsofsquaredcoefficientsadduptoone...alongwiththeadditionalconstraintthatthisnewcomponentwillbeuncorrelatedwithallthepreviouslydefinedcomponents.
\(\mathbf{e}'_i\mathbf{e}_i=\sum_{j=1}^{p}e^2_{ij}=1\)
24/6/2015 Lesson7:PrincipalComponentsAnalysis(PCA)
https://onlinecourses.science.psu.edu/stat505/book/export/html/49 4/19
\(\text{cov}(Y_1,Y_i)=\sum_{k=1}^{p}\sum_{l=1}^{p}e_{1k}e_{il}\sigma_{kl}=\mathbf{e}'_1\Sigma\mathbf{e}_i=0\),
\(\text{cov}(Y_2,Y_i)=\sum_{k=1}^{p}\sum_{l=1}^{p}e_{2k}e_{il}\sigma_{kl}=\mathbf{e}'_2\Sigma\mathbf{e}_i=0\),
\(\vdots\)
\(\text{cov}(Y_{i1},Y_i)=\sum_{k=1}^{p}\sum_{l=1}^{p}e_{i1,k}e_{il}\sigma_{kl}=\mathbf{e}'_{i1}\Sigma\mathbf{e}_i=0\)
Thereforeallprincipalcomponentsareuncorrelatedwithoneanother.
7.2Howdowefindthecoefficients?Howdowefindthecoefficientseijforaprincipalcomponent?
Thesolutioninvolvestheeigenvaluesandeigenvectorsofthevariancecovariancematrix.
Solution:
Wearegoingtolet1throughpdenotetheeigenvaluesofthevariancecovariancematrix.Theseareorderedsothat1hasthelargesteigenvalueandpisthesmallest.
\(\lambda_1\ge\lambda_2\ge\dots\ge\lambda_p\)
Wearealsogoingtoletthevectorse1throughep
e1,e2,...,ep
denotethecorrespondingeigenvectors.Itturnsoutthattheelementsfortheseeigenvectorswillbethecoefficientsofourprincipalcomponents.
Thevariancefortheithprincipalcomponentisequaltotheitheigenvalue.
\(\textbf{var}(Y_i)=\text{var}(e_{i1}X_1+e_{i2}X_2+\dotse_{ip}X_p)=\lambda_i\)
Moreover,theprincipalcomponentsareuncorrelatedwithoneanother.
\(\text{cov}(Y_i,Y_j)=0\)
Thevariancecovariancematrixmaybewrittenasafunctionoftheeigenvaluesandtheircorrespondingeigenvectors.ThisisdeterminedbyusingtheSpectralDecompositionTheorem.Thiswillbecomeusefullaterwhenweinvestigatetopicsunderfactoranalysis.
SpectralDecompositionTheorem
Thevariancecovariancmatrixcanbewrittenasthesumoverthepeigenvalues,multipliedbytheproductofthecorrespondingeigenvectortimesitstransposeasshowninthefirstexpressionbelow:
24/6/2015 Lesson7:PrincipalComponentsAnalysis(PCA)
https://onlinecourses.science.psu.edu/stat505/book/export/html/49 5/19
\[\begin{array}{lll}\Sigma&=&\sum_{i=1}^{p}\lambda_i\mathbf{e}_i\mathbf{e}_i'\\&\cong&\sum_{i=1}^{k}\lambda_i\mathbf{e}_i\mathbf{e}_i'\end{array}\]
Thesecondexpressionisausefulapproximationif\(\lambda_{k+1},\lambda_{k+2},\dots,\lambda_{p}\)aresmall.Wemightapproximateby
\[\sum_{i=1}^{k}\lambda_i\mathbf{e}_i\mathbf{e}_i'\]
Again,thiswillbecomemoreusefulwhenwetalkaboutfactoranalysis.
EarlierinthecoursewedefinedthetotalvariationofXasthetraceofthevariancecovariancematrix,orifyoulike,thesumofthevariancesoftheindividualvariables.Thisisalsoequaltothesumoftheeigenvaluesasshownbelow:
\(\begin{array}{lll}trace(\Sigma)&=&\sigma^2_1+\sigma^2_2+\dots+\sigma^2_p\\&=&\lambda_1+\lambda_2+\dots+\lambda_p\end{array}\)
Thiswillgiveusaninterpretationofthecomponentsintermsoftheamountofthefullvariationexplainedbyeachcomponent.Theproportionofvariationexplainedbytheithprincipalcomponentisthengoingtobedefinedtobetheeigenvalueforthatcomponentdividedbythesumoftheeigenvalues.Inotherwords,theithprincipalcomponentexplainsthefollowingproportionofthetotalvariation:
\[\frac{\lambda_i}{\lambda_1+\lambda_2+\dots+\lambda_p}\]
Arelatedquantityistheproportionofvariationexplainedbythefirstkprincipalcomponent.Thiswouldbethesumofthefirstkeigenvaluesdividedbyitstotalvariation.
\[\frac{\lambda_1+\lambda_2+\dots+\lambda_k}{\lambda_1+\lambda_2+\dots+\lambda_p}\]
Naturally,iftheproportionofvariationexplainedbythefirstkprincipalcomponentsislarge,thennotmuchinformationislostbyconsideringonlythefirstkprincipalcomponents.
WhyItMayBePossibletoReduceDimensions
Whenwehavecorrelations(multicollinarity)betweenthexvariables,thedatamaymoreorlessfallonalineorplaneinalowernumberofdimensions.Forinstance,imagineaplotoftwoxvariablesthathaveanearlyperfectcorrelation.Thedatapointswillfallclosetoastraightline.Thatlinecouldbeusedasanew(onedimensional)axistorepresentthevariationamongdatapoints.Asanotherexample,supposethatwehaveverbal,math,andtotalSATscoresforasampleofstudents.Wehavethreevariables,butreally(atmost)twodimensionstothedatabecausetotal=verbal+math,meaningthethirdvariableiscompletelydeterminedbythefirsttwo.Thereasonforsayingatmosttwodimensionsisthatifthereisastrongcorrelationbetweenverbalandmath,thenitmaybepossiblethatthereisonlyonetruedimensiontothedata.
Note
Allofthisisdefinedintermsofthepopulationvariancecovariancematrixwhichisunknown.However,wemayestimatebythesamplevariancecovariancematrixwhichisgiveninthestandardformulahere:
\[\textbf{S}=\frac{1}{n1}\sum_{i=1}^{n}(\mathbf{X}_i\bar{\textbf{x}})(\mathbf{X}_i\bar{\textbf{x}})'\]
24/6/2015 Lesson7:PrincipalComponentsAnalysis(PCA)
https://onlinecourses.science.psu.edu/stat505/book/export/html/49 6/19
Procedure
Computetheeigenvalues\(\hat{\lambda}_1,\hat{\lambda}_2,\dots,\hat{\lambda}_p\)ofthesamplevariancecovariancematrixS,andthecorrespondingeigenvectors\(\hat{\mathbf{e}}_1,\hat{\mathbf{e}}_2,\dots,\hat{\mathbf{e}}_p\).
Thenwewilldefineourestimatedprinciplecomponentsusingtheeigenvectorsasourcoefficients:
\(\begin{array}{lll}\hat{Y}_1&=&\hat{e}_{11}X_1+\hat{e}_{12}X_2+\dots+\hat{e}_{1p}X_p\\\hat{Y}_2&=&\hat{e}_{21}X_1+\hat{e}_{22}X_2+\dots+\hat{e}_{2p}X_p\\&&\vdots\\\hat{Y}_p&=&\hat{e}_{p1}X_1+\hat{e}_{p2}X_2+\dots+\hat{e}_{pp}X_p\\\end{array}\)
Generally,weonlyretainthefirstkprincipalcomponent.Herewemustbalancetwoconflictingdesires:
1.Toobtainthesimplestpossibleinterpretation,wewantktobeassmallaspossible.Ifwecanexplainmostofthevariationjustbytwoprincipalcomponentsthenthiswouldgiveusamuchsimplerdescriptionofthedata.Thesmallerkisthesmalleramountofvariationisexplainedbythefirstkcomponent.
2.Toavoidlossofinformation,wewanttheproportionofvariationexplainedbythefirstkprincipalcomponentstobelarge.Ideallyasclosetooneaspossiblei.e.,wewant
\[\frac{\hat{\lambda}_1+\hat{\lambda}_2+\dots+\hat{\lambda}_k}{\hat{\lambda}_1+\hat{\lambda}_2+\dots+\hat{\lambda}_p}\cong1\]
7.3Example:PlacesRatedWewillusethePlacesRatedAlmanacdata(BoyerandSavageau)whichrates329communitiesaccordingtoninecriteria:
1. ClimateandTerrain2. Housing3. HealthCare&Environment4. Crime5. Transportation6. Education7. TheArts8. Recreation9. Economics
Notes:
Thedataformanyofthevariablesarestronglyskewedtotheright.Thelogtransformationwasusedtonormalizethedata.
UsingSASUsingMinitab
TheSASprogramplaces.saswillimplementtheprincipalcomponentprocedures:
24/6/2015 Lesson7:PrincipalComponentsAnalysis(PCA)
https://onlinecourses.science.psu.edu/stat505/book/export/html/49 7/19
Whenyouexaminetheoutput,thefirstthingthatSASdoesistogiveussummaryinformation.Thereare329observationsrepresentingthe329communitiesinourdatasetand9variables.Thisisfollowedbysimplestatisticsthatreportthemeansandstandarddeviationsforeachvariable.
Belowthisisthevariancecovariancematrixforthedata.Youshouldbeabletoseethatthevariancereportedforclimateis0.01289.
Whatwereallyneedtodrawourattentiontohereistheeigenvaluesofthevariancecovariancematrix.IntheSASoutputtheeigenvaluesinrankedorderfromlargesttosmallest.ThesevalueshavebeencopiedintoTable1belowfordiscussion.
DataAnalysis:
Step1:Weexaminetheeigenvaluestodeterminehowmanyprincipalcomponentsshouldbeconsidered:
Table1.Eigenvalues,andtheproportionofvariationexplainedbytheprincipalcomponents.
24/6/2015 Lesson7:PrincipalComponentsAnalysis(PCA)
https://onlinecourses.science.psu.edu/stat505/book/export/html/49 8/19
Component Eigenvalue Proportion Cumulative1 0.3775 0.7227 0.72272 0.0511 0.0977 0.82043 0.0279 0.0535 0.87394 0.0230 0.0440 0.91785 0.0168 0.0321 0.95006 0.0120 0.0229 0.97287 0.0085 0.0162 0.98908 0.0039 0.0075 0.99669 0.0018 0.0034 1.0000Total 0.5225
Ifyoutakealloftheseeigenvaluesandaddthemupandyougetthetotalvarianceof0.5223.
Theproportionofvariationexplainedbyeacheigenvalueisgiveninthethirdcolumn.Forexample,0.3775dividedbythe0.5223equals0.7227,or,about72%ofthevariationisexplainedbythisfirsteigenvalue.Thecumulativepercentageexplainedisobtainedbyaddingthesuccessiveproportionsofvariationexplainedtoobtaintherunningtotal.Forinstance,0.7227plus0.0977equals0.8204,andsoforth.Therefore,about82%ofthevariationisexplainedbythefirsttwoeigenvaluestogether.
Nextweneedtolookatsuccessivedifferencesbetweentheeigenvalues.Subtractingthesecondeigenvalue0.051fromthefirsteigenvalue,0.377wegetadifferenceof0.326.Thedifferencebetweenthesecondandthirdeigenvaluesis0.0232thenextdifferenceis0.0049.Subsequentdifferencesareevensmaller.Asharpdropfromoneeigenvaluetothenextmayserveasanotherindicatorofhowmanyeigenvaluestoconsider.
Thefirstthreeprincipalcomponentsexplain87%ofthevariation.Thisisanacceptablylargepercentage.
AnAlternativeMethodtodeterminethenumberofprincipalcomponentsistolookataScreePlot.Withtheeigenvaluesorderedfromlargesttothesmallest,ascreeplotistheplotof versusi.Thenumberofcomponentisdeterminedatthepoint,beyondwhichtheremainingeigenvaluesareallrelativelysmallandofcomparablesize.ThefollowingplotismadeinMinitab.
24/6/2015 Lesson7:PrincipalComponentsAnalysis(PCA)
https://onlinecourses.science.psu.edu/stat505/book/export/html/49 9/19
Thescreeplotforthevariableswithoutstandardization(covariancematrix)
Asyousee,wecouldhavestoppedatthesecondprincipalcomponent,butwecontinuedtillthethirdcomponent.Relativelyspeaking,contributionofthethirdcomponentissmallcomparedtothesecondcomponent.
Step2:Next,wewillcomputetheprincipalcomponentscores.Forexample,thefirstprincipalcomponentcanbecomputedusingtheelementsofthefirsteigenvector:
\(\begin{array}\hat{Y}_1&=&0.0351\times(\text{climate})+0.0933\times(\text{housing})+0.4078\times(\text{health})\\&&+0.1004\times(\text{crime})+0.1501\times(\text{transportation})+0.0321\times(\text{education})\\&&0.8743\times(\text{arts})+0.1590\times(\text{recreation})+
0.0195\times(\text{economy})\end{array}\)
Inordertocompletethisformulaandcomputetheprincipalcomponentfortheindividualcommunityofinterest,pluginthatcommunity'svaluesforeachofthesevariables.Afairlystandardprocedureis,ratherthanusingtherawdatahere,tousethedifferencebetweenthevariablesandtheirsamplemeans.Thisisknownastranslationoftherandomvariables.Translationdoesnotaffecttheinterpretationsbecausethevariancesoftheoriginalvariablesarethesameasthoseofthetranslatedvariables.
Magnitudesofthecoefficientsgivethecontributionsofeachvariabletothatcomponent.However,themagnitudeofthecoefficientsalsodependonthevariancesofthecorrespondingvariables.
7.4InterpretationofthePrincipalComponentsStep3:Tointerpreteachcomponent,wemustcomputethecorrelationsbetweentheoriginaldataforeachvariableandeachprincipalcomponent.
Thesecorrelationsareobtainedusingthecorrelationprocedure.Inthevariablestatementwewillincludethefirstthreeprincipalcomponents,"prin1,prin2,andprin3",inadditiontoallnineoftheoriginalvariables.Wewillusethesecorrelationsbetweentheprincipalcomponentsandtheoriginalvariablestointerprettheseprincipalcomponents.
Becauseofstandardization,allprincipalcomponentswillhavemean0.Thestandarddeviationisalso
24/6/2015 Lesson7:PrincipalComponentsAnalysis(PCA)
https://onlinecourses.science.psu.edu/stat505/book/export/html/49 10/19
givenforeachofthecomponentsandthesewillbethesquarerootoftheeigenvalue.
Moreimportantforourcurrentpurposesarethecorrelationsbetweentheprincipalcomponentsandtheoriginalvariables.Thesehavebeencopiedintothefollowingtable.Youwillalsonotethatifyoulookattheprincipalcomponentsthemselvesthatthereiszerocorrelationbetweenthecomponents.
PrincipalComponentVariable 1 2 3Climate 0.190 0.017 0.207Housing 0.544 0.020 0.204Health 0.782 0.605 0.144Crime 0.365 0.294 0.585Transportation 0.585 0.085 0.234Education 0.394 0.273 0.027Arts 0.985 0.126 0.111Recreation 0.520 0.402 0.519Economy 0.142 0.150 0.239
Interpretationoftheprincipalcomponentsisbasedonfindingwhichvariablesaremoststronglycorrelatedwitheachcomponent,i.e.,whichofthesenumbersarelargeinmagnitude,thefarthestfromzeroineitherpositiveornegativedirection.Whichnumbersweconsidertobelargeorsmallisofcourseisasubjectivedecision.Youneedtodetermineatwhatlevelthecorrelationvaluewillbeofimportance.Hereacorrelationvalueabove0.5isdeemedimportant.Theselargercorrelationsareinboldfaceinthetableabove:
Wewillnowinterprettheprincipalcomponentresultswithrespecttothevaluethatwehavedeemedsignificant.
FirstPrincipalComponentAnalysisPCA1
Thefirstprincipalcomponentisstronglycorrelatedwithfiveoftheoriginalvariables.ThefirstprincipalcomponentincreaseswithincreasingArts,Health,Transportation,HousingandRecreationscores.Thissuggeststhatthesefivecriteriavarytogether.Ifoneincreases,thentheremainingtwoalsoincrease.ThiscomponentcanbeviewedasameasureofthequalityofArts,Health,Transportation,andRecreation,andthelackofqualityinHousing(recallthathighvaluesforHousingarebad).Furthermore,weseethatthefirstprincipalcomponentcorrelatesmoststronglywiththeArts.Infact,wecouldstatethatbasedonthecorrelationof0.985thatthisprincipalcomponentisprimarilyameasureoftheArts.Itwouldfollowthatcommunitieswithhighvalueswouldtendtohavealotofartsavailable,intermsoftheaters,orchestras,etc.Whereascommunitieswithsmallvalueswouldhaveveryfewofthesetypesofopportunities.
SecondPrincipalComponentAnalysisPCA2
Thesecondprincipalcomponentincreaseswithonlyoneofthevalues,decreasingHealth.Thiscomponentcanbeviewedasameasureofhowunhealthythelocationisintermsofavailablehealthcare
24/6/2015 Lesson7:PrincipalComponentsAnalysis(PCA)
https://onlinecourses.science.psu.edu/stat505/book/export/html/49 11/19
includingdoctors,hospitals,etc.
ThirdPrincipalComponentAnalysisPCA3
ThethirdprincipalcomponentincreaseswithincreasingCrimeandRecreation.Thissuggeststhatplaceswithhighcrimealsotendtohavebetterrecreationfacilities.
Tocompletetheanalysisweoftentimeswouldliketoproduceascatterplotofthecomponentscores.
Inlookingattheprogram,youwillseeagplotprocedureatthebottomwhereweareplottingthesecondcomponentagainstthefirstcomponent.AsimilarplotcanalsobepreparedinMinitab,butisnotshownhere.
Eachdotinthisplotrepresentsonecommunity.SoifyouwerelookingatthereddotoutbyitselftotherightyoumayconcludethatthisparticulardothasaveryhighvalueforthefirstprincipalcomponentandwewouldexpectthiscommunitytohavehighvaluesfortheArts,Health,Housing,TransportationandRecreation.Whereasifyoulookatreddotattheleftofthespectrum,youwouldexpecttohavelowvaluesforeachofthosevariables.
Thetopdotinbluehasahighvalueforthesecondcomponent.SoyouwouldexpectthatthiscommunitywouldbelousyforHealth.Andconverselyifyouweretolookatthebluedotonthebottom,thecorrespondingcommunitywouldhavehighvaluesforHealth.
Furtheranalysesmayinclude:
Scatterplotsofprincipalcomponentscores.Inthepresentcontext,wemaywishtoidentifythelocationsofeachpointintheplottoseeifplaceswithhighlevelsofagivencomponenttendtobeclusteredinaparticularregionofthecountry,whilesiteswithlowlevelsofthatcomponentareclusteredinanotherregionofthecountry.Principlecomponentsareoftentreatedasdependentvariablesforregressionandanalysisofvariance.
24/6/2015 Lesson7:PrincipalComponentsAnalysis(PCA)
https://onlinecourses.science.psu.edu/stat505/book/export/html/49 12/19
7.5Alternative:StandardizetheVariablesInthepreviousexamplewelookedatprincipalcomponentsanalysisappliedtotherawdata.Inourearlierdiscussionwenotedthatiftherawdataisusedprincipalcomponentanalysiswilltendtogivemoreemphasistothosevariablesthathavehighervariancesthantothosevariablesthathaveverylowvariances.Ineffecttheresultsoftheanalysiswilldependonwhatunitsofmeasurementareusedtomeasureeachvariable.Thatwouldimplythataprincipalcomponentanalysisshouldonlybeusedwiththerawdataifallvariableshavethesameunitsofmeasure.Andeveninthiscase,onlyifyouwishtogivethosevariableswhichhavehighervariancesmoreweightintheanalysis.
Auniqueexampleofthistypeofimplementationmightbeinanecologicalsettingwhereyouarelookingatcountsofdifferentspeciesoforganismsatanumberofdifferentsamplesites.Here,onemaywanttogivemoreweighttothemorecommonspeciesthatareobserved.Byanalysingtherawdatayouwilltendtofindthatmorecommonspecieswillalsoshowhighervariancesandwillbegivenmoreemphasis.Ifyouweretodoaprincipalcomponentanalysisonstandardizedcounts,allspecieswouldbeweightedequallyregardlessofhowabundanttheyareandhence,youmayfindsomeveryrarespeciesenteringinassignificantcontributorsintheanalysis.Thismayormaynotbedesirable.Thesetypesofdecisionsneedtobemadewiththescientificfoundationandquestionsinmind.
Summary
Theresultsofprincipalcomponentanalysisdependonthescalesatwhichthevariablesaremeasured.Variableswiththehighestsamplevarianceswilltendtobeemphasizedinthefirstfewprincipalcomponents.Principalcomponentanalysisusingthecovariancefunctionshouldonlybeconsideredifallofthevariableshavethesameunitsofmeasurement.
Ifthevariableseitherhavedifferentunitsofmeasurement(i.e.,pounds,feet,gallons,etc),orifwewisheachvariabletoreceiveequalweightintheanalysis,thenthevariablesshouldbestandardizedbeforeaprincipalcomponentsanalysisiscarriedout.Standardizethevariablesbysubtractingitsmeanfromthatvariableanddividingitbyitsstandarddeviation:
\[Z_{ij}=\frac{X_{ij}\bar{x}_j}{s_j}\]
where
Xij=Dataforvariablejinsampleuniti\(\bar{x}_{j}\)=Samplemeanforvariablejsj=Samplestandarddeviationforvariablej
Wewillnowperformtheprincipalcomponentanalysisusingthestandardizeddata.
Note:thevariancecovariancematrixofthestandardizeddataisequaltothecorrelationmatrixfortheunstandardizeddata.Therefore,principalcomponentanalysisusingthestandardizeddataisequivalenttoprincipalcomponentanalysisusingthecorrelationmatrix.
PrincipalComponentAnalysisProcedure
24/6/2015 Lesson7:PrincipalComponentsAnalysis(PCA)
https://onlinecourses.science.psu.edu/stat505/book/export/html/49 13/19
Theprincipalcomponentsarefirstcalculatedbyobtainingtheeigenvaluesforthecorrelationmatrix:
\(\hat{\lambda}_1,\hat{\lambda}_2,\dots,\hat{\lambda}_p\)
InthismatrixwedenotetheeigenvaluesofthesamplecorrelationmatrixR,andthecorrespondingeigenvectors
\(\mathbf{\hat{e}}_1,\mathbf{\hat{e}}_2,\dots,\mathbf{\hat{e}}_p\)
Thentheestimatedprinciplecomponentsscoresarecalculatedusingformulassimilartobefore,butinsteadofusingtherawdatawewillusethestandardizeddataintheformulaebelow:
\(\begin{array}{lll}\hat{Y}_1&=&\hat{e}_{11}Z_1+\hat{e}_{12}Z_2+\dots+\hat{e}_{1p}Z_p\\\hat{Y}_2&=&\hat{e}_{21}Z_1+\hat{e}_{22}Z_2+\dots+\hat{e}_{2p}Z_p\\&&\vdots\\\hat{Y}_p&=&\hat{e}_{p1}Z_1+\hat{e}_{p2}Z_2+\dots+\hat{e}_{pp}Z_p\\\end{array}\)
Restoftheprocedureandtheinterpretationsareasdiscussedbefore.
7.6Example:PlacesRatedafterStandardizationThepreviousanalysisisrepeatedafterstandardizingthevariables.
UsingSASUsingMinitab
TheSASprogramplaces1.saswillimplementtheprincipalcomponentproceduresusingthestandardizeddata:
24/6/2015 Lesson7:PrincipalComponentsAnalysis(PCA)
https://onlinecourses.science.psu.edu/stat505/book/export/html/49 14/19
Theoutputbeginswithdescriptiveinformationincludingthemeansandstandarddeviationsfortheindividualvariablesbeingpresented.
ThisisfollowedbytheCorrelationMatrixforthedata.Forexample,thecorrelationbetweenthehousingandclimatedatawasonly0.273.Therearenohypothesispresentedthatthesecorrelationsareequaltozero.Wewillusethiscorrelationmatrixinsteadtoobtainoureigenvaluesandeigenvectors.
Weneedtofocusontheeigenvaluesofthecorrelationmatrixthatcorrespondtoeachoftheprincipalcomponents.Inthiscase,totalvariationofthestandardizedvariablesisgoingtobeequaltop,thenumberofvariables.Afterstandardizationeachvariablehasvarianceequaltoone,andthetotalvariationisthesumofthesevariations,inthiscasethetotalvariationwillbe9.
Theeigenvaluesofthecorrelationmatrixaregiveninthesecondcolumninthetablebelow.Notealsotheproportionofvariationexplainedbyeachoftheprincipalcomponents,aswellasthecumulativeproportionofthevariationexplained.
24/6/2015 Lesson7:PrincipalComponentsAnalysis(PCA)
https://onlinecourses.science.psu.edu/stat505/book/export/html/49 15/19
Step1
Examinetheeigenvaluestodeterminehowmanyprincipalcomponentsshouldbeconsidered:
Component Eigenvalue Proportion Cumulative1 3.2978 0.3664 0.36642 1.2136 0.1348 0.50133 1.1055 0.1228 0.62414 0.9073 0.1008 0.72495 0.8606 0.0956 0.82056 0.5622 0.0625 0.88307 0.4838 0.0538 0.93688 0.3181 0.0353 0.97219 0.2511 0.0279 1.0000
Thefirstprincipalcomponentexplainsabout37%ofthevariation.Furthermore,thefirstfourprincipalcomponentsexplain72%,whilethefirstfiveprincipalcomponentsexplain82%ofthevariation.Comparetheseproportionswiththoseobtainedusingnonstandardizedvariables.Thisanalysisisgoingtorequirealargernumberofcomponentstoexplainthesameamountofvariationastheoriginalanalysisusingthevariancecovariancematrix.Thisisnotunusual.
Inmostcases,therequiredcutoffisprespecifiedi.e.howmuchofthevariationtobeexplainedispredetermined.Forinstance,ImightstatethatIwouldbesatisfiedifIcouldexplain70%ofthevariation.Ifwedothisthenwewouldselectthecomponentsnecessaryuntilyougetupto70%ofthevariation.Thiswouldbeoneapproach.Thistypeofjudgmentisarbitraryandhardtomakeifyouarenotexperiencedwiththesetypesofanalysis.Thegoaltosomeextentalsodependsonthetypeofproblemathand.
Anotherapproachwouldbetoplotthedifferencesbetweentheorderedvaluesandlookforabreakorasharpdrop.Theonlysharpdropthatisnoticeableinthiscaseisafterthefirstcomponent.Onemight,basedonthis,selectonlyonecomponent.However,onecomponentisprobablytoofew,particularlybecausewehaveonlyexplained37%ofthevariation.Considerthescreeplotbasedonthestandardizedvariables.
24/6/2015 Lesson7:PrincipalComponentsAnalysis(PCA)
https://onlinecourses.science.psu.edu/stat505/book/export/html/49 16/19
Thescreeplotforstandardizedvariables(correlationmatrix)
Step2
Next,wecancomputetheprincipalcomponentscoresusingtheeigenvectors.Thisisaformulaforthefirstprincipalcomponent:
\(\begin{array}\hat{Y}_1&=&0.158\timesZ_{\text{climate}}+0.384\timesZ_{\text{housing}}+0.410\timesZ_{\text{health}}\\&&+0.259\timesZ_{\text{crime}}+0.375\times
Z_{\text{transportation}}+0.274\timesZ_{\text{education}}\\&&0.474\timesZ_{\text{arts}}+0.353\timesZ_{\text{recreation}}+0.164\timesZ_{\text{economy}}\end{array}\)
Andremember,thisisnowgoingtobeafunction,notoftherawdatabutthestandardizeddata.
Themagnitudesofthecoefficientsgivethecontributionsofeachvariabletothatcomponent.Sincethedatahavebeenstandardized,theydonotdependonthevariancesofthecorrespondingvariables.
Step3
Next,wecanlookatthecoefficientsfortheprincipalcomponents.Inthiscase,sincethedataarestandardized,withinacolumntherelativemagnitudeofthosecoefficientscanbedirectlyassessed.EachcolumnherecorrespondswithacolumnintheoutputoftheprogramlabeledEigenvectors.
PrincipalComponentVariable 1 2 3 4 5Climate 0.158 0.069 0.800 0.377 0.041Housing 0.384 0.139 0.080 0.197 0.580Health 0.410 0.372 0.019 0.113 0.030Crime 0.259 0.474 0.128 0.042 0.692Transportation 0.375 0.141 0.141 0.430 0.191Education 0.274 0.452 0.241 0.457 0.224
24/6/2015 Lesson7:PrincipalComponentsAnalysis(PCA)
https://onlinecourses.science.psu.edu/stat505/book/export/html/49 17/19
Arts 0.474 0.104 0.011 0.147 0.012Recreation 0.353 0.292 0.042 0.404 0.306Economy 0.164 0.540 0.507 0.476 0.037
Interpretationoftheprincipalcomponentsisbasedonfindingwhichvariablesaremoststronglycorrelatedwitheachcomponent.Inotherwords,weneedtodecidewhichnumbersarelargewithineachcolumn.InthefirstcolumnwewilldecidethatHealthandArtsarelarge.Thisisveryarbitrary.Othervariablesmighthavealsobeenincludedaspartofthisfirstprincipalcomponent.
ComponentSummaries
FirstPrincipalComponentAnalysisPCA1
ThefirstprincipalcomponentisameasureofthequalityofHealthandtheArts,andtosomeextentHousing,TransportationandRecreation.HealthincreaseswithincreasingvaluesintheArts.Ifanyofthesevariablesgoesup,sodotheremainingones.Theyareallpositivelyrelatedastheyallhavepositivesigns.
SecondPrincipalComponentAnalysisPCA2
Thesecondprincipalcomponentisameasureoftheseverityofcrime,thequalityoftheeconomy,andthelackofqualityineducation.CrimeandEconomyincreasewithdecreasingEducation.Herewecanseethatcitieswithhighlevelsofcrimeandgoodeconomiesalsotendtohavepooreducationalsystems.
ThirdPrincipalComponentAnalysisPCA3
Thethirdprincipalcomponentisameasureofthequalityoftheclimateandpoornessoftheeconomy.ClimateincreaseswithdecreasingEconomy.Theinclusionofeconomywithinthiscomponentwilladdabitofredundancywithinourresults.Thiscomponentisprimarilyameasureofclimate,andtoalesserextenttheeconomy.
FourthPrincipalComponentAnalysisPCA4
Thefourthprincipalcomponentisameasureofthequalityofeducationandtheeconomyandthepoornessofthetransportationnetworkandrecreationalopportunities.EducationandEconomyincreasewithdecreasingTransportationandRecreation.
FifthPrincipalComponentAnalysisPCA5
Thefifthprincipalcomponentisameasureoftheseverityofcrimeandthequalityofhousing.Crimeincreaseswithdecreasinghousing.
7.7OncetheComponentsHaveBeenCalculated
24/6/2015 Lesson7:PrincipalComponentsAnalysis(PCA)
https://onlinecourses.science.psu.edu/stat505/book/export/html/49 18/19
Onecaninterpretthesecomponentbycomponent.Onemethodofdecidinghowmanycomponentsistoincludeonlythosethatgiveunambiguousresults,i.e.,wherenovariableappearsintwodifferentcolumnsasasignificantcontribution.
Notethattheprimarypurposeofthisanalysisisdescriptiveitisnothypothesistesting!Soyourdecisioninmanyrespectsneedstobemadebasedonwhatprovidesyouwithagood,concisedescriptionofthedata.
Wehavetomakeadecisionastowhatisanimportantcorrelation,notnecessarilyfromastatisticalhypothesistestingperspective,butfrom,inthiscaseanurbansociologicalperspective.Youhavetodecidewhatisimportantinthecontextoftheproblemathand.Thisdecisionmaydifferfromdisciplinetodiscipline.Insomedisciplinessuchassociologyandecologythedatatendtobeinherently'noisy',andinthiscaseyouwouldexpect'messier'interpretations.Ifyouarelookinginadisciplinesuchasengineeringwhereeverythinghastobeprecise,youmightputhigherdemandsontheanalysis.Youwouldwanttohaveveryhighcorrelations.Principalcomponentsanalysisaremostlyimplementedinsociologicalandecologicaltypesofapplicationsaswellasinmarketingresearch.
Asbefore,youcanplottheprincipalcomponentsagainstoneanotherandwecanexplorewherethedataforcertainobservationslies.
Sometimestheprincipalcomponentsscoreswillbeusedasexplanatoryvariablesinaregression.Sometimesinregressionsettingsyoumighthaveaverylargenumberofpotentialexplanatoryvariablestoworkwith.Andyoumaynothavemuchofanideaastowhichonesyoumightthinkareimportant.Whatyoumightdoistoperformaprincipalcomponentsanalysisfirstandthenperformaregressionpredictingthevariablesentersfromtheprincipalcomponentsthemselves.Thenicethingaboutthisanalysisisthattheregressioncoefficientswillbeindependenttooneanother,sincethecomponentsareindependentofoneanother.Inthiscaseyouactuallysayhowmuchofthevariationinthevariableofinterestisexplainedbyeachoftheindividualcomponents.Thisissomethingthatyoucannotnormallydoinmultipleregression.
Oneoftheproblemsthatwehavewiththisanalysisisthatbecauseofallofthenumbersinvolved,theanalysisisnotas'clean'asonewouldlike.Forexample,inlookingatthesecondandthirdcomponents,theeconomyisconsideredtobesignificantforbothofthosecomponents.Asyoucansee,thiswillleadtoanambiguousinterpretationinouranalysis.
AnalternativemethodofdatareductionisFactorAnalysiswherefactorrotationsareusedtoreducethecomplexityandobtainacleanerinterpretationofthedata.
7.8SummaryInthislessonwelearnedabout:
ThedefinitionofaprincipalcomponentsanalysisHowtointerprettheprincipalcomponentsHowtoselectthenumberofprincipalcomponentstobeconsideredHowtochoosebetweendoingtheanalysisbasedonthevariancecovariancematrixorthecorrelationmatrix.
Lookforthislesson'shomeworkproblemsthatwillgiveyouachancetoputwhatyouhavelearnedtouse...
24/6/2015 Lesson7:PrincipalComponentsAnalysis(PCA)
https://onlinecourses.science.psu.edu/stat505/book/export/html/49 19/19