Top Banner
Ellipsoidal representations about correlations (Towards general correlation theory) Toshiyuki Shimono [email protected] KAKENHI* Symposium * Grant-in-Aid for Scientific Research University of Tsukuba 2011-11-8
38

Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Symposium)

Jan 26, 2015

Download

Education

A fundamental theory in statistics, possibly applicable to data mining, machine learning, as well as epistemology. The principia mathematica of mine, 2nd version.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Symposium)

Ellipsoidal representations about correlations(Towards general correlation theory)

Toshiyuki [email protected]

KAKENHI* Symposium*Grant-in-Aid for Scientific Research

University of Tsukuba2011-11-8

Page 2: Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Symposium)

My profile

• My jobs are mainly building algorithms using data in large amounts such as: o web access log o newspaper articleso POS(Point of Sales) data o tags of millions of pictureso links among billions of pageso psychology test results of a human resource companyo data produced used for recommendation engineso data produced an original search engine

• This presentation touches on those above.

Page 3: Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Symposium)

Background

1. Paradoxes of real world data :o any elaborate regression analysis mostly gives ρ < 0.7

(This is when the observation is not very accurate, and 0.7 is arbitrary.)     -> so how to deal with them? 

o data accuracy seems not important to see ρ if  ρ < 0.7,      -> details shown later. 

2. My temporal answer :o The correlations are very important,

   so we need interpretation methods.o The ellipsoids will give you insights.

3. Then we will :o understand the real world  dominated by weak correlations.o find new rules and findings  in broad science, hopefully.

Page 4: Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Symposium)

Main contents§1.  What is ρ?

o Shape of ellipse/ellipsoido Mysterious robustness

§2.  Geometry of regressiono Similarity ratio of ellips * so Graduated rulerso Linear scalar fields

Page 5: Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Symposium)

§1. What is ρ ?(ρ : the correlation coefficient)

It was developed by Karl Pearson from a similar but slightly different idea introduced by Francis Galton in the 1880s. 

(quoted from en.wikipedia.org)

Page 6: Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Symposium)

The shapes of correlation ellipses (1)Each entry of the left figure shows the 2-dimensional Gaussian distributions with ρ changing from -1 to +1 stepping with 0.1.  (5000 points are plotted for each)

Page 7: Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Symposium)

The shapes of correlation ellipses (2) 

The ellipse inscribes the unit square at 4 points (±1,±ρ) and (±ρ,±1).

The density function of 2-dim Gauss-distribution with standardizations.

Note: for higher dimensions,

Page 8: Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Symposium)

The shapes of correlation ellipses (3)

When you draw the ellipses above, 1. draw an ellipse with the height and width of

√(1±ρ),2. rotate it 45 degree,3. do parallel-shift and axial-rescaling.

• Displacement and axial-rescaling are allowed.  (Rotation or rescaling along other direction is prohibited.)

Page 9: Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Symposium)

The shapes of correlation ellipses (4) [Baseball example]   6 teams of the Central League played 130 games in the each of past 31 years.  Each dot below corresponds to each team and each year (N = 186  = 6 × 31). 

x : total score gained(G)y : - rank  ρ  = 0.419

x : total score lost(L)y : - rank  ρ  = -0.471

x : total score gainedy : total score lost   ρ  = 0.423

x : -rank prediction from both G & Ly : - rank   ρ  = -0.828

(The prediction is through the multiple regression analysis)

Page 10: Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Symposium)

The shapes of correlations (5) SKIP

Page 11: Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Symposium)

Correlation ellipsoid (higher dimension)

For 3-dim case, the probability ellipsoid touches the unit cube at 6 points of ±( ρ ・ 1 , ρ ・ 2 , ρ ・ 3 ) where ・ = 1,2,3.(For k-dimensions, the hyper-ellipsoid touches the unit hyper-cubeat 2×k points of of ±( ρ ・ 1 , ρ ・ 2 ,.., ρ ・ k ) where ・ = 1,2,..,k.

x

y

z

 ( 1 , 0.3 , 0.5 ) 

 ( 0.3 , 1 , 0.7 ) 

 ( 0.5 , 0.7 , 1 ) 

ρ-matrix herein is,

 1    0.3  0.5 0.3  1    0.7 0.5  0.7  1 

 (-0.5 ,-0.7 ,-1 ) 

 (-0.3 ,-1 ,-0.7 ) 

 (-1,-0.3,-0.5) 

Page 12: Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Symposium)

The mysterious robustness (1) 

ρ[X:Y]  and  ρ [ f(X) : g(Y) ] seems to differ only little each other • when f and g are both increasing functions• unless X, Y, f(X) or g(Y)  contains `outlier(s)'.   

(Sampling fluctuations of ρ are much more than the effectcaused by non-linearity as well as error ε.)

* A function f( ・ ) is increasing iff f(x) f(y) holds for any x y. ≦ ≦

Page 13: Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Symposium)

The mysterious robustness (2)

ρ[X:Y]=0.557 ρ[X2:Y]=0.519X を 2 乗

ρ[X:Y2]=0.536Y を 2 乗

ρ[Xrank:Yrank]=0.537X,Y を順位化

ρ[X:log(Y)]=0.539Y を対数化

ρ[X(5):Y(5)]=0.507X,Y を 5 値化

ρ[X(7):Y(7)]=0.524X,Y を 7 値化

• The deformations cause less effect on ρ,• N=200 1 causes bigger ρ fluctuations.≫

Even N=200 causes the samplingcorrelations rather big fluctuations, whereas the X marks from the experiments rather concentrates.

(x,y)=(u,0.5*u+0.707*v) with (u,v) from an uniform square.

Page 14: Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Symposium)

The mysterious robustness (3)

Sampled ρ are perturbed corresponding to the sampling size with N=30(blue) or N=300(red). The deformation effect by f( ) is less.

Page 15: Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Symposium)

Where does the champion come from?

If ρ of the game is not close to 1, the true cannot win. The winner is approximately ρ times as strong as the true guy.(If the results and abilities form a 2-dim 0-centered Gaussian.)

potential ability

The champion of a game is often not the true champion.

Page 16: Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Symposium)

Summary of   `§1. What is ρ? '

• ρ is recognizable as an ellipse. • ρ-matrix is recognizable as an ellipsoid.• ρ seems robust against axial deformations unless outliers exist.• ρ of a game is suggested by the champions.

Page 17: Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Symposium)

§2. Geometry of Regression

The figures herein show the possible region where (x,y,z)=(ρ[Y:Z],ρ[Z:X],ρ[X:Y]) can exist.

Page 18: Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Symposium)

Multiple-ρ is the similarity ratio of ellipses

(When X ・ is k-dimentional, the hyper-ellipsoid is determined by k×k matrix whose  elements are ρ [ Xi : Xj ], and the inner point is at p-dimensional vector whose elements are ρ [  Xi : Y ] . )

[ Formulation of MRA ]

[ Multiple - ρ ]

The multiple-ρ ( 1) is the≦similarity ratio of the ellipses.

Page 19: Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Symposium)

Examples : Multiple-ρ from the ellipses

Many interesting phenomena would be systematically explained.

Page 20: Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Symposium)

Partial-ρ is read by a ruler in the ellipse

The red ruler • parallel to the corresponding axis,• passing through (r1,r2),• fully expanding inside the ellipse,• graduated linearly ranging ±1, 

reads the partial-ρ. 

The partial correlation r1' comes form the idea of the correlation between X1 and Y but X2 is fixed. 

r1' = 0.75 for this case.r2' is also read by changing the ruler direction vertically.  

Page 21: Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Symposium)

Standardized partial regression coefficients

• ai are called the partial regression coefficients. • Assume X1,X2,Y are standardized. 

Make a scalar field inside the ellipse• 1 on the plus-side boundary of k-th axis,• 0 on the boundary of the other axis,• interpolate the assigning values linearly.

Then, ak is read by the value at (r1,r2). Note: • Extension to higher dimensions are easy.• Boundary points at each facet is single.• This pictorialization may be useful to SEM

(Structural Equation Analysis). 

Page 22: Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Symposium)

The elliptical depiction for the baseball example This page is added after the symposium

Red : for the multiple-ρ (0.828), Blue : for the two partial-ρMagenta : for the partial regression coefficients.

Each value corresponds to the length ratio of the bold part to the whole same-colored line section.

X1 : annual total score gainedX2: annual total score lostY: zero minus annual ranking 

( ρ[Y:X1] , ρ[Y:X2] ) = (0.419,-0.471)  is plotted inside the ellipse slanted with ρ[X1:X2]=0.423.

-> The meaning of numbers becomes clearer.

Page 23: Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Symposium)

Summary and findings of  §2 Geometry of regression

• Multiple-ρ is the similarity ratio of two ellipses/ellipsoids.• Partial-ρ is read by a graduated ruler in the ellipse/ellipsoids.• Each regression coefficients are given by the schalar field.

So far, the derived numbers from MRA (Multiple Regression Analysis) have often said to be hard to recognize. But this situation can be changed. 

Page 24: Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Symposium)

[ Main resutls ]Using the ellipse or hyper-ellipsoid, •  any correlation matrix is wholly pictorialized.•  multiple regression is translated into geometric quotients.

[ Sub results ]•  ρ seems quite robust against axial deformations unless outliers exist.• (Spherical trigonometry may give you insights). <- Not referred today.

[ Next steps ] • treat the parameter/sampling perturbations• systematize interesting statistical phenomena• produce new theories further on• give new twists to other research areas• make useful applications to the real world cases• organize a new logic system for this ambiguous world.

Summary as a whole

Page 25: Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Symposium)

Refs1. 岩波数学辞典     Encyclopedic Dictionary of Mathematics, The Mathematical Society of Japan

2. R, http://www.r-project.org/

3.  共分散構造分析 [ 事例編 ] 

  The  author sincerely welcomes any related literature. 

Page 26: Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Symposium)

Background of this presentation SKIP1. We make judgements from related things

 in daily or social life, but this real world is noisy and filled with exceptions.

  e.g. "Does the better posture and mental   concentration cause the better performance?" 

2. The real world data causes paradoxes :o any elaborate regression analysis mostly gives ρ < 0.7, how to deal?o data accuracy is not important when  ρ < 0.7, details shown later.o why subjective sense works in the real? 

3. Geometric interpretations of multiple regression analysis may be usefulo that wholly takes in any correlation matrixo that is geometric using ellipsoids

    to observe, analyze the background phenomena in detail. 

4. Then we will understand weak correlations that dominates our world.

Page 27: Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Symposium)

A primitive question  SKIPQuestion         Why(How) is data analysing important? 

My Answer         It gives you inspirations and             updates your recognition to the real world.            Knowing the numbers μ, σ, ρ, ranking, VaR *            from phenomena you have met           is crucially important to make your next action             in either of your daily, social or business life!!                  * average, std deviation, correlation coefficient, the rank order, Value at Risk

                And so, the interpretation of the numbers is necessary. 

        (And I provides you that of ρ today!)

    

Page 28: Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Symposium)

Main ideas in more detail SKIPUsing the ellipse or hyper-ellipsoid, •  2nd order moments are completely imaginable in a picture.•  the numbers from Multiple-Regression are also imaginable.

1. (Pearson's) Correlation Coefficient • basic of statistics (as you know)• may change well when outliers are contained• however, changes only few against `monotone' map   • depicted as 'correlation ellipse'

2. Multiple Regression Analysis• (Spherical Surface Interpretation)• Ellipse Interpretation

Page 29: Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Symposium)
Page 30: Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Symposium)

Main ideas  SKIP

1. What is the correlation coefficient after all? 2. Geometric interpretations of Multiple Regression

Analysis. 

Page 31: Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Symposium)

The mysterious robustness (3) SKIP

front figures: x - original sampling correlation. y - 3-valued then correlation calculated.  back figures: sample of 100. 

Page 32: Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Symposium)
Page 33: Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Symposium)

Summary of   `§1. What is ρ? 'REDUNDANT

• A correlation ρ is recognizable as an ellipse. • A correlation matrix is also recognizable as an ellipsoid.• ρ seems robust against axial deformations unless outliers exist.• You can guess `ρ' of a game by the champion.

Page 34: Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Symposium)
Page 35: Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Symposium)

When partial-ρ is zero. (SKIP)

The condition partial-ρ = 0  ⇔• The inner angle of the spheric triangle is 90 degrees.• The two `hyper-planes' cross at 90 degrees at the `hyper-

axis'. The axis corresponds the fixed variables and each of the planes contains each of the two variables.

•  On the ellipse/ellipsoid, the characteristic point is on the midpoint of the ruler.

Page 36: Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Symposium)

Multiple-ρ is the similarity ratio of ellipsesREDUNDANT

(When X ・ is k-dimentional, the hyper-ellipsoid is determined by k×k matrix whose  elements are ρ [ Xi : Xj ], and the inner point is at p-dimensional vector whose elements are ρ [  Xi : Y ] . )

[ Formulation of MRA ]

[ Multiple - ρ ]

For arbitrary variables number case, you calculate: the inverse of the correlation matrix → the reciprocal of each of the diagonal elements → 1 minus each of them → take the square root of each → each are the multiple-ρ of the corresponding variable from the rest variables. 

The multiple-ρ ( 1) is the≦similarity ratio of the ellipses.

Page 37: Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Symposium)

Summary and findings of  §2 Geometry of regressionREDUNDANT

• Multiple-ρ is the similarity ratio of two ellipses/ellipsoids.• Partial-ρ is read by a graduated ruler in the ellipse/ellipsoids.• Each regression coefficients are given by the scholar field.• (Spherical trigonometry)

So far, the derived numbers from MRA have often said to be hard to recognize. But this situation can be changed. 

Page 38: Ellipsoidal Representations about correlations (2011-11, Tsukuba, Kakenhi-Symposium)

Introduction  This page is added after the symposium

There is a Japanese word `kaizen', which means improvement.

The real world is, however, so ambiguous that it often is hard to know whether any kaizen action would make positive effect or not.

Sometimes your action may cause negative effect or zero effect in an averaged sense even if you believe your action is a good one. Assume a situation that you can control a variable to make some effect on the outcome variable (the number of control variables would increase in the following).

The author's hypothetical proposition is that the correlation coefficient indeed plays important role. A reason is that when the correlation is positive then your rational action is just increasing the value of the control variable. And it seems very reasonable that you should select a strongly correlated variable to the output variable.

The problems still existing today are as follows:  - The meaning of correlation value is not yet well known.  - The meaning of multiple regression analysis is also not yet    well known(, although when the correlation is weak the reasonable    choice of analysis is multiple analysis or its elaborate    derivatives).

The author found that correlation is very robust against any`axial deformations’ unless variables contain outliers.  Rather sampling correlation coefficient perturbs much more in many cases when N is  less than 1000.  The author also found geometrical backgrounds of correlations of multiple regression analysis (Perhaps R.A.Fisher already knew that, but any personaround me didn’t know that) that is producing many insights.

(The robustness is not well analyzed at this moment (somepieces of analysis and numerical examples) Thegeometrical background is analyzed in basic points sothe author is considering to investigate further for parameter perturbations.)

This page may need intensive proofreading by the author.