G = ( (u 1 ,n)r e u 1 ,n M n ,..., (u L ,n)r e u L ,n M n , ..., (v,m 1 )r e v,m 1 U v , ..., v,m L R e (v,m L ) U v ) d/dt( (U u +tG u ) (M m +tG m ) ) 0 = dsse(t)/dt= 2( (U u +tG u )(M m +tG m ) - r u,m ) 1 (u,m)r d/dt( t 2 G u G m + t(G m U u +G u M m ) + U u M m ))=0 2(U u M m -r um +t(G m U u +G u M m )+t 2 G u G m ) (u,m)r g=G m G u h=G m U u +G u M m e=U u M m -r um ( 2tG u G m + G m U u +G u M m ) = 0 (um)r ( U u M m -r u,m +t(G m U u +G u M m )+t 2 G u G m ) (e+th+t 2 g) (2tg+h) = (um)r (2teg+2t 2 hg+2t 3 g 2 +eh+th 2 +t 2 gh) = 0 (um)r g 2 + t 3 2(um)r t 2 3(um)r eh = 0 hg + t(um)r (2eg+h 2 ) + (um)r a b c d sse(t) = (u,m)r (p u,m (t) - r u,m ) 2 = (u,m)r ( (U u +tG u )(M m +tG m ) - r 2(U u M m +t 2 G u G m +tG m U u +tG u M m - r u,m ) F=(U u ,M m ) feature_vector, p u,m =U u M m prediction, r u,m rating, e u,m =r u,m -p u,m error, G Gradient(sse) sse(F)=(u,m)r (U u M m - r u,m ) 2 e.g., if two r=ratings, r u,m =2 and r v,n =4, laying things out as: F(t)≡( U u +tG u , M m +tG m ) p u,m (t)≡( (U u +tG u )o(M m +tG m ) ) Given a relationship matrix, r, between 2 entities, u, m. (e.g., the Netflix "Ratings" matrix, r um = rating user, u, gives to movie, m. The SVD Recommender uses the ratings in r to train 2 smaller matrixes, a user-feature matrix, U, and a feature-movie matrix, M. Once U and M are trained, SVD quickly predicts the rating u would give to m as the dot product, p um =U u oM m . Starting with a few features, U u vector = extents to which a user "likes" feature; M m = the level at which each feature characterizes m. SVD trains U and M using gradient descent minimization of sse (sum square error = RATINGS p um -r um ) 2 ). Solving at 3 +bt 2 +ct+d=0, t= t = ( q+[q 2 +(r-p 2 ) 3 ]) 1/3 + ( q-[q 2 +(r- p 2 ) 3 ]) 1/3 + p q = p 3 +(bc-3ad)/(6a 2 ) p = -b/(3a) r = c/(3a) -b 3 /27a 3 +bc/6a 2 -d/2a) +{(-b 3 /27a 3 +bc/6a 2 -d/2a) 2 +(c/3a-b 2 /9a 2 ) 3 } 1/2 ] 1/3 -b 3 /27a 3 +bc/6a 2 -d/2a) -{(-b 3 /27a 3 +bc/6a 2 -d/2a) 2 +(c/3a-b 2 /9a 2 ) 3 } 1/2 ] 1/3 - b/3a r m n u 2 1 M m v 4 1 M n 1 1 U u U v F p m n u 1 v 1 e m n u 1 v 3 r__ p__ e____ ee_ g___ h__ a 164 p 0.3414 2 1 1 -1 1 1 -1 -2 b -168 q -0.004 4 1 1 -3 9 9 -3 -6 c -16 r -0.032 1 1 F -1-3 GR d 20 0.0365 t desc 40628 t root 164 -161 -21.9 1 -0.036 164 -168 -16 20 164 -6.0 -161 -16 -161 5.927 -21.9 20 -21.9 0.802 19.19 Something is wrong with the cubic root formula!
30
Embed
G = ( (u 1,n) r e u 1,n M n,..., (u L,n) r e u L,n M n,..., (v,m 1 ) r e v,m 1 U v,..., v,m L R e (v,m L ) U v ) d/dt( (U u +tG u )(M m.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Given a relationship matrix, r, between 2 entities, u, m. (e.g., the Netflix "Ratings" matrix, rum = rating user, u, gives to movie, m.The SVD Recommender uses the ratings in r to train 2 smaller matrixes, a user-feature matrix, U, and a feature-movie matrix, M. Once U and M are trained, SVD quickly predicts the rating u would give to m as the dot product, pum=UuoMm. Starting with a few features, Uu vector = extents to which a user "likes" feature; Mm = the level at which each feature characterizes m. SVD trains U and M using gradient descent minimization of sse (sum square error =
RATINGS pum-rum)2 ).
Solving at3+bt2+ct+d=0, t=
or t = ( q+[q2+(r-p2)3])1/3 + ( q-[q2+(r-p2)3])1/3 + p
There are 3 zeros of the derivative function: -0.337, 0.361 and 1.000. Function values are 0.047, 18.47 and 4.0. See attachment and the chart. Blue chart is the derivative and the red chart is the function.
Given: 0 = ax³ + bx² + cx + dDivide through by a 0 = x³ + ex² + fx + g e = b/a f = c/a g = d/aStep 2: Do horizontal shift, x = z − e/3 which removes square term, leaving 0 = z³ + pz + q where z = x + e/3 p = f − e² / 3 q = 2e³/27 − ef/3 + gStep 3: Introduce u and t: u t = (p/3)³ , u − t = q so t = −½q ± ½√(q² + 4p³/27) u = ½q ± ½√(q² + 4p³/27) ⋅0 = z³ + pz + q has a root at: z = (t) − (u) 'u' and 't' both have a '±' sign. It doesn't matter which you pick to plug into ∛ ∛the above equation... as long as you pick the same sign for each.
a 164 p 0.3414 -0.007 -0.001b -168 q -0.004 -0.198 -0.106c -16 r -0.032d 20
x z=x+e/3 q^2+4p^3/27 1 1 -0.013e -1.02f -0.09 p -0.447 t ERRg 0.121 q 0.0090 u
Since calculus isn't working (to find the min mse along f(t)=f+tG ), will this type of binary search be efficient enough? Maybe so! In all dimensions, the mse(t) equation is quartic (dimension=4) so The general shape is as below (where any subset of the local extremes can coelese).
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z AAAB AC AD1 \a=Z22 3 3 5 2 5 3 3 /rvnfv~fv~{goto}L~{edit}+.005~/XImse<omse-.00001~/xg\a~3 2 5 1 2 3 5 3 .001~{goto}se~/rvfv~{end}{down}{down}~4 3 3 3 5 5 2 /xg\a~5 5 3 4 3 6 2 1 2 1 7 4 1 1 4 38 4 3 2 5 39 1 4 5 3 2 LRATE omse10 0 1 0 0 1 0 1 0 1 0 1 1 1 1 0 1 0 1 1 1 3 3 2 3 1 3 3 2 0.001 0.1952090 fv
A22: +A2-A$10*$U2 /* error for u=a, m=1 */A30: +A10+$L*(A$22*$U$2+A$24*$U$4+A$26*$U$6+A$29*$U$9) /* updates f(u=a) */U29: +U9+$L*(($A29*$A$30+$K29*$K$30+$N29*$N$30+$P29*$P$30)/4) /* updates f(m=8 */AB30: +U29 /* copies f(m=8) feature update in the new feature vector, nfv */W22: @COUNT(A22..T22) /* counts the number of actual ratings (users) for m=1 */X22: [W3] @SUM(W22..W29) /*adds ratings counts for all 8 movies = training count*/AD30: [W9] @SUM(SE)/X22 /* averages se's giving the mse */
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z AAAB AC AD21 working error and new feature vector (nfv)22 0 0 0 **0 ** 3 6 3523 0 0 ** 0 ** 0 3 624 0 0 0 ** 0 2 525 0 ** ** 3 326 0 0 **1 327 **** ** 0 3 428 ** 1 0 ** 3 429 ** ** 0 0 2 4 L mse30 0 1 0 0 1 0 1 0 1 0 1 1 1 1 0 1 0 1 1 1 3 3 2 3. 1 3 3 2 0.001 0.1952063 nfv
/rvnfv~fv copies fv to nfv after converting fv to values.
{goto}L~{edit}+.005~ increments L by .005
/XImse<omse-.00001~/xg\a~ IF mse still decreasing, recalc mse with new L
.001~ Reset L=.001 for next round
/xg\a~ Start over with next round
{goto}se~/rvfv~{end}{down}{down}~ "value copy" fv to output list
Notes: In 2 rounds mse is as low as Funk gets it in 2000 rounds. After 5 rounds mse is lower than ever before (and appears to be bottoming out).
I know I shouldn't hardcode parameters! Experiments should be done to optimize this line search (e.g., with some binary search for a low mse).
Since we have the resulting individual square_errors for each training pair, we could run this, then for mask the pairs with se(u,m) > Threshold.
Then do it again after masking out those that have already achieved a low se. But what do I do with the two resulting feature vectors? Do I treat it like a two feature SVD or do I use some linear combo of the resulting predictions of the two (or it could be more than two)? We need to test out which works best (or other modifications) on Netflix data.
Maybe on those test pairs for which the training row and column have some high errors, we apply the second feature vector instead of the first? Maybe we invoke CkNN for test pairs in this case (or use all 3 and a linear combo?)
This is powerful! We need to optimize the calculations using pTrees!!!
pTrees in MapReduce MapReduce and Hadoop are key-value approaches to organizing and managing BigData.
pTree Text Mining:: capturie the reading sequence, not just the term-frequency matrix (lossless capture) of a text corpus.
Secure pTreeBases: This involves anonymizing the identities of the individual pTrees and randomly padding them to mask their initial bit positions.
pTree Algorithmic Tools: An expanded algorithmic tool set is being developed to include quadratic tools and even higher degree tools.
pTree Alternative Algorithm Implementation: Implementing pTree algorithms in hardware (e.g., FPGAs) should result in orders of magnitude performance increases?
pTree O/S Infrastructure: Computers and Operating Systems are designed to do logical operations (AND, OR...) rapidly. Exploit this for pTree processing speed.
pTree Recommender: This includes, Singular Value Decomposition (SVD) recommenders, pTree Near Neighbor Recommenders and pTree ARM Recommenders.
Research SummaryWe datamine big data (big data ≡ trillions of rows and, sometimes, thousands of columns (which can complicate data mining trillions of rows).How do we do it? I structure the data table as [compressed] vertical bit columns (called "predicate Trees" or "pTrees").I process those pTrees horizontally (because processing across thousands of column structures is orders of magnitude faster than processing down trillions of row structures. As a result, some tasks that might have taken forever can be done in a humanly acceptable amount of time.
What is data mining? Largely it is classification (assigning a class label to a row based on a training table of previously classified rows).
Clustering and Association Rule Mining (ARM) are important areas of data mining also, and they are related to classification.
The purpose of clustering is usually to create [or improve] a training table. It is also used for anomaly detection, a huge area in data mining.
ARM is used to data mine more complex data (relationship matrixes between two entities, not just single entity training tables). Recommenders recommend products to customers based on their previous purchases or rents (or based on their ratings of items)".
To make a decision, we typically search our memory for similar situations (near neighbor cases) and base our decision on the decisions we (or an expert) made in those similar cases. We do what worked before (for us or for others). I.e., we let near neighbor cases vote. But which neighbor vote? "The Magical Number Seven, Plus or Minus Two..." Information"[2] is one of the most highly cited papers in psychology cognitive psychologist George A. Miller of Princeton University's Department of Psychology in Psychological Review. It argues that the number of objects an average human can hold in working memory is 7 ± 2 (called Miller's Law). Classification provides a better 7.
DPP-KM 1. Check gaps in DPPp,d(y) (over grids of p and d?). 1.1 Check distances at any sparse extremes. 2. After several rounds of 1, apply k-means to the resulting clusters (when k seems to be determined).
DPP-DA 2. Check gaps in DPPp,d(y) (grids of p and d?) against the density of subcluster. 2.1 Check distances at sparse extremes against subcluster density. 2.2 Apply other methods once Dot ceases to be effective.
DPP-SD) 3. Check gaps in DPPp,d(y) (over a p-grid and a d-grid) and SDp(y) (over a p-grid). 3.1 Check sparse ends distance with subcluster density. (DPPpd and SDp share construction steps!)
The Square Distance Functional (SD) Check gaps in SDp(y) ≡ (y-p)o(y-p) (parameterized over a pRn grid).
Calc yoy, yop, yoq concurrently? Then constant multiplies 2*yop, (1/|p-q|)*yop concurrently. Then add | subtract.Calculate DPPpq(y)2. Then subtract it from SDp(y)
FAUST clustering (the unsupervised part of FAUST)
This class of partitioning or clustering methods relies on choosing a dot product projection so that if we find a gap in the F-values, we know that the 2 sets of points mapping to opposite sides of that gap are at least as far apart as the gap width.).
The Coordinate Projection Functionals (ej) Check gaps in ej(y) ≡ yoej = yj
The Dot Product Projection (DPP) Check for gaps in DPPd(y) or DPPpq(y)≡ (y-p)o(p-q)/|p-q| (parameterized over a grid of d=(p-q)/|p-q|Spheren. d
The Dot Product Radius (DPR) Check gaps in DPRpq(y) ≡ √ SDp(y) - DPPpq(y)2
FAUST DPP CLUSTER on IRiS with DPP(y)=(y-p)o(q-p)/|q-p|, where p is the min (or n) corner and q is the max (x) corner of the circumscribing rectangle (mdpts or avg (a) is used also).
IRIS: 150 irises (rows), 4 columns (Pedal Length, Pedal Width, Sepal Length, Sepal Width). first 50 are Setosa (s), next 50 are Versicolor (e), next 50 are Virginica (i) irises.
Here we project onto lines through the corners and edge midpoints of the coordinate-oriented circumscribing rectangle. It would, of course, get better results if we choose p and q to maximize gaps.Next we consider maximizing the STD of the F-values to insure strong gaps (a heuristic method).
"Gap Hill Climbing": mathematical analysis1. To increase gap size, we hill climb the standard deviation of the functional, F (hoping that a "rotation" of d toward a higher StDev would increase the likelihood that gaps would be larger since more dispersion allows for more and/or larger gaps. This is very heuristic but it works.
2. We are more interested in growing the largest gap(s) of interest ( or largest thinning). To do this we could do:
F-slices are hyperplanes (assuming F=dotd) so it would makes sense to try to "re-orient" d so that the gap grows.Instead of taking the "improved" p and q to be the means of the entire n-dimensional half-spaces which is cut by the gap (or thinning), take as p and q to be the means of the F-slice (n-1)-dimensional hyperplanes defining the gap or thinning.This is easy since our method produces the pTree mask of each F-slice ordered by increasing F-value (in fact it is the sequence of F-values and the sequence of counts of points that give us those value that we use to find large gaps in the first place.).
0 1 2 3 4 5 6 7 8 9 a b c d e ff 1 0e 2 3d 4 5 6c 7 8b 9a98765 a j k l m n4 b c q r s3 d e f o p2 g h1 i0
d1
d 1-gap
=p
q=
d2
d2-gap
The d2-gap is much larger than the d1=gap. It is still not the optimal gap though. Would it be better to use a weighted mean (weighted by the distance from the gap - that is weighted by the d-barrel radius (from the center of the gap) on which each point lies?)
0 1 2 3 4 5 6 7 8 9 a b c d e ff 1e 2 3d 4 5 6c 7 8b 9a98765 a j k 4 b c q 3 d e f 2 1 0
d1
d 1-gap
p
q
d2
d2-gap
In this example it seems to make for a larger gap, but what weightings should be used? (e.g., 1/radius2) (zero weighting after the first gap is identical to the previous). Also we really want to identify the Support vector pair of the gap (the pair, one from one side and the other from the other side which are closest together) as p and q (in this case, 9 and a but we were just lucky to draw our vector through them.) We could check the d-barrel radius of just these gap slice pairs and select the closest pair as p and q???
V(d)= 2a11d1 +j1a1jdj
2a22d2+j2a2jdj
:
2anndn +jnanjdj
d0, one can hill-climb it to locally maximize the variance, V, as follows: d1≡(V(d0)); d2≡(V(d1)):... where
Maximizing theVarianceGiven any table, X(X1, ..., Xn), and any unit vector, d, in n-space, let
= jXj
2 dj2 +2
j<kXjXkdjdk - "
= j=1..n
(Xj2
- Xj2)dj
2 ++(2j=1..n<k=1..n
(XjXk - XjXk)djdk )
V(d)≡VarianceXod=(Xod)2 - (Xod)2
= i=1..N
(j=1..n xi,jdj)
2 - ( j=1..n
Xj dj )2
N 1
= i
jxi,j
2dj2 +
j<k xi,jxi,kdjdk
-
jXj
2dj2 +2
j<k XjXkdjdk N 1
N2
+ jkajkdjdkV(d)=jajjdj2
subject to i=1..ndi2=1
dT o A o d = V(d)
V
i XiXj-XiX,j
:
d1 ... dnd1:
dn
ijaijdidjV(d) =
x1
x2
:xN
x1odx2od
xNod=
Xod=Fd(X)=DPPd(X)d1
dn
- (jXj dj) (
kXk dk) =
i(
j xi,jdj) (
k xi,kdk) N 1
2a11 2a12 ... 2a1n
2a21 2a22 ... 2a2n
:'2an1 ... 2ann
d1
:di
:dn
V(d)≡Gradient(V)=2Ao
d or
We can separate out the diagonal or not:
Theorem1:
k{1,...,n}, d=ek will hill-climb V to its globally maximum.
Let d=ek s.t. akk is a maximal diagonal element of A,Theorem2 (working on it):
d=ek will hill-climb V to its globally maximum.
How do we use this theory?For Dot Product gap based Clustering, we can hill-climb akk below to a d that gives us the global maximum variance. Heuristically, higher variance means more prominent gaps.
For Dot Product Gap based Classification, we can start with X = the table of the C Training Set Class Means, where Mk≡MeanVectorOfClassk .
M1
M2
:MC
Then Xi = Mean(X)i and
and XiXj = Mean Mi1 Mj1
. : MiC MjC
These computations are O(C) (C=number of classes) and are instantaneous. Once we have the matrix A, we can hill-climb to obtain a d that maximizes the variance of the dot product projections of the class means.
Build a Decision tree. 1. Find the d that maximizes the variance of the dot product projections of the class means each round.2. Apply DI each round (see next slide).
Definite_i = ( Mx<i, Mn>i )
FAUST DI FAUST DI K-class training set, TK, and a given d (e.g., from D≡MeanTKMedTK):
Let mi≡meanCi s.t. dom1dom2 ...domK Mni≡Min{doCi} Mxi≡Max{doCi} Mn>i≡Minj>i{Mnj} Mx<i≡Maxj<i{Mxj}
Indefinite_i_i+1 = [Mn>i, Mx<i+1] Then recurse on each Indefinite.
For IRIS 15 records were extracted from each Class for Testing. The rest are the Training Set, TK. D=MEAN sMEANe
Definite_i_______ Indefinite_i_i+1______ class Mx<i MN>i class MN>i Mx<i+1 s-Mean 50.49 34.74 14.74 2.43 s(i=1) -1 25 e-Mean 63.50 30.00 44.00 13.50 e(i=2) 10 37 se 25 10 empty i-Mean 61.00 31.50 55.50 21.50 i(i=3) 48 128 ei 37 48
F < 18 setosa (35 seto) 1ST ROUND D=MeansMeane 18 < F < 37 versicolor (15 vers) 37 F 48 IndefiniteSet2 (20 vers, 10 virg) 48 < F virginica (25 virg)
F < 7 versicolor (17 vers. 0 virg) IndefSet2 ROUND D=MeaneMeani 7 F 10 IndefSet3 ( 3 vers, 5 virg) 10 < F virginica ( 0 vers, 5 virg)
F < 3 versicolor ( 2 vers. 0 virg) IndefSet3 ROUND D=MeaneMeani 3 F 7 IndefSet4 ( 2 vers, 1 virg) Here we will assign 0 F 7 versicolor 7 < F virginica ( 0 vers, 3 virg) 7 < F virginica
F < 20 versicolor (15 vers. 0 virg) IndefSet2 ROUND D=MeaneMeani 20 < F virginica ( 0 vers, 1 virg)
Option-1: The sequence of D's is: Mean(Classk)Mean(Classk+1) k=1... (and Mean could be replaced by VOM or?)Option-2: The sequence of D's is: Mean(Classk)Mean(h=k+1..nClassh) k=1... (and Mean could be replaced by VOM or?)Option-3: D seq: Mean(Classk)Mean(h not used yetClassh) where k is the Class with max count in subcluster (VoM instead?)Option-2: D seq.: Mean(Classk)Mean(h=k+1..nClassh) (VOM?) where k is Class with max count in subcluster.Option-4: D seq.: always pick the means pair which are furthest separated from each other.
Option-5: D Start with Median-to-Mean of IndefiniteSet, then means pair corresp to max separation of F(mean i), F(meanj)Option-6: D Always use Median-to-Mean of IndefiniteSet, IS. (initially,
IS=X)
FAUST MVDIFAUST MVDI on IRIS 15 records from each Class for Testing (Virg39 was removed as an outlier.)
On the 127 sample SatLog TestSet: 4 errors or 96.8% accuracy.speed? With horizontal data, DTI is applied one unclassified sample at a time (per execution thread).With this pTree Decision Tree, we take the entire TestSet (a PTreeSet), create the various dot product SPTS (one for each inode), create ut SPTS Masks. These masks mask the results for the entire TestSet.
E.g.,? Let D=vector connecting class means and d= D/|D|PX dot d>a = PdiXi>a
FAUST-Oblique: Create tbl, TBL(classi, classj, medoid_vectori, medoid_vectorj). Notes: If we just pick the one class which when paired with r, gives max gap, then we can use max gap or max_std_Int_pt instead of max_gap_midpt. Then need std j (or variancej) in TBL.
FAUST Oblique Classifier: formula:
P(X dot D)>a X any set of vectors. D=oblique vector (Note: if D=ei, PXi > a ).
AND 2 pTrees masks
To separate r from v: D = (mvmr), a = (mv+mr)/2 o d = midpoint of D projected onto d
P(m
b m
r )oX>(mr +m
) |/2od
P(m
vmr)oX>(m
r+mv )/2od
masks vectors that makes a
shadow on mr side of the midpt r r r v v r mr r v v v r r v mv v r b v v r b b v b mb b b b b b b
For classes r and b
"outermost = "furthest from means (their projs of D-line); best rankK points, best std points, etc. "medoid-to-mediod" close to optimal provided classes are convex.
Best cutpoint? mean, vector_of_medians, outmost, outmost_non-outlier?
D
r
g
b
grb grb grb grb grb grb grb grb
grb
In higher dims same (If "convex" clustered classes, FAUST{div,oblique_gap} finds them.
bgr bgr bgr bgr bgr bgr bgr bgr bgr bgr
r r r v v r mr r v v v r r v mv v r v v r v
P(m
rmv )/|m
rmv |oX<a
For classes r and v D = mrmv
a
r r vv r mR r v v v v r r v mV v r v v r v
FAUST ObliqueFAUST Oblique
PR = P(X dot d)<a
D≡ mRmV = oblique vector. d=D/|D|
Separate classR, classV using midpoints of means (mom)midpoints of means (mom) method: calc a
a
View mR, mV as vectors (mR≡vector from origin to pt_mR), a = (mR+(mV-mR)/2)od
= (mR+mV)/2 o d (Very same formula works when D=mVmR, i.e., points to left)
d
Training ≡ choosing "cut-hyper-plane" (CHP), which is always an (n-1)-dimensionl hyperplane (which cuts space in two). Classifying is one horizontal program (AND/OR) across pTrees to get a mask pTree for each entire class (bulk classification)Improve accuracy? e.g., by considering the dispersion within classes when placing the CHP. Use1. the vector_of_median, vom, to represent each class, rather than mV, vomV ≡ ( median{v1|vV},2. project each class onto the d-line (e.g., the R-class below); then calculate the std (one horizontal formula per class; using Md's method); then use the std ratio to place CHP (No longer at the midpoint between mr [vomr] and mv [vomv] )
median{v2|vV}, ... )
vomV
v1
v2
vomR
std of these distances from origin
along the d-line
dim 2
dim 1
d-line
x y x\y 1 2 3 4 5 6 7 8 9 a b1 1 1 13 1 2 32 2 3 2 43 3 45 2 5 59 3 615 1 7 f14 2 815 3 9 6 d13 4 a b10 9 b c e1110 c9 11 d a1111 e 87 8 f 7 9
L1(x,y) Count Array
z1 1 2 1 1 1 1 2 1 1 1 1 1 1
z2 1 3 1 1 1 2 1 1 1 1 1 1
z3 1 3 1 1 1 1 1 2 1 1 1 1
z4 1 2 1 1 1 1 1 2 1 2 1 1
z5 1 3 2 1 1 1 2 1 1 1 1
z6 1 2 3 2 4 1 2
z7 1 2 1 1 1 1 2 4 1 1
z8 1 2 1 1 1 2 4 1 2
z9 1 2 1 1 3 2 1 3 1
z10 1 2 2 2 1 2 2 2 1
z11 1 1 2 1 1 1 2 1 2 2 1
z12 1 1 1 1 1 1 1 2 1 1 1 2 1
z13 1 1 2 1 1 1 1 3 3 1
z14 1 1 1 1 1 1 1 2 1 1 1 2 1
z15 1 1 1 1 2 1 1 1 2 3 1
L1(x,y) Value Array
z1 0 2 4 5 10 13 14 15 16 17 18 19 20
z2 0 2 3 8 11 12 13 14 15 16 17 18
z3 0 2 3 8 11 12 13 14 15 16 17 18
z4 0 2 3 4 6 9 11 12 13 14 15 16
z5 0 3 5 8 9 10 11 12 13 14 15
z6 0 5 6 7 8 9 10
z7 0 2 5 8 11 12 13 14 15 16
z8 0 2 3 6 9 11 12 13 14
z9 0 2 3 6 11 12 13 14 16
z10 0 3 5 8 9 10 11 13 15
z11 0 2 3 4 7 8 11 12 13 15 17
z12 0 1 2 3 6 8 9 11 13 14 15 17 19
z13 0 2 3 5 8 11 13 14 16 18
z14 0 1 2 3 7 9 10 12 14 15 16 18 20
z15 0 4 5 6 7 8 9 10 11 13 15
12/8/12
L1(x,y) Count Array
z1 1 2 1 1 1 1 2 1 1 1 1 1 1
z2 1 3 1 1 1 2 1 1 1 1 1 1
z3 1 3 1 1 1 1 1 2 1 1 1 1
z4 1 2 1 1 1 1 1 2 1 2 1 1
z5 1 3 2 1 1 1 2 1 1 1 1
z6 1 2 3 2 4 1 2
z7 1 2 1 1 1 1 2 4 1 1
z8 1 2 1 1 1 2 4 1 2
z9 1 2 1 1 3 2 1 3 1
z10 1 2 2 2 1 2 2 2 1
z11 1 1 2 1 1 1 2 1 2 2 1
z12 1 1 1 1 1 1 1 2 1 1 1 2 1
z13 1 1 2 1 1 1 1 3 3 1
z14 1 1 1 1 1 1 1 2 1 1 1 2 1
z15 1 1 1 1 2 1 1 1 2 3 1
L1(x,y) Value Array
z1 0 2 4 5 10 13 14 15 16 17 18 19 20
z2 0 2 3 8 11 12 13 14 15 16 17 18
z3 0 2 3 8 11 12 13 14 15 16 17 18
z4 0 2 3 4 6 9 11 12 13 14 15 16
z5 0 3 5 8 9 10 11 12 13 14 15
z6 0 5 6 7 8 9 10
z7 0 2 5 8 11 12 13 14 15 16
z8 0 2 3 6 9 11 12 13 14
z9 0 2 3 6 11 12 13 14 16
z10 0 3 5 8 9 10 11 13 15
z11 0 2 3 4 7 8 11 12 13 15 17
z12 0 1 2 3 6 8 9 11 13 14 15 17 19
z13 0 2 3 5 8 11 13 14 16 18
z14 0 1 2 3 7 9 10 12 14 15 16 18 20
z15 0 4 5 6 7 8 9 10 11 13 15
After having subclustered with linear gap analysis, it would make sense to run this round gap algoritm out only 2 steps to determine if there are any singleton, gap>2 subclusters (anomalies) which were not found by the previous linear analysis.
x y x\y 1 2 3 4 5 6 7 8 9 a b1 1 1 13 1 2 32 2 3 2 43 3 45 2 5 59 3 615 1 7 f14 2 815 3 9 6 M d13 4 a b10 9 b c e1110 c9 11 d a1111 e 87 8 f 7 9
This just confirms z6 as an anomaly or outlier, since it was already declared so during the linear gap analysis.
Confirms zf as an anomaly or outlier, since it was already declared so during the linear gap analysis.
x y x\y 1 2 3 4 5 6 7 8 9 a b1 1 1 13 1 2 32 2 3 2 43 3 45 2 5 59 3 615 1 7 f14 2 815 3 9 6 M d13 4 a b10 9 b c e1110 c9 11 d a1111 e 87 8 f 7 9
AND each red with each blue with each green, to get the subcluster masks (12 ANDs)
For any FAUST clustering method, we proceed in one of 2 ways: gap analysis of the projections onto a unit vector, d, and/or gap analysis of the distances from a point, f (and another point, g, usually):
f, g, d, SpS(xod) require no processing (gap-finding is the only cost).
MCR(fg) adds the cost of SpS((x-f)o(x-f)) and SpS((x-g)o(x-g)).
f3
g3
x
y
z
f2
g2
f1g1
(nv1,nv2,Xv3)
(nv1,Xv2,Xv3)
(nv1,Xv2,nv3)
MinVect=nv=(nv1,nv2,nv3)
(Xv1,Xv2,Xv3)=Xv=MaxVect
(Xv1,nv2,Xv3)
(Xv1,Xv2,nv3)
(Xv1,nv2,nv3)
FAUST Clustering Methods: MCR (Using Midlines of circumscribing Coordinate Rectangle)
Define a sequence fk,gkdk
Given d, fMinPt(xod) and gMaxPt(xod). Given f and g, dk≡(f-g)/|f-g|
fk≡((nv1+Xv1)/2,...,nvk,...,(nvn+Xvn)/2)
MCR(dfg) on Iris150 Do SpS(xod) linear gap analysis (since it is processing free).
d1 noned2 none
d30 10 set23...1 19 set450 30 ver49...0 69 vir19
SubClus2
SubClus1
SubClus1
d41 6 set440 18 vir39Leaves exactly the 50 setosa.
SubClus2
d4 noneLeaves 50 ver and 49 vir
(look for outliers in subclus1, subclus2 Sequence thru{f, g} pairs: SpS((x-f)o(x-f)), SpS((x-g)o(x-g)) rnd gap.
On what's left:
SubClus1
f1 none
g1 none
f2 none
g2 none
f3 none
g3 none
f4 none
g4 none
f1 none
SubClus2
g1 none
f2 1 41 vir230 47 vir180 47 vir32
g2 none
f3 none
g3 none
f4 none
g4 none
f1
g1
0001
0011
0010
nv= 0000
0111
0101
0110
0100
0½½½=
=1½½½
1001
1011
1010
1000
1111 =Xv
1101
1110
1100
f2 = ½0½½
g2 =½1½½
f3
g3 =½½1½
=½½0½ f4
g4 =½½½1
=½½½0
f1= 0½½½
g1= 1½½½
gk≡((nv1+Xv1)/2,...,nXk,...,(nvn+Xvn)/2) dk=ek and SpS(xodk)=Xk
SubClus1d4 1 6 set440 18 vir39Leaves exactly the 50 setosa as SubCluster1.
SubClus2d4 0 0 t41 0 t240 10 ver18...1 25 vir450 40 b40 40 b24Leaves the 49 virginica (vir39 declared an outlier) and the 50 versicolor as SubCluster2.
MCR(d) performs well on this dataset.
Accuracy: We can't expect a clustering method to separate versicolor from virginica because there is no gap between them. This method does separate off setosa perfectly and finds all 30 added outliers (subcluster of size 1 or 2). It finds virginica outlier, vir39, which is the most prominent intra-class outlier (distance 29.6 from the other virginica iris's, whereas no other iris is more than 9.1 from its classmates.)Speed: dk = ek so there is zero calculation cost for the d's.SpS(xodk) = SpS(xoek) = SpS(Xk) so there is zero calculation cost for it.The only cost is the loading of the dataset PTreeSet(X) (We use one column, SpS(Xk) at a time.) and that loading is required for any method. So MCR(d) is optimal with respect to speed!
Declare subclusters of size 1 or two to be outliers. Create the full pairwise distance table for any subcluster of size 10 and declare any point an outlier if its column (other than the zero diagonal value) values all exceed the threshold (which is 4).
1. Choose f0 (high outlier potential? e.g., furthest from mean, M?)2. Do f0-rnd-gap analysis (+ subcluster anal?)3. f1 be s.t. no x further away from f0 (in some dir) (all d1 dot prods0)4. Do f1-rnd-gap analysis (+ subclust anal?).5. Do d1-linear-gap analysis, d1≡ f0-f1 / |f0-f1|.6. Let f2 s.t. no x is further away (in some direction) from d1-line than f2 7. Do f2-round-gap analysis.8. Do d2-linear-gap d2 ≡ f0-f2 - (f0-f2)od1 / len...
FM(fgd) (Furthest-from-the-Mediod) FMO (FM using a Gram-Schmidt Orthonormal basis) X Rn. Calculate M=MeanVector(X) directly, using only the residualized 1-counts of the basic pTrees of X. And BTW, use residualized STD calculations to guide in choosing good gap width thresholds (which define what an outlier is going to be and also determine when we divide into sub-clusters.))
f4=t4 g4=vir1 Ln>4 noneThis ends the process. We found all (and only) added anomalies, but missed t34, t14, t4, t1, t3, b1, b3.
x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x xxx x x x x xx x x x x x x x x xxxx xx x x x x x x xx x x x x x x x x x x x x x x xx x x x x xx x x x xx x x
SbCl_2.1 g1=ver39 Rn>41 0 vir390 7 set21 Note:what remains in SubClus2.1 is exactly the 50 setosa. But we wouldn't know that, so we continue to look for outliers and subclusters.SbCl_2.1 g1=set19 Rn>4 none
Finally we would classify within SubCluster1 using the means of another training set (with FAUST Classify). We would also classify SubCluster2.1 and SubCluster2.2, but would we know we would find SubCluster2.1 to be all Setosa and SubCluster2.2 to be all Versicolor (as we did before). In SubCluster1 we would separate Versicolor from Virginica perfectly (as we did before).
If this is typical (though concluding from one example is definitely "over-fitting"), then we have to conclude that Mark's round gap analysis is more productive than linear dot product proj gap analysis!
FFG (Furthest to Furthest), computes SpS((M-x)o(M-x)) for f1 (expensive? Grab any pt?, corner pt?) then compute SpS((x-f1)o(x-f1)) for f1-round-gap-analysis.
Then compute SpS(xod1) to get g1 to have projection furthest from that of f1 ( for d1 linear gap analysis) (Too expensive? since gk-round-gap-analysis and linear analysis contributed very little! But we need it to get f2, etc. Are there other cheaper ways to get a good f2? Need SpS((x-g1)o(x-g1)) for g1-round-gap-analysis (too expensive!)
We could FAUST Classify each outlier (if so desired) to find out which class they are outliers from. However, what about the rouge outliers I added? What would we expect? They are not represented in the training set, so what would happen to them? My thinking: they are real iris samples so we should not do the really do the outlier analysis and subsequent classification on the original 150.
We already know (assuming the "other training set" has the same means as these 150 do), that we can separate Setosa, Versicolor and Virginica prefectly using FAUST Classify.
For speed of text mining (and of other high dimension datamining), we might do additional dimension reduction (after stemming content word). A simple way is to use STD of the column of numbers generated by the functional (e.g., Xk, SpS((x-M)o(x-M)), SpS((x-f)o(x-f)), SpS(xod), etc.). The STDs of the columns, Xk, can be precomputed up front, once and for all. STDs of projection and square distance functionals must be done after they are generated (could be done upon capture too). Good functionals produce many large gaps. In Iris150 and Iris150+Out30, I find that the precomputed STD is a good indicator of that. A text mining scheme might be:1. Capture the text as a PTreeSET (after stemming the content words) and store mean, median, STD of every column (content word stem).2. Throw out low STD columns. 4'. Use a weighted sum of "importance" and STD? (If the STD is low, there can't be many large gaps.)A possible Attribute Selection algorithm: 1. Peel from X, outliers using CRM-lin, CRC-lin, possibly M-rnd, fM-rnd, fg-rnd.. (Xin = X - Xout)2. Calculate widths of each Xin-Circumscribing Rectangle edge, crewk 4. Look for wide gaps top down (or, very simply, order by STD).4'. Divide crewk into count{xk| xXin}. (but that doesn't account for dups) 4''. look for preponderance of wide thin-gaps top down.4'''. look for high projection interval count dispersion (STD). Notes: 1. Maybe an inlier sub-cluster needs occur from more than one functional projection to be declared an inlier sub-cluster? 2. STD of a functional projection appears to be a good indicator of the quality of its gap analysis.
For FAUST Cluster-d (pick d, then f=MnPt(xod) and g=MxPt(xod) ) a full grid of unit vectors (all directions, equally spaced) may be needed. Such a grid could be constructed using angles a1, ... , am, each equi-width partitioned on [0,180), with the formulas:
d = e1k=n...2cosk + e2sin2k=n...3cosk + e3sin3k=n...4cosk + ... + ensinn where i's start at 0 and increment by .
So, di1..in= j=1..n[ ej sin((ij-1)) * k=n. .j+1cos(k) ]; i0≡0, divides 180 (e.g., 90, 45, 22.5...)
CRMSTD(dfg) Eliminate all columns with STD < threshold.d30 10 set23...50set+vir391 19 set250 30 ver49...50ver_49vir0 69 vir19