This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1. Cluster Analysis Using RapidMiner and SAS 9.3
2. Agenda The Data Some preliminary treatments Checking for
outliers Manual outlier checking for a given confidence level
Filtering outliers Data without outliers Selecting attributes for
clusters Setting up clusters Reading the clusters Using SAS for
clustering Dendrogram Depicting Tree using SAS Conclusion
3. The Data Number of observations: 97 3 numeric variables:
Birth rate per thousand Death rate per thousand Infant mortality
rate per thousand 1 polynomial variable: Country Data obtained from
UN Demographic Yearbook 1990
4. Some preliminary treatments Checking for outliers using
RapidMiner
5. Some preliminary treatments Manual checking for outliers at
a given confidence level For Birth (95%) mu-2(sigma) =
27.384-2(12.978) = 1.428 mu+2(sigma) = 27.384+2(12.978) = 53.34
Hence, no outliers
6. Filtering outliers o 10 outliers recorded
7. Data without outliers o Filter examples o Parameter string:
outlier=true o Invert filter
8. Selecting attributes for clusters o Clusters on polynomial
variables make no sense o Remove Country from attribute list
9. Setting up clusters o K=3 o Join both nodes to get cluster
model information
10. Reading the Clusters Cluster 1: Low values of each numeric
variable Cluster 2: High values of each numeric variable Cluster 0:
Moderate values of each numeric variable
11. Reading the Clusters Scatter Plot Birth and Death against
Infant Death Rate Size Infant Death Rate
12. Using SAS for clustering Using canonical variables for
standardization of variables to mean 0 and standard deviation 1
Spherical within-cluster covariance matrix proc aceclus
data=Poverty out=Ace p=.03 noprint; var Birth Death InfantDeath;
run; proc cluster data=Ace outtree=Tree method=ward ccc pseudo
print=15; var can1 can2 can3 ; id Country; run;
13. Using SAS for clustering First 2 canonical variables
account for about 93% of the total variation
14. Dendrogram
15. Tree depiction Plot can1 and can2 against cluster Shows
similar plot compared to RapidMiner output
16. Conclusion Cluster 1: Mostly developed European nations,
USA, UK, Singapore, USSR, etc Cluster 2: Afghanistan, Pakistan,
Iran, mostly under privileged African nations Efficient allocation
of public goods Lower crime rates Abortion legalized Low GDP
Abortion not legal High crime rates, prevalent wars and terrorist
activities Poor health standards, high poverty levels Cluster 0:
India, Mexico, South Africa, Saudi Arabia, etc Emerging nations
Increasing growth rates Controlled negative externalities Focus on
literacy and employment