Koffi Pierre Yao MIT 12/30/2015 An algorithm for model-free, parameter-less clustering of stochastically distributed data This report documents the performance of a density-based clustering algorithm designed to bypass the need for the data scientist to tune clustering parameters. Most clustering algorithms available in Python sklearn require the tuning of parameters such as the number of clusters, the neighborhood size, the branching which require domain knowledge for the data or excellent data science skills. Although data science skills are incontrovertible for good data analysis, a clustering algorithm requiring minimal prior knowledge of the data structure is undeniably of great usefulness. Figure 1: A comparison of the clustering algorithms in scikit-learn. Information available at http://scikit-learn.org/stable/modules/clustering.html Herein, I report design a model-free, parameter-free clustering algorithm. Principle: A point density surface is calculated as: [] = ∑ −([],[]) =1 Farther points have exponentially diminished influence on the density score of a given point. This approach is expected to concentrate density at the location of the cluster center where points are expected to be densest if the process generating the data points is reasonably stochastic around the true cluster center. Note that the distance measure does not need to be Euclidian.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Koffi Pierre Yao MIT 12/30/2015
An algorithm for model-free, parameter-less clustering of
stochastically distributed data
This report documents the performance of a density-based clustering algorithm designed to bypass
the need for the data scientist to tune clustering parameters. Most clustering algorithms available
in Python sklearn require the tuning of parameters such as the number of clusters, the neighborhood
size, the branching which require domain knowledge for the data or excellent data science skills.
Although data science skills are incontrovertible for good data analysis, a clustering algorithm
requiring minimal prior knowledge of the data structure is undeniably of great usefulness.
Figure 1: A comparison of the clustering algorithms in scikit-learn. Information available at
dcluster.data: All data imput during the clustering training
dcluster.sample: indices (within dcluster.data) of the sample data used by the algorithm to find
the cluster centers.
dcluster.center_points: coordinates of the cluster centers after training.
dcluster.attributions: cluster assignments of the all points in dcluster.data arranged as <point
coordinate, cluster center index in dcluster.center_points>. It is None if attributions is set to ‘no’
at initialization.
dcluster.dist_type: The distance function type used. It is either value of p in a p-norm distance
measure or ‘custom’ if a custom distance function is to be used.
dcluster.kwargs: A dictionary record of the argument,value pairs used at the initialization.
dcluster.assign(row_arranged_data): Outputs the cluster assignment of a new point. Output is
arranged as numpy array [data point coordinates, cluster number]
Conclusion:
I report a density-based clustering algorithm for parameter-less clustering of stochastically
distributed clustered data. This algorithm shows a much improved clustering performance
without parameter tuning compared to the least parametrized DBSCAN clustering module
available in sklearn.
The code and example are presented below.
C:\Users\Pierre\OneDrive\PYTHON DENSITY CLUSTERING\dcsm2.py Friday, January 1, 2016 10:52 PM# UNSUPERVISED CLUSTER CENTERS IDENTIFICATION USING NON-ITERATIVE # DENSITY ANALYSIS AND NON-ANALYTICAL GRADIENT DESCENT#################WRITTEN BY KOFFI PIERRE CLAVER YAO###############################################DATE: 12/28/2015#################################CURRENT ADDRESS: MASSACHUSETTS INSTITUTE OF TECHNOLOGY##########import numpy as npfrom math import expimport copyclass DCSM:
''' This Module attempts to cluster data based on density of points The 'data' is assumed to be arranged with points as rows of the data table as a numpy array Works best with Gaussian distributed data. This algorithm computes the density matrix and therefore can be memory O(n^2) memory intensive. Therefore, <<<<Random Sampling>>>> is used to optimize the algorithm performance **kwargs: redundant_center_discard = True/False: If the algorithm converges on several centers,
should close centers be removed? time_show = True/False, show the time needed to do the fitting or not attributions = 'yes'/ 'no', center assignments. Any string not 'no' is considered 'yes'. centers_proximity_ratio = <float number>: How close can the centers be as a ratio of
the maximum separation data = The data that will be fitted sampling = 'y'/'n' should the algorithm use sampling to search for the centers? Default
is 'yes' when the number of data points exceeds 500 dist_func = custom distance function handle.Only called if dist_type = 'custom' EXAMPLE USAGE: import dcsm2 as DCSM dcluster = DCSM.DCSM(data = data, dist_type = 4, redundant_center_discard = True,
attributions = 'yes', sampling = 'n') center_points, attributions = dcluster.find_medoids() dcluster.assign([[5,-10],[6,7]]) # attributions: assignments to centers formatted in columns: [point_index_in_data,
index_of_center_in_center_points] # center_indices: indices of the cluster centers in the data set # center_points: actually coordinates of the centers of the data set data = ...x1... ...x2... ...x3... . . . ...xn...
-1-
C:\Users\Pierre\OneDrive\PYTHON DENSITY CLUSTERING\dcsm2.py Friday, January 1, 2016 10:52 PM '''
# DENSITY CLUSTERING#################################################def find_medoids(self):
self.num_pts, self.num_feature = self.data.shape# Number of iterations that the algorithm will use to find the centers.# I find that more than one iteration will degrade the result instead # of improving it.it = self.kwargs.get('num_iter',1)answer = 'n'self.center_indices = []center_score = []size = self.num_ptsif (self.num_pts > 500): # stochastic sampling search of centers using random subgroups
# Defining the number of points in the sample data to be used for finding center clusters# The sample should be sufficiently large. The formula below is heurestic (500 points for 2D data# containing more than 500 data pointssize = (self.num_feature-1) * int(min(self.num_pts/5,500))
for i in xrange(it):self.__sample__(self.num_pts, size)dens,pair_distance = self.sample_data_density()temp_score = self.stoch_search_local_maxima(dens, pair_distance, 20)temp = np.where(temp_score > 0)[0]
-2-
C:\Users\Pierre\OneDrive\PYTHON DENSITY CLUSTERING\dcsm2.py Friday, January 1, 2016 10:52 PM# Remapping to the original indices in datafor i in temp:
org_loc = self.sample[i]if org_loc not in self.center_indices:
# This function calculates the distance of X1 and X2 per the# the specified order of the norm. It doesn't take the 1/p# The only acceptable kwargs are dist_func = <function handle># when the distance function is a custom function of two data pointsif (self.dist_type == 'custom'):
# Calculates the density matrix of the sample data based on the specified# distance type and returns the distance between two pairs of data
-3-
C:\Users\Pierre\OneDrive\PYTHON DENSITY CLUSTERING\dcsm2.py Friday, January 1, 2016 10:52 PM# points and the density at each pointscdata = self.data[self.sample,:]num_pts, ignore = cdata.shapedata_density = [0]*num_ptspair_distance = np.zeros((num_pts,num_pts))for i in xrange(num_pts-1):
if (i < num_pts-1):for j in xrange(i+1, num_pts):
d = self.distance(cdata[i], cdata[j])pair_distance[i,j] = dpair_distance[j,i] = dd = exp(-d)data_density[i] = data_density[i] + ddata_density[j] = data_density[j] + d
del ignoredata_density = np.array(data_density)return data_density, pair_distance
#####################################################################def n_nearest_neighbor(self, point_index, n, dist_from_point):
# n = How many nearest neighbors to seeksort_indices = sorted(range(len(dist_from_point)), key=lambda k: dist_from_point[k])# The first index will be itself because a point is closest to itselfreturn sort_indices[1:n+1]
# This function returns the local maximum of the density surface# by implementing a point to point numerical gradient ascent count_no_jumps = 0jumpedto = self.jump_up_to(start, dens, pair_distance, n)while count_no_jumps < 2:
# Use n nearest neighbors to decide where to ascend.dist_from_point = pair_distance[start,:]neighbors_indices = self.n_nearest_neighbor(start, n, dist_from_point)grad =float('-inf')neighb_jumpto = startfor neighb in neighbors_indices:
temp = (dens[neighb]-dens[start])# if a neighbor is found that has a higher density locally than the current point,# Jump to that neighbor. We select the neighbor with the highest density compared to#the current pointif (temp < 0):
-4-
C:\Users\Pierre\OneDrive\PYTHON DENSITY CLUSTERING\dcsm2.py Friday, January 1, 2016 10:52 PMcontinue
# This function returns a numpy array of how many times the search# Terminated on a particular point as indicated by its indexnum_pts = len(dens)winners = [0]*num_ptsfor i in xrange(num_pts//10):
# I am performing a stochastic search using 10% as much data as in the original# This choice could be computationally intensivestart = np.random.choice(np.array(range(num_pts)))loc_max = self.numerical_grad_descent(start, dens,pair_distance,n)winners[loc_max] += 1
# This function assigns cluster to each point in data based on # proximity to the cluster center stored in center indicesnum_points, ignore = self.data.shapepoints_indices = range(num_points)centers = [0]*num_pointsfor i in xrange(num_points):
dist = float('inf')k = 0for c in self.center_indices:
# This function discards a center if it is too close to another# (as a ratio of the maximum distance between centers) and has# a lower convergence score than the otherind_to_rm = list()l = len(self.center_indices)couples = [[i,j] for i in xrange(l) for j in xrange(i+1,l)]inter_center_dist =[self.distance(self.data[self.center_indices[i]],self.data[self.center_indices[j]]) fori in xrange(l) for j in xrange(i+1,l)]if len(inter_center_dist) < 2:
print 'ONLY ONE CLUSTER CENTER FOUND, CAN''T DISCARD ANY \n'else:
m = max(inter_center_dist)-5-
C:\Users\Pierre\OneDrive\PYTHON DENSITY CLUSTERING\dcsm2.py Friday, January 1, 2016 10:52 PMtemp = [couples[t] for t in range(len(inter_center_dist)) if inter_center_dist[t] <m*ratio]for el in temp:
if center_score[el[0]] <= center_score[el[1]]:if (el[1] not in ind_to_rm): ind_to_rm.append(el[1])else: ind_to_rm.append(el[0])
else:if (el[0] not in ind_to_rm): ind_to_rm.append(el[0])else: ind_to_rm.append(el[1])