GPU CUDA Parallel Hierarchical Clustering Cluster Update Algorithm
Post on 03-Mar-2015
270 Views
Preview:
DESCRIPTION
Transcript
GPU Parallel Hierarchical Clustering using CUDA
by Marco Janc
Clustering GPU Kernel
Each Thread processes one cluster ◦ calculates its new value with its neighbor defined by the
relation to the maximum cluster
Input ◦ DDM_(i-1) ◦ Length of DDM_(i-1) ◦ Row and column indices array ◦ Maximum cluster index
Output ◦ DDM_i Length is one triangular number smaller then the length of
DDM_(i-1) ◦ Cluster linear index references Each new cluster consists of two input cluster values, whose
minimum cluster index is saved as a reference.
Clustering GPU Kernel
Clustering GPU Kernel Example
Legend
Clustering GPU Kernel Example (1)
maximum cluster
cluster row or column is equal to maximum cluster row or column
cluster row and column is not equal to maximum cluster row or column, indicates a direct copy of cluster
linIdx Calculates the linear index of the given row and column value with the formula: linIdx = (row - 1) * row * 0.5f + column
thread calculates final value of those two clusters, who are in row- or columnwise relation with the maximum cluster
theroretically unnecessary duplicate identical calculation due to parallelism,
- in output reference index array, indicates maximum cluster reference which is not present in the new DDM, and set to 0, host knows maximum cluster index, to catch it while building the cluster tree
parent
left child / row
right child / column
CPU host tree node
Input Data
length: 10 maximum cluster index: 3
0.938
0.000
1.057 0.539
0.662 0.347 0.274
0.000
0.000
0.000
0
1
2
3
4
4 3 2 1 0 cl\cl
0.938 0 0 1.057 0.539 0 0.662 0.347 0.274 0.000
linearized:
Clustering GPU Kernel Example (2)
Linearized document-document matrix
Row and column indices array ◦ Calculating row index in kernel is vulnerable to low floating
point precision since it includes a square root. ◦ Column index can easily be calculated from a given row index.
Row Index 2 1 2 3 3 4 4 4 3 4
Column Index 0 0 1 2 0 0 2 3 1 1
update column indices
update row indices
Thread Index 1 0 2 3 6 5 8 9 4 7
Row Index 2 1 2 3 3 4 4 4 3 4
min
New row Index 2 1 2 2 3 4 4 4 1 4
Column Index 0 0 1 2 0 0 2 3 1 1
New column Index 0 0 1 0 0 0 2 0 0 1
min
min
min min min
min
min
min
min min min
Clustering GPU Kernel Example (3)
Get linear index (1)
Note: two clusters are merged at their minimum index
Thread Index 1 0 2 3 6 5 8 9 4 7
1 0 2 1 - 3 5 3 0 4
linIdx linIdx linIdx linIdx linIdx linIdx linIdx linIdx linIdx linIdx
New row Index 2 1 2 2 3 4 4 4 1 4
New column Index 0 0 1 0 0 0 2 0 0 1
Clustering GPU Kernel Example (4)
Get linear index (2)
Upd new row Index 2 1 2 2 3 3 3 3 1 3
Upd new col. Index 0 0 1 0 0 0 2 0 0 1
OutputLinear Index
Note: I. If the new row is greater or equal then the maximum row, it is decreased by 1 II. If I. if cluster is not copied (orange) and if the new column is greater or
equal then the maximum column, it is decreased by 1
0 0.938 0 0 1.057 0.662 0 0.347 Input DDM_0 0.539 0.274
0 0.738 0 0 0.505 Output DDM_1 0.274
avg
avg
avg
OutputLinear Index 1 0 2 1 - 3 5 3 0 4
Thread Index 1 0 2 3 6 5 8 9 4 7
Clustering GPU Kernel Example (5)
calculate cluster values
OutputLinear Index 1 0 2 1 - 3 5 3 0 4
Thread Index 1 0 2 3 6 5 8 9 4 7
1 0 2 8 6 output ref Indices 7 Note: • since clusters are merged at their minimum we only need the minimum of the cluster and its neighbor indices
min
min
min
Clustering GPU Kernel Example (6) calculate reference indices
linear output index 1 0 2 1 - 3 5 3 0 4
0 0 0
Note: • clusters are merged at their minimum cluster • check which child of the cluster equals which child of the maximum cluster • 1 indicates left (row) child • 2 indicates right (column) child • 0 indicates none (direct copy)
copy flag 2 2 2
row index 2 1 2 3 3 4 4 4 3 4
column index 0 0 1 2 0 0 2 3 1 1
thread index 1 0 2 3 6 5 8 9 4 7
Clustering GPU Kernel Example (7) calculate maximum cluster reference position to decrease CPU cluster tree
building costs
linearized
Output
length: 10
0.938
0.000
1.057 0.539
0.662 0.347 0.274
0.000
0.000
0.000
0
1
2
3
4
4 3 2 1 0 cl\cl
0.938 0 0 1.057 0.539 0 0.662 0.347 0.274 0.000
Clustering GPU Kernel Example (8)
document-document matrix
0.738
0.000
0.274
0.000
0.000
0
1
2
3
3 2 1 0
0.738 0.000 0.000
0.505
0.505 0.274 0.000 length: 6
cl\cl
update cluster-nodes with gpu calculated values, references and max position values
Update binary CPU Cluster-Tree
0.938 0 0 1.057 0.539 0 0.662 0.347 0.274 0.000
1 0 2 0 2 1 3 0 3 1 3 2 4 0 4 1 4 2 4 3
0
2 1
0.274 0.000
4 1 4 2
0.738
1
0
2
0.505
4 1.05
3 0
1.05
3 0
1.05
3 0
0 0 0 2 2 2
1 0 2 8 6 7
max pos
orig ref
index 0 1 2 3 4 5 8 7 6 9
Note: • a new cpu-thread iterates async over new values, and takes the cluster defined by the index in “orig ref” and adds the maximum cluster at its index defined by “max pos”
Algorithm (CUDA) - parameters
Calculate new values, references and max pos (1)
/*
* @param inVal_g float* input cluster value array
* @param inIdxRow_s int* input Row indices array
* @param inCount_s long long int number of elements [triangular number]
* @param inMaxIdx_s long long int index of the maximum cluster
* @param outValues_g float* output value array
* @param outLinIdxRef_g long long int* output original linear cluster references array
* @param outMaxPos_g int* output new cluster maximum position
0 = no relation
1 = left (row) child is maximum cluster left or right child
2 = right (column) is maximum cluster left or right child
*
* _g = global memory; _s = shared memory
*/
__global__ void calcClusterNewValuesRefMaxPos(const float* inValues_g,
const unsigned int* inIdxRow_g,
const unsigned long long int inCount_s,
const unsigned long long int inMaxIdx_s,
float* outValues_g,
unsigned long long int* outLinIdxRef_g,
unsigned int* outMaxPos_g)
{
//... see next slides
}
Algorithm (CUDA) – initialize cluster objects
Calculate new values, references and max pos (2)
const unsigned long long int blockId = blockIdx.y * gridDim.x + blockIdx.x
+ gridDim.x * gridDim.y * blockIdx.z;
const unsigned long long int tId = blockId * blockDim.x + threadIdx.x;
//maximum cluster is ignored
if(tId >= inCount_s || tId == inMaxIdx_s)
return;
//get maximum cluster / read row indices to calculate column indices
Idx2D clusterMax = Idx2D(inIdxRow_g[inMaxIdx_s], 0);
clusterMax.column = getTriMatCol(inMaxIdx_s, clusterMax.row);
//get cluster of this thread
ElFloat2D cluster = ElFloat2D(inIdxRow_g[tId], 0, inValues_g[tId]);
cluster.column = getTriMatCol(tId, cluster.row);
//relative cluster, init with cluster
ElFloat2D clusterRel = ElFloat2D(cluster.row, cluster.column, cluster.value);
//0 = direct copy of cluster, no relative
//1 = cluster max will be merged right, 2 = left
unsigned int copy = 0;
Algorithm (CUDA) – find neighbor / relative cluster and save max pos
Calculate new values, references and max pos (3)
//find relative cluster
if(cluster.row == clusterMax.column)
{
copy = 1;
clusterRel.set(clusterMax.row, cluster.column);
}
else if(cluster.row == clusterMax.row)
{
copy = 1;
if(clusterMax.column > cluster.column)
clusterRel.set(clusterMax.column, cluster.column);
else
clusterRel.set(cluster.column, clusterMax.column);
}
else if(cluster.column == clusterMax.column)
{
copy = 2;
if(cluster.row > clusterMax.row)
clusterRel.set(cluster.row, clusterMax.row);
else
clusterRel.set(clusterMax.row, cluster.row);
}
else if(cluster.column == clusterMax.row)
{
copy = 2;
clusterRel.set(cluster.row, clusterMax.column);
}
Algorithm (CUDA) – calculate new value and neighbor minimum index
Calculate new values, references and max pos (4)
//merge neighbors at their minimum index and calculate new value
if(copy != 0)
{
clusterRel.value = inValues_g[getMatLinIdx(clusterRel.row, clusterRel.column)];
cluster.row = min(cluster.row, clusterRel.row);
cluster.column = min(cluster.column, clusterRel.column);
cluster.value = 0.5f * (cluster.value + clusterRel.value); //average-linkage
}
//Update Row and Column Indices by reducing them with one
//if they are larger then their cluster max counterparts
if(cluster.row >= clusterMax.row)
{
cluster.row--;
//non-copy clusters dont need column decrease
if(copy == 0 && cluster.column > clusterMax.row)
cluster.column = max(0, cluster.column - 1);
}
//get minimum reference index
const unsigned long long int minRefIdx = min(tId, getMatLinIdx(clusterRel.row,
clusterRel.column));
Algorithm (CUDA) – output new data
Calculate new values, references and max pos (5)
//output at minimum index
if(minRefIdx == tId)
{
//Get output linear index
const unsigned long long int outLinIdx = getMatLinIdx(cluster.row, cluster.column);
outValues_g[outLinIdx] = cluster.value;
outLinIdxRef_g[outLinIdx] = minRefIdx;
outMaxPos_g[outLinIdx] = copy;
}
Algorithm (Java) – update cpu cluster treenodes
Update binary CPU Cluster-Tree
//original cluster list
ArrayList<Cluster> clusters;
//new cluster list with size one triangular number smaller then size of original
ArrayList<Cluster> clustersNew = new ArrayList<Cluster>(lengthOutput);
//cluster max
Cluster clusterMax = clusters.get(clusterMaxIndex);
//new cluster values; original references, new maximum cluster position
float[] newClusterValues; long[] clusterLinIdxRefs; int[] clusterMaxPos;
//iterate over all clusters
for(long i = 0; i < lengthOutput; i++)
{
Cluster cluster = clusters.get(clusterLinIdxRefs[i]);
if(clusterMaxPos[i] != 0) //0 indicates direct copy
{
cluster.setValue(newClusterValues[i]);
if(clusterMaxPos[i] == 1) //1 indicates cluster max will be left
cluster.setCluster1(clusterMax);
else //2 indicates cluster max will be right
cluster.setCluster2(clusterMax);
}
clustersNew.add(cluster);
}
this.clusters = clustersNew;
top related