Introduction Biclustering Possibilistic Biclustering algorithm Results & Conclusions Biclustering Bioinformatics Data Sets: A Possibilistic Approach Francesco Masulli Dept Computer and Information Sciences, University of Genova ITALY EMFCSC Erice 20/4/2007 Francesco Masulli Biclustering Bioinformatics Data Sets
52
Embed
Biclustering Bioinformatics Data Sets: A Possibilistic ...daa_erice07/solicited/masulli.pdf · Biclustering Bioinformatics Data Sets: A Possibilistic Approach Francesco Masulli Dept
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Nowadays, in the Post-Genomic era, we have manyBioinformatics data sets available (most of them releasedin public domain on the Internet)
The information embedded in most of them has no yetcompletely exploited, due to the lack of accurate machinelearning tools and/or of their diffusion in the Bioinformaticscommunity.
Francesco Masulli Biclustering Bioinformatics Data Sets
Most of Bioinformatics data sets come from DNAmicroarray experiments and are normally given as arectangular m × n matrix X , where each columnrepresents a feature (e.g., gene) and each row representsa data sample or condition (e.g., patient)
X = (xij)m×n, (1)
where the value xij is the expression of i-th gene in j-thcondition.
The analysis of microarray data sets can give a valuableinformation on the biological relevance of genes andcorrelations between them [Madei, 2004].
Francesco Masulli Biclustering Bioinformatics Data Sets
BIOINFORMATICS DATA SETSMajor Machine Learning tasks
Clustering (Unsupervised): Given a set of samples,partition them into groups containg similar samplesaccording to some similarity criteria (CLASSDISCOVERING).Classification (Supervised): Find classes of the test dataset using known classification of training data set (CLASSPREDICTION).Feature Selection (Dimensionality reduction): Select asubset of features responsible for creating the conditioncorresponding to the class (GENE SELECTION,BIOMARKER SELECTION).Outlier Detection : Detect data samples that are not goodrepresentative of any of the classes, and disregard themwhile performing data analysis.
Francesco Masulli Biclustering Bioinformatics Data Sets
BIOINFORMATICS DATA SETSMajor Machine Learning tasks
Clustering (Unsupervised): Given a set of samples,partition them into groups containg similar samplesaccording to some similarity criteria (CLASSDISCOVERING).Classification (Supervised): Find classes of the test dataset using known classification of training data set (CLASSPREDICTION).Feature Selection (Dimensionality reduction): Select asubset of features responsible for creating the conditioncorresponding to the class (GENE SELECTION,BIOMARKER SELECTION).Outlier Detection : Detect data samples that are not goodrepresentative of any of the classes, and disregard themwhile performing data analysis.
Francesco Masulli Biclustering Bioinformatics Data Sets
BIOINFORMATICS DATA SETSMajor Machine Learning tasks
Clustering (Unsupervised): Given a set of samples,partition them into groups containg similar samplesaccording to some similarity criteria (CLASSDISCOVERING).Classification (Supervised): Find classes of the test dataset using known classification of training data set (CLASSPREDICTION).Feature Selection (Dimensionality reduction): Select asubset of features responsible for creating the conditioncorresponding to the class (GENE SELECTION,BIOMARKER SELECTION).Outlier Detection : Detect data samples that are not goodrepresentative of any of the classes, and disregard themwhile performing data analysis.
Francesco Masulli Biclustering Bioinformatics Data Sets
BIOINFORMATICS DATA SETSMajor Machine Learning tasks
Clustering (Unsupervised): Given a set of samples,partition them into groups containg similar samplesaccording to some similarity criteria (CLASSDISCOVERING).Classification (Supervised): Find classes of the test dataset using known classification of training data set (CLASSPREDICTION).Feature Selection (Dimensionality reduction): Select asubset of features responsible for creating the conditioncorresponding to the class (GENE SELECTION,BIOMARKER SELECTION).Outlier Detection : Detect data samples that are not goodrepresentative of any of the classes, and disregard themwhile performing data analysis.
Francesco Masulli Biclustering Bioinformatics Data Sets
How to identify genes with similarbehavior with respect to differentconditions?
Instance of the problem of biclustering (also known asco-clustering, two-way clustering, ...) [Cheng & Church,2000; Hartigan, 1972; Kung et al, 2005; Turner et al, 2005]
Francesco Masulli Biclustering Bioinformatics Data Sets
How to identify genes with similarbehavior with respect to differentconditions?
Instance of the problem of biclustering (also known asco-clustering, two-way clustering, ...) [Cheng & Church,2000; Hartigan, 1972; Kung et al, 2005; Turner et al, 2005]
Francesco Masulli Biclustering Bioinformatics Data Sets
The algorithm constructs one bicluster at a time using astatistical criterion - a low mean squared residue (thevariance of the set of all elements in the bicluster, plus themean row variance and the mean column variance).
Once a bicluster is created, its entries are replaced byrandom numbers, and the procedure is repeated iteratively.
Drawback: The masking procedure results in aphenomenon of random interference, affecting thesubsequent discovery of large-sized biclusters [Yang et al.,2003].
Francesco Masulli Biclustering Bioinformatics Data Sets
The algorithm constructs one bicluster at a time using astatistical criterion - a low mean squared residue (thevariance of the set of all elements in the bicluster, plus themean row variance and the mean column variance).
Once a bicluster is created, its entries are replaced byrandom numbers, and the procedure is repeated iteratively.
Drawback: The masking procedure results in aphenomenon of random interference, affecting thesubsequent discovery of large-sized biclusters [Yang et al.,2003].
Francesco Masulli Biclustering Bioinformatics Data Sets
Let xij be the expression level of the i-th gene in the j-thcondition.
A bicluster is defined as a subset of the m × n data matrixX , i.e., a bicluster is a pair (g, c),where g ⊂ {1, . . . , m} is a subset of genes andc ⊂ {1, . . . , n} is a subset of conditions [Cheng & Church,2000; Hartigan, 1972; Kung et al, 2005; Turner et al, 2005].
We are interested in largest biclusters from DNAmicroarray data that do not exceed an assignedhomogeneity constraint [Cheng & Church, 2000] as theycan supply relevant biological information.
Francesco Masulli Biclustering Bioinformatics Data Sets
The size (or volume) n of a bicluster is usually defined asthe number of cells in the gene expression matrix Xbelonging to it, that is the product of the cardinalitiesng = |g| and nc = |c|:
n = ng · nc (2)
Normalized square residual
d2ij =
(
xij + xIJ − xiJ − xIj)2
n(3)
where the elements xIJ , xiJ and xIj are respectively thebicluster mean, the row mean and the column mean of Xfor the selected genes and conditions:
Francesco Masulli Biclustering Bioinformatics Data Sets
G measures the bicluster homogeneity, i.e., the differencebetween the actual value of an element xij and its expectedvalue as predicted from the corresponding row mean,column mean, and bicluster mean.
OUR AIM: maximizing the bicluster cardinality n and at thesame time minimizing the residual G (NP-complete task[Peete, 2003]) using the Possibilistic ClusteringParadigm .
Francesco Masulli Biclustering Bioinformatics Data Sets
POSSIBILISTIC BICLUSTERINGApproaches to clustering Bioinformatics data sets
Data clustering is a routine step in biological data analysis,and a basic tool in Bioinformatics [Golub, et al., 1999; P.Tamayo, et al., 1999; Azuaje, 2003]Main approaches:
Hierarchical Clustering [Eisen et al., 1998; Orengo et al.,2003]Partitional (or Central) Clustering: including C-Means[Duda & Hart, 1973], Self Organizing Map [Kohonen, 2001],Fuzzy C-Means [Bezdek, 1981], Deterministic Annealing[Rose et al, 1990], Alternating Cluster Estimation [Runkler,1999], etc.
Francesco Masulli Biclustering Bioinformatics Data Sets
POSSIBILISTIC BICLUSTERINGProbabilistic constraint in central clustering
Let X = {x1, . . . , xr} be a set of unlabeled data points,Y = {y1, . . . , ys} a set of cluster centers (or prototypes)and U = [upq] the fuzzy membership matrix.
Often, central clustering algorithms impose a probabilisticconstraint on memberships, according to which the sum ofthe membership values of a point in all the clusters mustbe equal to one:
r∑
q=1
upq = 1 (8)
Francesco Masulli Biclustering Bioinformatics Data Sets
POSSIBILISTIC BICLUSTERINGFrom Probabilistic to Possibilistic Clustering
Probabilistic constraintr∑
q=1
upq = 1:
PROS - competitive constraint allowing the unsupervisedlearning algorithms to find the barycenter of clustersCONS - membership to clusters (a) not interpretable as adegree of typicality - (b) can give sensibility to outliers
(a) (b)Francesco Masulli Biclustering Bioinformatics Data Sets
PCM objective function [Krishnapuram & Keller, 1996]:
Jm(U, Y ) =
s∑
p=1
r∑
q=1
upqEpq +
s∑
p=1
1βp
r∑
q=1
(upq log upq − upq),
(12)where:
Epq = ‖xq − yp‖2 (squared Euclidean distance)
βp (scale) depending on the average size of the p-th cluster.Thanks to the penality term, points with a high degree oftypicality have high upq values, and points not veryrepresentative have low upq values in all the clusters.Note that if βp → ∞ ∀p =⇒trivial solution upq = 0 ∀p, q, as no probabilistic constraintis assumed.
Francesco Masulli Biclustering Bioinformatics Data Sets
The pair (U, Y ) minimizes Jm, under the possibilisticconstraints 9-11 only if:
upq = e−Epq/βp ∀p, q, (13)
and
yp =
∑rq=1 xqupq∑r
q=1 upq∀p. (14)
Picard iterationMembership refinement algorithm, membership to clustersas cluster typicality degree (initialization of centroids using,e.g., Fuzzy C-Means).High outliers rejection capability as PCM makes theirmembership very low.
Francesco Masulli Biclustering Bioinformatics Data Sets
For each bicluster we assign two vectors of membership,one for the rows and one other for the columns, denotingthem respectively a and b.In a crisp sets framework row i and column j can eitherbelong to the bicluster (ai = 1 and bj = 1) or not (ai = 0 orbj = 0).An element xij of X belongs to the bicluster if both ai = 1and bj = 1, i.e., its membership uij to the bicluster is:
uij = and(ai , bj) (16)
The cardinality of the bicluster is then defined as:
n =∑
i
∑
j
uij (17)
Francesco Masulli Biclustering Bioinformatics Data Sets
We allow membership uij , ai and bj to belong in the interval[0, 1].The membership uij of an element xij of X to the biclustercan be obtained by the aggregation of row and columnmemberships, using, e.g., a fuzzy t-norm like:
uij = aibj (product) (18)
or
uij =ai + bj
2(average) (19)
The fuzzy cardinality of the bicluster is defined as the sumof the memberships uij for all i and j as in eq. 17.
Francesco Masulli Biclustering Bioinformatics Data Sets
Possibilistic Biclustering Problem : maximizing thebicluster cardinality n and minimizing the fuzzy residual Gunder the fuzzy possibilistic paradigm.To this aim we make the following assumptions:
we treat one bicluster at a time;the fuzzy memberships ai and bj are interpreted astypicality degrees of gene i and condition j with respect tothe bicluster;we compute the membership uij using the averageaggregator (eq. 19).
Francesco Masulli Biclustering Bioinformatics Data Sets
All those requirements are fulfilled by minimizing thefollowing functional JB with respect to a and b:
JB =∑
ij
(
ai + bj
2
)
d2ij +λ
∑
i
(ai ln ai−ai)+µ∑
j
(bj ln bj−bj)
(23)The first term is the fuzzy mean square residual G, whilethe other two are penalization terms.
The parameters λ and µ control the size of the bicluster.Their values can be estimated by simple statistics over thetraining set, and then hand-tuned to incorporate possiblea-priori knowledge and to obtain the expected results.
Francesco Masulli Biclustering Bioinformatics Data Sets
The memberships initialization can be made:randomlyusing some a priori information about relevant genes andconditions.using the results already obtained from another biclusteringalgorithm (in this case PBC will work as a refinementalgorithm)
ε controls the convergence of the algorithm.
After convergence of the algorithm the memberships a andb can be defuzzified by applying an α-cut, i.e., bycomparing with a threshold.
Francesco Masulli Biclustering Bioinformatics Data Sets
PBC is slightly sensitive to initialization of membershipswhile strongly sensitive to parameters λ and µ. PBC canfind biclusters of a desired size just tuning the parametersλ and µ (results averaged on 20 runs).
Method avg. G avg. n avg. ng avg. nc Largest nDBF [Zhang et al 2004] 115 1627 188 11 4000FLOC [Yang et al 2003] 188 1826 195 12.8 2000Cheng-Church [2000] 204 1577 167 12 4485
Method avg. G avg. n avg. ng avg. nc Largest nDBF [Zhang et al 2004] 115 1627 188 11 4000FLOC [Yang et al 2003] 188 1826 195 12.8 2000Cheng-Church [2000] 204 1577 167 12 4485
Method avg. G avg. n avg. ng avg. nc Largest nDBF [Zhang et al 2004] 115 1627 188 11 4000FLOC [Yang et al 2003] 188 1826 195 12.8 2000Cheng-Church [2000] 204 1577 167 12 4485
The Possibilistic Biclustering (PBC) algorithm extends thepossibilistic clustering paradigm for the solution of thebiclustering problem.
The membership uij of an element xij of X to the bicluster isobtained by aggregation of memberships (typicality) of hisrow (gene) and column (condition) with respect to bicluster.
The quality (residual G) of the large biclusters obtained isbetter than other biclustering methods.Further studies:
biological validation of the obtained resultsautomatically selection of parameters λ and µ
other aggregators for obtaining uij
Francesco Masulli Biclustering Bioinformatics Data Sets
The Possibilistic Biclustering (PBC) algorithm extends thepossibilistic clustering paradigm for the solution of thebiclustering problem.
The membership uij of an element xij of X to the bicluster isobtained by aggregation of memberships (typicality) of hisrow (gene) and column (condition) with respect to bicluster.
The quality (residual G) of the large biclusters obtained isbetter than other biclustering methods.Further studies:
biological validation of the obtained resultsautomatically selection of parameters λ and µ
other aggregators for obtaining uij
Francesco Masulli Biclustering Bioinformatics Data Sets