20 CHAPTER 2 LITERATURE SURVEY 2.1 SURVEY ON TEXT MINING Eighty percent of the information in the world is currently stored in unstructured textual format (Kalogeratos and Likas, 2011). Although techniques such as Natural Language Processing (NLP) can accomplish limited text analysis, there are currently no computer programs available to analyse and interpret text for diverse information extraction needs. Therefore text mining is a dynamic and emerging area. The world is fast becoming information intensive, in which specialized information is being collected into very large data sets. For example, extraction of information from Chinese handwritten documents (Koo and Cho, 2012). For example, Internet contains a vast amount of online text documents, which rapidly change and grow. It is nearly impossible to manually organize such vast and rapidly evolving data. The necessity to extract useful and relevant information from such large data sets (Chen et al, 2010) has led to an important need to develop computationally efficient text mining algorithms. An example problem is to automatically assign natural language text documents to predefined sets of categories based on their content. Other examples of problems involving large data sets include searching for targeted information from scientific citation databases such as Institute of Electrical and Electronics Engineers (IEEE), Association
24
Embed
CHAPTER 2 LITERATURE SURVEYshodhganga.inflibnet.ac.in/bitstream/10603/16044/11/11_chapter 2.p… · on human-composed reference summaries. Text mining is the automatic and semi-automatic
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
20
CHAPTER 2
LITERATURE SURVEY
2.1 SURVEY ON TEXT MINING
Eighty percent of the information in the world is currently
stored in unstructured textual format (Kalogeratos and Likas, 2011).
Although techniques such as Natural Language Processing (NLP) can
accomplish limited text analysis, there are currently no computer
programs available to analyse and interpret text for diverse information
extraction needs. Therefore text mining is a dynamic and emerging area.
The world is fast becoming information intensive, in which specialized
information is being collected into very large data sets. For example,
extraction of information from Chinese handwritten documents (Koo and
Cho, 2012).
For example, Internet contains a vast amount of online text
documents, which rapidly change and grow. It is nearly impossible to
manually organize such vast and rapidly evolving data. The necessity to
extract useful and relevant information from such large data sets (Chen et
al, 2010) has led to an important need to develop computationally
efficient text mining algorithms. An example problem is to automatically
assign natural language text documents to predefined sets of categories
based on their content.
Other examples of problems involving large data sets include
searching for targeted information from scientific citation databases such
as Institute of Electrical and Electronics Engineers (IEEE), Association
21
for Computing Machinery (ACM), Elseveir’s scopus (SCOPUS) search,
filter and categorize web pages by topic (Dimopoulos et al., 2010) and
routing relevant email to the appropriate addresses. A particular problem
of interest here is that of classifying documents into a set of user defined
categories based on the content. Thus, as the document size increases, the
dimension of the hyperspace in which text classification is done becomes
enormous, resulting in high computational cost (Luo et al, 2009).
However, the dimensionality can be reduced through feature
extraction algorithms. Topic summarization (Forestier et al, 2010) in
terms of content coverage, coherence, and consistency, the summaries
are superior to those derived from existing summarization methods based
on human-composed reference summaries.
Text mining is the automatic and semi-automatic extraction of
implicit, previously unknown, and potentially useful information and
patterns, from a large amount of unstructured textual data, such as
natural-language texts. In text mining, each document is represented as a
vector, whose dimension is approximately the number of distinct
keywords in it, which can be very large. One of the main challenges in
text mining is to classify textual data with such high dimensionality
(Song et al, 2013).
In addition to high dimensionality, text-mining algorithms
should also deal with word ambiguities such as pronouns, synonyms,
noisy data, spelling mistakes, abbreviations, acronyms and improperly
structured text. Text mining algorithms are two types: Supervised
learning and unsupervised learning. In addition to supervised and
unsupervised learning a Meta learning approach is also applied to
optimization (Kordik et al, 2010).
22
Supervised learning (Zhiding et al, 2010) is a technique in
which the algorithm uses predictor and target attribute value pairs to
learn the predictor and the target value relation. The training data consist
of pairs of predictor and target values. Each predictor value is tagged
with a target value. If the algorithm can predict a categorical value for a
target attribute, it is called a classification function. Class is an example
of a categorical variable. Positive and negative can be two values of the
categorical variable class. Categorical values do not have partial
ordering. If the algorithm can predict a numerical value then it is called
regression. Numerical values have partial ordering.
Use of traditional k-mean type algorithm is limited only to
numeric data. Ahmad and Dey (2007) presents a clustering algorithm
based on k-mean paradigm that works well for data with mixed numeric
and categorical features. The authors proposed new cost function and
distance measure based on co-occurrence of values. The measures also
take into account the significance of an attribute towards the clustering
process (Birant and Kut, 2007).
Unsupervised learning (Ilin, 2012) is a technique in which the
algorithm uses only the predictor attribute values. There is no target
attribute value and the learning task is to gain some understanding of
relevant structural patterns in the data. Each row in a data set represents a
point in n-dimensional space and unsupervised learning algorithms
investigate the relationship between these various points in n-
dimensional space. Examples of unsupervised learning are clustering,
density estimation and feature extraction.
Text collections contain millions of unique terms, which make
the text - mining process difficult. Therefore, feature-extraction is used
23
when applying machine learning methods. A feature is a combination of
attributes (keywords), which captures important characteristics of the
data. A feature extraction method creates a new set of features far
smaller than the number of original attributes by decomposing the
original data. Therefore it enhances the speed of supervised learning.
Zha et al (2001) has combined k-means with spectral analysis,
Kotsiantis et al (2004) extended k-means algorithm to improve the k-
means algorithm. The Spatial Mining has been done by Ng and Han
(1994). Jain et al (1999) has improved this concept by introducing
Principal Component Analysis and this has been adopted for the analysis
done in image processing technique.
Unsupervised algorithms like Principal Components Analysis
(PCA), singular value decomposition, and Nonnegative Matrix
Factorization (NMF) involve factoring the document-word matrix, based
on different constraints for feature extraction (Ghosh et al, 2011).
Nonnegative matrix factorization is a new unsupervised algorithm for
efficient feature extraction of text documents. NMF is a feature
extraction algorithm that decomposes text data by creating a user-defined
number of features. NMF gives a reduced representation of the original
text data. It decomposes a text data matrix.
Each document of a text collection can be represented as a
linear combination of basis text document vectors or “feature” vectors. A
document, ‘Doc1’ (first column of the matrix) can be constructed as a
linear combination of the basis vectors ‘W1’, ‘W2’ … ‘Wk’, with the
corresponding coefficients ‘h11’, ‘h21’, … ‘hk1’ from matrix Hkn. Thus,
once the model is built and the feature vectors are constructed, any
document can be represented in terms of ‘k’ coefficients; resulting in a
24
reduced dimensionality (Yan et al, 2011) from ‘m’ to ‘k’. In this example
document ‘Doc1’ is a linear combination of feature vectors ‘W1’, ‘W2’,
‘W3’…’W10’ and its corresponding weights.
The NMF decomposition is non-unique; the matrices ‘W’ and
‘H’ depend on the NMF algorithm employed and the error measure used
to check convergence. Some of the NMF algorithm types are
multiplicative update algorithm, gradient descent algorithm by an
alternating least squares algorithm. The NMF algorithm iteratively
updates the factorization based on a given objective function. The
general objective function is to minimize the Euclidean distance between
each column of the matrix and its approximation. Xu and Wunsch (2010)
proved that the above update rules achieve monotonic convergence.
Clearly, the accuracy of the approximation depends on the
value of ‘k’, which is the number of feature vectors. In this work, ‘k’ is
user defined. A systematic study has been carried out to investigate the
influence of k on the accuracy of the model.
In text documents, two important aspects are Term weight and
Similarity measure (Zhang et al, 2012). In text mining each document is
represented as a vector. The elements in the vector reflect the frequency
of terms in documents, and each word is a dimension and documents are
vectors. Each word in a document has weights. These weights can be of
two types: Local and global weights. If local weights are used, then term
weights are normally expressed as term frequencies (TF).
If global weights are used, Inverse Document Frequency
(IDF), IDF values, gives the weight of a term. It is possible to do better
term weighing by multiplying ‘tf’ values with ‘IDF’ values, by
25
considering local and global information. Therefore total weight of a
‘term = tf * IDF’. This is commonly referred to as, ‘tf * IDF’ weighting.
Different from previous document clustering methods based
on latent semantic indexing or NMF, The Locality Preserving Index
(LPI) has been done by Agrafiotis and Xu (2002); Cai et al (2005) tries
to discover both the geometric and discriminating structures of the
document space using locality preserving indexing (LPI). In the LPI,
information retrieval is provided using rough set method of filtering
method based on support vector machine.
This was further modified by Cai et al. (2011), in which the
authors used NMF for text categorization. NMF can only be performed
in the original feature space of the data points and it gives acceptable
results than existing systems.
In LPI, the documents can be projected into a lower
dimensional semantic space in which the documents related to the same
semantics are close to each other. Cai et al (2011) further modified LPI
as Locally Consistent Concept Factorization (LCCF) by using the graph
Laplacian to smooth the document-to-concept mapping. The LCCF can
extract concepts with respect to the intrinsic manifold structure and thus
documents associated with the same concept can be well clustered. These
are affected to improve the performance of the algorithm which have
limitation due to more epochs and repeated iterations.
The divide-and-merge (Cheng et al., 2006), metric learning
model (Lebanon, 2006) is proposed in the literature which has
performance limitations due to more epochs and repeated iterations. The
divide-and-merge methodology of clustering a set of objects that
26
combines a top-down “divide” phase with a bottom-up “merge” phase. In
contrast, previous algorithms use either top-down or bottom-up methods
to construct a hierarchical clustering or produce a flat clustering using
local search (e.g., k-means). Divide and merge is used by many
researchers, in which Cheng et al (2006) proposed spectral algorithm for
divide phase. Sentiment analysis or opinion mining aims to use
automated tools to detect subjective information such as opinions,
attitudes, and feelings expressed in text.
If two documents describe similar topics, employing nearly the
same keywords, these texts are similar and their similarity measure
should be high. Usually dot product represents similarity of the
documents. To normalize the dot product, it can be divided it by the
Euclidean distances of the two documents (He et al, 2011). This ratio
defines the cosine angle between the vectors, with values between
‘0’ and ‘1’. This is called cosine similarity.
Soft margin classification - If the training set is linearly
separable then it is called hard margin classification. If the training set is
not linearly separable, slack variables ‘ξi’ can be added to allow some
misclassification of difficult or noisy examples where ξi > 0, i = 1 … n.
This procedure is called soft margin classification (Wang et al, 2012).
Non-linear classifiers (Charu et al, 2012) - The slack variable
approach is not a very efficient technique for classifying non-separable
classes in input space. In this case soft margin classification is not
applicable because the data is not linearly separable. Non-linear
classifiers require a feature map ‘Φ’, which is a function that maps the
input data patterns into a higher dimensional space. For example, two
27
dimensional input spaces show two non-separable classes as circles and
triangles.
After that the input data space is mapped to a three-
dimensional feature space using a feature map ‘Φ’. In the feature space
support vector machine can find a linear classifier that can separate these
classes easily by a hyper plane. For a data of ‘100’ dimensional, all
second order features are 5000. The feature map approach inflates the
input representation. It is not scalable, unless small subset of features is
used. The explicit computation of the feature map Φ can be avoided, if
the learning algorithm would just depend on inner products, Support
Vector Machine (SVM) decision function (Mu et al, 2012) has been
always in terms of dot products.
Kernel functions - Kernels functions are used for mapping the
input space to a feature space instead of a feature map ‘Φ’, if the
operations on classes are always dot products (Wu, 2012). In this way the
complexity of calculating ‘Φ’ can be reduced. The main optimization
function of SVM can be re-written in the dual form where data appears
only as inner product between data points. Kernel, ‘K’ is a function that
returns the inner product of two data points. Computing kernel, ‘K’ is
equivalent to mapping data patterns into a higher dimensional space and
then taking the dot product there.
Using this kernel approach, SVM exploits information about
the inner product between data points into feature space. Kernels map
data points in feature space where they are more easily possible linearly
separable. In order to classify non-separable classes kernel technique is a
better approach. SVM performs a nonlinear mapping of the input vector
from the input space into a higher dimensional Hilbert space, where the
28
mapping is determined by the kernel function. Two typical kernel
functions are, 1) Polynomial Kernel, where ‘d’ is the dimension and ‘C’
is a constant, 2) Gaussian Kernel, where ‘σ’ is the bandwidth of a
Gaussian curve.
Many methods for local optimization are based on the notion
of a direction of a local descent at a given point. A local improvement of
a point in hand can be made using this direction. As a rule, modern
methods for global optimization do not use directions of global descent
for global improvement of the point in hand. From this point of view,
Global OPtimization (GOP) algorithm based on a dynamical systems
approach is an unusual method. A hybrid GOP proposed by Ali and
Babak (2010), which structure is similar to that used in local
optimization: a new iteration can be obtained as an improvement on the
previous one along a certain direction. In contrast with local methods, is
a direction of a global descent and for more diversification combined
with Tabu search.
Multi-class and Multi-target problems - Text classification is
usually a multi-target problem. Each document can be in multiple
categories, exactly one category or no category. Examples of multi-target
problems in medical diagnosis are, a disease may belong to multiple
categories, and a gene can have multiple functions. A multi-target
problem is the same as building K independent binary problems, where
K is the number of targets.
Each problem uses the rows of its target set to a value and all
the other rows are set to the opposite class. In a multi-target case a
document can belong to more than one class with high probability. For
example, suppose that a given document can belong to one of ‘4’ classes:
29
Circle, Square, Triangle and Diamond. In this case, need ‘4’ independent
binary problems. In this case, after a model is built, when a new
document arrives, the mining uses its ‘4’ binary models and determines
that the document belongs to one or more of the ‘4’ classes.
A document, in a multi-target problem (Wang, et al, 2011),
belongs to more than one class. If a document belongs only to a single
class, it would be a multi-class problem. Each binary problem is built
using all the data.
2.2 REVIEWS ON DATA CLUSTERING
The clustering or the cluster analysis is a set of methodologies
for classification of samples into a number of groups. Therefore, the
samples in one group are grouped and samples belonging to different
groups are grouped as another group. The input of clustering is a set of
samples and the process of clustering is to measure the similarity and or
dissimilarity between giving samples. The output of the clustering is a
number of groups or clusters in the form of graphs (Scarselli et al 2009),
histograms and normal computer results showing group no in Figure
(2.1).
The Clustering is a well-established technique for data
interpretation. It usually requires prior information, e.g., about the
statistical distribution of the data or the number of clusters to detect.
“Clustering” attempts to identify natural clusters in a data set. It does this
by partitioning the entities in the data such that each partition consists of
entities that are close (or similar), according to some distance (similarity)
function based on entity attributes (Luhr and Lazarescu, 2009).
30
Conversely, entities in different partitions are relatively far apart
(dissimilar).
Existing clustering algorithms such as K-means, Partioning
Around Medoids (PAM), Clusterig Large Applications based
RANdomized Search (CLARANS), Density Based Spatial Clustering of
Applications with Noise and (DBSCAN) are designed to find clusters
that fit some static models. For example, K-means, PAM and CLARANS
assume that clusters are hyper-ellipsoidal or hyper-spherical and are of
similar sizes. The DBSCAN assumes that all points of a cluster are
density reachable and points belonging to different clusters are not.
However, all these algorithms can break down if the choice of
parameters in the static model is incorrect with respect to the data set
being clustered, or the model did not capture the characteristics of the
clusters (e.g., size or shape). Because the objective is to discern structure
in the data, the results of a clustering are then examined by a domain
expert to see if the groups suggest something.
For example, crop production data from an agricultural region
may be clustered according to various combinations of factors, including
soil type, cumulative rainfall, average low temperature, solar radiation,
availability of irrigation, strain of seed used and type of fertilizer applied.
Interpretation by a domain expert is needed to determine whether a
discerned pattern- such as a propensity for high yields to be associated
with heavy applications of fertilizer-is meaningful, because other factors
may actually be responsible (e.g., if the fertilizer is water soluble and
rainfall has been heavy).
31
(a) Initial data (b) Output in three (c) Output in four
clusters clusters
Figure. 2.1: Cluster analysis process
Many clustering algorithms that work well with traditional
data deteriorate when executed on geospatial data (which often are
characterized by a high number of attributes or dimensions), resulting in
increased running times or poor-quality clusters. For this reason, recent
research has cantered on the development of clustering methods for
large, highly dimensioned data sets, particularly techniques that execute
in linear time as a function of input size or that require only one or two
passes through the data. Recently developed spatial clustering methods
that seem particularly appropriate for geospatial data include
partitioning, hierarchical, density based, grid based and cluster based
analysis.
Hierarchical methods build clusters through top-down (by
splitting) or bottom-up (through aggregation) methods. Density based
methods define clusters as regions of space with a relatively large
number of spatial objects; unlike other methods, these can find
arbitrarily-shaped clusters. Grid based methods divide space into a raster
tessellation and cluster objects based on this structure. Model based
methods find the best fit of the data relative to specific functional
32
forms. Constraints based methods can capture spatial restrictions on
clusters or the relationships that define these clusters.
An input to a cluster analysis can be described as an ordered
pair (X, s), or (X, d), where ‘X’ is a set of descriptions of samples and ‘s’
and ‘d, are measures for similarity or dissimilarity (distance) between
samples, respectively in equation (2.1) and (2.2). Output from the
clustering system is a partition A = {G1, G2, …, GN} where Gk, k = 1, …,
N is a crisp subset of ‘X’ such that:
G1∪ G2∪ …, ∪GN = X (2.1)
G1∩ G2 ∩ …, ∩GN = Ø (2.2)
The G1, G2 … Gn are the clusters.
Most clustering algorithms are based on the following four
popular approaches:
(1) Partitioning methods
(2) Hierarchical clustering
(3) Iterative square-error partitioned clustering
(4) Density based clustering
• Partitioning methods: Given a database of ‘n’ objects or data
tuples, a partitioning method constructs ‘k(n)’ partitions of the
data, where each partition represents a cluster. That is, it
classifies the data into’k’ groups, which together satisfy the
following requirements:
• Each group must contain at least one object
• Each object must belong to exactly one group
33
Notice that the second requirement can be relaxed in some
fuzzy partitioning techniques (Tang et al, 2010). Such a
partitioning method creates an initial partitioning. It then uses
an iterative relocation technique that attempts to improve the
partitioning by moving objects from one group to another.
Representative algorithms include k-means, k-medoids and