Data Mining Tool for Effective Classification and Retrieval of Relevant User Data Using Fuzzy and BSO 1 Antony Rosewelt & 2 Arokia Renjit 1 Department of CSE, Stella Mary's College of Engineering, Nagercoil, India 2 Department of CSE, Jeppiaar Engineering College, Chennai, India [email protected]; [email protected]Abstract –Recently, the data mining techniques are used as a tool to solve the basic information or data retrieval from large volume of databases such as Data warehouses, repositories and World Wide Web. The huge volume of user data can be stored in the cloud repositories and relevant information stored and maintained in Internet. The efficiency of the data mining tools can be finalized based on the volume of relevant data or information successfully retrieved from the source. Moreover, the classification process is also playing major role to identify the right data or information and categorize them for retrieving, storing and maintaining. For this purposes, we propose a new data mining tool for retrieving the data effectively by using pre-processing and classification. Here, introduce a new semantic based data pre-processing technique for effective data pre-processing. Moreover, propose a new classification algorithm for effective data classification using fuzzy rules and Bees Swarm Optimization based Information Retrieval algorithm. In addition, group the relevant data and web pages using the existing k-means clustering algorithm in this work. During the retrieval process, inter and intra coupling relationships between the data must be analysed by using the existing semantic model. Here, the common terms for identifying the intra relationship between the data and the partial order relation used for identifying the intra-relationship between the data. Finally, the proposed mining tool has been evaluated by using the famous repositories namelyWeb-docs and Wiki-links and the user’s feedback which are collected from users by Amazon. Keywords -Information retrieval, Data mining, Bees Swarm Optimization algorithm, Fuzzy rules, Clustering, Classification, coupling inter-relationship, coupling intra-relationship. 1. INTRODUCTION The rapid development of internet and related data, the data or information retrieval related tools are playing crucial role over the relevant data extraction process. Current internet users International Journal of Pure and Applied Mathematics Volume 119 No. 16 2018, 1239-1256 ISSN: 1314-3395 (on-line version) url: http://www.acadpubl.eu/hub/ Special Issue http://www.acadpubl.eu/hub/ 1239
18
Embed
Data Mining Tool for Effective Classification and Retrieval o f ... · tasks such as data clustering, data classification , recommendation systems, queries and outlier term detection
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Data Mining Tool for Effective Classification and Retrieval of
Relevant User Data Using Fuzzy and BSO
1 Antony Rosewelt &
2 Arokia Renjit
1Department of CSE, Stella Mary's College of Engineering, Nagercoil, India
2Department of CSE, Jeppiaar Engineering College, Chennai, India
The rapid development of internet and related data, the data or information retrieval related
tools are playing crucial role over the relevant data extraction process. Current internet users
International Journal of Pure and Applied MathematicsVolume 119 No. 16 2018, 1239-1256ISSN: 1314-3395 (on-line version)url: http://www.acadpubl.eu/hub/Special Issue http://www.acadpubl.eu/hub/
1239
are fully depending on the tools even for retrieve the required information or data to improve
their knowledge and business due to the availability of large volume data. The conventional
information retrieval methods are working based on the keywords such as positive and
negative keywords which are useful for identifying the related terms. Even though, the
existing information retrieval methods are not satisfying the internet users fully due to the
availability of semantic challenges like polysemy and the synonymy. These challenges are
called as vocabulary or word mismatch by researchers and academicians (Furnas et al., 1987).
The enormous efforts have been taken by various researchers in the past for addressing the
word mismatch issue like query expansion methods and the lattice based information retrieval
approach for the query transmission. The query expansion generates a new query by
enhancing theaugmented query with new attributes with same meaning where the attributes
are additional keywordsthat extracted from a dictionary like WordNet and the relevance
feedback (Carpineto and Romano, 2012). Otherwise, extra keywords from the original data
sources which are used for expanding the query and concept of lattice based information
retrieval technique can be refined and also expanded the query which exploresthe navigation
search techniques by using the data specificity and the generality relation of the lattice
(Carpineto and Romano, 2005).
The fuzzy logic is used for overcoming the uncertainty issues through the development of
formal concept analysis. The standard uncertainty issues like data vagueness and the implicit
information over the relevant queries and the related documents for retrieving the relevant
data. Many fuzzy logic and lattice based techniques were proposed for handling these issues
by various researchers in the past using formal concept analysis (Poelmans et al., 2014;
Kumar et al., 2015).Many existing methods adopted the concept partial order relation of the
concepts which are available in the web, databases and repositories for computing the inter
and intra relationships between the various concepts, related web documents and data’s
available in repositories and returnsthe related web documents or data for the given user
query. However, these all methods are neglecting the semantic data between the concepts like
common objects and the attributes of concepts. Finally, the data coupling relationship
between the conceptsthat consisting of common object, common attribute andthe partial order
relationship of concept that is neglected. Moreover, coupling relationship is demonstrated
that its significant value which is used to improve the existing analysis and also the learning
International Journal of Pure and Applied Mathematics Special Issue
1240
tasks such as data clustering, data classification, recommendation systems, queries and outlier
term detection process (Pang et al 2016).
In this work, a new data mining tool is proposed for retrieving the data effectively by using
data pre-processing and the data classification process. Moreover, a new semantic approach is
also proposed for effective data pre-processing. In addition, a new classification algorithm
has been proposed for effective data classification using fuzzy rules and the existingBees
Swarm Optimization based Information Retrieval method. Moreover, an existing clustering
algorithm called k-means clustering algorithm is used for grouping the data effectively based
on the relevancy score. The relevancy is also considered in this work as inter and intra
coupling relationships between the data with analysis by using the existing semantic model.
The partial order relationship is used in this work for identifying the relationship level
between the data. Finally, the proposed data mining tool has been tested with the data or
information or feedback which is collected from the famous repositories namely Web-docs
and Wiki-links and the user’s feedback.
The rest of this paper is organized as follows: Section 2 discussed in detail about the existing
data mining tools which are developed by researchers in this direction in the past. Section 3
explains the overall proposed system architecture. Section 4 described in detail about the
proposed semantic based data pre-processing, clustering and the data classification process.
Section 5 gives conclusion and the future works in this direction.
2. LITERATURE SURVEY
There are many works have been done in the direction of semantic based data pre-processing,
information retrieval, data clustering and data classification (Arokia Renjit and
Shanmuganathan 2010 and 2011) by the various researchers in the past. Among them,
Youcef et al (2018) exploredthe advances of data mining techniques for solving the basic
document retrieval problem. In their technique, they discovered the useful data by using data
mining techniques and also used the knowledge for exploring the full documents efficiently.
They have investigated the two different techniques such as data pre-processing, clustering
process and Bees Swarm Optimization for exploring the clustered and grouped documents
deeply.Their approach improved the quality of retrieved relevant documents reasonably in
less time. Shufeng et al (2018) introduced a new framework which is based on lattice and the
coupling relationship analysis. Their framework employs the formal concepts which are
International Journal of Pure and Applied Mathematics Special Issue
1241
extracted by using the fuzzy formal concept analysis forrepresenting the queries and the
documents.They also find the coupling relationship analysis such as intra and inter concept
that are applied to rank the web documents.
Fabricio et al (2015) investigated that use of a bi-clustering method for capturing the local
methods of the coherence across the subsets of records and the available fields. They have
solved the dimensionality problem and reduced the redundancy of correlated features and
also improved the separability and the classification accuracy.
Thiago et al (2018) developed a new supervised classifierthat appliedthe
limitationprobabilities of the random walk theory on underlying networks which are
constructed from the input labelled data. They also demonstrated that the examples that
combines the low and high level attributes in their classifier.
Fuji (2009) exploredfew main areas of the information retrieval which are in advanced level.
The authors concentrate that related to the cross lingual, multimedia and the semantic based
information retrievals. Here, the cross lingual based information retrieval deals with rising
queries in one kind of language and also retrieve the related web documents in various
languages. In their work, the semantic based data or information retrieval that goes beyond
the level of surface data orthe related information by using the concepts which are
represented in web documents and also the user queries for improvingthe retrieval process.
Antonio et al (2018) considered as an initial point which has a new strategy based on the
clustering process. They improved the performance by solving the major issues which are
related to the records that located in near to the cluster boundaries by enlarging the size and
also consideredthe use of Deep Neural Networks that are used for learning a suitable
representation for the classification task.They achieved the reasonable classification accuracy
over the eight different datasets.
Mao et al (2013) addressed the problems such as to find the relevant documents, complicated
in use of languages, ambiguous in language and the result inaccuracy. They developed a new
semantic based content mapping technique for the information retrieval model. Their new
model employs the standard semantic features and an ontological structure for constructing a
new content map. Their model improved the accuracy of the relevant document or data
retrieval results.
International Journal of Pure and Applied Mathematics Special Issue
1242
Olga et al(2017) described an effective method called PolaritySim for determining the word
level contextual polarity that uses readily available consumer rated reviews as the only
external resource.
Preben et al (2005) investigated that the expressions of collaborative activities within the
information searching and the data retrieval processes. They also presented empirical
experimental results from a real world life and also the information setting within the domain.
Moreover, they also categorise and also related to the variousstages in an information
searching and the retrieval processes. Finally, they introduced a new information retrieval
that is an improved information retrieval model in collaborative aspects.
Rabia et al (2006) employed new algorithms forranking the documents automatically. They
merged the information retrieval results of the multiple systems by using the various data
fusion algorithms and alsouse the top-ranked documents which are relevant and also
employed these relevant documentsfor evaluating and ranking the methods. Moreover, they
also introduced a new approach for the selection of information retrieval systems that are to
be used for effective data fusion. Finally, the authors proved that their method perform well
than the existing automatic ranking techniques.
Goran et al (2014) presentedthe new methods to retrieve the document and also summarized
the multi-documents.Their method measures the similarity between the queries and the web
documents which combines the graph kernels on event graphs. Their model achieved the
better clustering performance and the relevant multi-document summarization.
Antonio et al (2010) proposed anew algorithm for refining the ontologies that are used for
relevant information retrieval tasks with the preliminary positive results. Andrea et al (2012)
presentedtheir experience in using X.MAS that is a generic multi-agent architecture which
aimed at the process of relevant information retrieval, data filtering and also reorganizing the
information based on the user requests. Tatiana et al (2013) describedin detail about the basic
theories of human development that used to explain the specifics of young users such as their
cognitive skills, fine motor skills, knowledge, memory and emotional states in so far as they
differ from those of adults.
Sairamesh et al (2015) proposed a new algorithm to infer the user interests that are based on
the user queries and the fast profile logs and also to provide the relevant information which is
based on the user personalization. Moreover, they introduced a new classifier for classifying
International Journal of Pure and Applied Mathematics Special Issue
1243
the data and also apply a new ranking algorithm for categorizing the relevant data.
Kulunchakov et al (2017) proposed a novel approach for constructing new ranking algorithms
for effective relevant information or data retrieval. Mehrbakhsh et al (2018) proposed a new
recommendation systemthat is based on the ontology and the dimensionality reduction
techniques forimproving the sparsity and the scalability problems.
Obada et al (2017) proposed a novel method by using fuzzy logic for developingthe tasks,
user profiles and documents to model the user relevant information searching behaviour. The
feedback relevancy is also calculated and considered in this work by using a linear regression
model that used to predict the web document relevancy based on the implicit relevance
indicators. Moreover, the fuzzy rule based summarisation was also used for integrating the
profiles. The overall performance of their method was evaluated based on the evaluation
metrics such as precision and recall metrics that shows the significant improvements in the
relevant information retrieval based on the user queries.
3. SYSTEM ARCHITECTURE
The overall architecture of the proposed system developed for analysing the web data and
documents in this work is shown in Figure 1. It consists of six modules such as web
documents/ feedback data, a user interface module, an intelligent data mining tool, a rule
manager, a rule base and results.
Figure 1.System Architecture
User Interface Module
Intelligent Data Mining
Tool
Data Pre-processing
Rule Manager
Document Clustering
Data Classification
Rule
Base Result
Web
Documents/
Feedback data
International Journal of Pure and Applied Mathematics Special Issue
1244
The web documents and feedback data consists of large volume of web documents and also
the feedback data that are available in amazon website and cloud repositories. The collection
of web documents and web data like feedback data that have been considered as input dataset
in this work. The user interface module collects the necessary web documents and web data
like feedback data from business websites like amazon. The Intelligent Data Mining tool
consists of three sub modules such as data pre-processing, clustering and data classification.
Here, the data pre-processing sub module is taken care of removing the noisy data, null data
and meaningless data. The clustering module is responsible for grouping the relevant data or
relevant web documents using the existing k-means clustering algorithm. The classification
sub module is responsible for categorizing the data or documents effectively by using
intelligent fuzzy rules. The rule manager manages the fuzzy rules and interacts with data
mining tool and rule base. It stores and retrieves the fuzzy rules over the knowledge base. The
rule base stores all kinds of fuzzy rules which are useful for categorizing the feedback data
and for classifying the web documents. The proposed model refers to the rule base built
around user queries. The result module holds the resulted documents or feedback data of the
user query.
4. PROPOSED WORK
In this section, we discussed in detail about the proposed model which is the combination of
data pre-processing, clustering and classification. In the proposed model, a new semantic
based data pre-processing technique is proposed in this work for identifying the original and
useful data for the analysis. Moreover, an existing clustering algorithm is used for grouping
the relevant data or web documents for further analysis quickly. In addition, a new classifier
is also proposed for effective data or document classification. This section is categorized into
three subsections such as semantic based data pre-processing, K-Means clustering and Fuzzy
Rule and BSO based Classification.
4.1 Semantic based Data Pre-processing
The main aim of data pre-processing is to enhance the capability of the existing data mining
tools which are used for extracting the relevant data or documents that is used in this work
later by the proposed fuzzy rule and BSO based classifier. Here, it removes the unnecessary
data like null values from the input dataset. Moreover, it checks the availability of semantic
data or content which are available in the dataset. In this proposed data pre-processing phase
International Journal of Pure and Applied Mathematics Special Issue
1245
is responsible to tokenize the input content, check the grammar and checks the content is
semantically correct or not.
4.2 K-Means Clustering
The K-means algorithm is used in this work for grouping the ‘n’ points into ‘k’ subsets 𝑆𝑢𝑏𝑗 .
Every subset of the clusters that are having the 𝑛𝑠𝑗 number of data points in a cluster. First,
the data points (𝑛𝑠𝑗 ) that are assigned randomly to the k number of clusters and also the
centroid point is also calculated for each cluster. Then, each centroid point is also assigned
for the cluster whose point is very close to that centroid point. The above mentioned steps are
repeated when there is no assignment further of the data points that are to the clusters. In this
work, the adaptation of K-means clustering algorithm to the proposed work in two steps. In
first step, the web data weightage is assigned for all the documents with individual words
weightage. The term frequency and relevancy score are also calculated based on the words
weightages and the occurrence of a word in a document is calculated in step 2 of this work.
4.3 Fuzzy Rule and BSO based Classification
In this subsection, a new fuzzy rule and BSO based classifier is explained in detail that is
incorporating with the proposed intelligent data mining tool which is developed for effective
relevant information retrieval from repositories. The proposed classifier is the combination of
the existing BSO based classification algorithm that is developed by [1] and the necessary
rules have been incorporated for making effective decision over the retrieval process on web
data.
4.4 Intelligent Data Mining Tool for Information Retrieval
In this work, a new and intelligent data mining tool has been designed for relevant data
retrieval from the large volume of databases and the cloud repositories. Here, we have used a
semantic based data pre-processing, clustering and classification techniques for effective data
retrieval. This tool has three different phases for taking care these three different activities in
this tool.
International Journal of Pure and Applied Mathematics Special Issue
1246
Input: WebDocuments
Output: Relevant Documents
Phase 1: Pre-processing
Step 1: Read the documents one by one from the list DL = (d1, d2, d3….dn)
Step 2: Read first line from the first document Di and also checks the tokens from the
standard metadata.
Step 3: Apply the parser over the line sentence ‘l’.
Step 4: Call the syntax analyser for grammar checking.
Step 5: if the line doesn’t have any grammatical error then
Apply LSA (DLi, S, l)
Else
Correct the grammatical errors and Go to step 5.
Step 6: Create a semantic network for the line by calling the procedure semantic_network()
Step 7: Compare the developed semantic structure for the line and the semantic metadata of
the line in node wise for the whole sentence.
Step 8: If the line is matched semantically with metadata then
Step 9: If the data is not end then
Display the semantic analysis results
Else
Goto Step 13.
Step 10: Apply the procedure Pragmatic_Analysis()
Step 11: If Anaphora is resolved then Goto Step 9
Step 12: Else Checksthe file status
Step 13: If EOF then Stop
Step 14: Else Go To Step 1.
Phase 2: Clustering
Step 1: Set the ‘k’ number of clusters
Step 2: Select the ‘k’ initial center points for the all ‘k’ groups.
Step 3: Weightages are assigned for all the words that are available in the document as a
word representation by using the expert guidelines which are stored in the database.
International Journal of Pure and Applied Mathematics Special Issue
1247
Step 4: Read first document from set of documents
Step 5: Find the Cosine similarity for the words belongs to a first group and store it into m.
Step 6: Checks the cosine similarity of each document words
Step 7: if the cosine similarity of the words is less than that will be considered as minimum
cosine value of the whole data.
Step 8: if any one of the document words are changed the average score of a group then
Stop the process and exit
Else
Find the new center point of each groups which are available in a cluster.
Step 9: Return the clustered document set.
Phase 3: Classification
Step 1: Accept the user request from the users queries
Step 2: Apply the existing classifier called Bees Swarm Optimization based Information
Retrieval algorithm over the clustered documents that are available in the document
list.
Step 3: Provide the relevant content or data to the user and apply fuzzy rules.
Step 4: Map the semantic fuzzy rules and the nodes that are available in the newly
constructed semantic tree nodes.
Step 5: If the nodes and rules are matched then
Produce all the relevant contents.
End if
Step 6: Call the procedure for retrieving the exact contents.
The proposed data mining tool performs three different actions such as semantic based pre-
processing, k-means clustering for data and grouping the documents and the fuzzy rule based
BSO-IR for effective relevant data from the databases.
5. RESULTS AND DISCUSSION
International Journal of Pure and Applied Mathematics Special Issue
1248
This section described in detail about the test bed that is used to evaluate the proposed data
mining tool which is used for retrieving the relevant data or documents from the web or
repositories. Here, the famous performance metrics are used to measure the performance of
the proposed data mining tool which is used for retrieving the relevant data. The experiments
have been conducted using the web documents which are containing the product review as a
feedback about the product or company and the CSV file which contains the user feedback
about amazon products. The Java program was used for implementing the data mining tool.
The prediction accuracy over the documents or user data has been calculated in this work
using the following metrics such as precision, recall and F-measure which are defined below: