Contents • Introduction • Problem Definition • Literature Review • Objective • Methodology & Algorithmic Strategy • Proposed System • System Architecture • Modules • Technology • Application • References
Contents• Introduction• Problem Definition• Literature Review• Objective • Methodology & Algorithmic Strategy• Proposed System• System Architecture• Modules• Technology• Application • References
Introduction
Introduction
• Data Mining Domain.
• What is Text categorization?
• What is Text Reduction?
• Advantages of Text categorization
• Automated Text Categorization
Introduction (Contd…)
Fig: Working of Text Classifier
Unknown Input
Text Classifier
Output Class 1
Output Class 2
Output Class 3
Output Class 4
Output Class 5
Output Class N
Dataset
Problem Definition
Problem definition• Text categorization is a process of performing classification of unknown
text in classes.
• Until now manual classification is done in system and automated
classification doesn’t give better efficiency.
• To improve the efficiency of automated text categorization, we can propose
a modified approach of Naïve Bayes algorithm which outcomes the
disadvantages of existing system.
Objectives
Objectives• To preprocess the 20 newsgroup dataset
• Perform text reduction.
• Generate Frequencies based on Modified algorithm
• Classification of unknown input using proposed algorithm
• Comparison of existing and proposed system.
Literature Review
Literature Review
Sr No. Title Authors Description
1. Toward Optimal Feature
Selection in Naive
Bayes for Text
Categorization
Bo Tang, Student Member,
IEEE, Steven Kay, Fellow, IEEE, and Haibo He,
Senior Member, IEEE
IEEE : Feb 2016
In this paper, Automated feature selection is important for text categorization to reduce the feature size and to speed up the learning process of classifiers. In this paper, we present a novel and efficient feature selection framework based on the Information Theory, which aims to rank the features with their discriminative capacity for classification.
Disadvantages:Accuracy is very low based on wordcount. It can be improved by using any other classification algorithms like Naïve Bayes
Literature Review
Sr No. Title Authors Description
2. Comparative Study Of Classification
Algorithm For Text Based
Categorization
Omkar Ardhapure,
Gayatri Patil, Disha
Udani, Kamlesh
Jetha
IJRET : Feb 2016
In this paper, Text categorization is a process in data mining which assigns predefined categories to free-text documents using machine learning techniques. Any document in the form of text, image, music, etc. can be classified using some categorization techniques. .
Disadvantages:Accuracy is not compared with much data. In data mining the more the data, proper results can be found.
Literature Review
Sr No. Title Authors Description
3. Text Classification by Combining Text Classifiers to Improve the Efficiency of Classification
Aaditya Jain, Jyoti
Mandowara
IJCA : April 2016
Basic working of web crawler is presented in this paper.
Disadvantages:Nothing is given about how pages can be ranked using some algorithms. Only working of Text Classification is given.
Literature Review
Sr No.
Title Authors Description
4. Arabic Text Categorization using k-nearest neighbour,
Decision Trees (C4.5) and Rocchio
Classifier: A Comparative Study
Adel Hamdan Mohammad,
Omar Al-Momani and
Tariq Alwada’n
IJCET : April 2016
In this paper authors proposed that many researches about text classification in English language. A few researchers in general talk about text classification using Arabic data set. This research applies three well known classification algorithm. Algorithm applied are KNearest neighbour (K-NN), C4.5 and Rocchio algorithm.
Disadvantages:Accuracy is not compared with much data. In data mining the more the data, proper results can be found.
Literature Review
Sr No. Title Authors Description
5. Text Categorization
on Multiple Languages Based On
Classification Technique
Kapila Rani, Satvika
IJCSIT- March 2016
In this paper authors proposed that The availability of constantly increasing amount of textual data of various Indian regional languages in electronic form has accelerated. Hence, the Classification of text documents based on languages is essential .
Disadvantages:Not very much informative regarding search engine optimizations.
Methodology&
Algorithmic Strategy
MethodologyPre-processing Dataset (Apply Reduction)
The first step is to perform text reduction of 20 newsgroup dataset using text reduction technique
Dataset Preprocessor Method
Reduction Procedure
Reduced Dataset
MethodologyGenerating Frequencies
We have to generate dataset frequencies provided below.
• Wordcount = number of times a word occur in a file.
• Term Frequency (TF) = occurrence / total (word freq. for each file)
• Inverse Document Frequency (IDF) = total doc. / no of doc in which term
occurs
• Normalized term Frequency = ∑ TF/ no. of occurrence
MethodologyClassification
Unknown Input
Text Classifier
Output Class 1
Output Class 2
Output Class 3
Output Class 4
Output Class 5
Output Class N
Dataset
Proposed Work
Proposed Work
• System will be provided automatic text categorization using
Modified Naïve Bayes algorithm.
• Text Reduction and Feature selection is done for Dataset
preprocessing.
• Dataset for simulation : 20 newsgroup
• Comparison of existing and proposed system will be provided
Proposed Architecture
Proposed Architecture
Unknown News
Modified NB Algorithm
Output Class 1
Output Class 2
Output Class 3
Output Class 4
Output Class 5
Output Class N
News Dataset
Modules
Modules
1. Dataset preprocessing
2. Text Categorization
3. Modified NB Implementation
4. Comparative Study
Module - Dataset Preprocessing
1. Read each and every file one by one
2. Perform Preprocessing
a. Remove stop words.
b. Remove Special Symbols
c. Remove Unwanted Spaces.
3. Calculate word count, Term Frequency, Normalized Term Frequency, Inverse
Document Frequency
4. Insert data in DB as word , docid, classid, wordcount , TF, IDF, NTF
Module – Text Categorization
1. Performing Text Categorization on an unknown file.
a. Perform preprocessing on Input file
Remove stop words, special symbols and unwanted spaces.
2. Generate Decision Matrix.
News class 1
News class 2
News class N
Word 1 Word 2 Word N
Module – Text Categorization
1. For k-NN algorithm use wordcount
Wordcount word1 – class 1
Wordcount word2 – class 1
Wordcount wordN – class 1
News class 1
News class 2
News class N
Word 1 Word 2 Word N
Module – Text Categorization
1. For Naïve bayes algorithm use TF * IDF
TF*IDF word1 – class 1
TF*IDFword2 – class 1
TF*IDFwordN – class 1
News class 1
News class 2
News class N
Word 1 Word 2 Word N
Module – Text Categorization
1. For Modified Naïve bayes algorithm use NTF * IDF
NTF*IDF word1 – class 1
NTF*IDFword2 – class 1
NTF*IDFwordN – class 1
News class 1
News class 2
News class N
Word 1 Word 2 Word N
Module – Text Categorization
2. Perform Row wise Addition
Frequency
Class 1
Class N
Word count for k-NN
TF * IDF for Naïve Bayes
NTF * IDF for Modified Naïve Bayes
Module – Text Categorization
3. Perform Maximum of Array and select index as predicted class.
Frequency
Class 1
Class N
Word count for k-NN
TF * IDF for Naïve Bayes
NTF * IDF for Modified Naïve Bayes
Technology
Technology
1. Front End : Java 8
2. BackEnd : MySQL
Applications
Applications
1. Antivirus System
2. Disease Detection Systems.
Requirements
Software Requirements
1. Eclipse or Netbeans for Java Development
2. Apache Tomcat 8
Expected Hardware Requirements
1. I3 Processor
2. 4GB RAM
3. 500 GB HDD
References
References1. Bo Tang, Student Member, IEEE, Steven Kay, Fellow, IEEE, and Haibo He, Senior Member,
IEEE,” Toward Optimal Feature Selection in Naive Bayes for Text Categorization” Dependable and Secure Computing, Submitted To Ieee Transactions On Knowledge And Data Engineering, 09February 2016.
2. Omkar Ardhapure, Gayatri Patil, Disha Udani, Kamlesh Jetha, “Comparative Study Of Classification Algorithm For Text Based Categorization”, in IJRET: International Journal of Research in Engineering and Technology, Volume: 05 Issue: 02 | Feb-2016.
3. Aaditya Jain, Jyoti Mandowara, “Text Classification by Combining Text Classifiers to Improve the Efficiency of Classification”, International Journal of Computer Application (2250-1797) Volume 6– No.2, March- April 2016.
4. Adel Hamdan Mohammad, Omar Al-Momani and Tariq Alwada’n, “Arabic Text Categorization using k-nearest neighbour, Decision Trees (C4.5) and Rocchio Classifier: A Comparative Study” in International Journal of Current Engineering and Technology, E-ISSN 2277 – 4106, P-ISSN 2347 – 5161 , Vol.6, No.2 (April 2016).
5. Kapila Rani, Satvika, “Text Categorization on Multiple Languages Based On Classification Technique”, (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 7, March 2016, 1578-1581.
References
6. T. Joachims, “Text categorization with support vector machines: Learning with many relevant features,” in ECML, 1998.
7. W. Lam, M. Ruiz, and P. Srinivasan, “Automatic text categorization and its application to text retrieval,” IEEE Transactions on Knowledge and Data Engineering, vol. 11, no. 6, pp. 865–879, 1999.
8. F. Sebastiani, “Machine learning in automated text categorization,” ACM computing surveys (CSUR), vol. 34, no. 1, pp. 1–47, 2002.
9. H. Al-Mubaid, S. Umair et al., “A new text categorization technique using distributional clustering and learning logic,” IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 9, pp. 1156–1165, 2006.
10. Y. Aphinyanaphongs, L. D. Fu, Z. Li, E. R. Peskin, E. Efstathiadis, C. F. Aliferis, and A. Statnikov, “A comprehensive empirical comparison of modern supervised classification and feature selection methods for text categorization,” Journal of the Association for Information Science and Technology, vol. 65, no. 10, pp. 1964–1987, 2014.
Thank You