Text categorization

Contents• Introduction• Problem Definition• Literature Review• Objective • Methodology & Algorithmic Strategy• Proposed System• System Architecture• Modules• Technology• Application • References

Introduction

Introduction

• Data Mining Domain.

• What is Text categorization?

• What is Text Reduction?

• Advantages of Text categorization

• Automated Text Categorization

Introduction (Contd…)

Fig: Working of Text Classifier

Unknown Input

Text Classifier

Output Class 1

Output Class 2

Output Class 3

Output Class 4

Output Class 5

Output Class N

Dataset

Problem Definition

Problem definition• Text categorization is a process of performing classification of unknown

text in classes.

• Until now manual classification is done in system and automated

classification doesn’t give better efficiency.

• To improve the efficiency of automated text categorization, we can propose

a modified approach of Naïve Bayes algorithm which outcomes the

disadvantages of existing system.

Objectives

Objectives• To preprocess the 20 newsgroup dataset

• Perform text reduction.

• Generate Frequencies based on Modified algorithm

• Classification of unknown input using proposed algorithm

• Comparison of existing and proposed system.

Literature Review

Literature Review

Sr No. Title Authors Description

1. Toward Optimal Feature

Selection in Naive

Bayes for Text

Categorization

Bo Tang, Student Member,

IEEE, Steven Kay, Fellow, IEEE, and Haibo He,

Senior Member, IEEE

IEEE : Feb 2016

In this paper, Automated feature selection is important for text categorization to reduce the feature size and to speed up the learning process of classifiers. In this paper, we present a novel and efficient feature selection framework based on the Information Theory, which aims to rank the features with their discriminative capacity for classification.

Disadvantages:Accuracy is very low based on wordcount. It can be improved by using any other classification algorithms like Naïve Bayes

Literature Review


2. Comparative Study Of Classification

Algorithm For Text Based

Categorization

Omkar Ardhapure,

Gayatri Patil, Disha

Udani, Kamlesh

Jetha

IJRET : Feb 2016

In this paper, Text categorization is a process in data mining which assigns predefined categories to free-text documents using machine learning techniques. Any document in the form of text, image, music, etc. can be classified using some categorization techniques. .

Disadvantages:Accuracy is not compared with much data. In data mining the more the data, proper results can be found.

Literature Review


3. Text Classification by Combining Text Classifiers to Improve the Efficiency of Classification

Aaditya Jain, Jyoti

Mandowara

IJCA : April 2016

Basic working of web crawler is presented in this paper.

Disadvantages:Nothing is given about how pages can be ranked using some algorithms. Only working of Text Classification is given.

Literature Review

Sr No.

Title Authors Description

4. Arabic Text Categorization using k-nearest neighbour,

Decision Trees (C4.5) and Rocchio

Classifier: A Comparative Study

Adel Hamdan Mohammad,

Omar Al-Momani and

Tariq Alwada’n

IJCET : April 2016

In this paper authors proposed that many researches about text classification in English language. A few researchers in general talk about text classification using Arabic data set. This research applies three well known classification algorithm. Algorithm applied are KNearest neighbour (K-NN), C4.5 and Rocchio algorithm.

Disadvantages:Accuracy is not compared with much data. In data mining the more the data, proper results can be found.

Literature Review


5. Text Categorization

on Multiple Languages Based On

Classification Technique

Kapila Rani, Satvika

IJCSIT- March 2016

In this paper authors proposed that The availability of constantly increasing amount of textual data of various Indian regional languages in electronic form has accelerated. Hence, the Classification of text documents based on languages is essential .

Disadvantages:Not very much informative regarding search engine optimizations.

Methodology&

Algorithmic Strategy

MethodologyPre-processing Dataset (Apply Reduction)

The first step is to perform text reduction of 20 newsgroup dataset using text reduction technique

Dataset Preprocessor Method

Reduction Procedure

Reduced Dataset

MethodologyGenerating Frequencies

We have to generate dataset frequencies provided below.

• Wordcount = number of times a word occur in a file.

• Term Frequency (TF) = occurrence / total (word freq. for each file)

• Inverse Document Frequency (IDF) = total doc. / no of doc in which term

occurs

• Normalized term Frequency = ∑ TF/ no. of occurrence

MethodologyClassification

Unknown Input

Text Classifier

Output Class 1

Output Class 2

Output Class 3

Output Class 4

Output Class 5

Output Class N

Dataset

Proposed Work

Proposed Work

• System will be provided automatic text categorization using

Modified Naïve Bayes algorithm.

• Text Reduction and Feature selection is done for Dataset

preprocessing.

• Dataset for simulation : 20 newsgroup

• Comparison of existing and proposed system will be provided

Proposed Architecture

Proposed Architecture

Unknown News

Modified NB Algorithm

Output Class 1

Output Class 2

Output Class 3

Output Class 4

Output Class 5

Output Class N

News Dataset

Modules

Modules

1. Dataset preprocessing

2. Text Categorization

3. Modified NB Implementation

4. Comparative Study

Module - Dataset Preprocessing

1. Read each and every file one by one

2. Perform Preprocessing

a. Remove stop words.

b. Remove Special Symbols

c. Remove Unwanted Spaces.

3. Calculate word count, Term Frequency, Normalized Term Frequency, Inverse

Document Frequency

4. Insert data in DB as word , docid, classid, wordcount , TF, IDF, NTF

Module – Text Categorization

1. Performing Text Categorization on an unknown file.

a. Perform preprocessing on Input file

Remove stop words, special symbols and unwanted spaces.

2. Generate Decision Matrix.

News class 1

News class 2

News class N

Word 1 Word 2 Word N


1. For k-NN algorithm use wordcount

Wordcount word1 – class 1

Wordcount word2 – class 1

Wordcount wordN – class 1

News class 1

News class 2

News class N



1. For Naïve bayes algorithm use TF * IDF

TF*IDF word1 – class 1

TF*IDFword2 – class 1

TF*IDFwordN – class 1

News class 1

News class 2

News class N



1. For Modified Naïve bayes algorithm use NTF * IDF

NTF*IDF word1 – class 1

NTF*IDFword2 – class 1

NTF*IDFwordN – class 1

News class 1

News class 2

News class N



2. Perform Row wise Addition

Frequency

Class 1

Class N

Word count for k-NN

TF * IDF for Naïve Bayes

NTF * IDF for Modified Naïve Bayes


3. Perform Maximum of Array and select index as predicted class.

Frequency

Class 1

Class N

Word count for k-NN

TF * IDF for Naïve Bayes

NTF * IDF for Modified Naïve Bayes

Technology

Technology

1. Front End : Java 8

2. BackEnd : MySQL

Applications

Applications

1. Antivirus System

2. Disease Detection Systems.

Requirements

Software Requirements

1. Eclipse or Netbeans for Java Development

2. Apache Tomcat 8

Expected Hardware Requirements

1. I3 Processor

2. 4GB RAM

3. 500 GB HDD

References

References1. Bo Tang, Student Member, IEEE, Steven Kay, Fellow, IEEE, and Haibo He, Senior Member,

IEEE,” Toward Optimal Feature Selection in Naive Bayes for Text Categorization” Dependable and Secure Computing, Submitted To Ieee Transactions On Knowledge And Data Engineering, 09February 2016.

2. Omkar Ardhapure, Gayatri Patil, Disha Udani, Kamlesh Jetha, “Comparative Study Of Classification Algorithm For Text Based Categorization”, in IJRET: International Journal of Research in Engineering and Technology, Volume: 05 Issue: 02 | Feb-2016.

3. Aaditya Jain, Jyoti Mandowara, “Text Classification by Combining Text Classifiers to Improve the Efficiency of Classification”, International Journal of Computer Application (2250-1797) Volume 6– No.2, March- April 2016.

4. Adel Hamdan Mohammad, Omar Al-Momani and Tariq Alwada’n, “Arabic Text Categorization using k-nearest neighbour, Decision Trees (C4.5) and Rocchio Classifier: A Comparative Study” in International Journal of Current Engineering and Technology, E-ISSN 2277 – 4106, P-ISSN 2347 – 5161 , Vol.6, No.2 (April 2016).

5. Kapila Rani, Satvika, “Text Categorization on Multiple Languages Based On Classification Technique”, (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 7, March 2016, 1578-1581.

References

6. T. Joachims, “Text categorization with support vector machines: Learning with many relevant features,” in ECML, 1998.

7. W. Lam, M. Ruiz, and P. Srinivasan, “Automatic text categorization and its application to text retrieval,” IEEE Transactions on Knowledge and Data Engineering, vol. 11, no. 6, pp. 865–879, 1999.

8. F. Sebastiani, “Machine learning in automated text categorization,” ACM computing surveys (CSUR), vol. 34, no. 1, pp. 1–47, 2002.

9. H. Al-Mubaid, S. Umair et al., “A new text categorization technique using distributional clustering and learning logic,” IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 9, pp. 1156–1165, 2006.

10. Y. Aphinyanaphongs, L. D. Fu, Z. Li, E. R. Peskin, E. Efstathiadis, C. F. Aliferis, and A. Statnikov, “A comprehensive empirical comparison of modern supervised classification and feature selection methods for text categorization,” Journal of the Association for Information Science and Technology, vol. 65, no. 10, pp. 1964–1987, 2014.

Thank You