Top Banner
Contents • Introduction Problem Definition Literature Review • Objective Methodology & Algorithmic Strategy Proposed System System Architecture • Modules • Technology • Application • References
42

Text categorization

Mar 20, 2017

Download

Engineering

Shubham Pahune
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Text categorization

Contents• Introduction• Problem Definition• Literature Review• Objective • Methodology & Algorithmic Strategy• Proposed System• System Architecture• Modules• Technology• Application • References

Page 2: Text categorization

Introduction

Page 3: Text categorization

Introduction

• Data Mining Domain.

• What is Text categorization?

• What is Text Reduction?

• Advantages of Text categorization

• Automated Text Categorization

Page 4: Text categorization

Introduction (Contd…)

Fig: Working of Text Classifier

Unknown Input

Text Classifier

Output Class 1

Output Class 2

Output Class 3

Output Class 4

Output Class 5

Output Class N

Dataset

Page 5: Text categorization

Problem Definition

Page 6: Text categorization

Problem definition• Text categorization is a process of performing classification of unknown

text in classes.

• Until now manual classification is done in system and automated

classification doesn’t give better efficiency.

• To improve the efficiency of automated text categorization, we can propose

a modified approach of Naïve Bayes algorithm which outcomes the

disadvantages of existing system.

Page 7: Text categorization

Objectives

Page 8: Text categorization

Objectives• To preprocess the 20 newsgroup dataset

• Perform text reduction.

• Generate Frequencies based on Modified algorithm

• Classification of unknown input using proposed algorithm

• Comparison of existing and proposed system.

Page 9: Text categorization

Literature Review

Page 10: Text categorization

Literature Review

Sr No. Title Authors Description

1. Toward Optimal Feature

Selection in Naive

Bayes for Text

Categorization

Bo Tang, Student Member,

IEEE, Steven Kay, Fellow, IEEE, and Haibo He,

Senior Member, IEEE

IEEE : Feb 2016

In this paper, Automated feature selection is important for text categorization to reduce the feature size and to speed up the learning process of classifiers. In this paper, we present a novel and efficient feature selection framework based on the Information Theory, which aims to rank the features with their discriminative capacity for classification.

Disadvantages:Accuracy is very low based on wordcount. It can be improved by using any other classification algorithms like Naïve Bayes

Page 11: Text categorization

Literature Review

Sr No. Title Authors Description

2. Comparative Study Of Classification

Algorithm For Text Based

Categorization

Omkar Ardhapure,

Gayatri Patil, Disha

Udani, Kamlesh

Jetha

IJRET : Feb 2016

In this paper, Text categorization is a process in data mining which assigns predefined categories to free-text documents using machine learning techniques. Any document in the form of text, image, music, etc. can be classified using some categorization techniques. .

Disadvantages:Accuracy is not compared with much data. In data mining the more the data, proper results can be found.

Page 12: Text categorization

Literature Review

Sr No. Title Authors Description

3. Text Classification by Combining Text Classifiers to Improve the Efficiency of Classification

Aaditya Jain, Jyoti

Mandowara

IJCA : April 2016

Basic working of web crawler is presented in this paper.

Disadvantages:Nothing is given about how pages can be ranked using some algorithms. Only working of Text Classification is given.

Page 13: Text categorization

Literature Review

Sr No.

Title Authors Description

4. Arabic Text Categorization using k-nearest neighbour,

Decision Trees (C4.5) and Rocchio

Classifier: A Comparative Study

Adel Hamdan Mohammad,

Omar Al-Momani and

Tariq Alwada’n

IJCET : April 2016

In this paper authors proposed that many researches about text classification in English language. A few researchers in general talk about text classification using Arabic data set. This research applies three well known classification algorithm. Algorithm applied are KNearest neighbour (K-NN), C4.5 and Rocchio algorithm.

Disadvantages:Accuracy is not compared with much data. In data mining the more the data, proper results can be found.

Page 14: Text categorization

Literature Review

Sr No. Title Authors Description

5. Text Categorization

on Multiple Languages Based On

Classification Technique

Kapila Rani, Satvika

IJCSIT- March 2016

In this paper authors proposed that The availability of constantly increasing amount of textual data of various Indian regional languages in electronic form has accelerated. Hence, the Classification of text documents based on languages is essential .

Disadvantages:Not very much informative regarding search engine optimizations.

Page 15: Text categorization

Methodology&

Algorithmic Strategy

Page 16: Text categorization

MethodologyPre-processing Dataset (Apply Reduction)

The first step is to perform text reduction of 20 newsgroup dataset using text reduction technique

Dataset Preprocessor Method

Reduction Procedure

Reduced Dataset

Page 17: Text categorization

MethodologyGenerating Frequencies

We have to generate dataset frequencies provided below.

• Wordcount = number of times a word occur in a file.

• Term Frequency (TF) = occurrence / total (word freq. for each file)

• Inverse Document Frequency (IDF) = total doc. / no of doc in which term

occurs

• Normalized term Frequency = ∑ TF/ no. of occurrence

Page 18: Text categorization

MethodologyClassification

Unknown Input

Text Classifier

Output Class 1

Output Class 2

Output Class 3

Output Class 4

Output Class 5

Output Class N

Dataset

Page 19: Text categorization

Proposed Work

Page 20: Text categorization

Proposed Work

• System will be provided automatic text categorization using

Modified Naïve Bayes algorithm.

• Text Reduction and Feature selection is done for Dataset

preprocessing.

• Dataset for simulation : 20 newsgroup

• Comparison of existing and proposed system will be provided

Page 21: Text categorization

Proposed Architecture

Page 22: Text categorization

Proposed Architecture

Unknown News

Modified NB Algorithm

Output Class 1

Output Class 2

Output Class 3

Output Class 4

Output Class 5

Output Class N

News Dataset

Page 23: Text categorization

Modules

Page 24: Text categorization

Modules

1. Dataset preprocessing

2. Text Categorization

3. Modified NB Implementation

4. Comparative Study

Page 25: Text categorization

Module - Dataset Preprocessing

1. Read each and every file one by one

2. Perform Preprocessing

a. Remove stop words.

b. Remove Special Symbols

c. Remove Unwanted Spaces.

3. Calculate word count, Term Frequency, Normalized Term Frequency, Inverse

Document Frequency

4. Insert data in DB as word , docid, classid, wordcount , TF, IDF, NTF

Page 26: Text categorization

Module – Text Categorization

1. Performing Text Categorization on an unknown file.

a. Perform preprocessing on Input file

Remove stop words, special symbols and unwanted spaces.

2. Generate Decision Matrix.

News class 1

News class 2

News class N

Word 1 Word 2 Word N

Page 27: Text categorization

Module – Text Categorization

1. For k-NN algorithm use wordcount

Wordcount word1 – class 1

Wordcount word2 – class 1

Wordcount wordN – class 1

News class 1

News class 2

News class N

Word 1 Word 2 Word N

Page 28: Text categorization

Module – Text Categorization

1. For Naïve bayes algorithm use TF * IDF

TF*IDF word1 – class 1

TF*IDFword2 – class 1

TF*IDFwordN – class 1

News class 1

News class 2

News class N

Word 1 Word 2 Word N

Page 29: Text categorization

Module – Text Categorization

1. For Modified Naïve bayes algorithm use NTF * IDF

NTF*IDF word1 – class 1

NTF*IDFword2 – class 1

NTF*IDFwordN – class 1

News class 1

News class 2

News class N

Word 1 Word 2 Word N

Page 30: Text categorization

Module – Text Categorization

2. Perform Row wise Addition

Frequency

Class 1

Class N

Word count for k-NN

TF * IDF for Naïve Bayes

NTF * IDF for Modified Naïve Bayes

Page 31: Text categorization

Module – Text Categorization

3. Perform Maximum of Array and select index as predicted class.

Frequency

Class 1

Class N

Word count for k-NN

TF * IDF for Naïve Bayes

NTF * IDF for Modified Naïve Bayes

Page 32: Text categorization

Technology

Page 33: Text categorization

Technology

1. Front End : Java 8

2. BackEnd : MySQL

Page 34: Text categorization

Applications

Page 35: Text categorization

Applications

1. Antivirus System

2. Disease Detection Systems.

Page 36: Text categorization

Requirements

Page 37: Text categorization

Software Requirements

1. Eclipse or Netbeans for Java Development

2. Apache Tomcat 8

Page 38: Text categorization

Expected Hardware Requirements

1. I3 Processor

2. 4GB RAM

3. 500 GB HDD

Page 39: Text categorization

References

Page 40: Text categorization

References1. Bo Tang, Student Member, IEEE, Steven Kay, Fellow, IEEE, and Haibo He, Senior Member,

IEEE,” Toward Optimal Feature Selection in Naive Bayes for Text Categorization” Dependable and Secure Computing, Submitted To Ieee Transactions On Knowledge And Data Engineering, 09February 2016.

2. Omkar Ardhapure, Gayatri Patil, Disha Udani, Kamlesh Jetha, “Comparative Study Of Classification Algorithm For Text Based Categorization”, in IJRET: International Journal of Research in Engineering and Technology, Volume: 05 Issue: 02 | Feb-2016.

3. Aaditya Jain, Jyoti Mandowara, “Text Classification by Combining Text Classifiers to Improve the Efficiency of Classification”, International Journal of Computer Application (2250-1797) Volume 6– No.2, March- April 2016.

4. Adel Hamdan Mohammad, Omar Al-Momani and Tariq Alwada’n, “Arabic Text Categorization using k-nearest neighbour, Decision Trees (C4.5) and Rocchio Classifier: A Comparative Study” in International Journal of Current Engineering and Technology, E-ISSN 2277 – 4106, P-ISSN 2347 – 5161 , Vol.6, No.2 (April 2016).

5. Kapila Rani, Satvika, “Text Categorization on Multiple Languages Based On Classification Technique”, (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 7, March 2016, 1578-1581.

Page 41: Text categorization

References

6. T. Joachims, “Text categorization with support vector machines: Learning with many relevant features,” in ECML, 1998.

7. W. Lam, M. Ruiz, and P. Srinivasan, “Automatic text categorization and its application to text retrieval,” IEEE Transactions on Knowledge and Data Engineering, vol. 11, no. 6, pp. 865–879, 1999.

8. F. Sebastiani, “Machine learning in automated text categorization,” ACM computing surveys (CSUR), vol. 34, no. 1, pp. 1–47, 2002.

9. H. Al-Mubaid, S. Umair et al., “A new text categorization technique using distributional clustering and learning logic,” IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 9, pp. 1156–1165, 2006.

10. Y. Aphinyanaphongs, L. D. Fu, Z. Li, E. R. Peskin, E. Efstathiadis, C. F. Aliferis, and A. Statnikov, “A comprehensive empirical comparison of modern supervised classification and feature selection methods for text categorization,” Journal of the Association for Information Science and Technology, vol. 65, no. 10, pp. 1964–1987, 2014.

Page 42: Text categorization

Thank You