Intelligent Database Systems Presenter : YU-TING LU Authors : Harun Ug˘uz 2011.KBS A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm
Feb 24, 2016
Intelligent Database Systems Lab
Presenter : YU-TING LU
Authors : Harun Ug˘uz
2011.KBS
A two-stage feature selection method for text categorization by usinginformation gain, principal component analysis and genetic algorithm
Intelligent Database Systems Lab
OutlinesMotivationObjectivesMethodologyExperimentsConclusionsComments
Intelligent Database Systems Lab
Motivation
• A major problem of text categorization is its
large number of features.
• Most of those are irrelevant noise that can
mislead the classifier.
Intelligent Database Systems Lab
Objectives
• Two-stage feature selection and feature extraction is
used to improve the performance of text
categorization.
Intelligent Database Systems Lab
Methodology
Intelligent Database Systems Lab
Methodology – pre-processing– removing of stop-words
– Stemming
– term weighting
– pruning of the words
a, an, and, because, can, do, every, the…
computer, computing, computation, computes comput
prune the words that appear less than two times in the documents.
Terms of the document collection
documents
Intelligent Database Systems Lab
Methodology – feature ranking with information gain• each term within the text is ranked depending on
their importance for the classification in decreasing order using the IG method.
Intelligent Database Systems Lab
Methodology – dimension reduction methods• principal component analysis
• Genetic algorithm for feature selection
Individual’s encoding
Fitness function
Mutation Crossover
11011001100111011110
Selection
p m≦
Intelligent Database Systems Lab
Methodology – text categorization methods• KNN classifier
• C4.5 decision tree classifier
Intelligent Database Systems Lab
precision recall F-measure
Methodology – evaluation of the performance
Intelligent Database Systems Lab
Experiments – datasets– Reuters dataset-21578
– Classic3 dataset
Category name Number of document
Earn 3743
Acquisition 2179
Money-fx 633
Crude 561
Grain 542
Trade 500
Category name Number of document
CRANFIELD 1398
MEDLINE 1033
CISI 1460
Intelligent Database Systems Lab
Experiments – Reuters-21578 A document-term matrix is acquired with a dimension of 8158 × 7542 at the end of pre-processing.
Intelligent Database Systems Lab
Experiments – Reuters-21578
Intelligent Database Systems Lab
Experiments – Classic3 A document-term matrix is acquired in the dimension of 3891 × 6679 at the end of pre-processing.
Intelligent Database Systems Lab
Experiments – Classic3
Intelligent Database Systems Lab
Conclusions
• The success of text categorization performed through the C4.5 decision tree and KNN algorithms using fewer features selected via IG-PCA and IG- GA is higher than the success acquired using features selected via IG.
• Two-stage feature selection methods can improve the performance of text categorization.
Intelligent Database Systems Lab
Comments• Advantages
- understand the basic methods• Applications
- text categorization