Text Classification and Images by Carl Sable
Jan 16, 2016
Text Classification and Images
by Carl Sable
Overview
• Text Classification.– Involves assigning text documents to one or more
groups (classes).
– Techniques can be applied to image captions to classify corresponding images.
• Various methods, evaluation techniques, and related issues will be discussed.
• Some discussion of other research involving image captions.
Text Classification Tasks
• Text Categorization (TC) - Assign text documents to existing, well-defined categories.
• Information Retrieval (IR) - Retrieve text documents which match user query.
• Clustering - Group text documents into clusters of similar documents.
• Text Filtering - Retrieve documents which match a user profile.
Text Categorization
• Classify each test document by assigning category labels.– M-ary categorization assumes M labels per
document.– Binary categorization requires yes/no decision for
every document/category pair.
• Most techniques require training.– Parametric vs non-parametric.– Batch vs on-line.
Early Work
• The Federalist papers.– Published anonymously between 1787-1788.– Authorship of 12 papers in dispute (either
Hamilton or Madison).
• Mostellar and Wallace, 1963.– Compared rate per thousand words of high
frequency words.– Collected very strong evidence in favor of
Madison.
Rocchio
• All documents and categories represented by word vectors.
• TF*IDF weights for words.– Term frequency is number of times word appears in
document or category.
– Inverse document relates to scarcity of word over entire training collection.
• Similarity computed for all document, category pairs.
Naïve Bayes
• Estimates probabilities of categories given a document.
• Uses joint probabilities of words and categories (Bayes’ rule).
• Assumes words are independent of each other.
• Can incorporate a priori probabilities of categories.
Other Common Methods
• K-Nearest Neighbor (kNN) - Use k closest training documents to predict category.
• Decision Trees (DTree)- Construct classification trees based on training data.
• Neural Networks (NNet) - Learn non-linear mapping from input words to categories.
• Expert Systems - Use manually constructed, domain-specific, application-specific rules.
Advanced Techniques
• Support Vector Machines (SVMs).– Use Structural Risk Minimization principle.– Find hypothesis which minimizes “true error”.
• Widrow-Hoff and EG - Update weight vector based on each training example.
• Maximum Entropy - Derive constraints expressing characteristics of training data.
• Boosting - Combine weak hypotheses to produce highly accurate classification rule.
Common Test Corpora
• Reuters - Collection of newswire stories from 1987 to 1991, labeled with categories.
• TREC-AP newswire stories from 1988 to 1990, labeled with categories.
• OHSUMED Medline articles from 1987 to 1991, MeSH categories assigned.
• UseNet newsgroups.
• WebKB - Web pages gathered from university CS departments.
Other Issues to Consider
• Which words to use (feature selection).
• Normalization.
• Use of lexical databases.– Longman Dictionary of Contemporary English
(LDOCE), WordNet, English Verb Classes and Alternations (EVCA).
– May cause problems due to lexical ambiguity.
• High cost of manual labels.
Categorizing Images
• Some previous research on content-based image categorization, very little on text-based image categorization!
• WebSEEk.– Categorizes images and videos based on key-terms
extracted from URL, alt text, hyperlinks, and directory names.
– Semi-automated key-term dictionary maps key-terms to subject(s) from a taxonomy.
Evaluation Metrics
• Per Category Measures:– simple accuracy or error measures
can be misleading.
– precision, recall, and fallout.
– F-measure, average precision, and break-even point (BEP) combine precision and recall.
• Macro-averaging vs Micro-averaging.
• Should choose metric ahead of time (maybe)!
Yes iscorrect
No iscorrect
AssignedYES
a b
AssignedNO
c d
p = a / (a + b)
r = a / (a + c)
f = b / (b + d)
Acc = (a + d) / n
Err = (b + c) / n
contingency table:
Some Results and Analysis
• Comparisons.– SVM and kNN, AdaBoost, WH, and EG all showed
very impressive performance.– Naïve Bayes and Rocchio tended to show relatively
poor performance.
• Rocchio possibly could have done better.– Should be using probabilistic Rocchio.– Works best if categories are mutually exclusive.– May perform at its best when only 2 categories.
Information Retrieval
• User inputs query, system should retrieve all relevant documents.
• Simple technique: keyword search.
• Other techniques use on word vectors.– TF*IDF commonly used for weights.– Can compute similarity between query vector and
document vectors.
• Evaluation - Similar to text categorization, treat relevant documents as single category.
Relevance Feedback
• After initial retrieval, user makes relevance judgements for retrieved documents.
• New round of retrieval based on feedback.• Similar to text categorization with two
categories: relevant vs non-relevant.• Rocchio algorithm originally created for this
task.• Naïve Bayes very successful.
Possible Improvements
• Lexical databases sometimes used for query expansion.
• Word sense disambiguation.– Expand query with correct senses.– Used on documents to prevent retrieval based
on false matches.
• Notion of semantic similarity.
Retrieval of Captioned Images
• Typical properties of image captions:– Shorter than documents in typical IR tasks.– Subject noun phrase usually denotes most significant
object in picture.– In news domain, first sentence generally describes
image, rest is background.
• Different types of queries.
• Many techniques from general IR not applicable.
Related Research
• Smeaton.– Automatically derived Hierarchical Concept Graphs
(HCGs) based on WordNet IS-A links.– Computed semantic similarity between nouns.– Some success improving image retrieval.
• Guglielmo and Rowe.– Used logical form records to capture meaning of
queries and captions for comparison.– System significantly beat keyword search.
Other Text Classification Tasks
• Clustering documents.– Create groups with similar attributes.– Various methods and algorithms exist.– Hierarchical vs non-hierarchical.– Each group has centroid.– Can aid in Information Retrieval.
• Text Filtering.– Filter articles of potential interest for a user.– Uses many of the same methods as TC and IR.
Processing Image Captions
• The Correspondence Problem - How to correlate visual information with words.– Visual semantics.– Symbolic representation of visual data.
• Srihari.– Piction - System that automatically identifies human
faces in captioned newspaper photos.– Integrates NLP module which parses captions with
IU module that detects objects.
Final Observations
• Previous Work.– General text categorization studied extensively.– Some research on text-based image retrieval.– Very little research involving text-based image
categorization.
• Image captions contain information unlikely to be extracted from just images.
• High potential exists for significant research involving text-based image categorization.