Text Mining Maurice Masih 13030141093 03/24/22 1
Topic of Discussion
• Introduction• Text mining Comparison with other mining • Text Mining Process• How Algorithm is derived for Text Mining• Text Analysis For Google Sheet• Conclusion
04/15/23 2
Introduction• It is the process of deriving high-quality information
– Non trivial information– Unstructured text.
• It is also called as text data mining or text analytics.
Need
Bio Tech Industry
80% of biological knowledge is only in research paper(unstructured).
If a scientist manually read 50 research paper/week and only 10% of data are useful then he/she manages only 5 research paper/week
04/15/23 3
Text mining Comparison with…
Text Mining
Information Retrieval
Web Mining
Data Mining
Statistics
Computer Linguistics &
natural language
processing04/15/23 4
Text Mining Process
Text transformation
Text Preprocessing
Text
Attribute Selection
Data Mining/ Patter Discovery
Interpretation/ Evaluation•Document
Clustering•Text Characteristics
•Text Cleanup•Tokenization
•Text representation•Feature Selection
•Reduce Dimensionality•Remove irrelevant attributes
•Structured database•Application dependent•Classic data mining technique
Terminate or iterate
04/15/23 5
1.Text
Document clustering Large volume of textual data. No clear picture what document suit the application. Common technique is K mean clustering.
Text Characteristics Dependency Ambiguity Noisy Data Unstructured data
04/15/23 6
2.Text Preprocessing
Text Cleanup Remove ads from page Convert from binary format Normalize text Deal with tables, figures and formulas
Tokenization Splitting up a string of characters into a set of tokens. Need to deal with issues like, Apostrophes, hyphens. Need to deal with tenses, part of speech, etc.
04/15/23 7
3.Text transformationText Representation Text document is represented by the words (features) it contains
and their occurrences.
Bag of Words04/15/23 8
4.Attribute Selection
Reduction of dimensionality Learners have difficulty addressing tasks with high dimensionality. Scarcity of resources and feasibility issues also call for a further
cutback of attributes.
Irrelevant features Not all features help!
e.g., the existence of a noun in a news article is unlikely to help classify it as “politics” or “sport”.
04/15/23 10
5.Data Mining/ Pattern Discovery
Text mining process merges with the traditional Data Mining process. Classic Data Mining techniques are used on the structured database
that resulted from the previous stages.
6.Interpretation & Evaluation
What to do next? Terminate Iterate
04/15/23 11
Text Analysis For Google Sheet
•Perform Sentiment Analysis•Extract mention of entities and concepts.•Summarize long chunks of text •Detect the language of a document•Find the best hashtags .•Extract the full text of an article, as well as its author name, embedded media, etc.
04/15/23 13
Conclusion
Text mining generally consists of the analysis of (multiple) text documents by extracting key phrases, concepts, matches etc. and the preparation of the text processed in that manner for further analyses with numeric data mining techniques.
04/15/23 14
References
• http://www.r-bloggers.com/text-mining-in-r-automatic-categorization-of-wikipedia-articles/
• http://www.kdd.org/sites/default/files/issues/7-1-2005-06/9-Popowich.pdf
• www.Slideshare.net
04/15/23 15