Top Banner
Text Mining Maurice Masih 13030141093 03/24/22 1
16
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Tesxt mining

Text Mining

Maurice Masih13030141093

04/15/23 1

Page 2: Tesxt mining

Topic of Discussion

• Introduction• Text mining Comparison with other mining • Text Mining Process• How Algorithm is derived for Text Mining• Text Analysis For Google Sheet• Conclusion

04/15/23 2

Page 3: Tesxt mining

Introduction• It is the process of deriving high-quality information

– Non trivial information– Unstructured text.

• It is also called as text data mining or text analytics.

Need

Bio Tech Industry

80% of biological knowledge is only in research paper(unstructured).

If a scientist manually read 50 research paper/week and only 10% of data are useful then he/she manages only 5 research paper/week

04/15/23 3

Page 4: Tesxt mining

Text mining Comparison with…

Text Mining

Information Retrieval

Web Mining

Data Mining

Statistics

Computer Linguistics &

natural language

processing04/15/23 4

Page 5: Tesxt mining

Text Mining Process

Text transformation

Text Preprocessing

Text

Attribute Selection

Data Mining/ Patter Discovery

Interpretation/ Evaluation•Document

Clustering•Text Characteristics

•Text Cleanup•Tokenization

•Text representation•Feature Selection

•Reduce Dimensionality•Remove irrelevant attributes

•Structured database•Application dependent•Classic data mining technique

Terminate or iterate

04/15/23 5

Page 6: Tesxt mining

1.Text

Document clustering Large volume of textual data. No clear picture what document suit the application. Common technique is K mean clustering.

Text Characteristics Dependency Ambiguity Noisy Data Unstructured data

04/15/23 6

Page 7: Tesxt mining

2.Text Preprocessing

Text Cleanup Remove ads from page Convert from binary format Normalize text Deal with tables, figures and formulas

Tokenization Splitting up a string of characters into a set of tokens. Need to deal with issues like, Apostrophes, hyphens. Need to deal with tenses, part of speech, etc.

04/15/23 7

Page 8: Tesxt mining

3.Text transformationText Representation Text document is represented by the words (features) it contains

and their occurrences.

Bag of Words04/15/23 8

Page 9: Tesxt mining

3.Text transformation contd..

04/15/23 9

Page 10: Tesxt mining

4.Attribute Selection

Reduction of dimensionality Learners have difficulty addressing tasks with high dimensionality. Scarcity of resources and feasibility issues also call for a further

cutback of attributes.

Irrelevant features Not all features help!

e.g., the existence of a noun in a news article is unlikely to help classify it as “politics” or “sport”.

04/15/23 10

Page 11: Tesxt mining

5.Data Mining/ Pattern Discovery

Text mining process merges with the traditional Data Mining process. Classic Data Mining techniques are used on the structured database

that resulted from the previous stages.

6.Interpretation & Evaluation

What to do next? Terminate Iterate

04/15/23 11

Page 12: Tesxt mining

How Algorithm is derived for Text Mining

04/15/23 12

Page 13: Tesxt mining

Text Analysis For Google Sheet

•Perform Sentiment Analysis•Extract mention of entities and concepts.•Summarize long chunks of text •Detect the language of a document•Find the best hashtags .•Extract the full text of an article, as well as its author name, embedded media, etc.

04/15/23 13

Page 14: Tesxt mining

Conclusion

Text mining generally consists of the analysis of (multiple) text documents by extracting key phrases, concepts, matches etc. and the preparation of the text processed in that manner for further analyses with numeric data mining techniques.

04/15/23 14

Page 15: Tesxt mining

References

• http://www.r-bloggers.com/text-mining-in-r-automatic-categorization-of-wikipedia-articles/

• http://www.kdd.org/sites/default/files/issues/7-1-2005-06/9-Popowich.pdf

• www.Slideshare.net

04/15/23 15

Page 16: Tesxt mining

04/15/23 16