Top Banner
Science: Text and Language Dr Andy Evans
13

Science: Text and Language Dr Andy Evans. Text analysis Processing of text. Natural language processing and statistics.

Dec 24, 2015

Download

Documents

Colleen Webster
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Science: Text and Language Dr Andy Evans. Text analysis Processing of text. Natural language processing and statistics.

Science: Text and Language

Dr Andy Evans

Page 2: Science: Text and Language Dr Andy Evans. Text analysis Processing of text. Natural language processing and statistics.

Text analysis

Processing of text.

Natural language processing and statistics.

Page 3: Science: Text and Language Dr Andy Evans. Text analysis Processing of text. Natural language processing and statistics.

Processing text: Regex

Java Regular Expressionsjava.util.regex

Regular expressions:Powerful search, compare (and replace) tools.

(other types of regex include direct replace options – in java regex these are separate methods)

Page 4: Science: Text and Language Dr Andy Evans. Text analysis Processing of text. Natural language processing and statistics.

Regex

Standard java:

if ((email.indexOf(“@” > 0) &&

(email.endsWith(“.org”))) {

return true;

}

Regex version:

if(email.matches(“[A-Za-z]+@[A-Za-z]+\\.org”)) return true;

Page 5: Science: Text and Language Dr Andy Evans. Text analysis Processing of text. Natural language processing and statistics.

Example components[abc] a, b, or c (simple class) [^abc] Any character except a, b, or c (negation) [a-zA-Z] a through z, or A through Z, inclusive (range) [a-d[m-p]] a through d, or m through p: [a-dm-p] (union) [a-z&&[def]] d, e, or f (intersection) [a-z&&[^bc]] a through z, except for b and c: [ad-z] (subtraction) [a-z&&[^m-p]] a through z, and not m through p: [a-lq-z] (subtraction). Any character (may or may not match line terminators) \d A digit: [0-9] \D A non-digit: [^0-9] \s A whitespace character: [ \t\n\x0B\f\r] \S A non-whitespace character: [^\s] \w A word character: [a-zA-Z_0-9] \W A non-word character: [^\w]? Once or not at all* Zero or more times+ One or more times

Page 6: Science: Text and Language Dr Andy Evans. Text analysis Processing of text. Natural language processing and statistics.

Matching

Find all words that start with a number.

Pattern p = Pattern.compile(“\\d\\.*”);

Matcher m = p.matcher(stringToSearch);

while (m.find()) {

String temp = m.group();

System.out.println(temp);

}

Page 7: Science: Text and Language Dr Andy Evans. Text analysis Processing of text. Natural language processing and statistics.

Replacing

replaceFirst(String regex, String replacement)

replaceAll(String regex, String replacement)

Page 8: Science: Text and Language Dr Andy Evans. Text analysis Processing of text. Natural language processing and statistics.

Regex

Good start is the tutorial at:http://docs.oracle.com/javase/tutorial/essential/regex/

Also Mehran Habibi’s Java Regular Expressions.

Page 9: Science: Text and Language Dr Andy Evans. Text analysis Processing of text. Natural language processing and statistics.

Natural Language Processing

A large part is Part of Speech (POS) Tagging:Marking up of text into nouns, verbs, etc., usually based on the location in the text and other context rules.

Often formulates these rules using machine-learning (of various kinds), training the program on corpora of marked-up text.

Used for :Text understanding.Knowledge capture and use.Text forensics.

Page 10: Science: Text and Language Dr Andy Evans. Text analysis Processing of text. Natural language processing and statistics.

NLP Libraries

Popular are:

Natural Language Toolkit (NLTK; Python)http://www.nltk.org/

OpenNLP (Java)http://opennlp.apache.org/index.html

Page 11: Science: Text and Language Dr Andy Evans. Text analysis Processing of text. Natural language processing and statistics.

OpenNLP

Sentence recognition and tokenising.Name extraction (including placenames).POS Tagging.Text classification.

For clear examples, see the manual at:http://opennlp.apache.org/documentation.html

Page 12: Science: Text and Language Dr Andy Evans. Text analysis Processing of text. Natural language processing and statistics.

Other info

Other than the Numerical Recipes books, the other classic texts are Donald E. Knuth’s The Art of Computer ProgrammingFundamental Algorithms Seminumerical Algorithms Sorting and SearchingCombinatorial Algorithms

But at this stage, you’re better off getting…

Page 13: Science: Text and Language Dr Andy Evans. Text analysis Processing of text. Natural language processing and statistics.

Other infoMichael T. Goodrich and Roberto Tamassia’s Data Structures and Algorithms in Java.

Basic java, arrays and list.Recursion in algorithms.Key mathematical algorithms.Algorithm analysis.Data storage structures (stacks, queues,

hashtables, binary trees, etc.)Search and sort.Text processing.Graph/network analysis.Memory management.