Science: Text and Language Dr Andy Evans
Processing text: Regex
Java Regular Expressionsjava.util.regex
Regular expressions:Powerful search, compare (and replace) tools.
(other types of regex include direct replace options – in java regex these are separate methods)
Regex
Standard java:
if ((email.indexOf(“@” > 0) &&
(email.endsWith(“.org”))) {
return true;
}
Regex version:
if(email.matches(“[A-Za-z]+@[A-Za-z]+\\.org”)) return true;
Example components[abc] a, b, or c (simple class) [^abc] Any character except a, b, or c (negation) [a-zA-Z] a through z, or A through Z, inclusive (range) [a-d[m-p]] a through d, or m through p: [a-dm-p] (union) [a-z&&[def]] d, e, or f (intersection) [a-z&&[^bc]] a through z, except for b and c: [ad-z] (subtraction) [a-z&&[^m-p]] a through z, and not m through p: [a-lq-z] (subtraction). Any character (may or may not match line terminators) \d A digit: [0-9] \D A non-digit: [^0-9] \s A whitespace character: [ \t\n\x0B\f\r] \S A non-whitespace character: [^\s] \w A word character: [a-zA-Z_0-9] \W A non-word character: [^\w]? Once or not at all* Zero or more times+ One or more times
Matching
Find all words that start with a number.
Pattern p = Pattern.compile(“\\d\\.*”);
Matcher m = p.matcher(stringToSearch);
while (m.find()) {
String temp = m.group();
System.out.println(temp);
}
Replacing
replaceFirst(String regex, String replacement)
replaceAll(String regex, String replacement)
Regex
Good start is the tutorial at:http://docs.oracle.com/javase/tutorial/essential/regex/
Also Mehran Habibi’s Java Regular Expressions.
Natural Language Processing
A large part is Part of Speech (POS) Tagging:Marking up of text into nouns, verbs, etc., usually based on the location in the text and other context rules.
Often formulates these rules using machine-learning (of various kinds), training the program on corpora of marked-up text.
Used for :Text understanding.Knowledge capture and use.Text forensics.
NLP Libraries
Popular are:
Natural Language Toolkit (NLTK; Python)http://www.nltk.org/
OpenNLP (Java)http://opennlp.apache.org/index.html
OpenNLP
Sentence recognition and tokenising.Name extraction (including placenames).POS Tagging.Text classification.
For clear examples, see the manual at:http://opennlp.apache.org/documentation.html
Other info
Other than the Numerical Recipes books, the other classic texts are Donald E. Knuth’s The Art of Computer ProgrammingFundamental Algorithms Seminumerical Algorithms Sorting and SearchingCombinatorial Algorithms
But at this stage, you’re better off getting…
Other infoMichael T. Goodrich and Roberto Tamassia’s Data Structures and Algorithms in Java.
Basic java, arrays and list.Recursion in algorithms.Key mathematical algorithms.Algorithm analysis.Data storage structures (stacks, queues,
hashtables, binary trees, etc.)Search and sort.Text processing.Graph/network analysis.Memory management.