1 More Xkwic and Tgrep LING 5200 Computational Corpus Linguistics Martha Palmer March 2, 2006
Jan 12, 2016
1
More Xkwic and Tgrep
LING 5200Computational Corpus LinguisticsMartha PalmerMarch 2, 2006
LING 5200, 2006 BASED on Kevin Cohen’s LING
52002
Resources – Laura is bugging me to make a CU Corpora page… Like this
http://www.stanford.edu/dept/linguistics/corpora/cas-home.html
TGREP http://www.stanford.edu/dept/linguistics/corpora/cas-tut-tgrep.html
LING 5200, 2006 BASED on Kevin Cohen’s LING
52003
Searching with pos tags and !
[word = "[tT]he" & !( pos = "DT" ) ]; wsj
[ !(word = "water" | pos = "NN")]; [ !(word = "water") & !( pos = "NN")]; [ word != "water" & pos != "NN" ];
LING 5200, 2006 BASED on Kevin Cohen’s LING
52004
Operator precedence
The precedence properties of the (logical) operators are defined by the following list, i.e. if operator x is listed before operator y, operator x has precedence over y. Operators are evaluated left-right
=, !=, !, &, | [ ! word = "water" & ! pos = "NN" ];
disambiguates as [ !(word = "water") & !( pos = "NN")];
LING 5200, 2006 BASED on Kevin Cohen’s LING
52005
Searching sequences with | and ? "Bill" [pos = "NP"];
[pos = "NP"] [pos = "NP"] [pos = "NP"];
([pos = "NP"] [pos = "NP"]) | ([pos = "NP"] "of" [pos = "NP"]); ([pos = "NP"] "of“? [pos = "NP"]); Note: First match applies
LING 5200, 2006 BASED on Kevin Cohen’s LING
52006
Corpus Position: wild cards and contexts "give" []* "up"; "give" []{0,5} "up"; "give" []* "up" within 7; "Clinton" expand to 5; "Clinton" expand left to 5; "Clinton" expand right to 5;
LING 5200, 2006 BASED on Kevin Cohen’s LING
52007
Assignments and Intersect
Q1 = "rain"; Q2 = [pos="NN"]; intersect Q1 Q2;
Q1 = [pos = "JJ"] [pos = "NN"]; Q2 = "acid" "rain"; intersect Q1 Q2; [word = "acid" & pos = "JJ"] [word =
"rain" & pos = "NN"]
LING 5200, 2006 BASED on Kevin Cohen’s LING
52008
Structural restrictions
"give" []* "up" within s;
("gain" []* "profit") | ("profit" []* "gain") within 3 s;
("gain" []* "profit") | ("profit" []* "gain") within article;
"Clinton" expand left to 2 s;
LING 5200, 2006 BASED on Kevin Cohen’s LING
52009
Defining structural restrictions
Nounphrase = [pos = "DT"] [pos = "JJ"] [pos = "NN"];
Nounphrase;
[pos = “JJ”]
Go back to select
LING 5200, 2006 BASED on Kevin Cohen’s LING
520010
For fun
<s> [pos = "V.*"][pos = "PN.*”] </s>
<s> []* [pos = "V.*"][pos = "PN.*”] </s>
( [pos = “V.*”] [pos = “PN.*”]) within s
Not a question, not beginning of sentence…
LING 5200, 2006 BASED on Kevin Cohen’s LING
520011
less is more
less <filename> cat ??/* | less Switches
SPACE – next screenful b– previous screenful /<reg exp pattern> /RNR search for pattern ?<reg exp pattern> search backwards for
pattern q - quit
LING 5200, 2006 BASED on Kevin Cohen’s LING
520012
Searching for a word
tgrep Halloween – what happens? Why don’t you have to specify a file?babel>grep tgrep .cshrc
# tgrep stuff
#setenv TGREP_CORPUS /corpora/treebank2/tbl_075/tgrepabl/brwn_cmb.crp
setenv TGREP_CORPUS /corpora/treebank2/tgrepabl/wsj_mrg.crp
Count results: tgrep research | wc –l cat ??/* | grep Halloween | wc -l
LING 5200, 2006 BASED on Kevin Cohen’s LING
520013
Tgrep Switches
-a Match on all patterns in a sentence -w Return the whole sentence -n Put the entire string on one line -t Print only the terminals
LING 5200, 2006 BASED on Kevin Cohen’s LING
520014
Viewing it in sentential context tgrep –wn Halloween | more
tgrep –wn research | more (20,865 hits)
Can also use less
LING 5200, 2006 BASED on Kevin Cohen’s LING
520015
Viewing it in sentential context tgrep –wn research | more
LING 5200, 2006 BASED on Kevin Cohen’s LING
520016
Searching by POS tgrep NNS | more
Another way to do your sanity check
LING 5200, 2006 BASED on Kevin Cohen’s LING
520017
See more data?
tgrep NNS | grep . | more
LING 5200, 2006 BASED on Kevin Cohen’s LING
520018
Sentential context (again) tgrep –wn NNS | more
LING 5200, 2006 BASED on Kevin Cohen’s LING
520019
Searching by syntactic constituent tgrep NP | more
LING 5200, 2006 BASED on Kevin Cohen’s LING
520020
Single-line outputs tgrep –n NP | more
LING 5200, 2006 BASED on Kevin Cohen’s LING
520021
Viewing tree-like output tgrep –w NP | head 20
LING 5200, 2006 BASED on Kevin Cohen’s LING
520022
Searching for relations between nodes tgrep ‘NP < CC’ | head -16
LING 5200, 2006 BASED on Kevin Cohen’s LING
520023
tgrep –g (whole language)
A < B – A immediately dominates B A < B – A is immediately dominated by B A << B – A dominates B A >> B – A is dominated by B A . B – A immediately precedes B A .. B – A precedes B A<<,B – B is the leftmost descendent of A A<<‘B – B is the rightmost descendent of
A
LING 5200, 2006 BASED on Kevin Cohen’s LING
520024
Alternation
node names can be ORed e.g. tgrep ‘Clinton|Gore’ | head
LING 5200, 2006 BASED on Kevin Cohen’s LING
520025
Character classes
Regular expressions tgrep ‘/[Cc]hild/’ | egrep . | head
LING 5200, 2006 BASED on Kevin Cohen’s LING
520026
Working towards that weird example… tgrep ‘/[Pp]resident/’ | head
LING 5200, 2006 BASED on Kevin Cohen’s LING
520027
Combining alternation and a regular expression tgrep ‘Clinton|Gore|[Pp]resident/’ |
head
LING 5200, 2006 BASED on Kevin Cohen’s LING
520028
Searching for a transitive verb
tgrep -w 'VP << like < NP << DT' | more
LING 5200, 2006 BASED on Kevin Cohen’s LING
520029
Verbs + Particles
tgrep -w 'VP << kick' > kick
tgrep 'VP << /kick.*/ <2 PRT' kick
tgrep 'VP <1 VB <2 PRT' kick
tgrep -nw 'VP <1 /VB.*/ <2 PRT' kick
tgrep 'VP <1 (VB < kick) <2 PRT' kick
tgrep 'VP <1 (/VB.*/ < kick) <2 PRT' kick