This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Distill structured data from unstructured and semi-structured text
Exploit the extracted data in your applications
For years, Microsoft Corporation CEO Bill Gates was against open source. But today he appears to have changed his mind. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Name Title OrganizationBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman Founder Free Soft..
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, in sagittis facilisis, John Smith at <Phone> amet lt arcu
tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit lectus.
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin l enina i facilisis, <Name> at 555-1212 arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu augue rutrum velit, sed John Smith at 555-1212 hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti sociosqu ad litora
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in e sagittis facilisis, arcu augue rutrum velit, sed <PersonPhone>, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti
Cascading Grammars By Example
Name Token[~ “at”] Phone PersonPhoneName Token[~ “at”] Phone PersonPhone
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu augue rutrum velit, sed <Name> at <Phone> hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, in sagittis facilisis, John Smith at <Phone> amet lt arcu
tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit lectus.
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin l enina i facilisis, <Name> at 555-1212 arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu augue rutrum velit, sed John Smith at 555-1212 hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti sociosqu ad litora
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in e sagittis facilisis, arcu augue rutrum velit, sed <PersonPhone>, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti
Common Pattern Specification Language (CPSL)
Name Token[~ “at”] Phone PersonPhoneName Token[~ “at”] Phone PersonPhone
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu augue rutrum velit, sed <Name> at <Phone> hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est
Level 0
Level 2
Level 1
CPSL
– A standard language for specifying cascading grammars– Created in 1998
Several known implementations
– TextPro: reference implementation of CPSL by Doug Appelt– JAPE (Java Annotation Pattern Engine)
• Part of the GATE NLP framework
• Under active consideration for commercial use by several companies
CPSL
– A standard language for specifying cascading grammars– Created in 1998
Several known implementations
– TextPro: reference implementation of CPSL by Doug Appelt– JAPE (Java Annotation Pattern Engine)
• Part of the GATE NLP framework
• Under active consideration for commercial use by several companies
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Maecenas turpis. Proin nam ac ligula a lectus suscipit porttitor. Fusce non tellus sed urna pulvinar tincidunt.
Etiam in enim. In blandit mi sit amet lectus. Nullam adipiscing fringilla odio. In hac habitasse platea dictumst. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Ut elementum quam eget justo. In arcu leo,
We went to a OTIS concert last Thursday. Suspendisse malesuada est vel risus. Aenean sed ante fermentum dolor placerat rutrum. John Pipe plays guitar, id pellentesque pede felis a erat. Felis Marco Benevento on the Hammond organ. Curabitur sollicitudin porta velit. Donec scelerisque. Donec a magna sed sem accumsan sodales. It was SO MUCH FUN! Hes accumsan sed, aliquam eget, ornare et, metus. Integer eleifend tellus dictum nisi.
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, in sagittis facilisis, John Smith at <Phone> amet lt arcu
tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit lectus.
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin l enina i facilisis, <Name> at 555-1212 arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu augue rutrum velit, sed John Smith at 555-1212 hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti sociosqu ad litora
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in e sagittis facilisis, arcu augue rutrum velit, sed <PersonPhone>, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti
Performance
Name Token[~ “at”] Phone PersonPhoneName Token[~ “at”] Phone PersonPhone
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu augue rutrum velit, sed <Name> at <Phone> hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est
create view PersonPhone asselect P.name as person, N.number as phonefrom Person P, PhoneNumber N, Sentence Swhere Follows(P.name. N.number, 0, 30) and Contains(S.sentence, P.name) and Contains(S.sentence, N.number) and ContainsRegex(/\b(phone|at)\b/, SpanBetween(P.name, N.number));
– Higher-priority rules in a level dominate lower-priority ones– Complex interactions between rules– Not enough information available in low-level rules
John Pipe plays the guitar
InstrumentInstrument
John Pipe plays the guitar
Person Instrument
Person
Marco Benevento on the Hammond organ
Person Person
Marco Benevento on the Hammond organ
Instrument
Person dominates Instrument Instrument dominates Person
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, in sagittis facilisis, John Smith at <Phone> amet lt arcu
tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit lectus.
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin l enina i facilisis, <Name> at 555-1212 arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu augue rutrum velit, sed John Smith at 555-1212 hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti sociosqu ad litora
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in e sagittis facilisis, arcu augue rutrum velit, sed <PersonPhone>, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti
Common Pattern Specification Language (CPSL)
Name Token[~ “at”] Phone PersonPhoneName Token[~ “at”] Phone PersonPhone
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu augue rutrum velit, sed <Name> at <Phone> hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est
Level 0
Level 2
Level 1
CPSL
– A standard language for specifying cascading grammars
– Created in 1998
CPSL
– A standard language for specifying cascading grammars
create view PersonPhone asselect P.name as person, N.number as phonefrom Person P, PhoneNumber N, Sentence Swhere Follows(P.name. N.number, 0, 30) and Contains(S.sentence, P.name) and Contains(S.sentence, N.number) and ContainsRegex(/\b(phone|at)\b/, SpanBetween(P.name, N.number));
Set<Pair<Span>> PersonPhoneCandidate = new HashSet<Pair<Span>>();
for (Span P : Person) { for (Span N : PhoneNum) { if (Follows(P,N,0,30)) then { String textBetween = text.substring(P.end, N.begin); Pattern R = Pattern.compile(“\\b(phone|at)\\b“); if (matches(R, textBetween) { PersonPhoneCandidate.add(new Pair<Span>(P,N)); } } } } Set<Pair<Span>> PersonPhone = new HashSet<Pair<Span>>(); for (Pair<Span> C : PersonPhoneCandiate) { for (Span S : Sentence) { if(S.contains(C)) { PersonPhone.add(C); } } }
return C;}
boolean Follows(Span first, Span second, int min, int max) { int firstEnd = first.end; int secondBegin = second.begin; int distance = (secondBegin – firstEnd);
Names appear in widely varying contexts– Mr. Dabrowski received a Bachelor degree…– Dr. Jean L. Rouleau Dean of Medicine University…– …met Peter and Katie Lawton who have…– …lives in Riverdale, NY, with his wife Marie-Jeanne. He has two married
sons, James and Michael. – The Honorable Carol Boyd Hallett - Of Counsel…– Kimberly Purdy Lloyd received a Bachelor of Science degree from the
University of Texas… Additional Challenges
– Avoiding person names inside/overlap with other entities• Organization, Address
– List of person names• Attendees Ida White, Bridget McBean, Volker Hauck
Currently supports names from > 8 countries, including Israel Currently supports names from > 8 countries, including Israel
USAddress has well-defined pattern– <StreetAddress> <SecondaryUnit>? <City> <State> <Zipcode>?– 1515 Pioneer Drive Harrison, AR 72601– 3607 Church Street, Suite 300 · Cincinnati, Ohio 45244– 101 S. Webster Street . PO Box 7921 . Madison, Wisconsin 53707-
7921 Challenges
– Multiple parts to the Address– Some parts are optional (e.g., Secondary Unit, Zipcode)– <City> cannot be identified using Dictionary due to resource restrictions– Handling ambiguous abbreviations
• Ms MA In state names• Dr. Row Street suffixes
Currently supports U.S. and German addresses Currently supports U.S. and German addresses
– is a graduate of Hofstra University– He joined Interactive Data in 2003– President of Foley & Lardnear LLP– Received her B.S in English from University of Wisconsin– The bill at the Savoy Hotel
Additional Challenges
– Long organization names (Q: where is the begin & end?)• The Chartered Institute of Public Finance and Accountancy
– May contain list of person names• Squar, Milner, Peterson, Miranda & Williamson, LLP• John Ortiz, James & James Ltd
– Adjacent organization names• University of Michigan Ross School of Business
– Multiple representation for the same organization & its subdivisions• Enron, Enron Corp., Enron Corporation, Enron Metals & Commodity Corp.