Top Banner
Grammar-based treebank – a happy marriage of empiricism and theory? Alexandr Rosen Institute of Theoretical and Computational Linguistics Faculty of Arts, Charles University, Prague Grammar and Corpora 2012 4th International Conference Czech Academy of Sciences, Prague 28–30 November 2012 A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 1 / 84
84

Grammar-based treebank a happy marriage of empiricism and ...

Dec 12, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Grammar-based treebank a happy marriage of empiricism and ...

Grammar-based treebank– a happy marriage of empiricism and theory?

Alexandr Rosen

Institute of Theoretical and Computational LinguisticsFaculty of Arts, Charles University, Prague

Grammar and Corpora 20124th International Conference

Czech Academy of Sciences, Prague28–30 November 2012

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 1 / 84

Page 2: Grammar-based treebank a happy marriage of empiricism and ...

The bottom line (or two)

A corpus is an approximation of language use,a grammar is an approximation of language system.

The empirical and the theoretical sides of linguisticsmeet in the annotation of a corpus.

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 2 / 84

Page 3: Grammar-based treebank a happy marriage of empiricism and ...

Outline of the talk

1 Why treebanks, why grammars?

2 Treebanks

3 Grammars

4 The grammar–treebank relationship

5 Czech treebanking

6 Architecture

7 Examples

8 Input processing

9 Conclusions and plans

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 3 / 84

Page 4: Grammar-based treebank a happy marriage of empiricism and ...

Why treebanks, why grammars?

Outline of the talk

1 Why treebanks, why grammars?

2 Treebanks

3 Grammars

4 The grammar–treebank relationship

5 Czech treebanking

6 Architecture

7 Examples

8 Input processing

9 Conclusions and plans

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 4 / 84

Page 5: Grammar-based treebank a happy marriage of empiricism and ...

Why treebanks, why grammars?

Why treebanks?

Treebank . . . a text corpus annotated (at least) with syntactic structure

= why corpora?

= why annotation?

= why syntax?

?= why grammars?

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 5 / 84

Page 6: Grammar-based treebank a happy marriage of empiricism and ...

Why treebanks, why grammars?

Why treebanks? (cont’d)

Explicit markup of syntactic relations (constituents, heads/dependents)

Easier to identify semantic relations (predicates and arguments)

Simplifies some queries

Simplifies extraction of lexical properties (valency)

Support for grammar development

Training data for NLP applications

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 6 / 84

Page 7: Grammar-based treebank a happy marriage of empiricism and ...

Why treebanks, why grammars?

Why grammars? 1/2

“Every time I fire a linguist, systemperformance goes up.”

Fred Jelinek, 1980s

But maybe we don’t care about systemperformance?

Moreover:No longer a wise strategy for NLPEmpirical and symbolic methods canbe combined‘Deep’ linguistics needed for long-termsuccess

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 7 / 84

Page 8: Grammar-based treebank a happy marriage of empiricism and ...

Why treebanks, why grammars?

Why grammars? 2/2

“We should probably all spendmore time on the linguisticannotation of actual data ratherthan on writing grammar rules,based primarily on introspection.”

Erhard Hinrichs, 1990s

But what kind of annotation?“A sentence has as many structures as there are theories.”[Haider(1993)]

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 8 / 84

Page 9: Grammar-based treebank a happy marriage of empiricism and ...

Treebanks

Outline of the talk

1 Why treebanks, why grammars?

2 Treebanks

3 Grammars

4 The grammar–treebank relationship

5 Czech treebanking

6 Architecture

7 Examples

8 Input processing

9 Conclusions and plans

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 9 / 84

Page 10: Grammar-based treebank a happy marriage of empiricism and ...

Treebanks

Treebanks

First treebank: Lancaster-Leeds Treebankearly 1980s, 45 KW, later SUSANNE, due to Geoffrey Sampson

First major project: Penn Treebankrelease 0.5 in 1992, now 3 MW

Now according to Wiki: 74 treebanks in about 40 languages

The 11th International Workshop on Treebanks and LinguisticTheories starts today, approx. 20 contributions each year

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 10 / 84

Page 11: Grammar-based treebank a happy marriage of empiricism and ...

Treebanks

Treebanks differ in:

Size

Linguistic background

Format

Level of detail

Depth of analysis

Ways they are built

Also spoken, parallel, historical, ... treebanks

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 11 / 84

Page 12: Grammar-based treebank a happy marriage of empiricism and ...

Treebanks

Treebanks around the world *)

63 treebanks, 36 languages, sizes up to 1.5 billion words

Also spoken (8), historical (7), parallel (4)

Mostly stochastically parsed and manually corrected

15 parsed by a symbolic grammar (LFG, HPSG, DCG) andmanually disambiguated

39 PS-based annotation, 20 dependency-based annotation

15 available with multiple annotation formats – Penn Treebank :PS, P/A, dependency, LFG, HPSG, CCG, LTAG, PDT

20 with on-line search interface

*) The speaker’s time permitting!

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 12 / 84

Page 13: Grammar-based treebank a happy marriage of empiricism and ...

Treebanks

More examples of treebanks

Prague Dependency Treebank – Czech: 1.5 MWTiger – German: 0.9 MWLASSY – Dutch: 1500 MWLingo Redwoods – English: 45 KSBulTreeBank – Bulgarian: 250 KWINESS Treebanking Infrastructure – various: [Rosén et al.(2012)]Składnica – Polish Constituency Treebank : 8 KS...

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 13 / 84

Page 14: Grammar-based treebank a happy marriage of empiricism and ...

PDT – analytical layer

Page 15: Grammar-based treebank a happy marriage of empiricism and ...

PDT – tectogrammatical layer

Page 16: Grammar-based treebank a happy marriage of empiricism and ...

Tiger

Page 17: Grammar-based treebank a happy marriage of empiricism and ...

Old Church Slavonic (INESS)

Page 18: Grammar-based treebank a happy marriage of empiricism and ...

Polish (Składnica)

Page 19: Grammar-based treebank a happy marriage of empiricism and ...

Grammars

Outline of the talk

1 Why treebanks, why grammars?

2 Treebanks

3 Grammars

4 The grammar–treebank relationship

5 Czech treebanking

6 Architecture

7 Examples

8 Input processing

9 Conclusions and plans

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 19 / 84

Page 20: Grammar-based treebank a happy marriage of empiricism and ...

Grammars

About grammars

Treebank grammars [Charniak & Charniak(1996)]

Probabilistic grammars directly projected from treebanks

“a paradigm shift from the manually constructed, a priori fixedlinguistic grammars” [Prescher et al.(2006)]

Annotation manuals

Symbolic (rule-based) grammars

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 20 / 84

Page 21: Grammar-based treebank a happy marriage of empiricism and ...

Grammars

The paradigm shift

Analytical, linguistic×empirical, data-driven

Analytical = analysis of linguistic competence

Poor coverage→ discontinue ‘deep’ processing?

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 21 / 84

Page 22: Grammar-based treebank a happy marriage of empiricism and ...

Grammars

Anyone need grammars? (Stephan Oepen, TLT2) 1/2

The Ultimate GrammarCoverage of arbitrary data, cross-domain and cross-genreAdequate grammatical analyses in all casesInclusion of semanticsFully declarativeSame grammar for both parsing and generationHigh-efficiency processing tools

BUT:No generally accepted linguistic theoryLong, tedious, error-prone engineering processFew experts

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 22 / 84

Page 23: Grammar-based treebank a happy marriage of empiricism and ...

Grammars

Anyone need grammars? (Stephan Oepen, TLT2) 2/2

The Final TreebankRepresentative data for ‘all’ of the language, domains, and genresFull annotation with (at least) syntactic and semantic informationUtterly coherentFree of errorsFully documentedFreely available

BUT:No generally accepted annotation standardLong, tedious, error-prone annotation process

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 23 / 84

Page 24: Grammar-based treebank a happy marriage of empiricism and ...

Grammars

The answer:grammars and treebanks should go together

Treebank annotation is where a grammar and a treebank canmeet

Treebank annotation is also where multiple theories can meet andcomplement each other

Grammar and treebank are like two sides of a coin:competence × performancesystem × uselangue × paroletheoretical × empirical

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 24 / 84

Page 25: Grammar-based treebank a happy marriage of empiricism and ...

The grammar–treebank relationship

Outline of the talk

1 Why treebanks, why grammars?

2 Treebanks

3 Grammars

4 The grammar–treebank relationship

5 Czech treebanking

6 Architecture

7 Examples

8 Input processing

9 Conclusions and plans

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 25 / 84

Page 26: Grammar-based treebank a happy marriage of empiricism and ...

The grammar–treebank relationship

Treebank – grammar/theory relations

A treebank is useful ...As a source and testbed for grammar/theory development[Hajicová & Sgall(2006)]As training data for treebank grammars and other NLP tools

A grammar/theory is useful ...

To guide the design of an annotation schemeTo control annotation consistencyTo generate treebank annotations

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 26 / 84

Page 27: Grammar-based treebank a happy marriage of empiricism and ...

The grammar–treebank relationship

Linking lexicon and treebank

Theoretically motivated designStart: independently compiled list of entriesIncremental development

Examples:

PDT-VALLEX [Hajic et al.(2003)]FrameNet [Palmer et al.(2005)]PropBank [Baker et al.(1998)]TüBa-D/Z Valency Lexicon [Hinrichs & Telljohann(2009)]...

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 27 / 84

Page 28: Grammar-based treebank a happy marriage of empiricism and ...

The grammar–treebank relationship

Linking grammar and treebank

Grammar development should be supported by an annotatedcorpusAutomatic annotation by symbolic grammars requires a fullyadequate grammar, ideally based on a corpusVicious circle? A possible answer: Incremental development ofboth the grammar and the treebank

Examples:

LinGO Redwoods [Oepen et al.(2002)]Norgram [Rosén et al.(2006)]BulTreeBank [Simov et al.(2002)]Składnica [Swidzinski & Wolinski(2010)]...

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 28 / 84

Page 29: Grammar-based treebank a happy marriage of empiricism and ...

The grammar–treebank relationship

Rarely a single correct parse of a sentence

Symbolic grammars have limited access to context and worldknowledge

They produce many parses due to morphosyntactic and structuralambiguities

SolutionsStochastic disambiguation

Stochastic ranking

Manual selection, preferably interactive, based on discriminants

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 29 / 84

Page 30: Grammar-based treebank a happy marriage of empiricism and ...

The grammar–treebank relationship

Never 100% coverage

A parsed corpus generated by a symbolic grammar will neverreach 100% coverage of real-world data (LinGO: about 80%)

Reasons are fundamental: competence × performance

Some examples:anacolutcontaminationattractionzeugmasome cases of extraction

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 30 / 84

Page 31: Grammar-based treebank a happy marriage of empiricism and ...

The grammar–treebank relationship

Examples of suboptimal syntax

(1) Kdowho

prijdecomes

pozde,late

nicnothing

muhim

nedají.not-give

Who comes late won’t get anything. (intended)

(2) Vcerayesterday

jsemAUX

videlsaw

aand

mluvilspoke

swith

tímthat

clovekem.man

I saw and spoke to that person yesterday.

(3) Neboor

jáI

GazdaGazda

nevím,not-know

jakhow

diktuje.dictates

Or I don’t know how Gazda dictates. (int’d, due to Jan Klaška)

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 31 / 84

Page 32: Grammar-based treebank a happy marriage of empiricism and ...

The grammar–treebank relationship

Beyond grammar

How to find negative evidence in standard corpora?

Except for non-words not easy in a corpus of written language

Much of ‘suboptimal’ language use in spoken and learner corpora

Grammar useful to detect ungrammaticality

A treebank of suboptimal German [Kepser et al.(2004)]

Phenomena-oriented corpus [Oliva(2008)]

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 32 / 84

Page 33: Grammar-based treebank a happy marriage of empiricism and ...

The grammar–treebank relationship

Can we build a grammar-based treebank that includes real language?

Possible solutions?

A combination of stochastic + symbolic methods

Two grammars: positive and negative [Oliva & Petkevic(1998)]

Competence + performance grammar[Kempen & Harbusch(2001)]

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 33 / 84

Page 34: Grammar-based treebank a happy marriage of empiricism and ...

Czech treebanking

Outline of the talk

1 Why treebanks, why grammars?

2 Treebanks

3 Grammars

4 The grammar–treebank relationship

5 Czech treebanking

6 Architecture

7 Examples

8 Input processing

9 Conclusions and plans

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 34 / 84

Page 35: Grammar-based treebank a happy marriage of empiricism and ...

Czech treebanking

The treebank of Czech

Prague Dependency Treebank

Dependency syntax, close to the Prague theory of FunctionalGenerative Description [Sgall et al.(1986)]

3 annotation levels: morphology, surface syntax, deep syntax

PDT 0.5 – 1998, 0.5 MW

PDT 1 – 2000, 1.5 MW

PDT 2 – 2004, deep syntax

PDT 2.5 – 2011, multi-word units, clause segmentation

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 35 / 84

Page 36: Grammar-based treebank a happy marriage of empiricism and ...
Page 37: Grammar-based treebank a happy marriage of empiricism and ...

Czech treebanking Time to scale up?

Time to scale up?

1.5 MW still too few for investigating less frequent forms andphenomena

Could offer more annotation formats

Could support inherent syntactic ambiguities

(4) Prineslbrought

bednubox

zefrom

sklepa.cellar

He brought a box from the cellar

(5) krajícslice

chlebabread

swith

máslembutter

a buttered slice of bread

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 37 / 84

Page 38: Grammar-based treebank a happy marriage of empiricism and ...

Czech treebanking Time to scale up?

A treebank for every taste

Theory-Supporting Treebank [Nivre(2003)]

Theory-neutral annotation contains too little information or toomany compromises to be really useful

Theory-specific may shut out people from other researchtraditions

Conversion? But the source annotation often lacks information tosupport a completely accurate conversion.

Possible conversions as a requirement in the design of treebankannotation schemes. Different kinds of (theory-specific)annotation should be supported by an underlying internalrepresentation.

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 38 / 84

Page 39: Grammar-based treebank a happy marriage of empiricism and ...

Czech treebanking Time to scale up?

A treebank for every taste

Multi-Representational Treebank [Xia et al.(2009)]

Definitional differences between phrase structure anddependency structure: convertible if designed properly.

Preferential differences – the same in both: empty categories;labels to edges; ordered or unordered trees.

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 39 / 84

Page 40: Grammar-based treebank a happy marriage of empiricism and ...

Czech treebanking Time to scale up?

Can a single core annotationbe viewed in different ways?

Theory-specific representations have different appearancesbut share a large part of content:constituency/dependency, morphosyntactic categories,even the spirit of analyses of many phenomena

A treebank offering different views of a sufficiently expressiveannotation scheme is a realistic goal

Additional benefit: relating linguistic theories

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 40 / 84

Page 41: Grammar-based treebank a happy marriage of empiricism and ...

Czech treebanking Time to scale up?

A larger treebank with customizable visualization?

Short-term goals:

Syntactic annotation of the Czech National Corpus(1.3 billion words) using a stochastic parser,followed by a rule-based correction module

Robust and expressive core annotation format, potentiallyunderspecified

Customizable query, visualization and export interface,offering multiple options to view syntactic structure

Accessible to lay users and satisfying experts at the same time

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 41 / 84

Page 42: Grammar-based treebank a happy marriage of empiricism and ...

Czech treebanking Time to scale up?

Long-term goals:

Development of a corpus-based grammar

Options for queries, visualization and export:ready-made, tailored to specific theories, ordefined by the user

Development of the correction module

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 42 / 84

Page 43: Grammar-based treebank a happy marriage of empiricism and ...

Czech treebanking Time to scale up?

The tasks of the grammar

Checking consistency

Adding more information on top of existing annotation

Assisting the treebank user

To help converting the data onto other formats more easily

To help distinguishing grammatical and suboptimal/ungrammaticalforms and structures

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 43 / 84

Page 44: Grammar-based treebank a happy marriage of empiricism and ...

Czech treebanking Time to scale up?

Grammar design and development

Constraint-based: all is possible except when stipulated otherwise

Hand-crafted but verified against the corpus data

Incremental development, based on conversion rules

Underspecification, partial parses to cope withsuboptimal/ungrammatical forms and constructions

Performance grammar as a mediator with the real-world language,similar to negative grammar?

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 44 / 84

Page 45: Grammar-based treebank a happy marriage of empiricism and ...

Architecture

Outline of the talk

1 Why treebanks, why grammars?

2 Treebanks

3 Grammars

4 The grammar–treebank relationship

5 Czech treebanking

6 Architecture

7 Examples

8 Input processing

9 Conclusions and plans

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 45 / 84

Page 46: Grammar-based treebank a happy marriage of empiricism and ...

Architecture Syntactic structure

Syntactic structureInternal skeleton structures: constituency-based, with acombination of binary and flat branching

Interpretable as constituency or dependency trees, according tousers’ specification, visualized with an arbitrary amount of detail,not necessarily by tree graphs

Surface and deep structure encoded within a single structure:constituents are labelled as syntactic functionsincluding head as a special function

Heads are further specified as deep or surfaceDeep head: deep syntactic governor: bylo by se to povedloSurface head: can be identical to the deep head or different:auxiliary, prepositions, subordinate conjunctions, numerals

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 46 / 84

Page 47: Grammar-based treebank a happy marriage of empiricism and ...

PRED

šel

AUXVbyl

AUXVby

AUXPdo

ADV

lesaS

SHD

byDHD

SHD

bylDHD

HD

šelADVB

SHD

doDHD

lesa

Page 48: Grammar-based treebank a happy marriage of empiricism and ...

Architecture Syntactic structure

Three levelsWord order and syntactic structure as distinct dimensions, eachsentence is represented at three inter-linked levels:

graphemics (orthographic words, contractions)

morphology (syntactic words, including haplologized items)

syntax (trees, no nodes for pro-dropped subjects)

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 48 / 84

Page 49: Grammar-based treebank a happy marriage of empiricism and ...

Architecture Syntactic structure

Annotation of syntactic phenomenaAgreement of various types

Compound periphrastic verbal forms(passives, conditional structures, future...)

Grammatical co-reference(grammatical control, relative/reflexive pronouns, predicativecomplements)

Multi-word units (collocations)

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 49 / 84

Page 50: Grammar-based treebank a happy marriage of empiricism and ...

Architecture Syntactic structure

Expressive powerExpressive enough to accommodate analyses of arbitrarygranularityAmbiguous or undecidable phenomena represented byunderspecification and distributive disjunctionAnnotation of any kind can be missing, a sentence may be a merelist of words

SpecificationsAnnotation must be licensed by a formal grammar. Words andconstituents have their appropriate (potentially underspecified)sets of featuresLexicons are used to index forms, syntactic words and compoundformsCustomizable visualizations are enabled by formal definitions

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 50 / 84

Page 51: Grammar-based treebank a happy marriage of empiricism and ...

Architecture Syntactic structure

Links within a treeAgreementCompound (multi-word) verbal predicatesGrammatical coreference...

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 51 / 84

Page 52: Grammar-based treebank a happy marriage of empiricism and ...

Architecture Construction types and syntactic functions

Syntactic structureeach nonterminal node is assigned a construction type and asyntactic functioneach terminal node is assigned a syntactic function

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 52 / 84

Page 53: Grammar-based treebank a happy marriage of empiricism and ...

Architecture Construction types

Hierarchy of construction typesHeadedUnHeaded

Coord – coordinationAdord – adordinationUnspec – unspecified (for collocations and other)

Function for UnHeaded structures:Memb – a member

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 53 / 84

Page 54: Grammar-based treebank a happy marriage of empiricism and ...

Architecture Syntactic functions

Syntactic functions for HeadedSurfHead – surface head: auxiliary být/bývat, prepositions,subordinate conjunctions, numerals in quantified expressions: petdetíDeepHead – in case it differs from SurfHead (head nouns in PPs,autosemantic verbs in analytical predicates...)Head – both SurfHead and DeepHead

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 54 / 84

Page 55: Grammar-based treebank a happy marriage of empiricism and ...

Architecture Syntactic functions

Other syntactic functions for HeadedSubj – subjectAttr – attributeObj-Advb

ObjAdvb

VbAttr – predicative complementReflTant – reflexive element (si, se) for inherent reflexivesDeagent – deagentive reflexiveApos – appositionInDep – independent syntactic element (parenthesis, vocativesyntactic noun...)

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 55 / 84

Page 56: Grammar-based treebank a happy marriage of empiricism and ...

Examples

Outline of the talk

1 Why treebanks, why grammars?

2 Treebanks

3 Grammars

4 The grammar–treebank relationship

5 Czech treebanking

6 Architecture

7 Examples

8 Input processing

9 Conclusions and plans

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 56 / 84

Page 57: Grammar-based treebank a happy marriage of empiricism and ...

Examples Contractions

Treating contractions

(6) Ty by ses byl ušpinil.you would REFL+AUX2nd ,sg bepple get dirtypple‘You would have got dirty.’

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 57 / 84

Page 58: Grammar-based treebank a happy marriage of empiricism and ...

Examples Contractions

Ty by ses byl ušpinil.

(7)

S

SUBJty

HEAD

SURFHEAD

by s

DEEPHEAD

SURFHEAD

bylDEEPHEAD

HEAD

ušpinilOBJse

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 58 / 84

Page 59: Grammar-based treebank a happy marriage of empiricism and ...

Examples Contractions

(8) Surface dependency structure derived from (7)

by+s

SUBJty

byl

ušpinil

OBJse

(9) Deep dependency structure derived from (7)

bys,byl,ušpinil

SUBJty

OBJse

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 59 / 84

Page 60: Grammar-based treebank a happy marriage of empiricism and ...

Examples Subject/object ambiguity

Subject/object ambiguity

Reflexive passive:

(10) ZarízeníNom/Gen se využívá.device REFL uses‘The device is being used.’

S

SUBJ

zarízeníHEAD

DEAGENT

seHEAD

využívá

S

HEAD

využíváDEAGENT

seOBJ

zarízení

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 60 / 84

Page 61: Grammar-based treebank a happy marriage of empiricism and ...

Examples Another type of subject/object ambiguity

Another type of subject/object ambiguity

(11) Zdravotnictví musí zachránit stát.health servicenom/acc must save statenom/acc

Two different readings:

#1 Health service must save the State.#2 Health service must be saved by the government.

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 61 / 84

Page 62: Grammar-based treebank a happy marriage of empiricism and ...

Examples Another type of subject/object ambiguity

S

SUBJ

zdravotnictvíHEAD

HEAD

musíOBJ

HEAD

zachránitOBJ

státS

SUBJ

státHEAD

HEAD

musíOBJ

HEAD

zachránitOBJ

zdravotnictví

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 62 / 84

Page 63: Grammar-based treebank a happy marriage of empiricism and ...

Input processing

Outline of the talk

1 Why treebanks, why grammars?

2 Treebanks

3 Grammars

4 The grammar–treebank relationship

5 Czech treebanking

6 Architecture

7 Examples

8 Input processing

9 Conclusions and plans

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 63 / 84

Page 64: Grammar-based treebank a happy marriage of empiricism and ...

Input processing

Processing of the input text:

Automatic correction of the output of a stochastic parser

Conversion of the corrected parse + modifications:phenomena that require arbitrary decisions in a dependency tree:constructions with function words, coordinated constructions, listsdisjunction accounting for structural ambiguities expressed byPDT’s “combined functions” AttrAdv, ObjAdv

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 64 / 84

Page 65: Grammar-based treebank a happy marriage of empiricism and ...

Input processing

Syntactic tree in the PDT and the new format

(12) Most,Bridge

kterýwhich

bylwas

vin

havarijnímemergency

stavu,state

byshould

melhavemodal

sloužitserve

dalšíchnext

tricetthirty

let.years.

‘The bridge, which was ramshackle, should serve for anotherthirty years.’

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 65 / 84

Page 66: Grammar-based treebank a happy marriage of empiricism and ...
Page 67: Grammar-based treebank a happy marriage of empiricism and ...
Page 68: Grammar-based treebank a happy marriage of empiricism and ...
Page 69: Grammar-based treebank a happy marriage of empiricism and ...

Input processing Correction module

Correction module

30 correction rules so far

For more frequent errors which can be reliably corrected

Such as noun in accusative as subject

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 69 / 84

Page 70: Grammar-based treebank a happy marriage of empiricism and ...

Input processing Correction module

Success rate of the correction modules

Rules Dependency Label TotalClauses 6 1688 774 1744NP 8 819 2066 2625PP 9 834 7160 7722Other 5 412 1390 1802Total (ppm) 3753 11390 13893Total (%) 0.38% 1.14% 1.39%

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 70 / 84

Page 71: Grammar-based treebank a happy marriage of empiricism and ...

Conclusions and plans

Outline of the talk

1 Why treebanks, why grammars?

2 Treebanks

3 Grammars

4 The grammar–treebank relationship

5 Czech treebanking

6 Architecture

7 Examples

8 Input processing

9 Conclusions and plans

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 71 / 84

Page 72: Grammar-based treebank a happy marriage of empiricism and ...

Conclusions and plans

Conclusions and plans 1/2

ResultsConversion rulesCorrection module200M corpus parsed and correctedBeta version of a viewer with three representation modes

Further workManually tagged and parsed subcorpus will provide better data totrain the parserMore parsing errors will be detected and correctedMore modes of viewing the syntactic structureGrammar development

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 72 / 84

Page 73: Grammar-based treebank a happy marriage of empiricism and ...

Conclusions and plans

Conclusions and plans 2/2

Empiricism and theory meet in the corpus annotation

Competence grammar to fully license the annotation ofgrammatical forms and constructions

Underspecification and partial parses for the rest

Performance grammar to close the gap between the real languageand the annotation provided by the competence grammar

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 73 / 84

Page 74: Grammar-based treebank a happy marriage of empiricism and ...

Conclusions and plans

Based on the work of:

Milena Hnátková, Petr Jäger,Tomáš Jelínek, Vladimír Petkevic,

Hana Skoumalová and myself

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 74 / 84

Page 75: Grammar-based treebank a happy marriage of empiricism and ...

Conclusions and plans

Supported by:

The Grant Agency of the Czech Republic

Project no. GACR P406/10/0434

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 75 / 84

Page 76: Grammar-based treebank a happy marriage of empiricism and ...

Conclusions and plans

S

VP

V

Thank

NP

you

PP

P

for

NP

Det

your

N

attention!Dekuji

vám za

pozornost!

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 76 / 84

Page 77: Grammar-based treebank a happy marriage of empiricism and ...

Conclusions and plans

References I

Baker, C. F., Fillmore, C. J., & Lowe, J. B. (1998).The Berkeley FrameNet project.In 36th Meeting of the Association for Computational Linguisticsand 17th International Conference on Computational Linguistics(COLING-ACL’98), pages 86–90, Montréal.

Charniak, E. & Charniak, E. (1996).Tree-bank grammars.In In Proceedings of the Thirteenth National Conference onArtificial Intelligence, pages 1031–1036.

Haider, H. (1993).Deutsche Syntax – Generativ.Narr, Tübingen.

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 77 / 84

Page 78: Grammar-based treebank a happy marriage of empiricism and ...

Conclusions and plans

References II

Hajic, J., Panevová, J., Urešová, Z., Bémová, A., & Pajas, P.(2003).PDT-VALLEX: Creating a large-coverage valency lexicon fortreebank annotation.In Proceedings of The Second Workshop on Treebanks andLinguistic Theories, pages 57–68. Växjö University Press.

Hajicová, E. & Sgall, P. (2006).Corpus annotation as a test of a linguistic theory.In Proceedings of LREC 2006, pages 879–884.

Hinrichs, E. W. & Telljohann, H. (2009).Constructing a valence lexicon for a treebank of German.In Proceedings of the Seventh International Workshop onTreebanks and Linguistic Theories, page 41–52.

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 78 / 84

Page 79: Grammar-based treebank a happy marriage of empiricism and ...

Conclusions and plans

References III

Kempen, G. & Harbusch, K. (2001).Performance grammar: a declarative definition.In M. Theune, A. Nijholt, and H. Hondorp, editors, CLIN, volume 45of Language and Computers – Studies in Practical Linguistics,pages 148–162. Rodopi.

Kepser, S., Steiner, I., & Sternefeld, W. (2004).Annotating and querying a treebank of suboptimal structures.In In Proceedings of the 3rd Workshop on Treebanks andLinguistic Theories (TLT2004), pages 63–74.

Nivre, J. (2003).Theory-supporting treebanks.In Proceedings of the Second Workshop on Treebanks andLinguistic Theories.

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 79 / 84

Page 80: Grammar-based treebank a happy marriage of empiricism and ...

Conclusions and plans

References IV

Oepen, S., Flickinger, D., Toutanova, K., & Manning, C. D. (2002).LinGO Redwoods: A rich and dynamic treebank for HPSG.In Proceedings of the Workshop on Treebanks and LinguisticTheories, September 20-21 (TLT02), Sozopol, Bulgaria.

Oliva, K. (2008).Phenomena-oriented corpora: a manifesto.In F. Štícha and M. Fried, editors, Grammar & Corpora =Gramatika a korpus 2007. Sborník príspevku ze stejnojmennékonference 25.-27. 9. 2007, Liblice= Selected contributions fromthe conference Grammar and Corpora, Sept. 25-27, 2007, Liblice,pages 77–104, Praha. Academia.

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 80 / 84

Page 81: Grammar-based treebank a happy marriage of empiricism and ...

Conclusions and plans

References V

Oliva, K. & Petkevic, V. (1998).Phenomena-based description of dependency-syntax: A survey ofideas and formalization.In E. Hajicová and B. Hladká, editors, Issues of Valency andMeaning – Studies in Honour of Jarmila Panevová. CharlesUniversity Press, Praha.

Palmer, M., Gildea, D., & Kingsbury, P. (2005).The proposition bank: An annotated corpus of semantic roles.Computational Linguistics, 31(1), 71–106.

Prescher, D., Scha, R., Sima’an, K., & Zollmann, A. (2006).What are treebank grammars?In BNAIC’06: BeNeLux conference on Artificial Intelligence 2006,Namur, Belgium.

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 81 / 84

Page 82: Grammar-based treebank a happy marriage of empiricism and ...

Conclusions and plans

References VI

Rosén, V., de Smedt, K., & Meurer, P. (2006).Towards a toolkit linking treebanking to grammar development.In Proceedings of the 5th International Workshop on Treebanksand Linguistic Theories (TLT’05), Prague, Czech Republic.

Rosén, V., Smedt, K. D., Meurer, P., & Dyvik, H. (2012).An open infrastructure for advanced treebanking.In J. Hajic, K. D. Smedt, M. Tadic, and A. Branco, editors,Proceedings of the META-RESEARCH Workshop on AdvancedTreebanking, LREC 2012, pages 22–29, Istanbul, Turkey. ELRA,European Language Resources Association.

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 82 / 84

Page 83: Grammar-based treebank a happy marriage of empiricism and ...

Conclusions and plans

References VII

Sgall, P., Hajicová, E., & Panevová, J. (1986).The Meaning of the Sentence in its Semantic and PragmaticAspects.Reidel and Academia, Dordrecht and Praha.Editor: Jacob Mey.

Simov, K., Osenova, P., Kolkovska, S., Balabanova, E., Doikoff, D.,Ivanova, K., & Alexander Simov, M. K. (2002).Building a linguistically interpreted corpus of Bulgarian: theBulTreeBank.In Proceedings of LREC 2002, pages 1729–1736, Canary Islands,Spain.

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 83 / 84

Page 84: Grammar-based treebank a happy marriage of empiricism and ...

Conclusions and plans

References VIII

Xia, F., Rambow, O., Bhatt, R., Palmer, M., & Sharma, D. M.(2009).Towards a multi-representational treebank.In F. Van Eynde, A. Frank, G. van Noord, and K. De Smedt, editors,Proceedings of the 7th International Workshop on Treebanks andLinguistic Theories (TLT7), pages 127–133, Utrecht. LOT.

Swidzinski, M. & Wolinski, M. (2010).Towards a bank of constituent parse trees for Polish.In Proceedings of the 13th International Conference on Text,Speech and Dialogue, TSD’10, pages 197–204, Berlin,Heidelberg. Springer-Verlag.

A. Rosen (CU Prague) Grammar-based treebank: empiricism/theory Grammar & Corpora 2012 84 / 84