1 CINTIL TreeBank Handbook: Design options for the representation of syntactic constituency António Branco, João Silva, Francisco Costa and Sérgio Castro University of Lisbon January 2011 1 INTRODUCTION 4 1.1 Concordancer 4 2 CONSTITUENCY RELATIONS 4 2.1 constituency in a nutshell 4 2.2 minimal constituents 5 2.3 syntactic predication 5 2.4 head 5 2.4.1 personal pronouns 6 2.4.2 clitic pronouns 6
33
Embed
CINTIL TreeBank Handbook: Design options for the ...semanticshare.di.fc.ul.pt/publications/BrancoSilvaCostaCastro2011.pdf · CINTIL TreeBank Handbook: Design options for the representation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
A treebank is an annotated corpus. It is a data set consisting of a collectionofindividualwritenutterancesassociated to therepresentationof their linguisticstructure,whichcanbesettocapturedifferentdegreesoflinguisticinformation.
CINTIL Treebank is a corpus of Portuguese utterances annotated with therepresentationofconstituencyrelations.ItisbeingdevelopedandmaintainedattheUniversityofLisbon.
ThisdocumentaimsatsupportingtheutilizationandexploitationoftheCINTILTreebank. It presents its major design options in what concerns therepresentationofsyntacticrelations.
The adopted design options were informed by advanced linguistic theorizing.Thereaderisreferredtotheliteratureforathoroughdiscussionandjustificationofthem.
For the sourceof theutterances in this corpus, for its compositionand for theannotationmethodologyusedsee(Barretoetal.,2006).
TheCINTILTreebankhastwoversions.Thereisareferenceversionforhumanusers,andthere isavariant for trainingprobabilisticparsers.Where the latterdiffers from the reference version, that is indicated below by text betweensquarebracketsstartingby"ProbParser:".
The example graphs displayed below are associated to its identifier in thecorpus. These sentences can be recovered in this concordancer with theseidentifiers.
2 Constituencyrelations
2.1 constituencyinanutshellInasequenceoflexemesw1w2w3,ifthesubsequencew1w2hasahigherlevelofaggregationthanthesubsequencesw1w2w3orw2w3,thesequencew1w2is considered to form a constituent of w1 w2 w3, of which w1 and w2 arethemselvesconstituents.
Thecontrastinglevelsofaggregationsaredeterminedthroughtheapplicationofempirical testswhichrelyongrammatical intuitionsor judgmentsonsyntacticwell‐formedness. These empirical tests are based on judiciously designedminimalpairsof sequences.To testaputativeconstituent, theseminimalpairs
5
are constructed, for instance, by means of the insertion of a parentheticalelementinsideit,bydisplacingittoanoncanonicalwordorderinthesentence,by replacing it by an anaphoric expression, or by coordinating it with otherknownconstituents,etc.
A constituent is represented by enclosing the relevant sequence in squarebrackets (e.g. [w1 w2] w3), or in an alternative, but equivalent notation, byforming a one level depth treewhose leaves arew1andw2and the topnodestandsforthewholeconstituent.
Asyntacticcategoryisasetofconstituentswithidenticalsyntacticdistribution,that is constituents whose replacement by each other preserves the syntacticwell‐formedenessoflargerexpressionstheyareconstituentsof(providedsomeotherkeygrammaticalrelationsarenotaffectedbythatreplacement,suchthatmorphologicalagreement,subcategorization,etc.).
2.3 syntacticpredicationThe constituency relations are intertwined with other grammatical relations,determining and being determined by them. Syntactic predication is one suchrelationofinterest.
A syntactic predication is organized around a predicate and its complements,possiblyextendedwithmodifiersandspecifiers.
AsyntacticpredicateofcategoryXisaspecialconstituent(termedhead)oftheirphrase, of category XP. In that constituency tree, the path from X to XP onlycontains(zeroormore)intermediatenodesofcategoryX'.ThatnodeXP,aswellastheintermediatenodesX',aresaidtobeprojectedbythatheadX.
In the treebank, in general, for major categories, a head X is represented asprojectinganXPwhenthisisaconstituenthavingcomplementsormodifiersofXassubconstituents(seealsosection"7.3Comparatives").
Given their specific or ambivalent nature in categorial terms, this schema isadaptedforthefollowingitems:
2.4.1 personalpronouns
ApersonalpronounhascategoryPRS.ItistheheadofanNP.
2.4.2 cliticpronouns
AcliticpronounhascategoryCL.ItistheheadofanNP.
2.4.3 participles
A past participle has category V. It is the head of an AP in attributive andpredicativeconstructions.
Given its specific nature, verbal predicates may also have an externalcomplement, not occuring inside the VP they project (see also section"5.1Sentences").
A specifier of an NP projected by a head N is a constituent of that NP,immediately dominated by it or by an intermediate N', provided all otherdominatingN'sarealsodominatedspecifiers(seealsosection"5.2Nominals"formoredetailsonNPs).
Giventhekeysemanticfunctionofspecifiers,itisconsideredthatNPswithoutaphonetically realized specifier (bare NPs) still undergo some process ofspecification. As a result, the NP node of bare NPs has a unary branch to theimmediatelydominatednode.
The exception is to be found in Proper Names that modify a common noun,
3 Non‐constituencyrelationsTreesareaimedatdepictingconstituencyrelations.IntheCINTILtreebank,theyare further decorated with information tags relevant also for two types ofgrammaticalrelationsthatareofanon‐constituencynature,namelygrammaticaldependencyrelationsandsemanticrolerelations.
Suchinformationtagsencode,respectively,grammaticalfunctionsandsemanticfunctionsof the correspondingnodes.Theyaredisplayed in accordance to thepatternZ‐GF‐SF,whereZisaconstituencycategory,GFisagrammaticalfunction,andSFisasemanticfunction(e.g.NP‐SJ‐ARG1).
A grammatical function results from an abstraction over complements andmodifiers of different predicates. It permits to categorize complements, ormodifiers, with similar syntactic constraints on their realization, such ascategory,case,agreement,canonicalwordorder,inflectionparadigm,etc.
Asemantic function,orsemantic role, isalsoanabstractionovercomplementsand modifiers of various syntactic predicates, but along a different, semantic,dimension. It permits to categorize complements, or modifiers, according tosimilarsemanticconstraintson theirdenotation, that is in termsof thesimilarcontribution that the extra‐linguistic elements they may denote bring for thecharacterizationoftheeventbeingdescribed.Giventhesemanticrolesaremuchmore elusive than grammatical functions, following common practice withrespect to thecreationofPropBanks(seealsosection3.2belowontheCINTILPropBank), the option here was to adopt a set of roles for complements thatprimarily permits to semantically distinguish complements of the samepredicateamongeachother.
The possible values of grammatical functions are listed in section 4.3 and forsemanticfunctionsarelistedinsection4.4.
3.1 CINTILDepBankGrammaticalfunctionsareanecessarybutnotsufficientelementtocharacterizegrammatical dependencies.Grammatical dependency relations canbedepictedas graphs whose nodes are words and whose directed arcs establish aconnectionfromalexemetoitssubordinatelexemes.
8
An arc represents the dependency of the subordinate item to the head. Thesedependencies can be of a number of different types, which are mostly thegrammaticalfunctions,andwithwhosetagsthearcsaredecorated.
Corpora annotated with grammatical dependency graphs are known asDependencyBanks.TheCINTILTreebank isaligned toadependencybank, theCINTIL DepBank. The bridging elements are the grammatical function tagsdecoratingthenodes,inthetreebank,andthearcs,inthedependencybank.
3.2 CINTILPropBankTreebanks encoding constituency relationswhich are extended to encode alsosemanticfunctions,orsemanticroles,ofelementsofsyntacticpredicationshavebeen termed as PropBanks in the literature. Given the nodes of the CINTILTreebankaredecoratedwithsemanticfunctions,thisannotatedcorporacanbetakenasbeingalsotheCINTILPropBank.
ItisworthnotingthatinsocalledPropBanks,thesemanticrelationsignaledbythe tag on a given constituent indicates a semantic relation between thatconstituentandapredicatorintheutterance.Hence,thatrelationbeingsignaledover a single constituent isnot fully identified in anexplicitwayasoneof thetermsisnotindicated.
Nonetheless, usually the relevant predicate is the closest predicate in the tree,whichbelongstothesameminimalpredicationasthetaggedconstituentdoes.
The cases where this does not hold are typical cases of complex predicates,formed bymeans of several chained verbs, e.g.modals, auxiliaries and raisingverbs. In suchcases the tagused to code the semantic function is sufixedwith"cp" (standing for "complex predicate") in order to help the search andconcordancingofthetreebank(formoredetailsseethesection4.4below)
ARG11 Argument 1 of subordinating predicator and Argument 1 of thesubordinate clause (semantic function of Subjects of so called Subject Controlpredicators)
ARG21 Argument 2 of subordinating predicator and Argument 1 of thesubordinate clause (semantic function of Subjects in so called Direct ObjectControlpredicators)
ARG31 Argument 3 of subordinating predicator and Argument 1 of thesubordinate clause (semantic function of Subjects in so called Indirect ObjectControlpredicators)
Quantifiersfloatingtoapost‐verbalposition,asinOsjogadoresviramtodosisso,are in adjunction to a projectionof the verb.Those floating to an immediatleypos‐nominal position, as in Os jogadores todos viram isso, are in adjunctionpositiontotheirNP(seeexample#Id:b092/5911,insection6.1below).
6 PhoneticallynullitemsPhoneticallynullitemsmarkpositionsinthetreerelatedtootherpositonsinthetree(incaseoftraces),ormarkellidedelementswhosecontextisrichenoughtosupport the recovery of their interpretation (in case of null subjects or nullheads).
[ProbParser:Phoneticallynullitemsareremovedfromthetreeandrepresentedbymeans of appropriated tree configuration or appropriate relabelling of therelevantnodes.]
6.1 nullsubjectsNull subjects aremarked by *NULL* and are immediately dominated by the SnodeprojectedbyaVoraVP:
6.3 tracesTraces of constituents that aremarked by *GAP* followed by _n where n is anaturalnumber.Thecategoryofthe"displaced"nodeiscoindexedwiththetraceandthusalsofollowedby_n:
7.1 relativesAmodifying relative clause is dominated by N'. It is of category CP, with twoimmediateconstituents,anXPprojectedbyarelativepronounRELandaclauseS.IthasgrammaticalfunctionMandsemanticrolePRED:
For the representation of thephonetically null trace, in correspondence to therelativizer XP, see section 6.3 on Traces (see also section "8 Long‐distancerelations").
7.2 adjectives:predicativeandattributiveIn predicative constructions, the Subject is ARG1 of the copula verb, and thecorrspondinglogicalformitshowsupasARG1oftheadjective.
The exception happens with adjectives likemaior,menor,melhor, pior, whichalsoexpressthecomparison,inwhichcasethecomparativeconstructionisbuiltaroundtheadjectiveandtheCONJPphrase.
#Id:e000481/64969
The adverbial of degree (e.g. mais, menos, tão) is sister of the adjective,dominatedbyanA'node.ItissuperficiallytaggedasA‐M‐M,thatisasmodifier,butnotethatinlogicalformtheadjectiveshowsupastheARG1ofthisadverb.
TheCONJPphrase isasisternodeof thatnodeA'. It isprojectedbyoneof theconjunctionexpressionsforcomparativesque,deque,de_oque,como,quanto.Itisacomplementoftheadverbialofdegree.HencethisadverbialhappennottoprojectanADVP.ThisphraseistaggedasCONJP‐C‐ARG2,indicatingthatitisthecomplementandARG2oftheadverb.
20
TheCONJPmaybeabsentofthecomparativeconstruction.Insuchcase,thoughitcanbesemanticallyrecovered fromthecontext, there isnophoneticallynulliteminsertedinthetreetomarkit.
7.7 gerundsWhen in complex predicate constructions, preceded by an auxiliary verb, agerundprojectsaVP.Otherwise,agerundprojectsanadverbialsentencewithanullsubject:
In a complex predicate, formedby any sequence of auxiliary, raising ormodalverbs, itsSubject ismarkedasNP‐SJ‐ARGncp, signaling that it is theSubjectofthe topmost verb (viz. ‐SJ‐) and the ARGn of some verb down below in thecomplexpredicate:
24
#Id:b134/8372
7.9 controlverbsSubject control verbs (e.g. querer) select for a Subject NP‐SJ‐ARG11, signalingthatitisboththesubjectofthecontrolverbandintheclauseoccurringasdirectobjectofthelater:
#Id:b001/34
Object control verbs (e.g. obrigar) select for a Direct Object NP‐DO‐ARG21,signalingthatitisboththeobjectofthecontrolverbandthesubjectintheclauseoccurringastheotherinternalargumentofthecontrolverb.
25
#Id:e000660/79129
Indirect object control verbs (e.g. pedir) select for a Indirect Object PP‐IO‐ARG31, signaling that it is both the indirect object of the control verb and thesubjectintheclauseoccurringastheotherinternalargumentofthecontrolverb.
7.10 "though"constructionsIn "though" constructions, the sentential complement of the adjective,introducedbytheprepositionde,andprojectedbyan inflected infinitive,hasaphoneticallynullobjectmarkedwith*TOUGH*:
#Id:a012/591
For more details, see also section 7.6 on infinitives and 6.4 on "though" nullobjects.
8 Long‐distancerelationsLong distance relations are established between a constituent and a rightdownwards position in the tree where this constituent typically occurs in(declarative)counterpartswithcannonicalSVOwordorder.Constructionswithlong‐distancerelationsincludetopicalizations,interrogativesandrelatives.
The cannonical position ismarked by a phonetically null item *GAP*which iscoindexedwiththeconstituentwithwhichitestablishesalong‐distancerelation.
[ProbParser:Thelongdistancedependencyisrepresentedbydecoratingeverynodeinthepathinsidethetreeconnectingthenodeimmediatelydominatingtheputativegapand thesisternodeof the "displaced"constituent.Thesenodes inthat path are decorated by concatenating to their category tags a slash "/"followedbythetripleCAT‐GF‐SRofthat"displaced"constituent,whereCATisitscategory,GFisitsgrammaticalfunction,andSRisitssemanticrole.]
In a predication supported by the transitive counterpart of a possibleanticausative verb, the Subject is ARG1, as with other transitive verbs. Asexpected,initspassivealternation,theSubjectisARG2.
28
10 Tokenization
10.1 sentencesplitingSentencesaresplitedat theexpectedpoints. It isworthofmention thecaseofutterances involving colon ":", which will be split into two separate entrysentencesinthetreebank,oneprecedingthecolonandtheotherfollowingit.
10.2 nonverbalutterancesTitlesofnewspaperarticles, stretchesaround colons, etc. are casesofpossibleutterancesinthecorpuswicharenotprojectedbyacorrespondingverbalhead.Inanycase,everyentryutteranceinthecorpusisdominatedbyanSnode.
10.3 contractionsContractions are expanded. The first element of an expanded contraction ismarkedwithan"_"(underscore)symbol,forinstancedo→|de_|o|.
10.4 cliticsClitics are detached from the verb. The detached clitic is marked with a "‐"(hyphen)symbol,asforinstancedáselho→|dá|se|lhe|o|
When inmesoclisis, a "‐CL‐"mark isused to signal theoriginalpositionof thedetachedclitic:afirmarseia→|afirmarCLia|se|
11.1 PropernamesMulti‐wordpropernames forma flatconstituentwhereeveryword issisterofeachother,isofcategoryN,andisdominatedbyasinglecommonN'node.ThisheadprojectsanNPexceptwhenitisamodifierofacommonnoun:
12.2 commaCommas separating left periphery constituents are right adjoined to theseconstituents.
Commas surrounding appositions are top most constituents of the appositiveconstituent.
#id:b029/1761
Commas with coordinative value are represented like lexical coordinativeconjuctionsare(forfurtherdetails,seesection7.4oncoordination):
31
#ida001/30
Commassurroundingparentheticalsareadjoinedtothesurroundedconstituent.With several parentheticals in sequence, the first one is surrounded, thefollowingoneshaveasinglecommaatitsright:
12.3 quotationmarksQuotation marks surrounding constituents are adjoined to them. When theysurroundalexicalitemofcategoryX,theyaredominatedbyanX'nodetogetherwithX:
13 ReferencesBarreto, Florbela, António Branco, Eduardo Ferreira, Amália Mendes, Maria
FernandaNascimento,FilipeNunesandJoãoSilva,2006,"OpenResourcesandToolsfortheShallowProcessingofPortuguese",Proceedingsofthe5thInternational Conference on Language Resources and Evaluation(LREC2006),Genoa,Italy.
BrancoAntónio,SérgioCastro,JoãoSilva,FranciscoCosta,2011,CINTILDepBankHandbook: Design options for the representation of grammaticaldependencies.