Top Banner
Corpus tools for a quantitative study of object positions in written language Eckhard Bick
49

Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

Corpus tools

for a quantitative study of object positions in written language

Eckhard Bick

Page 2: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

The advantage of using a corpus rather than introspection

• empirical, reproducable:empirical, reproducable: Falsifiable science Falsifiable science

• objective, neutral:objective, neutral: The corpus is always (mostly) The corpus is always (mostly) right, no interference from test-person's respect for right, no interference from test-person's respect for textbookstextbooks

• definable observation space:definable observation space: Diachronics, genre, Diachronics, genre, text typetext type

• statistics: statistics: Observe linguistic tendencies (%) as Observe linguistic tendencies (%) as opposed to (speaker-dependent) “stable” systems, opposed to (speaker-dependent) “stable” systems, quantify ?, ??, *, **quantify ?, ??, *, **

• context: context: All cases count, no “blind spots” All cases count, no “blind spots”

Page 3: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

The Portuguese example

• Portuguese object pronouns need an Portuguese object pronouns need an “attractor” (negation, subject) in order to “attractor” (negation, subject) in order to allow pre-verbal positionallow pre-verbal position

• More so in Portugal than in Brazil or More so in Portugal than in Brazil or MozambiqueMozambique

• Diachronic fluctuation, sociolect / speaker Diachronic fluctuation, sociolect / speaker statusstatus

• Introspection yields normative resultsIntrospection yields normative results

• Corpus yields true(er) results (NURC, Tycho Corpus yields true(er) results (NURC, Tycho Brahe, Folha vs. Público ....)Brahe, Folha vs. Público ....)

Page 4: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

How to enrich a corpus

• Meta-information: Source, time-Meta-information: Source, time-stamp etc.stamp etc.

• Grammatical annotation: Part of Grammatical annotation: Part of speech (PoS), inflexion, syntactic speech (PoS), inflexion, syntactic function, syntactic structure, function, syntactic structure, semantics ...semantics ...

• Manual vs. automatical annotationManual vs. automatical annotation

Page 5: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

e.g. Korpus90 and Korpus2000

• mixed text, ca. 20 (28) mill. ord eachmixed text, ca. 20 (28) mill. ord each

• sentence-randomized “quote” corpussentence-randomized “quote” corpus

• compiled by DSL (www.dsl.dk)compiled by DSL (www.dsl.dk)

• grammatically annotated by VISL grammatically annotated by VISL (visl.sdu.dk)(visl.sdu.dk)

– a) automatically with the DanGram parsera) automatically with the DanGram parser

– b) 1% manually revised (Arboretum treebank)b) 1% manually revised (Arboretum treebank)

Page 6: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

Other Danish corpora:

• Europarl (28M): true parallel Europarl (28M): true parallel multilingualmultilingual

• Bergenholtz (3M), Parole (0.25M)Bergenholtz (3M), Parole (0.25M)

• Wikipedia etc.: The internet as Wikipedia etc.: The internet as corpuscorpus

• Specialised: BySoc, Folketing, e-Specialised: BySoc, Folketing, e-mail ...mail ...

Page 7: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

How to annotate• All annotation is theory dependent, but some schemes less so All annotation is theory dependent, but some schemes less so

than others. The higher the annotation level, the more theory than others. The higher the annotation level, the more theory dependentdependent

• double role of corpora: (a) as goal, (b) as (gold-standard double role of corpora: (a) as goal, (b) as (gold-standard annotated) data for machine learning: rule-based systems for annotated) data for machine learning: rule-based systems for boot-strappingboot-strapping

• PoS (tagging): needs a lexicon (“real” or corpus-based)PoS (tagging): needs a lexicon (“real” or corpus-based)(a) probabilistic: HMM-base line, DTT, TnT, Brill etc., F ca. 97+%(a) probabilistic: HMM-base line, DTT, TnT, Brill etc., F ca. 97+%(b) rule-based: (b) rule-based: --- PoS Disambiguation as a “side-effect” of syntax (PSG etc.)--- PoS Disambiguation as a “side-effect” of syntax (PSG etc.)--- PoS Disambiguation as primary method (CG), F ca. 99%--- PoS Disambiguation as primary method (CG), F ca. 99%

• Syntax (parsing): function focus vs. form focusSyntax (parsing): function focus vs. form focus(a) primarily probabilistic: PCFG (constituent), (a) primarily probabilistic: PCFG (constituent),

MALT-parser (dependency F 90% after PoS)MALT-parser (dependency F 90% after PoS)(b) primarily rule-based: HPSG, LFG (constituent trees), (b) primarily rule-based: HPSG, LFG (constituent trees),

CG (syn. function F 96%, shallow dependency)CG (syn. function F 96%, shallow dependency)

Page 8: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

Constraint Grammar

A methodological rather than descriptive paradigm (Karlsson A methodological rather than descriptive paradigm (Karlsson 1995)1995)Token-based assignment and contextual disambiguation of Token-based assignment and contextual disambiguation of tag-encoded grammatical informationtag-encoded grammatical information

Grammars need lexicon/analyzer-based input and consist of Grammars need lexicon/analyzer-based input and consist of thousands of MAP, SUBSTITUTE, REMOVE and SELECT rules.thousands of MAP, SUBSTITUTE, REMOVE and SELECT rules.

The VISL project (SDU) uses The VISL project (SDU) uses Constraint GrammarConstraint Grammar parsers to parsers to add form and function tags to word tokens in corpora or add form and function tags to word tokens in corpora or running textrunning text

Form: e.g. N = noun, P = plural, GEN = genitiveForm: e.g. N = noun, P = plural, GEN = genitive

Syntactic function: e.g. @SUBJ = subject, @ACC = direct Syntactic function: e.g. @SUBJ = subject, @ACC = direct objectobject

Syntactic form: e.g. dependency markers (@SUBJ>, @<SUBJ), Syntactic form: e.g. dependency markers (@SUBJ>, @<SUBJ), numbered dependency (e.g. #5->3) or secondary constituent numbered dependency (e.g. #5->3) or secondary constituent treestrees

Page 9: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

Running CG-annotation

1. Da (When) [da] KS @SUB #52. den (the) [den] ART UTR S DEF @>N #43. gamle (old) [gammel] ADJ nG S DEF NOM @>N #44. sælger (salesman)[sælger] N UTR S IDF NOM @SUBJ> #55. kørte (drove) [køre] <mv> V IMPF AKT @FS-ADVL> #11 6. hjem (home) [hjem] ADV DIR @<SA #57. i (in) [i] PRP @<ADVL #58. sin (his) [sin] <poss> <refl> DET UTR S @>N #99. bil (car) [bil] N UTR S IDF NOM @P< #710., #511. kunne (could) [kunne] <aux> V IMPF AKT @FAUX #012. han (he) [han] PERS UTR 3S NOM @<SUBJ #1113. se (see) [se] <mv> V INF AKT @AUX< #1114. mange (many) [mange] <quant> DET nG P NOM @>N #1515. rådyr (deer) [rådyr] N NEU P IDF NOM @<ACC #1316. på (in) [på] PRP @<OA #1317. de (the) [den] ART nG P DEF @>N #1918. våde (wet) [våd] ADJ nG P nD NOM @>N #1919. marker (fields) [mark] N UTR P IDF NOM @P< #16

Page 10: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

DanGram

Preprocessing

Morphological analysis

CG-disambiguationPoS/morph

CG-syntax

NER, case roles

PSG grammarDependency

grammarTreebanks

CG corpora

Inflexion lexicon100.000 lexemes

Valency potential

Semantic prototypes

Raw text

Page 11: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

Cg-results for Danish: PoS (~ 99% accuracy)

Class recall precision F-score Class recall precision F-scoreN 99.5 99.1 99.2 ART 99.3 99.3 99.3PROP 100 100 100 DET 97.1 98.5 97.7V PR 99.2 99.2 99.2 PERS 99.4 99.4 99.3V IMPF 100 97.2 98.8 INDP 98.2 100 99.2V INF 98.1 99.0 98.5 NUM 100 100 100V PCP1 100 100 100 ADJ 96.8 94.4 95.5V PCP2 94.9 97.4 96.1 ADV 95.8 98.0 96.8INFM 100 100 100 PRP 99.4 99.1 99.2KS 96.6 95.0 95.7 KC 100 99.1 99.5

Page 12: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

CG-result for Danish: Syntactic function (~95% accuracy)

Class recall precision F-score Class recall precision F-score

@SUBJ> 96.7 95.2 95.9 @>N 97.3 98.2 97.7

@<SUBJ 90.1 96.8 93.3 @N< 90.9 96.1 93.4

@F-SUBJ> 86.6 86.6 86.6 @APP* 100 87.5 93.3

@F-<SUBJ 100 100 100 @N<PRED 100 80.0 88.8

@<ACC 94.6 95.3 94.9 @>A 88.6 95.9 92.1

@ACC>* 88.8 88.8 88.8 @A< 89.4 94.4 91.8

@<DAT* 100 75.0 85.7 @P< 98.1 98.1 98.1

@<PIV 93.5 87.8 90.5 @FS-<SUBJ* 77.7 77.7 77.7

@<SC 92.0 84.3 87.9 @FS-<ACC 100 72.7 84.1

@<OC* 83.3 100 90.8 @FS-ACC> 100 91.6 95.6

@<SA 83.3 86.9 85.0 @FS-<ADVL 90.3 96.5 93.2

@<OA* 100 75.0 86.7 @FS-ADVL> 84.6 78.5 81.4

@<ADVL 93.2 90.6 91.8 @FS-P< 90.9 100 95.2

@ADVL> 96.9 93.2 95.0 @ICL-<SUBJ* 100 100 100

@KOMP<* 100 100 100 @ICL-P< 96.1 100 98.0

@P< 98.1 98.1 98.1

Page 13: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

Corpus Eye

• internet-based internet-based http://corp.hum.sdu.dk/http://corp.hum.sdu.dk/, using , using CQP (Corpus Query Protocol)CQP (Corpus Query Protocol)

• menu based category searches in contextmenu based category searches in context

• multi-token constituents, regular expressions multi-token constituents, regular expressions and quantifiersand quantifiers

• sorting and quantificationsorting and quantification

• grammatically annotated corpora for 8 grammatically annotated corpora for 8 Germanic and Romance languages (about 1 Germanic and Romance languages (about 1 billion words), mostly from the written language billion words), mostly from the written language domain. domain.

Page 14: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

The interface

Page 15: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

Simple text searches: fx. e.g. composita

Page 16: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

Menu based category search

Page 17: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

Output: "raw" concordance

Page 18: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

Sortering and statistics

Page 19: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

"invandrer"adjective context:

Page 20: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

"udlænding"adjective context:

Page 21: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

"flygtning"adjektive context:

Page 22: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

imperatives animal expressions

Page 23: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

The case for treebanks

• A treebank is a corpus annotated with full syntactic structure, A treebank is a corpus annotated with full syntactic structure, attaching tokens to each other (dependency grammar) or to attaching tokens to each other (dependency grammar) or to interconnected non-terminal nodes (constituent grammar)interconnected non-terminal nodes (constituent grammar)

• Treebanks contain more syntactic detail than tagged corporaTreebanks contain more syntactic detail than tagged corpora

• Treebanks allow to train or evaluate automatic systems of Treebanks allow to train or evaluate automatic systems of analysisanalysis

• Treebanks allow searches for complex units and their relations, Treebanks allow searches for complex units and their relations, rather than individual tokens or their features. For instance, the rather than individual tokens or their features. For instance, the sequence of NPs with certain functions can be queried directly, sequence of NPs with certain functions can be queried directly, or conditioned on their being daughters of an embedded clause or conditioned on their being daughters of an embedded clause (subclause).(subclause).

• Treebanks exist for a large number of languages (cp. CoNLL-X Treebanks exist for a large number of languages (cp. CoNLL-X shared task), e.g. Negra/TIGER (German), Penn (English), shared task), e.g. Negra/TIGER (German), Penn (English), Mamba (Swedish), Cast3LB (Spanish), PDT (Czech) ....Mamba (Swedish), Cast3LB (Spanish), PDT (Czech) ....

• The largest The largest VISL treebankVISL treebank is the double-format is the double-format ArboretumArboretum treebank for Danish, annotated in both dependency and treebank for Danish, annotated in both dependency and constituent grammarconstituent grammar

Page 24: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

Indented PSG-notationSTA:fclfA:fcl=SUB:conj-s('da') Da (When) =S:np==DN:art('den' UTR S DEF) den (the)==DN:adj('gammel' nG S DEF NOM) gamle (old)==H:n('sælger' UTR S IDF NOM) sælger (salesman)=P:v-fin('køre' IMPF AKT) kørte (drove)=As:adv('hjem' DIR) hjem (home)=fA:pp==H:prp('in') i (in)==DP:np===DN:pron-poss('sin' <refl> UTR S) sin (his)===H:n('bil' UTR S IDF NOM) bil (car) P:vp-=Vaux:v-fin('kunne' IMPF AKT) kunne (could)S:pron-pers('han' UTR 3S NOM) han (he)-P:vp=Vm:v-inf('se' AKT) se (see)Od:np=DN:pron-indef('mange' <quant> nG P NOM) mange (many)=H:n('rådyr' NEU P IDF NOM) rådyr (deer)Ao:pp=H:prp('på') på (in)=DP:np==DN:art('den' nG P DEF) de (the)==DN:adj('våd' nG nD NOM) våde (wet)==H:n('mark' UTR P IDF NOM) marker (fields)

FUNCTION:formEDGES:nodes/terminals

Page 25: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

Search results as syntactic tree structures

Page 26: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

Syntaktiske funktioner i Korpus2000: sætningsniveau

0

500

1000

1500

2000

2500

3000

SUBJ F/S-SUBJ

ACC DAT PIV SC/SA OC/OA ADVL PRED

<

>

FS

ICL

Page 27: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

Syntactic functions in Korpus2000: group level

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

>N, N< >A, <A P<, >P

<

>

FS

ICL

Page 28: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

Sunctactic functions in Korpus2000:special functions

0 200 400 600 800 1000 1200 1400

>>P

N<PRED

KOMP<

ADVL

SUB

AUX<

INFM

Page 29: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

Semantic restrictions for objects:Does semantic class play a role for object positions?

forflytte <Hprof>_2 (human professional)forfægte <pp>_3 (tankeprodukt)forfølge <ac>_8 <Hprof>_6 <H>_4 .... (aktiviteter og mennesker)forføre <H>_3 (people)forgylde <H>_4 <Hprof>_3 (mennesker)forhale <act-c>_3 <act>_3 (handlinger og aktiviteter)forhandle <ac>_17 <sem-r>_9 <conv>_8 .... (tællelige abstrakta, "readables",

aftaler)forhaste <pp>_3 <sem>_3 (tankeprodukter)forhindre <act>_35 <Hprof>_23 <ac>_18 <act>_18 <H>_17 <HH>_14 <event>_9forhøje <ac>_13 <mon>_7 <mon-c>_5 ... (abstrakta og pengebeløb)forkaste <pp>_5 <Hprof>_4 <ac>_3 <conv>_3 .. (tankeprodukter, profess.,

aftaler)forklare <ac>_39 <act-c>_7 <act>_6 ... (abstrakta og handlinger)forkorte <per>_4 (perioder)

Searches based on semantic prototype annotation: verb semantics vs. object semantics

han så aldrig igen @ADVL filmen <sem>+@ACChan så aldrig vennen <H>+@ACC igen @ADVL

Page 30: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

Direct/accusative objects in Danish

form type fronted (ACC>) right of main verb (<ACC)

finite clause (FS) 5.2 % (quotes!) 12.8 % non-finite clause (ICL) 0.0 % (1 case) 5.3 %

nouns (N) 0.3 % (checked) 53.8 % proper nouns (PROP) 0.0 % (12 cases) 3.4%

relative pronouns 1.9 % - interrogative pronouns 0.5 % - (4 adverbs)

personal pronouns 1.0 % 12.0 % others 0.4 % 4.4 %

all 9.3 % 91.7 %

7,1 % i 1,1 million words from Korpus2000

Page 31: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

Fronted nominal objects

Subtype n frequency definition interrogative 79 29.0 % at se, hvilken interesse kineserne skulle have topic 74 27.2 % Denne interesse overførte han på virksomheden

De problemer har jeg slet ikke. focus 55 20.2 % Blot 6-7 kr. vil sparekassen se som betaling

Sin spillefilmsdebut fik han i 1962 med ... fronted in verb chain

43 15.8 % ... få tyvekosterne bragt hjem ... får man billeder at se gratis ... at lære de nødvendige redskaber at kende

raised 12 4.4 % Den slags er vi jo nogle stykker der kan lide fixed 7 2.6 % Hvad udvalget af værker angår, har ... vp-internal 2 0.7% ... at min søn ingen huller havde

... hun har ingen kage bagt

Page 32: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

Pronoun ellipsis in relative clauses

der som zero all: 938 n % n % n % n %

SUBJ 421 44,9 175 18,7 (15) (1,6) 611 65,1 raised - - 3 0,3 - - 3 0,3 det-focus 33 3,5 10 1,1 - - 43 4,6 ACC - - 34 3,6 37 3,9 71 7,6 raised - - 7 0,7 2 0,2 9 1,0 det-focus - - - - 6 0,6 6 0,6 >>P 4 0,4 16 1,7 12 1,3 32 3,4 raised - - 7 0,7 1 0,1 8 0,9 det-focus - - - - 5 0,5 5 0,5 DAT, CS, OC - - 5 0,5 - - 5 0,5 458 48,8 257 27,4 78 8,3 793 84,5

hvor når, da zero ADVL-adv 111 11,8 10 1,1 10 1,1 131 14,0 hvorPRP PRP+hvilken 88 9,4 924 98,5 P< (ADVL) 7 0,7 1 0,1 8 0,9 hvis at hvilket >N, SUB, S< 1 0,1 4 0,1 1 0,1 6 0,6 938 100,0

Page 33: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

Preadverbial placement of definite np-objects -- adverb types(VFIN) (N DEF @<ACC) (ADV @<ADVL) 46

Page 34: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

Preadverbial placement of definite np-objects -- adverb types(VFIN ) (ADV @<ADVL) (N DEF @<ACC)

-> only main clauses?, ->same adverbs as intra-vp?

Page 35: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

Relatve frequency of adverb types

after def-np obj.

before def-np obj.

Page 36: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

German, for comparison

(np-def @ACC) (adv) (AUX<)

(np-def @ACC) (adv) (>AUX)

(adv) (np-def @ACC) (AUX<)

Page 37: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

Candidates for adverbs with an influence on object position:Vp-inserted adverbsand their position specificity

red = attitudinal-adverbsblue = conjunctional adverbs

Page 38: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

Pre-positioned adverbs in preposition-governed infinitives

(PRP) (ADV) (INF @ICL-P<)

Rød = fokusadverbierblå = tidsadverbier

grøn = bøjede adverbier

Page 39: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

Post-positioned "light objects"VFIN (ADV @<ADVL) (PERS @<ACC)

either 1./2. person (speech!) or special cases ...

Page 40: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

Cross language perspective

• VISL uses a uniform descriptive system, with VISL uses a uniform descriptive system, with consistent form and function categories, for 27 consistent form and function categories, for 27 languages, handling special cases at the languages, handling special cases at the subcategory levelsubcategory level

• CorpusEye offers 2 large CG-annotated multi-CorpusEye offers 2 large CG-annotated multi-language corpora, allowing a certain degree of language corpora, allowing a certain degree of statistical standardisation (genre, lexicon etc.) statistical standardisation (genre, lexicon etc.) across languagesacross languages

– 1. Europarl parallel corpus (da, de, en, es, fr, it, 1. Europarl parallel corpus (da, de, en, es, fr, it, pt)pt)

– 2. Wikipedia corpus (da, de, en, eo, es, fr, it, pt)2. Wikipedia corpus (da, de, en, eo, es, fr, it, pt)

• Both the annotation (e.g. np-types), search system Both the annotation (e.g. np-types), search system (e.g. different statistics) and language inventory (e.g. different statistics) and language inventory (e.g. se) can be expanded in a project-driven way(e.g. se) can be expanded in a project-driven way

Page 41: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

Cross SL category distribution

GER = Germanic average, ROM = Romance average, Red = high values, Blue = low valuesNotables: Sentence length, inflexion vs. aux chains, subjunctive and conditional, ROM-adj vs. GER-v, ROM-coord., DK vs. ES, xx-French (shorter than even GER), politeness vocative

da sv de en nl GER xx/fr es it pt ROM fi elwords per sentence 25.5 25.1 25.3 25.7 23.1 24.9 27.8 32.1 32.9 33.2 32.7 25.3 31.0finite subclauses 3.81 3.75 3.47 3.47 3.30 3.56 3.16 4.04 3.68 3.52 3.75 3.00 3.72 relative clauses 1.95 2.05 1.68 1.70 1.58 1.79 1.72 2.16 2.10 2.07 2.11 1.50 2.09 direct object clauses 1.11 1.04 1.02 1.03 0.95 1.03 0.85 1.10 0.90 0.81 0.94 0.78 0.94 adverbial clauses 0.63 0.54 0.67 0.61 0.63 0.62 0.52 0.70 0.63 0.55 0.63 0.57 0.62participial adverbialsubclauses (log-5)

2.92 2.15 3.20 4.35 4.52 3.43 3.96 3.82 4.09 4.71 4.21 3.31 4.78

auxiliary chain parts 3.46 3.35 3.34 3.36 3.13 3.33 2.89 2.98 2.99 2.52 2.83 3.02 2.77 passive pcp2 0.47 0.45 0.42 0.45 0.44 0.45 0.41 0.33 0.34 0.39 0.35 0.44 0.39 active pcp2 1.17 1.14 1.15 1.33 1.07 1.17 1.12 1.22 1.20 0.95 1.12 1.04 1.17 infinitive 1.43 1.38 1.39 1.21 1.25 1.33 0.99 1.12 1.11 0.93 1.05 1.20 0.89subjunctive/vfin 4.99 5.58 4.76 4.53 4.40 4.85 4.19 4.76 4.26 4.79 4.60 5.55 4.35conditional 0.56 0.56 0.56 0.62 0.43 0.55 0.43 0.49 0.43 0.40 0.44 0.56 0.39vocative 0.04 0.04 0.06 0.05 0.06 0.05 0.05 0.06 0.07 0.04 0.06 0.05 0.05attributive 6.70 6.98 7.02 7.01 7.29 7.00 7.26 7.37 7.64 8.13 7.71 7.65 7.62common nouns 20.90 21.26 21.00 21.33 21.35 21.2 22.07 21.37 21.09 22.14 21.5 22.66 21.71finite verbs 8.94 8.59 8.48 8.29 8.49 8.56 7.57 8.18 7.78 7.23 7.73 7.83 7.86coordinating conjunction 2.67 2.48 2.80 2.68 2.56 2.64 2.74 3.20 3.16 3.28 3.21 2.40 3.20subordinating conjunct. 2.33 2.16 2.22 2.17 2.13 2.20 1.84 2.35 2.01 1.87 2.08 1.88 2.06demonstrative 1.96 2.14 2.34 2.17 2.24 2.17 1.99 2.17 1.98 2.02 2.06 1.82 1.81

Page 42: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

VISLhttp://visl.sdu.dk

[email protected]

****************************

parsers: http://beta.visl.sdu.dk

corpus search: http://corp.hum.sdu.dk

teaching: http://visl.sdu.dk

Page 43: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

Teksttypologi: Passivkonstruktioner

• Passivfrekvens som stilmærke for Passivfrekvens som stilmærke for kancellistil, abstraktionsniveau kancellistil, abstraktionsniveau m.m.?m.m.?

• 3,1% alle passiver, 2,3% finitte 3,1% alle passiver, 2,3% finitte former inkl. aktiv participium, 5,9 former inkl. aktiv participium, 5,9 infinitiverinfinitiver

• s-passiv eller blive-passivs-passiv eller blive-passiv

• leksemspecifikke passivnormaler?leksemspecifikke passivnormaler?

Page 44: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

• (a) Børnene flokkedes omkring ismaskinen. *Børnene blev flokket.Leksikaliseret S-passiv ("slås", "synes")

• (b) Løgene svitses. Løgene bliver svitset. Høj Spas/akt, høj Spas/Bpas

• (c) Aktieudbytte beskattes med 25%. Aktieudbytte bliver beskattet med 25%.

Høj Spas/akt, neutral Spas/Bpas

• (d) Minimælk fås kun fra Arla. *Minimælk bliver fået. Lav Spas/akt, høj Spas/Bpas

• (e) Der arbejdes på en løsning. Der bliver arbejdet. *Den bliver arbejdet. Blive-passiv kun med formelt subjekt.

• (f1) Bøgerne er solgt d. 10. oktober (=er blevet). *Bøgerne er solgte d. 10. oktober.(f2) Tallene er vist (=vises) med rød skrift. *Tallene er viste med rød skrift.

Være-passiv enten som s- eller som blive-passiv

Page 45: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

Teksttypologi: Passivkonstruktioner

Page 46: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

Leksikokgrafisk arbejde

fx leksemer der indgår i bestemte syntaktiske sekvenser:@SUBJ> (subjekt) @MV (main verb) @<ACC (objekt)”hest” ”æde””hø”

opmærkning med semantiske prototyper:opmærkning med semantiske prototyper:21aflyse <occ> (arrangementer)19aflyse <act-c> (tallelige handlinger og aktiviteter)4 aflyse <ac> (tallelige abstrakta)4 aflyse <act> (handlinger og aktiviteter)4 aflyse <sem-l> (musikalske værker m.m.)3 aflyse <event> (hændelser)3 aflyse <sit> (situationer)

Page 47: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

Træbanker

Page 48: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

ID-knap = træ-link

Page 49: Corpus tools for a quantitative study of object positions in written language Eckhard Bick.

Verbalkomplementering:* < (/P:/ < /spist?er?/ $.. /Od/)