Non-standard in literature project Methodology User Guide · Non-standard in literature project Methodology User Guide MichaelPercillier September15,2015

Non-standard in literature projectMethodology User Guide

Michael Percillier

September 15, 2015

Contents

1 Text selection 31.1 Representativeness . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Genre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Scanning, OCR and proofreading 52.1 Scanning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 OCR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3 Proofreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3.1 Normalising quotation marks . . . . . . . . . . . . . . . . . 62.3.2 Marking page turns . . . . . . . . . . . . . . . . . . . . . . . 62.3.3 Actual proofreading . . . . . . . . . . . . . . . . . . . . . . 6

3 Annotation 93.1 Basic automated annotation . . . . . . . . . . . . . . . . . . . . . . 9

3.1.1 Preparing a proofread text for basic automated annotation . 93.1.2 The basic automated annotation process . . . . . . . . . . . 10

3.2 Manual annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2.1 First time launch of XmlCat . . . . . . . . . . . . . . . . . . 113.2.2 XmlCat user interface . . . . . . . . . . . . . . . . . . . . . 123.2.3 Opening and saving a corpus file . . . . . . . . . . . . . . . 133.2.4 Annotating non-standard features . . . . . . . . . . . . . . . 143.2.5 Annotating meta-features . . . . . . . . . . . . . . . . . . . 223.2.6 Inter-rater Reliability Test . . . . . . . . . . . . . . . . . . . 24

A Select figures from the project 25

1

List of Figures

3.1 Screenshot of XmlCat window on first launch . . . . . . . . . . . . 113.2 Screenshot of successfully loaded tag set file . . . . . . . . . . . . . 123.3 The XmlCat function buttons . . . . . . . . . . . . . . . . . . . . . 123.4 Text display options in XmlCat . . . . . . . . . . . . . . . . . . . . 133.5 Line spacing options in XmlCat . . . . . . . . . . . . . . . . . . . . 133.6 Selection of a non-standard feature with the mouse cursor . . . . . 143.7 First level of the tagging window in XmlCat . . . . . . . . . . . . . 143.8 Fully expanded tagging window in XmlCat . . . . . . . . . . . . . . 153.9 View of inserted tag in XmlCat . . . . . . . . . . . . . . . . . . . . 153.10 Editor window in XmlCat . . . . . . . . . . . . . . . . . . . . . . . 163.11 Overview of relevant sounds and their SAMPA symbols . . . . . . . 18

A.1 Feature density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26A.2 Proportional feature profiles . . . . . . . . . . . . . . . . . . . . . . 26A.3 Presence of A-rated eWAVE features in Singaporean texts . . . . . 27A.4 Feature density of characters in select West African texts . . . . . . 27A.5 Feature density VS character importance in The Orange and the

Green . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28A.6 Character profiles in The Adventures of Holden Heng . . . . . . . . 28A.7 Meta features per region, normalised for 10,000 words . . . . . . . . 29

2

Chapter 1

Text selection

In this chapter, the criteria for text selection will be explained.

1.1 RepresentativenessIn order to represent a literary landscape, we prefer to include a large numberof text excerpts by many authors rather than a few texts in their entirety. Ourcorpus is to be divided into several subcorpora corresponding to world regions, i.e.Caribbean, South Asia, Southeast Asia, West Africa etc. For each subcorpus, theaim is to collect around 100,000 words. In order to make a diachronic analysispossible, attempts should be made to obtain comparable word counts for eachdecade. As such, decades with only one or two texts, or even coverage gaps,should be avoided. Furthermore, an author should not be represented by a singletext if possible.

1.2 GenreWe focus on prose texts only, as poetry may be influenced by rhyme and metre inaddition to nonstandard features, and dramatic texts are written to be performedorally, which means that features such as accent may be found in stage directionsrather than in the dialogue itself. This does not imply that these genres do notdeserve to be included, but simply that we focus on prose texts for the time being.

For prose texts, we can include novels and short stories. Due to their length,novels should not be included in their entirety. Instead, we select an excerpt,which may be one chapter or several. If several chapters are selected, they shouldbe contiguous, i.e. there should be no missing chapters in between. Short storiesmay be selected in their entirety.

3

As our project makes use of copyrighted material, we have to abide by the rulesof Fair Use so as to proceed without having to ask permission from every singlecopyright holder. To be on the safe side, avoid selections that make up more than10% of the total work. For short stories, make sure that the story takes up lessthan 10% of the short story collection or anthology in which it is printed.

4

Chapter 2

Scanning, OCR and proofreading

2.1 ScanningI recommend scanning in greyscale and using a resolution of at least 300 ppi (pixelsper inch). Any lower resolution lower will make it difficult for the OCR (OpticalCharacter Recognition) software to accurately recognise characters. It is usuallysimpler to save all scanned pages into a single PDF document.

2.2 OCROnce the selected text is scanned, you can process the file in your OCR software.Recommended software for OCR is Readiris, VueScan (Professional Edition), andAdobe Acrobat (versions 9 and later). Free alternatives include Simple OCR andFreeOCR. Microsoft OneNote also has OCR capabilities (right-click on an importedpicture/PDF and select “Copy Text From Picture”).

The OCR software converts the scanned image into text. The output shouldbe saved in a plain text file (.txt) using the UTF-8 character encoding. This isimportant as characters using diacritics (e.g. as in fiancée) will otherwise be lost,and the post-processing scripts expect text files in UTF-8. The output file shouldbe saved in the folder Project shared/Scans. The file name should ideally followthe pattern author_title.txt.

2.3 ProofreadingWhile OCR software can be quite accurate, it never achieves 100% accuracy andits output needs to be checked. Before and during the proofreading process, weundertake two additional steps:

5

• Normalising quotation marks

• Marking page turns

2.3.1 Normalising quotation marksPrinted texts may use different sets of quotation marks to indicate direct discourse,either “double quotation marks” or ‘single quotation marks’. We harmonise ourtexts to use “double quotation marks” only, one reason being consistency, theother being that direct character discourse will be annotated automatically in asubsequent step (double quotation marks are reliable, whereas single quotationmarks use the same character as the apostrophe).

Should your text use single quotation marks for direct character discourse, itis necessary to run it through the Python script quote_fix.py located in Project

shared/Scans. In order to run the script, you will need to have Python 2.7 installed.The easiest way is to open the file using Python Launcher. A console window willapear and you will be prompted to enter the name of the text file (including the.txt ending) and hit Enter. If you do not feel confident in running the Pythonscript, I can happily take over this step for you. The output file will be the previousfile name followed by -quotefixed.txt. The version with normalised quotationmarks is the one to be proofread.

2.3.2 Marking page turnsWhile proofreading, the beginning of a new page should be marked with the fol-lowing tag: <page id=’1’/> (where you substitute 1 for the relevant page numberin the printed volume). This will make it easier to find a specific passage in theprinted volume. It is important to use single quotes rather than double quotes, asdouble quotes will used to automatically place character discourse tag in a subse-quent step. In case of words split across pages, do not put the tag in the middleof the word, but rather place it before the first full word of the page. The pagetag should be inline, meaning that you should not interrupt a paragraph with aline break.

2.3.3 Actual proofreadingMany errors in OCR are due to the fact that print editions use roman/serif type-faces, which means that some characters are narrower than others, and adjacentcharacter combinations may look like a different character. In order to avoid mak-ing the same errors as the OCR software, I advise using a monospaced typeface,

6

i.e. one where each character occupies the same width. Additionally, certain char-acters may be hard to distinguish, so a typeface having more distinct charactersis helpful. Console typefaces (traditionally used for programming) are the bestchoice, as they are monospaced and have more distinct shapes for characters thatare easily misread. Serif typefaces (the ones that are likely to be encountered bythe OCR software as they are often used in print) such as Times New Roman orGaramond for example make the character sequences <rn> and <cl>look like <m>

and <d> respectively. Additionally, the character groups <0, O> (zero, uppercaseo) and <1, l, I> (one, lowercase l, uppercase i) look very much alike in serif type-faces, but are easier to distinguish in monospace console typefaces. A contrastbetween serif typefaces and recommended monospaced typefaces is given below.

• Serif:

– rnm cld 0O 1lI (Times New Roman)– rnm cld 0O 1lI (Garamond)

• Monospace:

– rnm cld 0O 1lI (Consolas)– rnm cld 0O 1lI (Monaco)– rnm cld 0O 1lI (Andale Mono)– rnm cld 0O 1lI (Menlo)– rnm cld 0O 1lI (DPCustomMono2)

The latter, DPCustomMono2, is admittedly ugly but specifically designed forproofreading (the font can be found at Project shared/DPCustomMono2.ttf).

To proofread your text, use a text editor (e.g. TextEdit, TextWrangler on Mac,or NotePad, Notepad++, Notepad2 on Windows) rather than a word processor(e.g. Microsoft Word, Apple Pages, OpenOffice, LibreOffice etc. ).

The difference between hyphens and dashes is not easy for the OCR softwareto recognise and often treated wrongly (and hyphens are used instead of dashes incertain editions to begin with). To disambiguate the two in our corpus, we use asingle - (hyphen/minus sign) for a hyphen, and a double -- for hyphens.

If your text selection contains more than one chapter, write “CHAPTER 1”(in capital letters followed by the chapter number) in a separate line followed bya blank line at the beginning of the chapter.

Passages in italics cannot be recognised as such by the OCR software. Whenyou spot italics, surround them with underscores, like _this_ . A subsequentscript will place italic tags at the appropriate positions.

7

Should you notice any obvious typos, e.g. He could never he sure, you maycorrect them by placing a correction tag like this: He could never <correction

orig=‘he’>be</correction>sure. Be sure to use single quotes for the orig attribute.Finally, the proofread text file contains the “pure” text and should not try to

imitate the formatting of the original print version. The only formatting unit wemaintain is the paragraph. A paragraph in the printed volume corresponds to aline in the digitised text. This means that you should only use line breaks betweenparagraphs. More specifically, the delimiter between paragraphs should be a blankline.

Once the text is entirely proofread, change the filename so that is prependedwith _CHECKED_ . This will mark the text as proofread.

8

Chapter 3

Annotation

3.1 Basic automated annotation3.1.1 Preparing a proofread text for basic automated an-

notationThe proofread text now needs to be put into XML format. In the folder Project

shared/Corpus/Checked Scans, there is a file called blank.xml. Open this file with atext editor and save a copy in the aforementioned folder with the filename followingthe pattern author_title.xml. Once done, fill in the meta-information about thework in the file header, i.e. :

• Region (e.g. Europe)

• Country (e.g. Ireland)

• Author (in the format Surname, Name)

• Title

• Year (in numbers, e.g. 1964)

• Genre (Novel or Short Story)

• Source (list volume title and editor if published as part of a collection oranthology, or write NA if the text is published as a monograph)

• Publisher

• ISBN

• E-book (No if you had to scan and proofread the text)

9

• Pages (in the format x-y followed by (Chapters a-b) or (Total))

The information should be provided between the opening and closing tags, soif for example you want to define the region as Europe, the relevant line shouldlook like this: <region>Europe</region>

The information for region, country, author, title, year and pages is absolutelynecessary to estimate the current size of the corpus and should therefore be enterednow. Remaining information can be entered eventually.

While the opening and closing <header> tags contain the file header, the openingand closing <body> tags are meant to contain the actual text. Simply copy andpaste the content of your previously proofread file in the blank line between thesetwo <body> tags. Once saved, the file is ready for basic automatic annotation.

To verify whether you did everything correctly, you can open the file in a webbrowser. If there are no error messages and the text is displayed, you produced avalid XML file.

3.1.2 The basic automated annotation processThe XML file is run through the Python script basic_format_annotation.py, whichdoes the following:

• give each paragraph a paragraph tag with a running number, in the format<p id=’1’>

• place character tags where “quotation marks” are present

• place italic tags where underscores were placed during the proofreading pro-cess

• place chapter tags in case chapter markers were placed in the proofreadingprocess

You will need to have Python 2.7 installed to perform this task yourself. Open-ing the script file with Python Launcher will prompt you to enter the name of thefile (including the .xml ending). I can take over this step for you.

The output of the script is a file with the same name as the input file followedby -basic.xml. This file can now be dragged in the folder Project shared/Corpus/Tag-ready as it is ready for manual annotation.

Opening the file in a web browser should display the following basic annotation:

• Chapters in a black frame with the chapter number written in red

• Character discourse framed in red

10

Figure 3.1: Screenshot of XmlCat window on first launch

• Italic passages in italics

• Paragraphs preceded by ¶ and the paragraph number in grey

• Page beginning marked by page number surrounded by in grey

3.2 Manual annotationThe manual annotation process is done with an XML editor I developed specificallyfor our project: XmlCat (short for XML Corpus Annotation Tool).

3.2.1 First time launch of XmlCatIn the folder Project shared, there are two ZIP files called XmlCat(Mac).zip andXmlCat(Win).zip. Copy the file relevant to your operating system on a local folderon your hard drive (meaning not anywhere in the shared project folder) and unzipit.

On Windows, a folder called XmlCat(Win) will be extracted, which you candrag to a location of your choice, e.g. in your Program Files folder. Inside theXmlCat(Win) folder, double-click XmlCat.exe to launch XmlCat.

On Mac, an application file is extracted. I recommend dragging it to yourApplications folder and, optionally, dragging it to your Dock for easy access.Double-clicking the application icon, or single-clicking it in your Dock, will launchXmlCat.

After first launching the application, you will notice that no tag set is currentlyselected, as seen in Figure 3.1 on page 11.

You will need to select a tag set file to annotate your text. To do this,click on the Change button and select the file nonstandard.yaml located in Project

shared/Tagsets. Instead of the red No tag set selected warning, the path to thetag set file you selected should now be selected in black, as shown in Figure 3.2 onpage 12.

11

Figure 3.2: Screenshot of successfully loaded tag set file

Figure 3.3: The XmlCat function buttons

No need to worry, selecting the tag set file only needs to be done on the firstlaunch, as XmlCat will remember your selection on the next launch.

IMPORTANT: do not move, delete or modify the tag set file!

3.2.2 XmlCat user interfaceFunction buttons

The tool bar contains six function buttons, shown in Figure 3.3 on page 12. Theirfunctions are:

• Open

• Save

• Tag

• Hide/Show

• Remove

• Edit

Alternatively, the buttons’ functions can be triggered by the following keyboardshortcuts:

• Mac: O (Open), S (Save), T (Tag), I (Hide/Show), R (Remove),E (Edit)

• Windows: Ctrl O (Open), Ctrl S (Save), Ctrl T (Tag), Ctrl I (Hide/Show),Ctrl R (Remove), Ctrl E (Edit)

12

Figure 3.4: Text display options in XmlCat

Figure 3.5: Line spacing options in XmlCat

Text display

The font and size of the displayed text can be adjusted to your liking using thedrop-down menu just under the function buttons, as shown in Figure 3.4 on page13. Click the Apply button for changes to take effect. These settings will be savedbetween sessions.

Line spacing can also be adjusted by clicking any of the three buttons to theright of the font display options. You can choose between tight, normal, and wideline spacing, as shown in Figure 3.5 on page 13.

3.2.3 Opening and saving a corpus fileThe Project shared/Corpus folder contains four sub-folders:

• Project shared/Corpus/Checked scans: contains scanned and proof-read texts

• Project shared/Corpus/Tag-ready: contains files ready to be annotated

• Project shared/Corpus/Tagging: contains files currently being annotated

• Project shared/Corpus/Final: contains entirely annotated files

To annotate a file, click the Open button (or use the keyboard shortcut), thenselect your file from the Tag-ready folder.

After starting work on a file, save your progress regularly in the Tagging folderby clicking the Save button. The file you originally opened in the Tag-ready foldershould remain unchanged in order to have annotated and unannotated versionsof our texts. When resuming work on a file, open the version in progress in theTagging folder rather than the file in the Tag-ready folder. Once the file is entirelyannotated, it can be dragged to the Final folder.

13

Figure 3.6: Selection of a non-standard feature with the mouse cursor

Figure 3.7: First level of the tagging window in XmlCat

Once a text is entirely annotated, its unannotated version in the Tag-ready

folder should be prefixed with _TAGGED_ to distinguish annotated texts from thosethat still require annotating.

3.2.4 Annotating non-standard featuresOnce you open a file, its content will be displayed in the application’s main window.Once you spot a non-standard feature, select it with your mouse cursor. Anexample is given in Figure 3.6 on page 14.

We assume that the vowel <e>in ‘set’ corresponds to the <i>of ‘sit’ in standardEnglish. Once the feature is selected, we click the Tag button (or use the keyboardshortcut). A new window will pop up, as shown in Figure 3.7 on page 14.

14

Figure 3.8: Fully expanded tagging window in XmlCat

Figure 3.9: View of inserted tag in XmlCat

This is the tagging window. We now click on the appropriate radio button.Doing so will reveal the features in the clicked category. Keep clicking until youarrive at the final level, shown in Figure 3.8 on page 15. (NB: if you click on awrong button, the process can be reset by clicking on a radio button at the veryfirst level on the left.)

At the final level, certain attributes have to be entered manually, in our partic-ular example the attributes observed and standard. Rather than use IPA symbols,we use SAMPA symbols, which is quicker and less problematic. In our example,we enter E and I. Once we click the Confirm button, the tagging window will closeand the XML tags will be inserted into the text automatically, as shown in Figure3.9 on page 15.

When annotating, I recommend having the text opened in both XmlCat anda web browser, preferably Mozilla Firefox, Google Chrome, or Apple Safari. Thebrowser display hides the tags, thus making the text easier to read, while usingthe tags to highlight the tagged passages. Having both views side by side makesthe tagging process easier. Once you saved your progress in XmlCat, you’ll have

15

Figure 3.10: Editor window in XmlCat

to hit the refresh button in your browser for the updated version to be displayed.

Editing tags

To correct a tag (in case you’ve made a mistake or changed your mind) or inserta comment in it, select the opening tag (i.e. the one without a ‘/’) and click theEdit button (or use the keyboard shortcut). The editor window will appear, asshown in Figure 3.10 on page 16.

The drop-down menus can be used to change the feature. Attributes can bechanged or removed. A note attribute can be added to mark some additionalinformation, e.g. to mark instances of good qualitative examples or uncertaintiesin annotation that need to be checked later. In our example, we’ll pretend weare unsure about our annotation, tick the note checkbox and write CHECK as anattribute. When we click the Update button, the tag will contain our changes.

In the editor window, tags can be changed except for the base level. To changethe base level, it is simpler to remove the entire tag and start afresh. Select theentire tagged span of text, including the opening and closing tags, and click theRemove button.

16

Guidelines for tagging

The tag set is split up into 5 main categories for describing non-standard features,each having an associated colour for highlighting in the corpus and in graphs (cf.Figures A.1 and A.2 in the appendix on page 26 for results derived from thisclassification):

• phonology • (differences in spelling that reflect non-standard pronunciation)

• grammar • (morphological and syntactic features)

• lexical • (English lexemes with a non-standard meaning, lexical innovations)

• code • (use of local languages other than English)

• spelling • (differences in spelling that have no bearing on pronunciation, i.e.eye dialect)

Phonology For phonological features, we distinguish three sub-groups: featuresaffecting vowels, consonants, or entire syllables. The features that apply to all 3groups are deletion and insertion.For vowels, the features diphthongisation andmonophthongisation refer to a standard monophthong realised as a diphthong anda standard diphthong realised as a monophthong respectively, while the featuresdiphthongShift and monophthongShift refer to standard diphthongs/monophthongsrealised as a different diphthong/monopthong. As previously mentioned, SAMPAsymbols should be used instead of IPA symbols. An overview of relevant soundsand their SAMPA symbols is given in Figure 3.11 on page 18.

Grammar Grammatical features are also split into sub-groups: clause, NP,predicate, preposition, negation, and adverb. For the attributes that have tobe entered manually, we use the standard abbreviations from the Leipzig GlossingRules as often as possible (the rules can be found under Project shared/ Litera-

ture/LGR.08.02.05.pdf). Some of the most commonly used abbreviations are listedbelow. Additions not part of the standard Leipzig Glossing Rules abbreviationsare marked with a †.

• NP:

– Case:* SBJ (Subject case, e.g. I, we, he, she)* GEN (Genitive case, e.g. neighbour’s)* POSS (Possessive case, e.g. my, our, his, her)

17

Figure 3.11: Overview of relevant sounds and their SAMPA symbols

18

* OBJ (Object case, e.g. me, us, him, her)* DEM (Demonstrative, e.g. these, those)

– Gender:* F (Feminine, e.g. she)* M (Masculine, e.g. he)* N (Neuter, e.g. it)* †N/A (All or none)

– Number:* SG (Singular)* PL (Plural)

• Predicate:

– Aspect:* PROG (Progressive)* PRF (Perfect)* †HAB (Habitual)

– Finiteness:* †FIN (Finite)* †NFIN (Non-finite)

– Person:* 1,2,3 (First, second, third person)* SG (Singular)* PL (Plural)* †BASE (Base form, e.g. think as opposed to thinks)

– Tense:* PRS (Present)* PST (Past)* FUT (Future)

– Voice:* †ACT (Active)* †MP (Mediopassive)* PASS (Passive)

• Clause:

19

– Make description as general as possible, e.g.:– observed=“S-A-V-O”

– standard=“S-V-O-A”

– NB: see ‘Deletion’ below for a list of POS abbreviations

• Deletion:

– ART (Article)– AUX (Auxiliary)– COP (Copula)– DET (Determiner)– †LEX (Lexical verb)– †NOUN– †PREP (Preposition)– †PRON (Pronoun)– †V (Verb)– Format:

* POS_LEMMA_instance* POS_PERSONNUMBER(.GENDER)_instance

– E.g.:* COP_BE_is* PRON_1SG_I* PRON_3SG.N_it

Lexical & code For the categories ‘code’ and ‘lexical’, the abbreviation HON isused for honorific terms of address, e.g. ‘mother’ used to address a lady older thanoneself in certain cultures. The attribute would then be meaning=“HON_mother”.

For discourse markers, use the DM abbreviation. If you know the function of thediscourse marker, write it after the abbreviation with an underscore in between,e.g. meaning=“DM_disapproval”, or use meaning=“DM_CHECK” if you need to verify thediscourse marker’s meaning.

20

Tricky cases In certain instances, we observe not only lack of standard pasttense marking, but also the missing 3rd person { -s}, e.g. in He just die. Rather thanassume a two-step process, i.e. the shift from past to present and the additionallack of morphological marking, which would require two tags and would not besalient in persons other than 3SG, we recommend treating such cases as a generallack of tense marking (both past and present), and treat the observed verb formas a case of base form. Our example should therefore be tagged as <grammar

feature=“predicate_tense” observed=“BASE_DIE_die” standard=“PST_DIE_died”>.For deletion features, it is obviously not possible to select the deleted feature in

the text. Instead, we select the preceding and following phonemes (for phonologicalfeatures) or morphemes (for grammatical features). Should one of these needtagging as well, only the otherwise untagged one should be selected (tags cancontain one another, but can’t cross boundaries; think of boxes of different sizesas a simple yet effective analogy).

If the observed feature retains the function of the standard feature (e.g. incase of a different past tense form rather than lack of past tense marking), thefeature names and glosses will remain the same. For example, the past tenseform tellt instead of told would be marked as <grammar feature=“predicate_tense”observed=“PST_TELL_tellt” standard=“PST_TELL_told”>.

eWAVE comparison

In order to compare the features observed in literary texts to the features of ac-tual varieties, we draw on eWAVE, the Electronic World Atlas of Varieties ofEnglish, available at http://ewave-atlas.org. As eWAVE covers only grammat-ical features, this comparison is limited to the “grammar” category. In additionto “observed” and “standard” attributes, “grammar” tags also have an “ewave”attribute, in which the corresponding eWAVE feature number should be entered.

The feature numbers have to be looked up on the eWAVE website. Clickingon the “Features” tab in the green ribbon at the top of the website will lead tothe list of features. You can find the corresponding feature by browsing throughvarious groups in the “Area” column, e.g. “Pronouns”, “Noun Phrase”, “Tense& Aspect” etc., or by using the search field in the “Feature name” column. Forexample, looking up the feature number for copula deletion can be done by typing“copula” in the search field, which will reveal that features 176, 177 and 178 arepossible options.

You may keep track of the features you have encountered or are likely to en-counter by saving a list of features sorted by their frequency in your variety. Youcan achieve this by clicking the “Varieties” tab, then selecting your variety fromthe world map. Once there, you can sort features by the column “Value”. Click-ing on the down-pointing triangle should place features ranked as “A - feature is

21

http://ewave-atlas.org

pervasive or obligatory” first.It should be noted that due to the high number of varieties covered in eWAVE,

feature descriptions tend to be ‘variety-agnostic’ and sometimes difficult to inter-pret as a result. For example, “yesterday he walk” would correspond to feature132 (“Zero past tense forms of regular verbs”), while there is no corresponding fea-ture called “Zero past tense forms of irregular verbs”. Instead, a feature such as in“yesterday he run” would have to be marked with eWAVE number 129 (“Levellingof past tense/past participle verb forms: unmarked forms”).

In case a feature is not present in the eWAVE inventory, enter E (for “Extra”)as the “ewave” attribute.

See Figure A.3 in the appendix on page 27 for an example of a comparisonwith eWAVE.

3.2.5 Annotating meta-featuresThe following textual and meta-linguistic features are also annotated:

• Character information

• Narrator information

• Meta-textual marking of non-standard features

Character information

Identifying characters and their discourse is important for tracking character pro-files. The bulk of character tags was placed automatically. However, characterinformation needs to be completed for these tags. To do so, select the openingcharacter tag and click the Edit button (NB: and not the Tag button), then com-plete the following information:

• Discourse: this has already been marked as direct automatically

• Medium: the options are speech, writing, and thought

• Name: the character’s name (or other description if no name is revealed);if the character also happens to be the text’s homodiegetic narrator, writeNarrator-Character(Name)

Instances of free (in)direct discourse cannot marked automatically. These spansof discourse should therefore be selected and tagged like a feature using the char-

acter category.

22

For instances where “quotation marks” do not mark character discourse, thecharacter tags should be removed. The entire span, including the opening andclosing character tags (but not the quotation marks surrounding them), should beselected. Clicking the Remove button will get rid of the character tags.

For examples of results obtained from assigning character tags, see Figures A.4,A.5 and A.6 in the appendix on pages 27 and 28.

Narrator information

Narrator information contains the following attributes:

• Identity: the narrator’s name or available description; use Unnamed in case ofan intradiegetic narrator whose identity is not revealed, and NA in case of anextradiegetic narrator

• Level: intradiegetic if the narrator exists in the world depicted in the text,or extradiegetic if the narrator is outside the text’s sphere of existence

• Participation: homodiegetic if the narrator participates in the narrated events,otherwise heterodiegetic

• Person: 1st, 3rd etc.

For further details on the terms used, consult the excerpts from Rimmon-Kennan(1983) and Prince(2003) in Project shared/Literature/Narratology.

Some texts may have have only one narrator tag surrounding the entire text.Others may change the narrator properties or switch between narrators.

Meta-linguistic features

The following attributes for special marking of non-standard features are used:

• italics (feature is marked in italics)

• quotes (feature is marked in quotation marks)

• translation (a translation is provided)

• explanation (an explanation is provided)

• footnote (the feature is marked with a footnote)

• language (a comment is made about the language, or which language isspoken)

For results derived from the annotation of meta-linguistic features, see FigureA.7 in the appendix on page 29.

23

3.2.6 Inter-rater Reliability TestGiven the fact that texts from different regions will be annotated by differentpeople, any distinct patterns we observe may not necessarily be inherent to thetexts themselves, but to annotators marking the data differently. In order tomitigate the effect of the latter as much as possible, we should harmonise theannotation process. For this reason, all of us should annotate a common sampletext to check that the deviations are not too important.

In the shared folder, a folder named Inter-Rater Reliability contains thesample text, which is the first chapter of Huckleberry Finn stored in a file calledTEST_twain_huckfinn-chapter1.xml. The text contains slightly more than 30 fea-tures that we should all annotate. They are marked with a <rating> tag, whichappears as cyan highlights when the file is opened in a web browser.

The inter-rater reliability test consists in the annotation of the sample texts byall members of the project and the subsequent comparison of the tagged sampletexts via a Fleiss’ Kappa test. If the differences are too vast, we will have to discussthe problematic cases and repeat the test until the results are good enough. Toannotate the sample chapter, open it with XmlCat and tag it as described above.Save your version in the Tests folder under an XML file name containing yourname. The only difference from the regular annotation process is that the <rating>tags have to be removed before placing the actual tags, and that you shouldn’tadd any <character> tags to character passages. The test is also an opportunityto familiarise yourself with the annotation method before handling your actualtexts. Do not hesitate to ask questions if need be. Also, if you notice that certainfeatures in your variety cannot be described using the tag set, do let me know sothat we can discuss an update of the tag set (which should ideally be done beforeyou start working on your texts, otherwise we will have to “retrofit” all previouslyannotated texts).

24

Appendix A

Select figures from the project

25

Figure A.1: Feature density

Europe Southeast Asia West Africa

Fea

ture

s no

rmal

ised

to 1

0,00

0 w

ords

020

040

060

080

010

00

020

040

060

080

010

00

code grammar lexis phonology spelling

Figure A.2: Proportional feature profiles

Europe Southeast Asia West Africa

Fea

ture

s in

%

020

4060

8010

0

020

4060

8010

0

code grammar lexical phonology spelling

26

Figure A.3: Presence of A-rated eWAVE features in Singaporean texts

Everyd

ay w

ill be

like S

unda

y

Gloria

Making

Coff

ee

Poisso

n Ivy

The A

dven

tures

of H

olden

Hen

g

No number distinction in reflexives

Object pronoun drop

Subject pronoun drop: referential pronouns

Subject pronoun drop: dummy pronouns

Regularization of plural formation: phonological regularization

Plural marking generally optional : for nouns with human referents

Plural marking generally optional: for nouns with non-human referents

Use of zero article where StE has definite article

Use of zero article where StE has indefinite article

Ever as marker of experiential perfect

Perfect marker already

Finish-derived completive markers

Loosening of sequence of tenses rule

Go-based future markers

Zero past tense forms of regular verbs

Give passive: NP1 (patient) + give + NP2 (agent) + V

Invariant non-concord tags

Invariant tag can or not?

Variant forms of dummy subject there in existential clauses

Deletion of auxiliary be: before progressive

Deletion of copula be: before NPs

Deletion of copula be: before AdjPs

Deletion of copula be: before locatives

Deletion of auxiliary have

Postposed one as sole relativizer

Existentials with forms of get

No subordination; chaining construction linking two main verbs (motion and activity)

Omission of StE prepositions

Inverted word order in indirect questions

No inversion/no auxiliaries in wh-questions

No inversion/no auxiliaries in main clause yes/no questions

Presence of subject in imperatives

Figure A.4: Feature density of characters in select West African texts

010

2030

4050

60

Fea

ture

den

sity

per

100

wor

ds

JomoMaster

Olanna

Ugwu

Ugwu's aunty

Miss AdebayoNnesinachiOkeomaProfessor EzekaUgwu's motherunnamedAccident witnesses

Ajayi

BintuIya

KonniKonnie's husband

Michael

Mrs NwukeNeighbours

Nwuke

Nwuke's servant

Older workers

Seamstresses

Station−master

Uta

Waiter

Ajayi's fatherBus conductorConstableDriverNwuke's two boysOnlookersSergeantWorker AinaAmusa

Corporal

Crowd

Elderly corporalLajide

Sam

MotherPoliceman

Half of a Yellow Sun Lokotown People of the City

27

Figure A.5: Feature density VS character importance in The Orange and the Green

5 10 15

010

2030

4050

60

The Orange and the Green(Scotland)

Character Proportion in %

Cha

ract

er F

eatu

re D

ensi

ty in

%

Barman

Charlie

Fella on bus

Fella with alligator

Jovial Catholic

Kind−hearted man

Moocher's mate

Narrator−Character+Jimmy

HughieJimmy Narrator−CharacterSuicidal man

Spearman's rank correlation rho : p = 0.0237505273435801 * rho = −0.64423010744359

Figure A.6: Character profiles in The Adventures of Holden Heng

●

●

●

●

●

●

●

● ●●

0 5 10 15 20 25 30

05

1015

20

The Adventures ofHolden Heng (Singapore)

Character Proportion in %

Cha

ract

er F

eatu

re D

ensi

ty in

%

Curtis

Father

HoldenImaginary Siew Fung

Mayo

Ray

Siew Fung

Mr. KohNews reporterTeenage girl

CurtisFather

Holden

Imaginary Siew FungMayo Ray

Siew FungMr. K

oh

News reporte

r

Teenage girl

The Adventures ofHolden Heng (Singapore)

020

4060

8010

0

35

1

7 6

2

4

98

10

28

Figure A.7: Meta features per region, normalised for 10,000 words

commentexplanationfootnoteitalicslanguagequotestranslationTOTAL



●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

Europe

Southeast Asia

West Africa

0 5 10 15

29

Non-standard in literature project Methodology User Guide · Non-standard in literature project Methodology User Guide MichaelPercillier September15,2015

Documents