Resources for Using Corpus Linguistics in ELT Kenji Kitao Doshisha University Kyoto, Japan S. Kathleen Kitao Doshisha Women’s College Kyoto, Japan
Dec 22, 2015
Resources for Using Corpus Linguistics in ELT
Kenji KitaoDoshisha University
Kyoto, Japan
S. Kathleen KitaoDoshisha Women’s College
Kyoto, Japan
I. Presentation A. Corpus linguistics and corpus-
related resources B. Online resources for corpus
linguistics 1. Types of resources 2. Examples of resources
C. Using corpus-related resources for language teaching
II. Application A. Assigned tasks B. Free exploration
Presentation Definitions
Corpus (Latin for “body”) A text or collection of texts Now generally used to refer to machine-
readable texts
Corpus linguistics the use of the empirical data from a
corpus to study language usage and to find patterns of language usage by analyzing actual language use
Requirements A corpus
Can be a single text or a large collection of texts
Larger corpora provide more reliable results, if the purpose is making generalizations about language use
Balanced corpora A variety of genres, including academic
writing, newspapers, fiction, and spoken language
Specialized corpora Examples
Academic writing Texts by learners of English, sometimes
with a specific native language Teachers can develop their own corpora
Newspaper articles Learners’ texts
Corpus analysis tool(s) Types
Tools with specific corpora Tools that can be used with any text or collection of
texts General
Word, Excel, etc. Specialized
Count words Find example of specific words or parts of speech Analyze word frequencies Evaluate readability
Online Corpora Free to all users Available for a fee or for purchase Available only to restricted users
In this presentation, we will only introduce resources that are free.
Using Corpus Linguistics for Language Teaching Technology has become widespread and
accessible Larger, more powerful computers that can
analyze large amounts of data quickly are available
Many corpus-related resources have become available
Language teachers and learners can use corpora
Corpus-related Internet resources 1. General resources on corpus
linguistics 2. Vocabulary frequency lists and
frequency level checkers 3. Online corpora, concordancers and
other text-analysis software 4. E-texts 5. Information about using corpus
linguistics for language teaching
Resources for Corpus Linguisticshttp://www.cis.doshisha.ac.jp/kkitao/library/resource/corpus/corpus.htm
1. General resources on corpus linguistics Web sites that help orient users to
corpora and to what is available online for teachers to use in the classroom or in preparing material
The Compleat Lexical Tutor http://www.lextutor.ca/
Resources for data-driven learning, including concordancers for various corpora and in which one can enter texts
Tutorials, resources of teachers, resources for research
Bookmarks for Corpus Linguists http://devoted.to/corpora/
extensive annotated list of links related to corpus linguistics, including
software tools frequency lists papers and articles English and non-English corpora
2. Vocabulary frequency lists, frequency level checkers, and n-gram extractors Frequency lists
Words used most frequently in English and thus words that are most useful for students to know
Often divided into sublists
Specialized word lists Academic Word List
http://www.nottingham.ac.uk/~alzsh3/acvocab/index.htm
List includes 570 headwords with their word families
Site includes an explanation of the word lists, the words in each sublist, suggestions for using the list, and a gapmaker that can be used to produce gap-filling exercises
5000 Vocabulary List for Visiting Scholars in the USA
http://www.paulnoll.com/Books/5000-Words/index.html
This is a list of the 5000 Words determined by the Chinese Academy of Sciences for scholars that need to go abroad for research or advanced studies in the USA. They are listed in alphabetical order and have sample sentences and examples. There is an additional three thousand words.
Frequency-level checkers Produces a list of words at each level of
difficulty Helps a teacher understand how difficult
the vocabulary in the reading passage is and which words students at different levels of proficiency might need to learn
N-gram finders Finds groups of n-words
JACET 8000 Word List http://www01.tcp-ip.or.jp/~shin/j8web/j8web.c
gi
On this web page, you can enter a text and get a list of the words that appear in the text at each of the eight levels of the JACET list. You also get statistics about what percentage of the words (both types and tokens) occur at each of the eight levels.
N-gram finders Online text analysis tool
http://www.online-utility.org/text/analyzer.jsp
Finds most frequent groups of 2 and 3 words, plus produces a list of all the words, their occurances, and their percentage
Advanced Search – Explore N-grams from the BNC
http://pie.usna.edu/explore.html Produces lists of n-grams, based on the
number of words and occurances you specify
N-gram phrase extractor http://www.er.uqam.ca/nobel/r21270/cgi-
bin/tuples/u_extract.html Produces KWIC list of n-grams
3. Online corpora, concordancers, and other text-analysis software Concordancers
A type of software for searching corpora Produces a list of key words in context (KWIC),
that is, search terms with the words that come before and after them.
May be able to search for parts of speech, e.g., take, followed by a preposition
May be able to search for two words that are not next to each other
Corpora (or parts of corpora) may have spoken language, written language, American English, British English, academic English, and so on.
Specialized corpora include: parallel corpora, which have same texts in
different languages (to compare same passages in different languages)
learner corpora, which have students’ writing/ speaking (to help identify learners’ problems or to study characteristics of their writing)
Examples of concordancers Turbo Lingo
http://www.staff.amu.edu.pl/~sipkadan/lingo.htm
Can enter a text or URL and get a list of KWIC, average sentence length, word frequency list, and other analyses
VIEW (Variation in English Words and Phrases) http://view.byu.edu/ Concordancing tool for the British
National Corpus, the Corpus of Contemporary American English, and a Time magazine corpus, plus non-English corpora
A powerful concordancing tool Has a useful tutorial
Click on what you want to do to see samples of searches
For example, if you want to learn to use wildcards, click on that word, and you will see several examples. You choose the type of search you want to do, and the search is automatically filled in. You can revise it based on what you want to do.
Types of searches Search by exact word, exact phrase,
wildcard, or part of speech For example, mysterious
Use ? or * as a wildcard For example, * point *
Search for an exact word plus a part of speech
For example, white [n*]
Compare usage of semantically related words
{sheer/total} [n*] Search for surrounding words
Nouns that follow the verb “wrap” Limit the search to one register
Adjectives in tabloid newspapers
Compare usage between registers, e.g., news and speaking
we [verb] that: ACAD vs SPOKEN Find words with similar, more general,
or more specific meanings Similar words to “small” More general than “shriek” More specific than “woman”
BNCweb To log in, go to:
http://bncweb.lancs.ac.uk/bncwebSignup/ For information, go to:
http://bncweb.info
On BNCweb, you can do simple searches, you can restrict your search to written or spoken texts or based on the type of text.
Form your own subcorpora.
Make frequency lists based on criteria you specify
For example, make a frequency list of all adverbs that end in –ly in spoken texts.
Look at your query history and save queries to use again.
See your results in a sentence view or a KWIC view.
Get a list of collocates, with statistics about their frequency.
Get information about what type of texts the search term was found in.
Online concordancer http://www.lextutor.ca/concordancers
/concord_e.html Can search a variety of corpora,
including the Brown Corpus, the British National Corpus (written and spoken), a learner corpus, etc.
Produces a KWIC list for a given word and a list of collocates and their frequency
WebCorp http://www.webcorp.org.uk/ Uses the Internet as a corpus and
produces KWIC as well as providing other information
Comparing two texts Text Lex Compare
http://www.lextutor.ca/text_lex_compare/ Allows users to enter two texts and get lists
of: Unique words to first text Shared words in two texts Unique words in second text
Useful to help teacher find new words in new text
Specialized corpora (a few examples) Spoken English
Corpus swb (American English telephone conversations)
http://www.ldc.upenn.edu/cgi-bin/lol/swb/speechcorpus?&corpus=swb
Technical English e-Xplore Technical English
https://learn.sz.htwk-leipzig.de/wc/main.php
Parallel corpora CRATER Multilingual Aligned Annotated
Corpus http://www.comp.lancs.ac.uk/linguistics/cr
ater/corpus.html Academic English
Michigan Corpus of American Spoken English http://quod.lib.umich.edu/m/micase/ Some large corpora also have sub-corpora
of academic English
Online software to assess readability Tests of document readability and
suggestions how to improve readability
http://www.online-utility.org/english/readability_test_and_improve.jsp
Can calculate texts of any length (some online text analysis programs have limits)
Can enter the text directly or enter a URL e.g.,
http://www.cis.doshisha.ac.jp/kkitao/Japan/shimoda/s1.htm
Provides statistics: Number of characters Number of words Number of sentences Number of syllables/word Number of words/sentence
Calculates readability indexes, including
Gunning Fog Index Coleman-Liau Index Flesch Kinkaid Grade Level Flesch Reading Ease
Lists sentences that might be rewritten to improve readability.
4. E-texts In some cases, teachers or students
may want to develop their own corpora. There are large numbers of e-text available.
Project Gutenberg http://www.gutenberg.org/wiki/Main_Page Large collection of downloadable fiction and
non-fiction
Internet Public Library: Online Texts http://www.ipl.org/div/subject/browse/
hum60.60.00/ A large number of online texts on a wide variety
of subjects Drew’s Script-o-Rama
http://www.script-o-rama.com/oldindex.shtml A website with a large number of scripts of
movies and TV programs American Rhetoric Online Speech Bank
http://www.americanrhetoric.com/speechbank.htm
A website with a large collection of speeches
5. Information about using corpus linguistics for language teaching Corpus-related websites specifically
for language teachers Learner corpora and SLA Research
http://leo.meikai.ac.jp/%7Etono/ Links to learner corpora made up of
language produced by speakers of various languages, links to useful tools, a bibliobraphy, and so on
Corpus linguistics: What it is and how it can be applied to teaching
http://iteslj.org/Articles/Krieger-Corpus.html
An article about corpus linguistics and how it can be used in the language classroom
Classroom Application Two types of uses of corpus-related
resources “Low contact” uses – teacher uses resources to help
in teaching, e.g., to find the difficult words in a reading passage; students do not actually see the corpus
“High contact” uses – students use the corpora themselves to learn about language, e.g., to find out which adjectives collocate with “rain”
“Data-driven learning” is a high contact use of corpus-related resources.
Using corpora to deduce rules of grammar or usage, e.g., to determine if a word’s connotation is positive or negative
Advantages of data-driven learning Focus on authentic language Encouragement of students to deduce Real, exploratory activities rather than drills A learner-centered activity
Web sites with suggestions for data-driven learning activities How to use concordances in teaching
English: Some suggestions http://www.nsknet.or.jp/%7Epeterr-s/
concordancing/usingconcs.html
Data-Driven Learning (DDL): the idea
http://www.ecml.at/projects/voll/rationale_and_help/booklets/resources/menu_booklet_ddl.htm
An explanation of DDL, with examples
Activities Use a corpus to check grammar
http://www.lextutor.ca/grammar_tester/
Use the concordancer in the bottom frame to check the grammar of the sample sentences in the top half
Use a concordancer to make a gap-filler or a quiz http://www.lextutor.ca/multi_conc/ http://www.nottingham.ac.uk/
~alzsh3/acvocab/awlgapmaker.htm
Find examples of a word and group them according to meaning Examples
(http://www.lextutor.ca/concordancers/concord_e.html)
party run
Use the results of a KWIC search to determine how synonyms are used differently Examples
http://www.lextutor.ca/concordancers/concord_e.html
travel, journey, trip, voyage, tour confident, fearless, pushy, upbeat, self-
reliant
Use the academic word list web page and enter a text and make a gap-filling activity http://www.nottingham.ac.uk/
~alzsh3/acvocab/awlgapmaker.htm
Resources for Corpus Linguisticshttp://www.cis.doshisha.ac.jp/
kkitao/library/resource/corpus/corpus.htm
Thank you