RESEARCH DESIGN & CORPUS COMPILATION
Jan 17, 2016
RESEARCH DESIGN & CORPUS COMPILATION
• Corpus design is intrinsic and a fundamental part of the analysis.
• It is guided by the RQ and affects the results.• Design criteria are interpretative and must be
explicit (why you chose the texts you did, how and why you organised them in the way you did)
• Different purposes = different corpora.
Corpus design
• What?• Which?• When?• Where?• How?• Why?
What
• Choosing discourse type(s)• there are epistemological considerations• And there are practical considerations• General vs. topical• epistemological considerations• practical considerations
Which
• Choosing variables• You need to have at least one variable
constant or your corpora are not really comparable
• e.g. same time period different newspapers• Same kind of newspaper different time period
comparison
• You are comparing and looking for patterns• One occurrence of anything is not enough, a
pattern is:• a) a figure that emerges from a
homogeneous background by means of differentiation and
• b) the accumulation of similar things.• C) recurring regularities of form
Comparative analysis
• Looking at:• DIFFERENCE• SIMILARITY• across corpora• within corpora
parameters across corpora
• mode (written vs. spoken)• discourse type (e.g. factual vs. fiction)• time (diachronic studies)• variety (e.g. British English vs. American English)• geography (e.g. national vs. local newspapers)• political tendency (Democrats vs. Republican’s
speeches)• individual (e.g. George Elliot vs. Thomas Hardy)• ...
Parameters within corpora
• sub-corpora• (e.g. headlines vs. articles; news vs. comment)• Specific lexical items• (e.g. moral vs. ethic; boy vs. girl; immigrant
vs. asylum seeker vs. refugee ...)
Collections of texts – not one text
• Integral output of a source-unit• (e.g. a whole edition of a newspaper)• The corpus of works by one author (not a
single text)
Topic based corpus
• Search-term(s) based collection• You gather texts by searching a database for
all the texts containing the search-term(s)• identifying the list of search items to ensure
the coverage of the topic is as complete as possible.
Time based
• Historical linguistics• diachronic change/stability of language• modern diachronic analysis• See edition of Corpora MD-CADS for examples
(Partington 2010)
Research questions
• All the choices we make in the corpus design and data collection phase
• e.g. what to collect, how to collect it, from which platform, in
• which format, etc.• all depend on the RQ!
Practical considerations
• availability• access• collection• speed• storage• format
The research question
• All the choices we make in the corpus design and data collection phase
• e.g. what to collect, how to collect it, from which platform, in
• which format, etc.• depend on the RQ!
RQ example 1
• 1. How are muslims represented in the British press?
• What are the appropriate search terms?• muslim*, moslem*, islam* ...?• Consider synonyms and near-synonyms,
alternative spellings etc.
RQ example 2
• 2. How is religion represented in the British press?
• How many terms do I need to add?• How many terms can I add?
RQ example 3
• 3. How much attention does the British press give to religion?
• A search-term based corpus will not tell you.• How will you find out?• How will you delimit the work? (by limiting
and defining the RQ a bit more, e.g. by defining a time period or the type of newspapers under consideration)
storage
• FOLDERS• folders and file names (a repository of
information, a sort of level 0 of mark-up)• FILES become our definition for what is a text• unit of analysis
Best practice
• Distribute information between FOLDER and FILE according to the structure of your corpus (and to your RQ)
• Avoid having more than 2 or 3 levels of folders• Keep names short but dense with information
• example 1: Do newspapers use the same language at a 20 years distance?
• Which among British broadsheets has changed the most?
storage for example 1
• CORPUS• year 1 Newspaper1 • y1_n1_f1 y1_n1_f2 y1_n1_f3y1_n1_f4 ...
• N2 • N3• year 2• N3• N1• N2
• example 2:• How are science and religion represented in
political discourse?
• Solution 1 Science corpusReligion corpus
Solution 2Democrat corpusRepublican corpus
How much?
• The bigger the better• BUT• also the size depends on the purpose!• I ask for a minimum of 100,000 words
• The transformation of texts into textual resources is a process of interpretation and therefore compilers have the responsibility typically associated with an editor.
• The questions we ask (and those we do not ask), affect the answers we can get, it is important to keep track of our expectations and choices and the reasons behind them.
Epistemological reflexivity: you need to ask yourself
• How has the research question defined and limited what can be ‘found’?
• How has the design of the study and the method of analysis ‘constructed’ the data and the findings?
• How could the research question have been investigated differently?
• To what extent would this have given rise to a different understanding of the phenomenon under investigation?
Reflexivity is an unavoidable aspect of research:
• epistemological reflexivity encourages us to reflect upon the assumptions (about the world, about knowledge) that we have made in the course of the research,
• and it helps us to think about the implications of such assumptions for the research and its findings(Nightingale and Cromby, 1999: 228).
• Principles of accountability• Replicability• These principles are important in researqch
and you need to learn to ask yourself how your research follows the principles
• We will be looking at all these issues again and in more detail
Exam
• The exam includes: the first draft consisting of the abstract and corpus description presented to the group
• A final draft consisting of abstract and a copy of your corpus and its description sent to me
• A presentation on the day of the exam.• Don’t forget you need to have proof of B2
competence in English to be able to register the exam.