Introduction to CHILDES and TalkBank Brian MacWhinney CMU - Psychology, Modern Languages, Language Technologies Institute
Mar 31, 2015
Introduction to CHILDES and TalkBank
Brian MacWhinney
CMU - Psychology, Modern Languages, Language Technologies Institute
The goal of TalkBank
The core idea
Human communication is a single unified process.
However, patterns in communication are analyzed by 20 different fields.
The time scales of the processes varies from milliseconds to centuries.
But all of these processes must have their ultimate effect in the Moment.
We can capture the Moment on video.
Principles
Data-sharing, Informed Consent
Multimedia
Open Access, Web Access, Commentary
Specified Format
Interoperability
Community integration
Availability
http://childes.psy.cmu.edu
http://talkbank.org
programs, manuals, fonts, morphologies, CA conventions, video production guides, XML Schema, links to other programs
data can be either downloaded or played back over the web
Current target areas
1. CHILDES
2. PhonBank
3. BilingualBank
4. AphasiaBank
5. CABank
6. ClassBank
CHILDES
Child Language Data Exchange System
Founded in 1984 in Concord MA
Director: Brian MacWhinney [email protected]
Programmers: Leonid Spektor, Franklin Chen
3000 Members
130 corpora
Over 3200 published articles
CHILDES and TalkBankCHILDES TalkBank
Age 23 years 7 years
Words 44 million 8 + 55 million
Media 750 GB 450GB
Languages 32 18
Publications 3200+ 89
Users 3000+ 500
Practical Considerations
Learning CLAN takes about a week
Transcription is slow. Perhaps 15:1 ratio. Blitzscribe, LENA, etc. probably will not work
Currently available data may not be perfect for a given issue
Corpora may need enhancement through MOR or Coder’s editor
9
Tools from the Web
Data: childes.psy.cmu.edu/data
CLAN: childes.psy.cmu.edu/clan
Manuals: childes.psy.cmu.edu/manuals
Morphosyntax: childes.psy.cmu.edu/morgrams
Phon childes.psy.cmu.edu/phon
Tutorial videos talkbank.org/training
Digital video: talkbank.org/dv
CA Methods: talkbank.org/CABank
11
Why no handout?
“Overviews” link has this PPT presentation
CHILDES is now fully electronic. No more paper.
Available Methods
Microanalysis - CA, phonetics, ethology
Microgenetic analysis - CA, code-switching (NEXT)
Group and treatment comparisons - Genesee
Error analysis - YipMatthews
Diffusion analysis - in preschools
Longitudinal studies - growth curves
Modeling - neural nets, dynamic systems, evolutionary models
CLAN Tools
Transcribing
Editing
Counts -- FREQ, KWAL
Analyses: MOR, GRASP, PHON
• Interoperability -- ELAN, Praat, SFS, EXMARaLDA, CLAPI, PHON
CA marks
inUnicod
e
Transcripts linked to media
16
Ground Rules
• Ethical use, informed consent
• Levels of permission
• Respect for dignity of participants
• Respect for contributors
• Requirement to cite sources
• Requirement to contribute data
17
Info-CHILDES and Membership
• Archived at LinguistList
• Info-CHIBolts for nuts and bolts
• Membership list
• IASCL Membership
18
Getting Set Up
• Download CLAN from Programs link
19
Windows issues
• You can work in c:\childes
• But your administrator may have this locked, so, you may need shortcuts.
• Windows IPA is difficult.
• Windows compression may produce .wmf
20
Downloading Manuals
CHAT, CLAN
21
Getting Started
• Open CLAN Manual to Chapter 2
• Double-click application
• Control-D to open Commands Window
• Set Working Directory to
c:\childes\clan\lib\samples
22
Should look like this:
Windows will be c:\childes\clan\lib\samples
23
Run FREQ
• Freq sample.cha
• Hit RUN or carriage return
• In output, does “want” occur 3 times?
24
Interface Features
• Help
• CLAN
• Files In
• Recall
• Set MOR, Lib, Output directories
25
Files In
26
Building Commands
• mlu +t*CHI +f sample.cha
• mlu *.cha
• Wildcards
• File output
• *.cha
27
Changing Directories
• Set Working to: ne32
combo +t*MOT +s"is^*ing" *.cha
• Set Working to: samples
kwal +sbunny +w2 -w2 0042.cha
• Triple click on output line to go back to source file
28
GEM
• Set Working to: Workshop
• GEM +s* pau001.cha
• Open output, play audio
29
Exercises - Chapter 8
• MLU50 – mlu +t*CHI +z50u +f *.cha
• MLU5 – maxwd +t*CHI +g1 +c5 +dl 68.cha | mlu >
68.ml5.cex
• TTR– freq +t*CHI +s"*-%%" +f *.cha
30
BatchFile• maxwd +t*CHI +g1 +c5 +dl 14.cha | mlu > 14.ml5.cex
• maxwd +t*CHI +g1 +c5 +dl 55.cha | mlu > 55.ml5.cex
• maxwd +t*CHI +g1 +c5 +dl 66.cha | mlu > 66.ml5.cex
• maxwd +t*CHI +g1 +c5 +dl 68.cha | mlu > 68.ml5.cex
• maxwd +t*CHI +g1 +c5 +dl 98.cha | mlu > 98.ml5.cex
• Batch batch.cex
• Or just run by highlighting in Commands (Windows)
31
Tables
Child MLU50
MLU5 TTR MLT Ratio
14 0.10 0.12 1.84 -0.90
55 -0.70 -0.65 -0.15 -0.94
66 -0.25 -0.19 -0.68 -1.14
68 3.10 2.56 -0.67 1.60
98 -0.95 -1.11 -0.55 0.31
32
The Editor
33
Playing a linked file
• Esc-8
• Esc-A
• Cont-Click
• F5
34
Linking a File - F5
• Cursor on *FAT
• Find file
• F5
• Press space for each utterance
• Save
35
F5 Tricks
• Go back to last good link
• Space quickly through contained overlap
• If a bullet is missing, cut and paste an old one
• For precision, try Sonic Mode
36
Sonic Mode
• Esc-0 to start
• Highlight area
• Shift-click to move edge
• Have cursor on line in file
• S to insert time marks
• Triple click a linked sentence
37
Transcribing
• Open new window (Command-N)
• Insert headers – @Begin
– @Languages: en
– @Participants: CHI Target_Child, MOT Mother, FAT Father, ROS Brother
– @Date
• F5 with space at each utterance
• Go back and transcribe each bullet (c-click)
• Adjust time marks using Esc-A
38
F5, locate sound, enter bullets
click on bullets, transcribe
39
Or use SoundWalker
40
Or use the Video Editor
41
CHECK
• CHECK is CRUCIAL
• Internal: Esc-L
• External: check *.cha
• External CHECK provides fuller control
42
Options
• Backup
• Wrapping
• Line Numbers
• CHECK
43
More Options
Line numbers F5 bullets SoundAnalyzer
44
Coder's Editor
• Open barry.cha
• Esc-0
• Cursor on first line
• Open codeshar.cut
• %spa
• Insert $NIA:AC:IN
45
Coder's Editor Commands
• F1 finish current tier and go to the next
• Esc-c finish coding current tier
• Esc-t restrict coding to a particular speaker
• Esc-Esc go on to the next speaker
• Esc-s rotate subcodes
• Control-g cancel illegal command
46
Send to Praat
Open Praat, Click before link, Send to Praat, Run Analysis
47
Learning to Digitize
48
Searching, Replacing
• Cont-R, Cont-F
• Space, No, !, control-G
49
Fixing Things
• CHSTRING
• INSERT (inserts @ID headers)
• FIXIT
• LONGTIER
• FIXBULLETS
• REN
• COMBTIER
50
Tour of English MOR Files
• Download a copy
• A-rules
• C-rules
• Sf.cut
• Lexicon
51
Running MOR
• Set MOR directory
• mor +xi (dogs)
• mor +xl barry.cha
• Open barry.ulx.cex
• Fix problems using KWAL
• mor *.cha
52
POST
• mor barry.cha +1 or else
• mor barry.cha and then
• ren *.mor.cex *.cha +f
• post *.cha +1
53
Fixing POST• POST is 95% accurate, but some projects
need 100% accuracy
• Eve training set may need error checking
• More data will train a better POST
• POST training is mostly about bootstrapping, using regexp to find and correct subcases leading to error
• Need to remove some POS possibilities and add them back through post-POST rules (spell as N)
54
CHAT
• What is an utterance?
• What is a word?
• Tour of the CHAT manual
55
Web Browsing of Video
56
Some examples
• Forrester
• Rollins
• Yasmin
• Paulo
• Brent, MacWhinney
• Classroom - JLS
57
Rollins Coding
Conclusions
• CHILDES and TalkBank provide solid tools for studying language learning and functioning
• Data-sharing has led to major advances in the field
• New approaches emphasize the use of multimedia analysis, computational linguistics, and speech technology
58