Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed Text Dr Dale Chant, Red Centre Software Pty Ltd ASC Conference: Making Sense of New Research Technologies Critical Reflections on Methodology and Technology: Gamification, Text Analysis, and Data Visualisation Friday 6 th and Saturday 7 th September 2013, University of Winchester
53
Embed
Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed Text
Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed Text. Dr Dale Chant, Red Centre Software Pty Ltd. ASC Conference: Making Sense of New Research Technologies Critical Reflections on Methodology and Technology: - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Exposing and Quantifying Narrative and Thematic
Structures in Well-formed and Ill-formed Text
Dr Dale Chant, Red Centre Software Pty Ltd
ASC Conference: Making Sense of New Research Technologies Critical Reflections on Methodology and Technology: Gamification, Text Analysis, and Data Visualisation
Friday 6th and Saturday 7th September 2013, University of Winchester
The Problem
• Coding open-ended verbatims takes a long time • Inconsistent coding judgements can wreak havoc
on small weekly samples • Some bodies of free text, such as Twitter feeds,
are beyond human capacity to digest due to sheer volume
• Machine coding by string matching assumes well-formed text – variant morphologies difficult to accommodate
Naïve Auto-Coding
• Read all source words (or complete strings) into an array
• Sort alphabetically• Assign codes from 1 to N, where N is the
number of unique words (or unique strings)• Write the assigned codes in the original word
(or string) order
Naïve Auto-Coding• Well-formed published text is code-complete
The first line of Wuthering Heights
The complete code frame has 9,201 items
Netting to a Theme
• With the code frame defined, themes can be netted from individual words
• To be useful, the allowable distance for a positive match needs to scale against the length of the target strings
• ‘ox’ to ‘fox’ has distance 1 (insert at head)This would be a false positive
• ‘megalomania’ to ‘megalomaniacs’ has distance 2 (insert twice at tail)This is a good match
Scaling the Algorithm (2)
• Short strings need a distance of zero• Intermediate strings need 1 or 2• Longer strings can bear 2 or 3 or more• The thresholds for short/intermediate/long and
allowed distances for a positive match are here termed the fuzz parameters
• Fuzz parameters are determined empirically, and will vary with the body of text being analysed.
What is GainedThe target string megalomaniac at an edit distance of 1 will match on:
12 * 26 in situ typos (negalomaniac) + 12 missing (megaomaniac) + 12*26 extraneous (megaloomaniac) + 11 transpositions (meglaomaniac) + 2*26 extra pre/post character (mmegalomaniac)
= 699 possible variations
The Procedure• Code the source text, one code per unique word• Run a sorted frequency count to expose recurrent themes
and concepts• Review actual instances of these words in situ to determine
appropriate fuzz parameters and the thematic and conceptual contexts
• Devise a compact target code frame which maps the themes and concepts words of interest to synonym and variant lists
• Process the source text against the targets, to create a categorical variable which can be tabulated in the normal manner against any other variable
Exposure and Quantification: Romeo and Juliet
• Since this text is bounded, code-complete and well-formed, the fuzz parameters can all be set to zero
• The Exposure step reveals dominance for
i) love and relatedii) misery and despairiii) conflict and death
Love dominates, then diminishes
Romeo Romeo, wherefore art thou Romeo?
To be replaced by misery and despair
As Percents (share of scene)
Interlaced with Conflict and DeathNear mathematical symmetry:
Wuthering Heights
Ill-formed Text
• Tweets on Australian Federal Politics• From 1 June 2013 to 31 July 2013• Search term: #auspol OR #auspoll OR
#ausvotes OR #ozcot• 927,190 cases• Average between 10 to 20,000 per day• Huge spike on 26 June
Distorting the Message (2)The Scheduled Automatons
Obsessive / Compulsives
HashTags• Much more than just a metatag• Function as a message tokens too:– Commentary on current affairs
(#1000BoatDeaths, #20000JobCuts)– Calls to action (#2013electiondateplease,
#AbolishParliament)– Political attack (#AbbotLies)– Take a position (#AgeOfEntitlement)– Make a joke or pun
(#calmdownbirdie, #fraudband)
33,206 Unique HashTags over June/July
Colour Masked to Highlight the Clumps
Zoom-in on BattleRort
Hashtag Spawn
Quantification should capture as many variants as possible.
Sort on 19 July, Zoom
The PNG Solution is more punitive than anything the Conservatives have tried
But many instances missed
Three dominant tags are clear, but the variants will be lost under a search on just asylum OR asylumseeker/s
Ditto Refugee/Refugees, etc.
Quantification• Smoothed percentage chart of all instances exposes the
narratives, but to quantify them accurately, we cannot forego counting the variants.
• To get a more precise read, we apply Damerau-Levenshtein.
• Recalling the four transformation rules (insert, delete, replace, transpose), the following matches (among many others) will be made to the dominant forms at run time:
Confirm it Works• Set fuzz parameters as distance=0 for strings 4 characters or less, distance=1 for 9
characters or less, and distance=2 for 10 characters or more
• Run the source hashtags against these targets to create a new variable comprising eight categorical codes
• To confirm, run a table of the eight coded categories against the original raw hashtag text
The source strings battelrort and calmdownbirdie are both correctly captured and coded.
Doing Likewise for the Tweet TextCode Category Synonym/Variant Targetsthe government1 Rudd Kevin Rudd/KevinRudd/Kevin13/Kevin747/Kevin/@KRuddMP/KRudd/Rudd/
CrudDudd/KR/milky bar/messiah2 Albanese Albanese/Albosleaze/Albo3 Gillard Julia Gillard/JuliaGillard/Gillard/Juliar/Julia/Jules4 Shorten Bill Shorten/Shorten5 Labor
Continued to Code 46Code Category Synonym/variant Targets34 Corruption corruption/corrupt/fraud/sleaze/dishonest/stealing/steal/greed/shonky/crooks/criminals/