Top Banner
Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed Text Dr Dale Chant, Red Centre Software Pty Ltd ASC Conference: Making Sense of New Research Technologies Critical Reflections on Methodology and Technology: Gamification, Text Analysis, and Data Visualisation Friday 6 th and Saturday 7 th September 2013, University of Winchester
53

Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed Text

Feb 25, 2016

Download

Documents

tirzah

Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed Text. Dr Dale Chant, Red Centre Software Pty Ltd. ASC Conference: Making Sense of New Research Technologies Critical Reflections on Methodology and Technology: - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

Exposing and Quantifying Narrative and Thematic

Structures in Well-formed and Ill-formed Text

Dr Dale Chant, Red Centre Software Pty Ltd

ASC Conference: Making Sense of New Research Technologies Critical Reflections on Methodology and Technology: Gamification, Text Analysis, and Data Visualisation

Friday 6th and Saturday 7th September 2013, University of Winchester

Page 2: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

The Problem

• Coding open-ended verbatims takes a long time • Inconsistent coding judgements can wreak havoc

on small weekly samples • Some bodies of free text, such as Twitter feeds,

are beyond human capacity to digest due to sheer volume

• Machine coding by string matching assumes well-formed text – variant morphologies difficult to accommodate

Page 3: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

Naïve Auto-Coding

• Read all source words (or complete strings) into an array

• Sort alphabetically• Assign codes from 1 to N, where N is the

number of unique words (or unique strings)• Write the assigned codes in the original word

(or string) order

Page 4: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

Naïve Auto-Coding• Well-formed published text is code-complete

The first line of Wuthering Heights

The complete code frame has 9,201 items

Page 5: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

Netting to a Theme

• With the code frame defined, themes can be netted from individual words

abandonment = abandon/abandoned/abandonment/reject/rejected/rejecting

Theme(1) = Text(3/5,6530/6532)

Coded Decoded

Page 6: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

Code Incomplete

• Open-ended tracker Brand Awareness questions, time-dependent blog or social media exchanges

• Can never be code-complete, because forthcoming data may throw up unanticipated variations

Dog/dogs/mongrel/mongrels/mutt/mutts/dingo/wolf/…

Page 7: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

Damerau-Levenshtein

• One approach is Approximate String Matching• Match a source string to a target string by

combinations ofi) Insertii) Deleteiii) Replaceiv) Transpose

• The edit distance is the number of transforms needed to get from the source to the target

Page 8: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

The Algorithm in Action

• There is an interactive implementation of Damerau-Levenshtein at

http://fuzzy-string.com/Compare/

Page 9: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

Scaling the Algorithm (1)

• To be useful, the allowable distance for a positive match needs to scale against the length of the target strings

• ‘ox’ to ‘fox’ has distance 1 (insert at head)This would be a false positive

• ‘megalomania’ to ‘megalomaniacs’ has distance 2 (insert twice at tail)This is a good match

Page 10: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

Scaling the Algorithm (2)

• Short strings need a distance of zero• Intermediate strings need 1 or 2• Longer strings can bear 2 or 3 or more• The thresholds for short/intermediate/long and

allowed distances for a positive match are here termed the fuzz parameters

• Fuzz parameters are determined empirically, and will vary with the body of text being analysed.

Page 11: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

What is GainedThe target string megalomaniac at an edit distance of 1 will match on:

12 * 26 in situ typos (negalomaniac) + 12 missing (megaomaniac) + 12*26 extraneous (megaloomaniac) + 11 transpositions (meglaomaniac) + 2*26 extra pre/post character (mmegalomaniac)

= 699 possible variations

Page 12: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

The Procedure• Code the source text, one code per unique word• Run a sorted frequency count to expose recurrent themes

and concepts• Review actual instances of these words in situ to determine

appropriate fuzz parameters and the thematic and conceptual contexts

• Devise a compact target code frame which maps the themes and concepts words of interest to synonym and variant lists

• Process the source text against the targets, to create a categorical variable which can be tabulated in the normal manner against any other variable

Page 13: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

Exposure and Quantification: Romeo and Juliet

• Since this text is bounded, code-complete and well-formed, the fuzz parameters can all be set to zero

• The Exposure step reveals dominance for

i) love and relatedii) misery and despairiii) conflict and death

Page 14: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

Love dominates, then diminishes

Romeo Romeo, wherefore art thou Romeo?

Page 15: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

To be replaced by misery and despair

Page 16: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

As Percents (share of scene)

Page 17: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

Interlaced with Conflict and DeathNear mathematical symmetry:

Page 18: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

Wuthering Heights

Page 19: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

Ill-formed Text

• Tweets on Australian Federal Politics• From 1 June 2013 to 31 July 2013• Search term: #auspol OR #auspoll OR

#ausvotes OR #ozcot• 927,190 cases• Average between 10 to 20,000 per day• Huge spike on 26 June

Page 20: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

http://votecompass.com/2013/07/25/are-you-among-australias-most-influential-political-tweeters-votecompass-maps-the-auspol-twittersphere/

Page 21: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

Tweet Frequencies

Page 22: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

Data Sources• Two commercial data source providers were used: Gnip and

ScraperWiki

• The Gnip data was collected in a single 28 hour run conducted on 15 Aug 2013

• ScraperWiki provides user-initiated searches for up to the prior seven days

• Because ScraperWiki is near real time, accounts later banned or suspended by 15th August, and hence not in the Gnip data, remain present

• The ScraperWiki data is used below only to demonstrate this point.

http://gnip.com/ https://scraperwiki.com/

Page 23: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

Australian Federal Politics since 2007

Rudd defeats Howard at general election

Howard, Conservative

Rudd, Labor

Gillard, Labor

Abbott, Conservative

‘07 ‘08 ‘09 ‘10 ‘11 ‘12 ‘13

Gillard challenges and defeats Rudd, calls election, hung parliament

Rudd challenges and defeats Gillard, calls election for 7 Sept

Timeline:

Page 24: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

The Grand Narrative

Assange Senate Bid

• With the Pretender to the Throne (Gillard) summarily dispatched

• The True and Rightful King (Rudd), triumphantly returned from (backbench) exile

• Now faces the Great Adversary (Abbott) in a battle to the (political) death for control of the realm

Warning: Aussie Vernacular Alert

Page 25: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

Rudd’s Major Problem

Page 26: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

The Conservative’s Hammer

Page 27: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

Distorting the Message (1)Attack of the TweetBots

Page 28: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

@ALPDirt Posts Once a Minute

Page 29: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

Distorting the Message (2)The Scheduled Automatons

Page 30: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

Obsessive / Compulsives

Page 31: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

HashTags• Much more than just a metatag• Function as a message tokens too:– Commentary on current affairs

(#1000BoatDeaths, #20000JobCuts)– Calls to action (#2013electiondateplease,

#AbolishParliament)– Political attack (#AbbotLies)– Take a position (#AgeOfEntitlement)– Make a joke or pun

(#calmdownbirdie, #fraudband)

Page 32: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

33,206 Unique HashTags over June/July

Page 33: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

Colour Masked to Highlight the Clumps

Page 34: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

Zoom-in on BattleRort

Page 35: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

Hashtag Spawn

Quantification should capture as many variants as possible.

Page 36: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

Sort on 19 July, Zoom

The PNG Solution is more punitive than anything the Conservatives have tried

Page 37: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

But many instances missed

Three dominant tags are clear, but the variants will be lost under a search on just asylum OR asylumseeker/s

Ditto Refugee/Refugees, etc.

Page 38: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

Quantification• Smoothed percentage chart of all instances exposes the

narratives, but to quantify them accurately, we cannot forego counting the variants.

• To get a more precise read, we apply Damerau-Levenshtein.

• Recalling the four transformation rules (insert, delete, replace, transpose), the following matches (among many others) will be made to the dominant forms at run time:

• battelrort->battlerort (transpose once)• calmdownbirdie->calmdownbridie (transpose once)• asylumseeke ->asylumseekers (insert twice)• asylumseekeers->asylumseekers (delete once)• asylymseekers->asylumseekers (replace once)

Page 39: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

Prepare the synonym/variants lists for the dominant tags

The procedure is: • Code the hashtags, one code per unique tag

• Generate a sorted frequency count table

• Choose a cut-off point - I have used 30

• Review all items > 30, define and initialise a coded synonym/variants list with the dominant tags

• Sort the table alphabetically by label

• Review label blocks for any variants which are too coarse for Damerau-Levenshtein, and add to the relevant synonym/variants target list

Page 40: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

Define Coded Categories against TargetsCode Category Synonym/ Variant Targets1 BattleRort #BattleRort/#BattleRortAbbott/#BattleRortGate/#BattleRortMovies/#BattleRortSongs

2 CalmDownBridie

#CalmDownBridie/#CalmDownTony/#CalmDownAbbott/#CalmDown

3 Any Media

#qanda/#abcnews24/#abcnews/#abc24/#Insiders/#730report/#abc730/#730/#lateline/#thedrum/#pmlive/#pmagenda/#amagenda/#4corners/#contrarians/#abc/#MSMfail/#MSM/#contrarians/#theboltreport/#viewpoint/#media/#Murdoch/#ABC1/#mediawatch/#datelineSBS

4 Any Border Protection

#asylumseekers/#refugees/#boatpeople/#asylum/#PNGSolution/#PNG/#Nauru/#ManusIsland/#Manus/#humanrights/#stoptheboats/#Indonesia/#immigration/#boats/#operationsovereignborders

5 Islam #islamophobe/#islamist/#islamlaw/#islamic/#Islam/#muslim6 NBN #NBNCo/#NBN/#fraudband7 Any

Environment#climate/#coal/#fracking/#carbon/#energy/#CSG/#climatetax/#climatechange/#environment/#ETS/#carbonscam/#carbontaxscam/#climatescam/#climatecon/#green/#AGWHoax/#globalwarming/#greenarmy/#renewables/#votegreen/#climategate/#naturalcsg/#naturalgas

8 pinkbatts #pinkbatts

Page 41: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

Confirm it Works• Set fuzz parameters as distance=0 for strings 4 characters or less, distance=1 for 9

characters or less, and distance=2 for 10 characters or more

• Run the source hashtags against these targets to create a new variable comprising eight categorical codes

• To confirm, run a table of the eight coded categories against the original raw hashtag text

The source strings battelrort and calmdownbirdie are both correctly captured and coded.

Page 42: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

Doing Likewise for the Tweet TextCode Category Synonym/Variant Targetsthe government1 Rudd Kevin Rudd/KevinRudd/Kevin13/Kevin747/Kevin/@KRuddMP/KRudd/Rudd/

CrudDudd/KR/milky bar/messiah2 Albanese Albanese/Albosleaze/Albo3 Gillard Julia Gillard/JuliaGillard/Gillard/Juliar/Julia/Jules4 Shorten Bill Shorten/Shorten5 Labor

PartyLabor Party/Labor/#ALP/ALP/the government/the govt

6 Greens Greens/Milne/Bob Brown7 Unions unions/faceless men/AWU/HSU/Bill Ludwig/Ludwig/Paul Howes/Piggy Howesthe opposition8 Abbott @TonyAbbotMHR/Tony Abbott/TonyAbbott/TAbbott/Abbott/Tony/TA/budgie smuggler/

mad monk9 Turnbull Malcolm Turnbull/@TurnbullMalcolm/Turnbull10 Coalition Coalition/liberal/the opposition/Libs/#LNP/LNP

etc to code 46

Page 43: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

Continued to Code 46Code Category Synonym/variant Targets34 Corruption corruption/corrupt/fraud/sleaze/dishonest/stealing/steal/greed/shonky/crooks/criminals/

thuggish/thugs/thieves/thief35 Treachery treachery/treacherous/back stab/backstab/back

stabbing/backstabbing/stabbed/stabbing/knifing/knifed/plot/betrayal/betrayed/betray/spill/leadership coup/ousting/oust

36 Insanity insanity/insane/nutter/crazy/lunatic/unstable/psychopath/psychotic/psycho/narcissist/ delusional/delusion/egotist/egotistic/egotistical/egomaniac/ego/power mad/powermad/madness/deviant

37 Stupidity stupidity/stupid/wanker/numpty/imbecilic/imbecile/zombie/clueless/moron/retarded/retard/idiotic/idiot/bogan

38 Incompetence incompetent/mismanagement/dysfunctional/waste/inefficient/chaotic/chaos/destructive/inept

39 Cowardice cowardice/coward/gutless/ticker/frightened/scared40 Hypocrisy hypocrisy/hypocrit/bigoted/bigot41 Arrogance arrogance/arrogant/smart arse/smartarse/smart ass/self indulgent/smugness/smug scandals42 Scandals scandalous/scandal43 AWU Slush

FundAWU slush fund/slush fund/Bruce Wilson

44 ALP Scandals Peter Slipper/Slipper/Craig Thompson/Thompson/Eddie Obeid/Obeid45 Heiner Heinerpolicy46 Policy agenda/policies/policy

Page 44: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

Share of VoiceAll synonym and variant matches for Rudd, Abbott, Gillard, as percentages of the sum of their total mentions per day:

Page 45: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

Compare June to July

Page 46: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

Tweet Categories by Day

Page 47: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

Rudd vs Abbott – Image Attributes

Page 48: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

Corruption Share – June vs July

Page 50: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

Performance• Machine: Standard business Dell laptop, dual core, 4 gig RAM, nothing

fancy, no accelerations

• The bottleneck is the Damerau-Levenshtein step on the tweet text, which for the above 46 categories over 113 meg of plain text, takes about 15 hours

• Performance is linear to the number of individual target synonyms/variants

• Damerau-Levenshtein on the hashtags, a much smaller set of targets, completes in about 20 minutes

• The major time commitment from a human is in devising the target

synonym and variants lists, here several hours

• For more routine applications of the technique, such as open-ended brand lists, preparing the target lists is trivial

[end of document]

Page 51: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

The Finalehttp://www.theage.com.au/

Page 53: Exposing and Quantifying Narrative and Thematic Structures in Well-formed and Ill-formed  Text

DatDat: Rebirthing Dada for the Digital MilieuTentative: 5.30 Tuesday Goldsmiths. Email [email protected] to confirm.