Top Banner
G. Futia F. Cairo F. Morando L. Leschiutta Exploiting Linked Open Data and Natural Language Processing for Classification of Political Speech Krems, 22 nd May 2014
20

Exploiting Linked Open Data and Natural Language Processing for Classification of Political Speech

Jul 04, 2015

Download

Data & Analytics

giuseppe_futia

Slides presented during the International Conference for E-Democracy and Open Government 2014
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Exploiting Linked Open Data  and Natural Language Processing for Classification of Political Speech

G. Futia F. Cairo F. Morando L. Leschiutta

Exploiting Linked Open Data and Natural Language Processing for

Classification of Political Speech

Krems, 22nd May 2014

Page 2: Exploiting Linked Open Data  and Natural Language Processing for Classification of Political Speech

22nd May 2014 Giuseppe Futia – Politecnico di Torino 2

Introduction● Our goal:

● assist anyone interested in automatic categorization of political speeches, to identify unambiguously the main political trends addressed by the White House

● What we have to achieve our goal:

● TellMeFirst (http://tellmefirst.polito.it/), a topic extraction tool:

– it leverages DBpedia knowledge base and English Wikipedia linguistic corpus

– it exploits Linked Open Data (LOD) and Natural Language Processing (NLP) techniques

Page 3: Exploiting Linked Open Data  and Natural Language Processing for Classification of Political Speech

22nd May 2014 Giuseppe Futia – Politecnico di Torino 3

DBpedia

● A crowd-sourced community effort to extract structured information from Wikipedia and a central interlinking hub for the Linking Open Data project.

● It is a suitable knowledge base for text classification (Mendes et al., 2012; Hellmann et al., 2013; Steinmetz et al., 2013)

Page 4: Exploiting Linked Open Data  and Natural Language Processing for Classification of Political Speech

22nd May 2014 Giuseppe Futia – Politecnico di Torino 4

Why DBpedia for USpolitical speeches?

Comparison between the coverage of US politics and the

coverage of politics of other countries

The coverage of politics in Wikipedia is “often very good for recent or prominent topics but is lacking on older or more obscure topics” (Brown, 2011).

Page 5: Exploiting Linked Open Data  and Natural Language Processing for Classification of Political Speech

22nd May 2014 Giuseppe Futia – Politecnico di Torino 5

Text Categorization Approach

● An instance-based approch: TellMeFirst assigns target documents to classes based on a local comparison between a set of pre-classified documents and the target document itself

● This training set consists of all the Wikipedia paragraphs where a wikilink occurs. These paragraphs are stored in a Lucene index, where each document represents a DBpedia resource

Page 6: Exploiting Linked Open Data  and Natural Language Processing for Classification of Political Speech

22nd May 2014 Giuseppe Futia – Politecnico di Torino 6

Success rate (%) of the TellMeFirst classification process on US Presidents profiles

1st topic Within thefirst 2 topics

Within the first 7 topics

Full text of the Presidents profiles 95.4% 100% 100%

President profiles without name and surname

45.4% 61.3% 90.9%

TellMeFirst provides as output the seven most relevant topics (in the form of DBpedia URI) of the document sorted by relevance

Page 7: Exploiting Linked Open Data  and Natural Language Processing for Classification of Political Speech

22nd May 2014 Giuseppe Futia – Politecnico di Torino 7

whitehouse.gov● 3173 videos in English were available on the White House

website on the 24th of November 2013

● These videos are categorized according to a taxonomy not related to the subject of the speeches

● They need a semantic layer that point out the content of the speeches, so that questions such as “what is the First Lady talking about?” could be automatically answered

Page 8: Exploiting Linked Open Data  and Natural Language Processing for Classification of Political Speech

22nd May 2014 Giuseppe Futia – Politecnico di Torino 8

Not just a bag-of-words tool

Results obtained with TellMeFirst (on the left) and with TagCrowd (on the right)

«President Obama Speaks on the Affordable Care Act»http://1.usa.gov/1jR4Ky2

Page 9: Exploiting Linked Open Data  and Natural Language Processing for Classification of Political Speech

22nd May 2014 Giuseppe Futia – Politecnico di Torino 9

Results (i)Occ. % overall % 2013 % 2012 % 2011 % 2010 % 2009

Barack Obama 607 4.88% 5.68% 4.52% 5.51% 4.45% 3.88%

Patient Protection and Affordable Care Act

286 2.30% 3.06% 1.35% 1.91% 2.47% 2.71%

American Recovery and Reinvestment Act of 2009

278 2.23% 1.09% 1.82% 2.88% 2.84% 1.88%

Social Security 272 2.19% 2.58% 1.77% 3.54% 1.61% 0.78%

Amount and percentage of topic occurrences extracted with TellMeFirst

Page 10: Exploiting Linked Open Data  and Natural Language Processing for Classification of Political Speech

22nd May 2014 Giuseppe Futia – Politecnico di Torino 10

Results (ii)● “New Deal” (141 occurrences), probably used as a metaphor

within the political speeches of President Obama

● “Libya” has a value corresponding to 1.00% in 2011. This result can be related to the full-scale revolt beginning on 17 February 2011 in Libya

● “Deepwater Horizon oil spill” reaches the 1.05% in 2010. This result is related to the marine oil spill which took place in the Gulf of Mexico that began on 20 april 2010

Page 11: Exploiting Linked Open Data  and Natural Language Processing for Classification of Political Speech

22nd May 2014 Giuseppe Futia – Politecnico di Torino 11

Correlation among topics

Page 12: Exploiting Linked Open Data  and Natural Language Processing for Classification of Political Speech

22nd May 2014 Giuseppe Futia – Politecnico di Torino 12

A focus on the First Lady (i)● According to Michelle Obama’s page on the White House

website, the First Lady “looks forward to continuing her work on the issues close to her heart”:

● supporting military families

● helping working women balance career and family encouraging national service

● promoting the arts and arts education

● fostering healthy eating and healthy living for children and families across the country

Page 13: Exploiting Linked Open Data  and Natural Language Processing for Classification of Political Speech

22nd May 2014 Giuseppe Futia – Politecnico di Torino 13

A focus on the First Lady (ii)● We tested whether TellMeFirst confirms or not these

impressions and claims, manually selecting nine Wikipedia categories which seemed to be related to these issues

● We then interrogated the SPARQL end-point of DBpedia with a query to collect all the topics of these categories

● We then associated each topic to one or more of the nine high-level categories: these categories encompassed almost 75% of the topics

Page 14: Exploiting Linked Open Data  and Natural Language Processing for Classification of Political Speech

22nd May 2014 Giuseppe Futia – Politecnico di Torino 14

A focus on the First Lady (iii)Wikipedia Category First Lady sp.

9 categoriesAll speeches9 categories

Government of the United States 26.68% 32.68%

Education 21.64% 5.40%

Nutrition 19.96% 1.61%

Social issues 14.71% 28.38%

Barack Obama 13.66% 14.00%

Health care 11.34% 7.57%

Arts 8.61% 1.11%

Military personnel 3.99% 3.16%

Gender equality 2.73% 0.84%

Others (unclassified topics) 25.63% 38.34%

Page 15: Exploiting Linked Open Data  and Natural Language Processing for Classification of Political Speech

22nd May 2014 Giuseppe Futia – Politecnico di Torino 15

Conclusions (i)● The ability for citizens to easily retrieve the content of political

speeches and decisions is a crucial factor in e-participation● Not guaranteed by a traditional keywords search, as in

most of the public administration websites (the White House website included)

● Example: in a keyword-based system, by typing the word "education", for instance, users get as result only videos that have the word education in their title

● All terms that belong to the semantic area of education are omitted

Page 16: Exploiting Linked Open Data  and Natural Language Processing for Classification of Political Speech

22nd May 2014 Giuseppe Futia – Politecnico di Torino 16

Conclusions (ii)● When documents are semantically classified through

DBpedia URIs all synonyms, hypernyms and hyponyms of lemmas are traced to the same concept making user search more effective

● Leveraging Wikipedia categories would allow to go even a step further, taking advantage of the links between concepts as designed by the Wikipedia community

Page 17: Exploiting Linked Open Data  and Natural Language Processing for Classification of Political Speech

22nd May 2014 Giuseppe Futia – Politecnico di Torino 17

Next steps

● Building a content search/navigation layer around the scraping/classification module

● Integration with other Linked Open Data repositories on the Web, combining the extracted topics with other information (President Obama's federal budget proposal?)

Page 18: Exploiting Linked Open Data  and Natural Language Processing for Classification of Political Speech

Thank you!

Giuseppe Futia ([email protected])

This paper was drafted in the context of the Network of Excellence in Internet Science EINS (GA n°288021), and, in particular, in relation with the activities concerning Evidence and Experimentation (JRA3).

Page 19: Exploiting Linked Open Data  and Natural Language Processing for Classification of Political Speech

22nd May 2014 Giuseppe Futia – Politecnico di Torino 19

Appendix - Scoring formula ● In a Lucene query, both the target document and the training

set become weighed terms vectors, where terms are weighted by means of the TF-IDF algorithm. The query returns a list of documents in the form of DBpedia URIs, ordered by similarity score. Scoring formula is:

Page 20: Exploiting Linked Open Data  and Natural Language Processing for Classification of Political Speech

22nd May 2014 Giuseppe Futia – Politecnico di Torino 20

Appendix - Basic concepts● Natural Language Processing - A field of computer science,

concerned with the interactions between computers and human (natural) languages.

● Linked Data - A recommended best practice for exposing, sharing, and connecting pieces of data, information, and knowledge on the Semantic Web using URIs and RDF.

● DBpedia - A crowd-sourced community effort to extract structured information from Wikipedia and a central interlinking hub for the Linking Open Data project.