Institute for Web Science and Technologies · University of Koblenz-Landau, Germany (Semi-)Automatic Analysis of Online Contents Steffen Staab @ststaab Web and Internet Science Group · ECS · University of Southampton, UK &
Institute for Web Science and Technologies · University of Koblenz-Landau, Germany
(Semi-)Automatic Analysis of Online Contents
Steffen Staab@ststaab
Web and Internet Science Group · ECS · University of Southampton, UK &
(Semi-)Automatic analysis of online content 2/68Steffen Staab
Content analysis
Text++
Content
(Semi-)Automatic analysis of online content 3/68Steffen Staab
Is it difficult?
„Nach dem Auspacken der LPS-105 präsentiert sich dem Betrachter ein stabiles Laufwerk, das genauso geringe Außenmaße besitzt wie die Maxtor.“
Unpacking the LPS 105 reveals a sturdy disk drive which is of the same small size as the Maxtor.
Text++
Content
(Semi-)Automatic analysis of online content 4/68Steffen Staab
„Content“ analysis: What is in online content?
....
Entailment
Summaries
Arguments
Discourse
OpinionsSentiments
Facts – who, what, when?
Syntax
Semantics
Pragmatics
Knowledge
Text++
Content
CLing
(Semi-)Automatic analysis of online content 5/68Steffen Staab
Purpose
Technical objectives• Search• data & knowledge
bases:• facts• arguments• ...
Applications• Google Search• Watson • „Watson 2“
Social science and humanities objectives
• Form hypotheses• Find indications• Recognize trends• ...
(Semi-)Automatic analysis of online content 6/68Steffen Staab
Objective oriented content analysis
....
Entailment
Summaries
Arguments
Discourse
OpinionsSentiments
Facts – who, what, when?
Syntax
Semantics
Pragmatics
Knowledge
Text++
Semantic Web
Trend hypotheses
Selection of facts, function,
trust
CLing
(Semi-)Automatic analysis of online content 7/68Steffen Staab
SEMANTIC WEB ANNOTATION
(Semi-)Automatic analysis of online content 8/68Steffen Staab
CREAM – Creating Metadata (Handschuh et al 2002, 2003)
Document Viewer / EditorOntology
Guidance & Fact Browser
Concepts
Instances of Concepts
Attribute Instances = instance of a property to a datatype instance
Relationship Instances =instance of a property
to a class instance
(Semi-)Automatic analysis of online content 9/68Steffen Staab
CREAM – Creating Metadata (Handschuh et al 2002, 2003)
Open world - Target ontologies now could be:• Schema.org
(3 Trillion facts collected by Google; 10,000 of concepts)
• Wikidata1,148,230 concepts (2 weeks ago)
(Semi-)Automatic analysis of online content 10/68Steffen Staab
Annotating facts with Cream
+++Open (wrt ontologies)Flexible
Semi-automatic: SCREAM
---
Effort for annotation(minimize # of clicks)Thick ClientTech Readiness Level ~5A lot of effort to prepare tool
for a taskLimited accuracy
(Semi-)Automatic analysis of online content 11/68Steffen Staab
Technology Readiness LevelsTRL 1: Beobachtung und Beschreibung des
Funktionsprinzips (8-15 Jahre zur Marktreife)TRL 2: Beschreibung der Anwendung einer TechnologieTRL 3: Nachweis der Funktionstüchtigkeit einer Technologie
(5-13 Jahre zur Marktreife)TRL 4: Versuchsaufbau im LaborTRL 5: Versuchsaufbau in EinsatzumgebungTRL 6: Prototyp in EinsatzumgebungTRL 7: Prototyp im Einsatz (1-5 Jahre)TRL 8: Qualifiziertes System mit Nachweis der
Funktionstüchtigkeit im EinsatzbereichTRL 9: Qualifiziertes System mit Nachweis des
erfolgreichen Einsatzes
(Semi-)Automatic analysis of online content 12/68Steffen Staab
CLUSTERING OF TEXTDATA
http://topicmodels.west.uni-koblenz.deWith Christoph Kling
(Semi-)Automatic analysis of online content 13/68Steffen Staab
Text Mining Documents
Documents are PDFs, emails, tweets,
Flickr photo tags, Word companions,…
Documents consist of bag of words metadata
- author(s) - timestamp- geolocation- publisher- booktitle- device...
Chinese food
Vegan
food
Break-fast
dimsumduckeggs
...
vegantofu...
eggsham...
Objective:Cluster, categorize,
& explain
(Semi-)Automatic analysis of online content 14/68Steffen Staab
Latent Dirichlet Allocation (LDA)
(Semi-)Automatic analysis of online content 15/68Steffen Staab
Latent Dirichlet Allocation (LDA)
Document-topic distributions
Topic-word distributions
K topicsM documentsEach doc m from M has length Nm
(Semi-)Automatic analysis of online content 16/68Steffen Staab
Use Metadata to Help Topic Prediction
Improve topic detection→ Morning times may help to improve the breakfast topic Describe dependencies: metadata ↔ topics
→ breakfast topic happens during morning hours Chinese
food
Vegan
food
Break-
fast
dimsumduckeggs
...
vegantofu...
eggsham...
(Semi-)Automatic analysis of online content 17/68Steffen Staab
Use Metadata to Help Topic Prediction
Improve topic detection→ Morning times may help to improve the breakfast topic Describe dependencies: metadata ↔ topics
→ breakfast topic happens during morning hours
Usage Autocompletion
→ From words to words Prediction of search queries
→ From metadata to words→ From words to metadata
Chinese food
Vegan
food
Break-
fast
dimsumduckeggs
...
vegantofu...
eggsham...
(Semi-)Automatic analysis of online content 18/68Steffen Staab
Dataset
Linux Kernel Mailinglist3,400,000 emails with timestamps and mailinglist ID
(Semi-)Automatic analysis of online content 19/68Steffen Staab
Nominal
Ordinal
Cyclic
Spherical
Networked
Structures of Metadata Spaces Kern Desk Mail
Spatial Model is not used in this application(but might be)!
(Semi-)Automatic analysis of online content 20/68Steffen Staab
Topics
(Semi-)Automatic analysis of online content 21/68Steffen Staab
Topics
(Semi-)Automatic analysis of online content 22/68Steffen Staab
Topics
Professional topics:
Hobbyist topics:
(Semi-)Automatic analysis of online content 23/68Steffen Staab
Topics
Metadata weighting:
(Semi-)Automatic analysis of online content 24/68Steffen Staab
126,408 Online Fetish Users: First 8 Topics
(Semi-)Automatic analysis of online content 25/68Steffen Staab
Sociodemographics of Fetish dataset
(Semi-)Automatic analysis of online content 26/68Steffen Staab
Influence of Sociodemographics on Favorite Fetishes
(Semi-)Automatic analysis of online content 27/68Steffen Staab
Other applications of (extended) LDA
Sentiment and Topics(Naveed et al ICWSM 2013)
Topics and spatial knowledge(Kling et al WSDM 2014)
Modelling of power(Kling et al ICWSM 2015)
(Semi-)Automatic analysis of online content 28/68Steffen Staab
BELIEVABILITY AND TRUST IN ONLINE NEWS
With Christoph Kling, Jerome KunegisCollaboraiton with Jutta Milde, Karin Stengel, Ines VogelOngoing work in KOMEPOL
(Semi-)Automatic analysis of online content 29/68Steffen Staab
Targets
(Semi-)Automatic analysis of online content 30/68Steffen Staab
Example article at Spiegel.de
(Semi-)Automatic analysis of online content 31/68Steffen Staab
Requirements
Scalability:• # Documents• # Annotators• # Annotations per
annotater
Tool:• Administration• Crowdsourcing• Semi-automatic
(Semi-)Automatic analysis of online content 32/68Steffen Staab
Separating article management and coding
(Semi-)Automatic analysis of online content 33/68Steffen Staab
Text-Upload
(Semi-)Automatic analysis of online content 34/68Steffen Staab
Managing projects
(Semi-)Automatic analysis of online content 35/68Steffen Staab
Article
(Semi-)Automatic analysis of online content 36/68Steffen Staab
Defining a Coding-Job
(Semi-)Automatic analysis of online content 37/68Steffen Staab
Highlighting using Keywords and Clustering
(Semi-)Automatic analysis of online content 38/68Steffen Staab
Article coding
(Semi-)Automatic analysis of online content 39/68Steffen Staab
Preparing a code book (1)
(Semi-)Automatic analysis of online content 40/68Steffen Staab
Preparing a code book (2)
(Semi-)Automatic analysis of online content 41/68Steffen Staab
CONCLUSION
(Semi-)Automatic analysis of online content 42/68Steffen Staab
Lessons Learned
New targets• Require new modeling of
gaps
Challenges• Technology Readiness
Levels• Many tools – no „good“ tool
(„done is better than perfect“?)
• Reproducability
ToDos• Eclipse/Protege of
annotation• modular• extensible• open
• Optimizing the processes
(Semi-)Automatic analysis of online content 43/68Steffen Staab
No tool to rule them all
....
Entailment
Summaries
Arguments
Discourse
OpinionsSentiments
Facts – who, when, where, what?
Syntax
Semantics
Pragmatics
Knowledge
Text++
Semantic Web
Trend-hypothesen
Faktenauswahl, Funktion, Vertrauen
Gap
Gap
CLing
(Semi-)Automatic analysis of online content 44/68Steffen Staab
THANK YOU FOR YOUR ATTENTION!
(Semi-)Automatic analysis of online content 45/68Steffen Staab
C. C. Kling, J. Kunegis, S. Sizov, and S. Staab. “Detecting non-gaussian geographical topics in tagged photo collections.” In: Seventh ACM International Conference on Web Search and Data Mining, WSDM 2014, New York, NY, USA, February 24-28, 2014.
I. C. Vogel, J. Milde, K. Stengel, S. Staab, C. C. Kling, and J. Kunegis. “Glaubwürdigkeit und Vertrauen von Online-News.” In: Datenschutz und Datensicherheit 39.5 (2015), pp. 312–316.
S. Handschuh, S. Staab. CREAM – CREAting Metadata for the Semantic Web. Computer Networks. 42(5): 579-598, Elsevier 2003.
S. Handschuh, S. Staab, F. Ciravegna. S-CREAM – Semi-automatic CREAtion of Metadata.In: Proc. of the European Conference on Knowledge Acquisition and Management – EKAW-2002 . Madrid, Spain, October 1-4, 2002. LNCS/LNAI 2473, Springer, 2002, pp. 358-372.
C. Kling. Probabilistic Models for Context in Social Media. Novel Approaches and Inference Schemes. Submitted as PhD thesis, Institute for Web Science and Technologies, University of Koblenz-Landau, to be defended Nov/Dec 2016
Nasir Naveed, Thomas Gottron, Steffen Staab:Feature Sentiment Diversification of User Generated Reviews: The FREuD Approach. ICWSM 2013
Christoph Carl Kling, Jérôme Kunegis, Heinrich Hartmann, Markus Strohmaier, Steffen Staab:Voting Behaviour and Power in Online Democracy: A Study of LiquidFeedback in Germany's Pirate Party. ICWSM 2015: 208-217
Bibliography
(Semi-)Automatic analysis of online content 46/68Steffen Staab
URLs
http://topicmodels.west.uni-koblenz.dehttp://komepol.west.uni-koblenz.de
http://www.slideshare.net/steffenstaab