Top Banner
Institute for Web Science and Technologies · University of Koblenz-Landau, Germany (Semi-)Automatic Analysis of Online Contents Steffen Staab @ststaab Web and Internet Science Group · ECS · University of Southampton, UK &
46

(Semi-)Automatic analysis of online contents

Apr 15, 2017

Download

Internet

Steffen Staab
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: (Semi-)Automatic analysis of online contents

Institute for Web Science and Technologies · University of Koblenz-Landau, Germany

(Semi-)Automatic Analysis of Online Contents

Steffen Staab@ststaab

Web and Internet Science Group · ECS · University of Southampton, UK &

Page 2: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 2/68Steffen Staab

Content analysis

Text++

Content

Page 3: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 3/68Steffen Staab

Is it difficult?

„Nach dem Auspacken der LPS-105 präsentiert sich dem Betrachter ein stabiles Laufwerk, das genauso geringe Außenmaße besitzt wie die Maxtor.“

Unpacking the LPS 105 reveals a sturdy disk drive which is of the same small size as the Maxtor.

Text++

Content

Page 4: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 4/68Steffen Staab

„Content“ analysis: What is in online content?

....

Entailment

Summaries

Arguments

Discourse

OpinionsSentiments

Facts – who, what, when?

Syntax

Semantics

Pragmatics

Knowledge

Text++

Content

CLing

Page 5: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 5/68Steffen Staab

Purpose

Technical objectives• Search• data & knowledge

bases:• facts• arguments• ...

Applications• Google Search• Watson • „Watson 2“

Social science and humanities objectives

• Form hypotheses• Find indications• Recognize trends• ...

Page 6: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 6/68Steffen Staab

Objective oriented content analysis

....

Entailment

Summaries

Arguments

Discourse

OpinionsSentiments

Facts – who, what, when?

Syntax

Semantics

Pragmatics

Knowledge

Text++

Semantic Web

Trend hypotheses

Selection of facts, function,

trust

CLing

Page 7: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 7/68Steffen Staab

SEMANTIC WEB ANNOTATION

Page 8: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 8/68Steffen Staab

CREAM – Creating Metadata (Handschuh et al 2002, 2003)

Document Viewer / EditorOntology

Guidance & Fact Browser

Concepts

Instances of Concepts

Attribute Instances = instance of a property to a datatype instance

Relationship Instances =instance of a property

to a class instance

Page 9: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 9/68Steffen Staab

CREAM – Creating Metadata (Handschuh et al 2002, 2003)

Open world - Target ontologies now could be:• Schema.org

(3 Trillion facts collected by Google; 10,000 of concepts)

• Wikidata1,148,230 concepts (2 weeks ago)

Page 10: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 10/68Steffen Staab

Annotating facts with Cream

+++Open (wrt ontologies)Flexible

Semi-automatic: SCREAM

---

Effort for annotation(minimize # of clicks)Thick ClientTech Readiness Level ~5A lot of effort to prepare tool

for a taskLimited accuracy

Page 11: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 11/68Steffen Staab

Technology Readiness LevelsTRL 1: Beobachtung und Beschreibung des

Funktionsprinzips (8-15 Jahre zur Marktreife)TRL 2: Beschreibung der Anwendung einer TechnologieTRL 3: Nachweis der Funktionstüchtigkeit einer Technologie

(5-13 Jahre zur Marktreife)TRL 4: Versuchsaufbau im LaborTRL 5: Versuchsaufbau in EinsatzumgebungTRL 6: Prototyp in EinsatzumgebungTRL 7: Prototyp im Einsatz (1-5 Jahre)TRL 8: Qualifiziertes System mit Nachweis der

Funktionstüchtigkeit im EinsatzbereichTRL 9: Qualifiziertes System mit Nachweis des

erfolgreichen Einsatzes

Page 12: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 12/68Steffen Staab

CLUSTERING OF TEXTDATA

http://topicmodels.west.uni-koblenz.deWith Christoph Kling

Page 13: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 13/68Steffen Staab

Text Mining Documents

Documents are PDFs, emails, tweets,

Flickr photo tags, Word companions,…

Documents consist of bag of words metadata

- author(s) - timestamp- geolocation- publisher- booktitle- device...

Chinese food

Vegan

food

Break-fast

dimsumduckeggs

...

vegantofu...

eggsham...

Objective:Cluster, categorize,

& explain

Page 14: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 14/68Steffen Staab

Latent Dirichlet Allocation (LDA)

Page 15: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 15/68Steffen Staab

Latent Dirichlet Allocation (LDA)

Document-topic distributions

Topic-word distributions

K topicsM documentsEach doc m from M has length Nm

Page 16: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 16/68Steffen Staab

Use Metadata to Help Topic Prediction

Improve topic detection→ Morning times may help to improve the breakfast topic Describe dependencies: metadata ↔ topics

→ breakfast topic happens during morning hours Chinese

food

Vegan

food

Break-

fast

dimsumduckeggs

...

vegantofu...

eggsham...

Page 17: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 17/68Steffen Staab

Use Metadata to Help Topic Prediction

Improve topic detection→ Morning times may help to improve the breakfast topic Describe dependencies: metadata ↔ topics

→ breakfast topic happens during morning hours

Usage Autocompletion

→ From words to words Prediction of search queries

→ From metadata to words→ From words to metadata

Chinese food

Vegan

food

Break-

fast

dimsumduckeggs

...

vegantofu...

eggsham...

Page 18: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 18/68Steffen Staab

Dataset

Linux Kernel Mailinglist3,400,000 emails with timestamps and mailinglist ID

Page 19: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 19/68Steffen Staab

Nominal

Ordinal

Cyclic

Spherical

Networked

Structures of Metadata Spaces Kern Desk Mail

Spatial Model is not used in this application(but might be)!

Page 20: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 20/68Steffen Staab

Topics

Page 21: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 21/68Steffen Staab

Topics

Page 22: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 22/68Steffen Staab

Topics

Professional topics:

Hobbyist topics:

Page 23: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 23/68Steffen Staab

Topics

Metadata weighting:

Page 24: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 24/68Steffen Staab

126,408 Online Fetish Users: First 8 Topics

Page 25: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 25/68Steffen Staab

Sociodemographics of Fetish dataset

Page 26: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 26/68Steffen Staab

Influence of Sociodemographics on Favorite Fetishes

Page 27: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 27/68Steffen Staab

Other applications of (extended) LDA

Sentiment and Topics(Naveed et al ICWSM 2013)

Topics and spatial knowledge(Kling et al WSDM 2014)

Modelling of power(Kling et al ICWSM 2015)

Page 28: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 28/68Steffen Staab

BELIEVABILITY AND TRUST IN ONLINE NEWS

With Christoph Kling, Jerome KunegisCollaboraiton with Jutta Milde, Karin Stengel, Ines VogelOngoing work in KOMEPOL

Page 29: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 29/68Steffen Staab

Targets

Page 30: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 30/68Steffen Staab

Example article at Spiegel.de

Page 31: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 31/68Steffen Staab

Requirements

Scalability:• # Documents• # Annotators• # Annotations per

annotater

Tool:• Administration• Crowdsourcing• Semi-automatic

Page 32: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 32/68Steffen Staab

Separating article management and coding

Page 33: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 33/68Steffen Staab

Text-Upload

Page 34: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 34/68Steffen Staab

Managing projects

Page 35: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 35/68Steffen Staab

Article

Page 36: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 36/68Steffen Staab

Defining a Coding-Job

Page 37: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 37/68Steffen Staab

Highlighting using Keywords and Clustering

Page 38: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 38/68Steffen Staab

Article coding

Page 39: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 39/68Steffen Staab

Preparing a code book (1)

Page 40: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 40/68Steffen Staab

Preparing a code book (2)

Page 41: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 41/68Steffen Staab

CONCLUSION

Page 42: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 42/68Steffen Staab

Lessons Learned

New targets• Require new modeling of

gaps

Challenges• Technology Readiness

Levels• Many tools – no „good“ tool

(„done is better than perfect“?)

• Reproducability

ToDos• Eclipse/Protege of

annotation• modular• extensible• open

• Optimizing the processes

Page 43: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 43/68Steffen Staab

No tool to rule them all

....

Entailment

Summaries

Arguments

Discourse

OpinionsSentiments

Facts – who, when, where, what?

Syntax

Semantics

Pragmatics

Knowledge

Text++

Semantic Web

Trend-hypothesen

Faktenauswahl, Funktion, Vertrauen

Gap

Gap

CLing

Page 44: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 44/68Steffen Staab

THANK YOU FOR YOUR ATTENTION!

Page 45: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 45/68Steffen Staab

C. C. Kling, J. Kunegis, S. Sizov, and S. Staab. “Detecting non-gaussian geographical topics in tagged photo collections.” In: Seventh ACM International Conference on Web Search and Data Mining, WSDM 2014, New York, NY, USA, February 24-28, 2014.

I. C. Vogel, J. Milde, K. Stengel, S. Staab, C. C. Kling, and J. Kunegis. “Glaubwürdigkeit und Vertrauen von Online-News.” In: Datenschutz und Datensicherheit 39.5 (2015), pp. 312–316.

S. Handschuh, S. Staab. CREAM – CREAting Metadata for the Semantic Web. Computer Networks. 42(5): 579-598, Elsevier 2003.

S. Handschuh, S. Staab, F. Ciravegna. S-CREAM – Semi-automatic CREAtion of Metadata.In: Proc. of the European Conference on Knowledge Acquisition and Management – EKAW-2002 . Madrid, Spain, October 1-4, 2002. LNCS/LNAI 2473, Springer, 2002, pp. 358-372.

C. Kling. Probabilistic Models for Context in Social Media. Novel Approaches and Inference Schemes. Submitted as PhD thesis, Institute for Web Science and Technologies, University of Koblenz-Landau, to be defended Nov/Dec 2016

Nasir Naveed, Thomas Gottron, Steffen Staab:Feature Sentiment Diversification of User Generated Reviews: The FREuD Approach. ICWSM 2013

Christoph Carl Kling, Jérôme Kunegis, Heinrich Hartmann, Markus Strohmaier, Steffen Staab:Voting Behaviour and Power in Online Democracy: A Study of LiquidFeedback in Germany's Pirate Party. ICWSM 2015: 208-217

Bibliography

Page 46: (Semi-)Automatic analysis of online contents

(Semi-)Automatic analysis of online content 46/68Steffen Staab

URLs

http://topicmodels.west.uni-koblenz.dehttp://komepol.west.uni-koblenz.de

http://www.slideshare.net/steffenstaab