Jun 04, 2015
Data Mining for Moderation of Social Data
Fernando G. Guerrero CEO SolidQ [email protected]
3 © 2011 SolidQ
Introductions • Fernando G. Guerrero •Global CEO of SolidQ • [email protected]
•Microsoft Regional Director for Spain since 2004 • SQL Server MVP from year 2000 till 2007 •Usual suspect at many international conferences
SolidQ 2012… 10th anniversary •160 people in 23 countries:
• Argentina, Australia, Austria, Bulgaria, Canada, Chile, Costa Rica, Croatia, Denmark, France, Germany, India, Israel, Italy, Mexico, Saudi Arabia, Serbia, Slovakia, Slovenia, Spain, Sweden, UK, USA
•50 current or former RDs or MVPs •Authors of many books, articles, and whitepapers •Research Collaboration with:
• Universidad de Alicante • Universidad de les Illes Balears • Universidad de Santiago de Compostela • The European Union • The Spanish Ministry of Economy and Innovation
6 © 2012 SolidQ
Agenda
• Social Data •Market Research • Sentiment Analysis, Text Mining •Moderation, Data Mining • SolidQ Research Lines in Social Data
7 © 2012 SolidQ
Social data is everywhere
8
9 © 2012 SolidQ
Social data is about everything
Music
10 © 2012 SolidQ
Social is there
• Is your organization promoting social about you?
Products Services Stories
11 © 2012 SolidQ
Social is there, reputation
•What is social saying about you? • Product • Services • Decisions • Image
12 © 2012 SolidQ
Market Research
•What is social requesting you? • Future Services • Product updates
•Can you ask questions to social?
• Is this service going to succeed • How can I fixed the current problem • Is society ready for this law
13 © 2012 SolidQ
Sentiment Analysis, Text Mining
The movie was fabulous!
The movie stars Mr. X
The movie was horrible!
[ Factual ] [ Sentimental ] [ Sentimental ]
14 © 2011 SolidQ
15 © 2012 SolidQ
What is Data Mining?
• Inform actionable business decisions •Contrasts with “machine learning”
16 © 2012 SolidQ
Media Case Study
•Millions of posts per year (different moderation scenarios) •About 25% are human moderated •About 10% of the moderated posts fail •No Business Intelligence applications for analysis
or reporting
17 © 2012 SolidQ
Moderation, Data Mining
• Contextual Information • Time • Location • User
• At 10am comments are safer than at 2AM. • A user maybe safe talking about science bad
dangerous talking about sports. • If a thread is hot (dangerous), comment maybe hot. • Combining context pattern the systems assign risk to
posts without going into the text.
18 © 2012 SolidQ
Solution – Logical Model
•Post Context (behavior analysis) • Patterns, data mining.
•Post Content (text analysis) • Profanity, low score sentences, text mining, mood or
tone (sentiment analysis)
19 © 2012 SolidQ
Typically Available Data on Posts
•Historical and real time data for: • User (e.g. userid, email, nationalid) • Location (e.g. Life & Style Fashion) • Time (e.g. 12 March 2011 18:56) • Content (e.g. text, link, picture, video). • Moderation result
•Other attributes like geography, age, education could be used
Post context, Patterns, Data Mining •User behavior. • Time behavior. • Location behavior.
20 © 2012 Solid Quality Mentors
Building useful attributes • 1.- Thread ( % Fails in a certain thread) • 2.- User (% Fails per User) • 3.- Diff Hour Forum Created (TimeDatePosted-TimeForumCreated) • 4.- User Forum (% Fails in a certain forum) • 5.- Diff Last for User (TimeDatePosted - TimeLastFailUser) • 6.- Hour of the day • 7.- Diff hour UserJoined-Now (TimeDatePosted-TimeUserJoined) • 8.- User Thread (% Fails per User in a thread) • 9.- Diff Hour Thread Created (TimeDatePosted-TimeThreadCreated) • 10.- Day of Week • More than 100 attributes.
21 © 2012 Solid Quality Mentors
Hard Work •Periods. •Algorithms. •Algorithms' parameters. •Model refreshing. •Attribute analysis. •Outliers. •Overpopulating. •Behavior after this systems is in production.
22 © 2012 Solid Quality Mentors
Data Mining Algorithms
•Decision Trees/Linear Regression • Sequence Analysis •Neural Networks/Logistic Regression •Clustering • Text Mining (Words and Phrases)
23 © 2012 SolidQ
24 © 2012 SolidQ
Conclusion on Context
•Risk based on context of the post • Time • User’s history • Publish location
• Enables risk analysis for all type of content • Comments (in any language) • Links • Pictures • Videos
Logical Model: Post content
•Profanity Analysis • Text Mining
The first minister and his secretary found sleeping together last night. They got drunk at a nearby pub.
• Sentiment Analysis
25 © 2012 SolidQ
26 © 2011 SolidQ
27 © 2012 SolidQ
Moderation, Data Mining System
28 © 2011 SolidQ
Analysis and Reporting •Published through integrated web application
• Moderation statistics. • Users statistics. • News and Stories Statistics. • Peaks.
29 © 2012 SolidQ
30 © 2012 SolidQ
Conclusion: Benefits
•Moderating half of the total posts, the solution captures 90% of failing posts. The remaining 10% seem to be likely safe posts. •Using Intelligent Moderation, media companies
scan the whole universe of posts at a comparatively low cost. •At peak times, Intelligent Moderation works
perfect.
31 © 2011 SolidQ
Football night in Europe
•On January 25th, 2012: • Liverpool defeated Manchester City in the Carling Cup • Barcelona defeated Real Madrid in Copa del Rey
•More than 100.000 comments arrived to the different BBC sites during 10 hours •All comments were filtered through our system •No problems observed during that time
32 © 2012 SolidQ
SolidQ Team in this project
•Project Managers • Francisco Gonzalez, Javier Torrenteras, Alejandro
Leguizamo
•Developers • Itzik Ben-Gan, Enrique Puig, Ruben Pertusa, Carlos
Martinez , Fernando G. Guerrero
• Technical reviewers • Mark Tabladillo, Dejan Sarka
• Social Media Specialist. • Jose Quinto, Rocio Díaz
33 © 2012 SolidQ
SolidQ Reseach
• Incomplete Grammar Analysis •Human interaction with IT systems
• Collaboration • Contextual analysis
• Sentiment Analysis • Market Research • Reputation
•Data Mining of context Social • Moderation • Market Research • Reputation
Invisible computing…
34
… Driven by Social Data