AnHai Doan University of Wisconsin @WalmartLabs Social Media, Data Integration, and Human Computation @WalmartLabs
Feb 25, 2016
AnHai DoanUniversity of Wisconsin@WalmartLabs
Social Media, Data Integration, and Human Computation
@WalmartLabs
2
A Journey Starting in 2001 ... Worked in data integration
– combine multiple data sources into one– e.g, aggregation/comparison shopping sites, Google Scholar
– use schema matching, information extraction, entity disambiguation
Ph.D. thesis focused on schema matching
Find houses with 2 bedroomsunder 400Krealestate.com
fsbo.com
homes.com
3
Schema Matching
address price31 Bagley Ct ... 250K12 Hope St ... 375K
location sold-at14 Main St ... 249,00025 West St ... 324,000
address = location price = sold-at
Developed automatic solution using machine learning Realized that automatic solutions are not good enough
– only 65-85% accuracy– need human intervention
Proposed a crowdsourcing approach
4
Crowdsourced Schema Matching
Can crowdsource other DI tasks too Difficult to publish
– Building data integration systems via mass collaboration, WebDB-03– Subsequent reviews: great work, I don’t believe it, neutral
Yes, Yes, No
Build a large-scale DI system on the Web Show that crowdsourcing is practical
address price31 Bagley Ct ... 250K12 Hope St ... 375K
location sold-at14 Main St ... 249,000
25 West St ... 324,000
address = location
Researcher HomepagesConference PagesGroup PagesDBworld mailing listDBLP
Started DBLife Project in 2005
Web pages* *
*
** * ***
SIGMOD-07
**
** give-talk
HV Jagadish Superpages
Keyword search
SQL querying
Question answering
Browse
Mining
Alert/Monitor
News summary
HV Jagadish
SIGMOD-07
**
File system RDBMS Hadoop
6
Example Superpage
7
Example Crowdsourcing
Picture is removed if enough users vote “no”.
8
Project Status in 2009 Data integration
– overall methodology: VLDB-07a, VLDB-07b, CIDR-09– DI operators: VLDB-07c– optimization: VLDB-07c, SIGMOD-08, ICDE-08a, SIGMOD-09a– provenance/others: ICDE-07a, ICDE-07b, VLDB-08a
Crowdsourcing / human computation– schema matching: ICDE-08b– best-effort information extraction: SIGMOD-08– human feedback into the DI pipeline: SIGMOD-09b– how lay users can query the database: SIGMOD-09c
System development– hard to build/maintain systems in academiaWanted to know what’s going on in industry
Wanted to take DBLife to the next levelJoined Kosmix in 2010 to do “DBLife on steroids”
9
Kosmix Founded by Anand Rajaraman & Venky Harinarayan
– formerly of Junglee, sold to Amazon for 250M 55M in funding, 30+ engineers Integrated Web data sources into a giant taxonomy
IMDBMusicbrainzTripadvisorWikipedia…
all
people
actors
Angelia Jolie Mel Gibson
placesInformation extractionEntity disambiguationEntity merging ...
File system RDBMS Hadoop
topic pages
10
Raised many interesting challenges - e.g., incremental updates, recycling human edits
Very good in certain topics (e.g., health)But hard to compete with Google and WikipediaSwitched to social media in early 2010
Social Media Exploding
11
Every two days now we create as much information as we did from the dawn of civilization up until 2003.
-- Eric Schmidt
• 100 million tweets per day• 1 billion Facebook shares per day• 1.5 million Foursquare checkins per day• 40,000 Flickr photos per second
Switching Made Much Business Sense Lot of social media data Lot of people using it, spending a lot of time on it
– lot of links now come from social media, not search engines– Google is worried (hence Buzz, Google+, Google++)
New level playing field Have a secret weapon: the giant taxonomy Next hot Internet wave
– SoLoMo = social + local + mobile
But can we build interesting applications? What is social media good for?
12
95% of tweets is still junk– I feel good today
Help teenagers track Justin Bieber– the background noise of Twitter
Charlie Sheen, celebrity fighting, Weiner losing his job Foster customer relationships
– follow your dentist Spread news Manage disasters Promote e-commerce Help organize events,
movements– revolutions
From Frivolous to Serious
13
Lot of Companies / Actions in This Space Build platforms for social media
– how to tweet more effectively Understand social media
– social analytics / route relevant information to users Use social media to make predictions Use social media to affect real-world changes
Mostly operate at the keyword level– how many times the keyword “Obama” has been mentioned today?
Kosmix: the leader in performing semantic analysis– how many times the entity President Obama has been mentioned
today?– “Obama”, “Barack”, “Barry”, “BO”, “the Pres”, “the Messiah”, ...
Kosmix Solution
IMDBMusicbrainzWikipedia… Information extraction
Entity disambiguationEntity merging Schema matchingEvent detectionEvent monitoring ...
Social Genome Applications
Highly scalable real-time infrastructureFile system RDBMS Hadoop Muppet
Slates Stream servers
Crowd sourcinginternal analysts, users, Mechanical Turks, others
Social Genomeall
people
actors
Angelia Jolie Mel Gibson
places Twitter users
@melgibson @dsmith …
FB users
mel-gibson davesmith …
events
celebritiessports politics …
Gibson car crash Egyptian uprising
the-same-as tweet-about
@dsmith: Mel crashed. Maserati is gone.
@far213: Tahrir is packed!Tahrir
CairoEgypt
related-tolocated-in
capital-of
Building Social Genome: Three Sample Challengesall
people
actors
Angelia Jolie Mel Gibson
places Twitter users
@melgibson @dsmith …
FB users
mel-gibson davesmith …
events
celebritiessports politics …
Gibson car crash Egyptian uprising
the-same-as tweet-about
@dsmith: Mel crashed. Maserati is gone.
@far213: Tahrir is packed!Tahrir
CairoEgypt
related-tolocated-in
capital-of
Extraction and Disambiguation:Traditional Methods Ill Suited for Social Media
all
people
actors directors
Angelia Jolie Mel Gibson
places
Long-term, Web context: actor, movie, Oscar, Hollywood
Short-term, social context: crash, car, Maserati
@dsmith: mel crashed. maserati is gone.
Mel was arrested again. What a dramatic fall sincehis Oscar-winning day.
Mel Brooks
events
celebritiessports politics …
Gibson car crash Egyptian uprising
Extractionuse rule-based / NLP / machine learning techniques
Extractionuse dictionaries use rules
Disambiguation
Disambiguation
20
Must Maintain a Highly Dynamic Social Genome
all
people
actors directors
Angelia Jolie Mel Gibson
places
Long-term, Web context: actor, movie, Oscar, Hollywood
Short-term, social context: crash, car, Maserati
Mel Brooks
events
celebritiessports politics …
Gibson car crash Egyptian uprising
Latency less than 2 seconds
The Giant Traditional Taxonomy is the Secret Weapon
Without it, dictionary-based extraction is not possible Provide a framework to
– “understand” social media, find related concepts, “hang” social contexts Very hard to develop, takes years
– like learning a new foreign language Partly explains why it was hard for others to catch up Must integrate traditional data well, then bootstrap
all
people
actors
Angelia Jolie Mel Gibson
places
Tahrir
CairoEgypt
located-in
capital-of
Event Detection: Current Solutions
• Focus on Twitter + Foursquare• Lot of current work in academia / industry• Limitations of most of the current solutions
– exploit just one kind of heuristics • e.g., find popular, strongly correlated words (Egypt, revolt)
– does not exploit crowdsourcing– does not scale
• not designed explicitly for parallelism
events
celebritiessports politics …
Gibson car crash Egyptian uprising
Twitter4squareFacebookMyspaceFlickr…
Event detection
Event Dection: Kosmix Solution
TwitterFoursquare
Detector 2
Detector n
Detector 1
…
Candidate events
Candidate events
Candidate events
Eventevaluatorand ranker
Rankedevents Population 2
Population 3
Population 1
...
Hadoop Muppet
Slates Stream servers
Event Monitoring: Current Solutions
• Manually write rules to match tweets to events– e.g., tweet contains certain keywords / userids positive– conceptually simple, relatively easy to implement– often achieve high initial precision
• Limitations– expensive, don’t scale– manually writing good rules can be hard– rules often become invalid/inadequate over time
• e.g., Baltimore shooting John Hopkins shooting24
Baltimore shooting
@dsmith: Baltimore shooting on TV5!
Egyptian uprising
@far213: Tahrir is packed!
Event Monitoring: Kosmix Solution
25
Event Twitter firehoseBaltimoreshooting
Initial profile{Baltimore, shoot}
Learning algorithm
Tweets“Baltimore shooting on TV5!”“Baltimore shooting. John Hopkins shut down.” ...
New profile{Baltimore, shoot, John Hopkins}
Social Analytics with The NYTimes
Tweets Annotators Tweets& Dimensions SocialCubes Stats
e.g. Location, Sentiment, Entity extraction, etc.
Barack Obama
Medicare
Hillary Clinton
Topics
Arizon
a
California
PositiveNegative
NeutralSentiment
Location
How many people in Arizona
feel positive of the new
Medicare plan?
New Yo
rk
How many feel negative of Barack Obama across the
US?
How many are tweeting about Barack Obama in New York, by
the minute for last 60 mins, by hour for last 24 hours, and by day for
last 10 days?
Barack Obama, President Obama, the Pres, Barry, BO, ...
Social Monitoring with an Unknown Agency
Twitter firehose
Justin BieberCharlie Sheen
Egyptian uprising
Jordan unrest
China unrestNorth
Tibet
West
Southeast
Count tweetsrelated to Wael Ghonim
146 in past 5 mins 3267 in past 12 hours
Bought by Walmart in May 2011
The Walmart Acquisition Deal reported to be
250-300M Kosmix became
@WalmartLabs– based in San Bruno– local office in India– plan new offices in
China and Brazil 100 persons today,
actively hiring
29
Why? 400+ B in revenue, only 5-10B online vs. 34B of Amazon Major problems if won’t catch up within 5-10 years
– see Borders
@WalmartLabs can help in many ways– Provides a core of technical people, attract more– Improve traditional e-commerce
– SEO, SEM, search on walmart.com– build a vast product taxonomy
– Helps build the e-commerce of the future– social, local, and mobile– a good way to catch up and leapfrog Amazon
30
Improve Traditional E-Commerce
31
Product data from thousands of vendors
In-house data
Web data
all products
cars
US cars
Ford Chevrolet
booksInformation extractionEntity disambiguationEntity merging ...
File system RDBMS Hadoop
searchads
Help Build the E-Commerce of Future: Social, Local, and Mobile
O2O (Online 2 Offline) emerging as a major trend– increasingly tighter integration of online and offline parts– e.g., Groupon, Living Social
Social, local, and mobile commerce examples– gift recommendation:
– “I love salt!”– “Your friend has just tweeted about the movie SALT. Would you
like to buy something related for her birthday?”– personalized “Groupon” with vendors:
– “You seem to be interested in gourmet coffee. If 50 persons sign up to buy the new DeLonghi coffee maker, you can get that for a 50% discount.”
– stocking a local store– a Siri-like shopping assistant
32
Wrapping Up Social media has become a major frontier on Web Integrating social data is fundamentally much harder
than integrating “traditional” data– lack of context– dynamic environment, new concepts appear quickly– quality issues, lots of spam– quick spread of information, user activities– fast data– solution will change over time, need human in the loop to monitor
Must integrate “traditional” data well, then bootstrap– giant taxonomy critical
Crowdsourcing becomes indispensible– but raises interesting challenges