What happen after crawling Big Data? Defining a process of filtering and automatically coding extracted Big Data from Twitter for social uses José Carpio, [email protected]Juan D. Borrero, [email protected]Estrella Gualda, [email protected]1st IMASS conference, Methods and Analyses in Social Sciences, 23-24 April 2014, Olhão, Portugal,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
What happen after crawling Big Data?
Defining a process of filtering and automatically coding extracted Big Data from
1. Big Data as a huge amount of digital information, so big and so complex that usual database technology cannot process efficiently.
2. The advent social web has made a significant contribution to the explosion of information from social computing systems such as Twitter, Facebook, Pinterest, Youtube…
1.Introduction
Big data offers the
social sciences and humanistic disciplines
new opportunities
of approaching the
knowledge of particular
social realities
when considering messages
from social media sites.
1.Introduction
Some studies are already deploying automatic data extraction techniques (Ackland and O’Neil, 2011; Carmel et al., 2009; Jones et al., 2008; Shumate and Dewitt, 2008; Wang and Jin, 2010; Xu et al., 2008) on big data.
Before analysis, a previous task would be filtering and coding the automatically crawled data, in order to reduce and “prepare” the information.
Table of Contents
Introduction1Focus and Topic
Focus
What is twitter? Twitter is a free social networking and micro-
blogging service that enables its users to send and read messages known as tweets.
Tweets are text-based posts of up to 140 characters displayed on the author's profile page and delivered to the author's subscribers who are known as followers.
What are hashtags? People use the hashtag symbol # before a
relevant keyword or phrase (no spaces) in their Tweet to categorize them.
(https://support.twitter.com/entries/49309)
Topic
Desahucios (Evictions)
It has to do with the rise of housing or eviction by enforcement due to non-payment of rent or mortgage.
This theme refers to a social crisis caused by the economic crisis in Spain.
Topic
¿What is the problem?
The same concept are tagged with different tags.
SpanishRevolution == RevolutionInSpain
Table of Contents
Introduction1 Framework
Framework
Big data challenge: efficiency and effectiveness
1. Efficiency: index compression, reducing lookup time or query caching.
2. Effectiveness: accurate feature extraction, personalization, relevance.
Framework
Drawbacks from Automatic Social Information Retrieval
2. Term variations: There is no standard for the structure of hashtags
– Moreover, mis-tagging due to spelling errors occurs often such as desahucios and deshaucios.
– Also, spacing is not allowed in a hashtag; therefore, both the underscore and the hyphen are typically used to separate words by a single tag. Eg., stopesahucios and stop-desahucios.
– Additionally, different possible spellings of the same word and tags using different languages generate term variations. Eg., sisepuede and sisepot.
Framework
Drawbacks from Automatic Social Information Retrieval
The vague-meaning problem is created by the following causes (Kroski, 2005; Golder et al., 2006; Hope et al., 2007; Marchetti et al., 2007):
Synonyms: It is when multiple and different hashtags share the same meaning.
Twitter users write in natural and free way. Therefore, we find morphological variations or synonyms and sometimes are difficult to automatically identify.
Table of Contents
Introduction1 Objectives
Objectives
1. To test a methodology to automatically filtering, coding and reducing the huge amount of data retrieved from Twitter, as a previous task to be done before the analysis of Big Data.
2. To determine the reliability of the methodology after being applied to a dataset of 500,000 tweets on the ‘desahucios’ (evictions) thematic.
Table of Contents
Introduction1 Methodology
Methodology
Extraction
Topics for the extraction
Data collection
Output
Text processing
• Spelling correction (case, tildes…)
• Classification with Levensthein distance thresholds
We extracted a random sample of 40,000 hashtags from a dataset of 499,420 tweets containing 784,583 hashtags around the desahucios thematic retrieved from 10 April to 28 May 2013 period.
Methodology
Text processing
Hashtags on this sample were automatically filtered, codified and reduced according different algorithms.
We aim to reduce noisy.
Methodology
Text processing / Labeling correction
How do I come up other corrections?
We need a distance metric. We used the Levenshtein distance (edit distance). Created by Vladimir Levenshtein, this algorithm measures the differences/distance between two strings.
It is done by calculating the minimum number of insertions, deletions, and substitutions for transforming one string into another.
Methodology
Text processing/Levenshtein
Min Edit Example
Words to be compared: methodologymetodology
Levenshtein distance: 1
One edit is needed, since we need to insert the h between t and o.