Open Source Social Media Analytics for Intelligence and Security Informatics Applications Invited Talk at 4th International Big Data Analytics Conference (BDA), Hyderabad, India 2015 December 17, 2015 Swati Agarwal PhD Scholar at Information Management and Data Analytics Group IIIT Delhi, India ([email protected]) Ashish Sureka Principal Scientist at ABB Corporate Research Center Bangalore, India ([email protected]) Vikram Goyal Associate Professor at Indraprastha Institute of Information Technology New Delhi, India ([email protected])
99
Embed
Open Source Social Media Analytics for Intelligence and ... · Social Media Websites API Wikipedia Social Curation (Theme or topic based content sharing) Crowd Sourcing Websites Examples-
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Open Source Social Media Analytics for Intelligence
and Security Informatics Applications
Invited Talk at 4th International Big Data Analytics Conference (BDA),
Agarwal S., Sureka A., Goyal V. - Invited Talk @ 4th International Big Data Analytics Conference, Hyderabad, India (December 17, 2015)
3. Multilingualism
Image Sources: twitter.com
Agarwal S., Sureka A., Goyal V. - Invited Talk @ 4th International Big Data Analytics Conference, Hyderabad, India (December 17, 2015)
4. Noisy Content
Presence of low quality messages and content of
low relevance
Image Sources: twitter.com
Agarwal S., Sureka A., Goyal V. - Invited Talk @ 4th International Big Data Analytics Conference, Hyderabad, India (December 17, 2015)
4. Noisy Content
Grammatical and spelling errors
Image Sources: twitter.com
Agarwal S., Sureka A., Goyal V. - Invited Talk @ 4th International Big Data Analytics Conference, Hyderabad, India (December 17, 2015)
4. Noisy Content
Presence of non-standard acronyms and
abbreviations
Image Sources: twitter.com
Agarwal S., Sureka A., Goyal V. - Invited Talk @ 4th International Big Data Analytics Conference, Hyderabad, India (December 17, 2015)
4. Noisy Content
Use of emoticons and incorrect capitalization-
informal nature of content
Image Sources: twitter.com
Agarwal S., Sureka A., Goyal V. - Invited Talk @ 4th International Big Data Analytics Conference, Hyderabad, India (December 17, 2015)
5. Data Annotation and Ground
Truth
Need/Importance
Effort Intensive
Imbalance Data
• basis for several machine learning algorithms- examining the performance
• Vast amount of data being uploaded on social media dataset in every second
• 1 in 100,000 posts is hate promoting or posted by extremist users
Agarwal S., Sureka A., Goyal V. - Invited Talk @ 4th International Big Data Analytics Conference, Hyderabad, India (December 17, 2015)
6. Manipulation, Fabrication
and Adversarial Behavior
Fake information
Image Sources: twitter.com
Agarwal S., Sureka A., Goyal V. - Invited Talk @ 4th International Big Data Analytics Conference, Hyderabad, India (December 17, 2015)
6. Manipulation, Fabrication
and Adversarial Behavior
Rumors
Image Sources: twitter.com
Agarwal S., Sureka A., Goyal V. - Invited Talk @ 4th International Big Data Analytics Conference, Hyderabad, India (December 17, 2015)
6. Manipulation, Fabrication
and Adversarial Behavior
Manipulative/Misleading Information
Commonly seen in videos
Textual
• Title
• Tags
• Description
Media
• Thumbnail
• Annotation
Agarwal S., Sureka A., Goyal V. - Invited Talk @ 4th International Big Data Analytics Conference, Hyderabad, India (December 17, 2015)
6. Manipulation, Fabrication
and Adversarial Behavior
Image Sources: youtube.com
Agarwal S., Sureka A., Goyal V. - Invited Talk @ 4th International Big Data Analytics Conference, Hyderabad, India (December 17, 2015)
Focus and Scope
Agarwal S., Sureka A., Goyal V. - Invited Talk @ 4th International Big Data Analytics Conference, Hyderabad, India (December 17, 2015)
Focus of Tutorial
Exploring two sub-problems within the broader area of
Intelligence and Security Informatics:
1. Online Radicalization Detection (presence of
extremist content, users and communities on
social media websites)
2. Online Civil Unrest Prediction (an early
prediction of civil unrest related events using
social media content)
Agarwal S., Sureka A., Goyal V. - Invited Talk @ 4th International Big Data Analytics Conference, Hyderabad, India (December 17, 2015)
Technology Landscape
Online Radicalization
Agarwal S., Sureka A., Goyal V. - Invited Talk @ 4th International Big Data Analytics Conference, Hyderabad, India (December 17, 2015)
• KNN
• SVM
• Naïve Bayes
• Boosting
• Logistic
Regression
• Topical Crawler
• Decision Tree
• EDA
• SVM
• OSLOM
• Rocchio
• Naïve Bayes
• Keyword Based
Flagging
• Honeypots
• Face Recognition
• Rule Based Classifier
• Regularized Least
Square
• TC
• EDA
• BFS
• DFS Link
Analysis
• Clustering-
Blog Spider
• Topical
Crawler (TC)
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
• Clustering-
Blog Spider
• Exploratory
Data Analysis
(EDA)
• TREC
• EDA
• Support Vector
Machine (SVM)
• Blockmodeling
• Multi-dimensional
Scaling
• Spring Embedder
• SVM
• Naïve Bayes
• Adaboost
• EDA
• Naïve Bayes
• OSLOM
• N-gram
• KNN
• SVM
• Best First
Search
• Shark Search
• Language
Modeling
Content Identification
User and Community Identification
Online Civil Unrest Prediction
Agarwal S., Sureka A., Goyal V. - Invited Talk @ 4th International Big Data Analytics Conference, Hyderabad, India (December 17, 2015)
• Stochastic
Hybrid Dynamic
Model
• Avatar
ensembles of
decision trees
Colb
au
gh
et.
Al.
Hu
a e
t. A
l.
• Clustering
Algorithm Nare
n e
t.
Al. • Map Reduce
• Apache Pig
• Naïve Bayes
• Logistic Regression
• Maximum Likelihood
• Generalized Linear
Model
• Dynamic Query
Expansion Jie
jun
Xu
et.
Al.
Ch
en
et.
Al.
• Heterogeneous
Graph Modeling
Com
pto
n e
t. A
l.
• Logistic
Regression Filch
en
ko
v e
t. A
l.
• Mathematical and
Theoretical Model
Sath
appan
et.
Al.
• Probabilistic
Soft Logic
Case Studies
1. Identification of Online
Extremist Content, Users and
Hidden Communities on YouTube
S. Agarwal and A. Sureka, A Focused Crawler for Mining Hate and Extremism Promoting Users, Videos and Communities on YouTube, 25th ACM Conference on Hypertext and
Social Media (HT 2014) 1-4 Sep 2014, Santiago, Chile
YouTubeLargest and most popular free video hosting and
sharing website- Found in 2005
Agarwal S., Sureka A., Goyal V. - Invited Talk @ 4th International Big Data Analytics Conference, Hyderabad, India (December 17, 2015)
Image Sources: youtube.com
Agarwal S., Sureka A., Goyal V. - Invited Talk @ 4th International Big Data Analytics Conference, Hyderabad, India (December 17, 2015)
Image Sources: youtube.com
Agarwal S., Sureka A., Goyal V. - Invited Talk @ 4th International Big Data Analytics Conference, Hyderabad, India (December 17, 2015)
“ Mining user generated content on social
media platforms to identify topic based hate
promoting content and locating hidden &
virtual communities of extremist users.
Research Problem
Agarwal S., Sureka A., Goyal V. - Invited Talk @ 4th International Big Data Analytics Conference, Hyderabad, India (December 17, 2015)
Research Contributions
Agarwal S., Sureka A., Goyal V. - Invited Talk @ 4th International Big Data Analytics Conference, Hyderabad, India (December 17, 2015)
Application of focused crawler for identifying
extremist content and locating hate
promoting users on YouTube Best First Search
Shark Search
Content based characterization of hate
promoting videos Focus of the content shown in video
Targeted audiences
Keywords present in the contextual metadata and spoken in
the video
Focused/Topical Crawler
Agarwal S., Sureka A., Goyal V. - Invited Talk @ 4th International Big Data Analytics Conference, Hyderabad, India (December 17, 2015)
Source: Dongyang Hou et. al; 2014
Agarwal S., Sureka A., Goyal V. - Invited Talk @ 4th International Big Data Analytics Conference, Hyderabad, India (December 17, 2015)
1. Subscriber
2. Featured Channel
3. Public Contact
Agarwal S., Sureka A., Goyal V. - Invited Talk @ 4th International Big Data Analytics Conference, Hyderabad, India (December 17, 2015)
>th
0.55
0.850.35
{Subscriber, Featured Channel, Public Contact}
0.25
0.32
Threshold: 0.30
0.32
Proposed Framework
Agarwal S., Sureka A., Goyal V. - Invited Talk @ 4th International Big Data Analytics Conference, Hyderabad, India (December 17, 2015)
Titles'of'Videos:'
Image Sources: youtube.com
Agarwal S., Sureka A., Goyal V. - Invited Talk @ 4th International Big Data Analytics Conference, Hyderabad, India (December 17, 2015)
Experimental Setup Discriminatory Features Set:
YouTube Profile Summary
User Activity Feeds (Title of Videos Uploaded, Shared, Commented
and Favorited)
Training Data: Discriminatory Features for 35 Hate Promoting Channel
(HinduismIslam, IndiaEternal, ISIcyberAGENT etc.)
Dynamic Parameters: 10 different YouTube Channels as Seeds (PakistanRoxxx,
hiddenpakistani, GreaterPakistan etc.)
Character N-Gram (3, 5)
Threshold Value for Relevance Computation (-2.0, -2.5, -3.0)
Focused Crawler: Best First Search and Shark Search
Agarwal S., Sureka A., Goyal V. - Invited Talk @ 4th International Big Data Analytics Conference, Hyderabad, India (December 17, 2015)
Experimental Results-
Focused Crawler
Agarwal S., Sureka A., Goyal V. - Invited Talk @ 4th International Big Data Analytics Conference, Hyderabad, India (December 17, 2015)
Best First Search
Approach
Predicted
Relevant Irrelevant
Actual
Relevant 921 314
Irrelevant 125 67
Shark Search
Approach
Predicted
Relevant Irrelevant
Actual
Relevant 991 295
Irrelevant 55 29
TPR OR
RecallTNR
PPV OR
PrecisionNPV F1-Score Accuracy
BFS 0.75 0.35 0.88 0.18 0.81 0.69
SSA 0.77 0.35 0.95 0.09 0.85 0.74
Confusion Matrix for Best First Search and Shark Search Algorithm (60 iterations each)
Accuracy Results for Best First Search and Shark Search Algorithm (60 iterations each)
Social Network Analysis
Agarwal S., Sureka A., Goyal V. - Invited Talk @ 4th International Big Data Analytics Conference, Hyderabad, India (December 17, 2015)
Community Graph of YouTube Users After Applying Focused Crawlers: Best First Search
(Left), Shark Search (Right)
Agarwal S., Sureka A., Goyal V. - Invited Talk @ 4th International Big Data Analytics Conference, Hyderabad, India (December 17, 2015)
Videos YT CategoryAvg. Length
(Sec)
Content
FocusTarget Audience Keywords
43News, Non-
profit151.68
Honor
Killing,
Harassment
Women, Refugee
People, Children
Child Marriage, Rape,
Responsibility, Protest,
Women, Asylum, Arrested,
Brutal, Slave
93
News, Auto-
Vehicle,
Politics
2526.16Islam
Promotion
Jewish, Muslim
People
Taliban, Bombs, Battle,
Courage, Allah, Islam,
Courage, Belief, Macca,
Money, Shaheed, Enemies
Of Islam
30Entertainme
nt319.61 Anti-India India Haters
Kashmir, Poverty, Liberate,
Hindu, Beggars, Pundit,
Untouchable, Extremism,
Attack, Killed, Anti-Muslim,
Anti-Pakistani, Hatred,
Masks, Freedom.
Examples- Content Based
Characterization of Videos
Agarwal S., Sureka A., Goyal V. - Invited Talk @ 4th International Big Data Analytics Conference, Hyderabad, India (December 17, 2015)
Videos YT CategoryAvg. Length
(Sec)
Content
FocusTarget Audience Keywords
25
News,
Politics &
Education
1225.56Liberate
KashmirKashmiri People
Muslim, Army, Military,
1947, Partition, Azad
Kashmir, Liber- ate
Kashmir, Pakistan, India,
Killing, Murder, Border,
Fighting, Democracy,
Martyr, Torture.
83News &
Politics349.28
Anti-
MuslimsPakistan Haters
Kashmir, Jihad, Pakistan,
India, Quran, Muslim,
Hindu, Qatil, Zakir Naik,
Hate Speech, Masjid,
Pandit, Defence, Madarsa,
Tribute, Bharat, America,
Attack, Napak, Holy,
Kabba,
Examples- Content Based
Characterization of Videos
Content Based
Characterization of Videos
Agarwal S., Sureka A., Goyal V. - Invited Talk @ 4th International Big Data Analytics Conference, Hyderabad, India (December 17, 2015)
Type of Video Content: Speech, News Segments,
Drawing, Interviews, Group Discussion, Animated
Videos, Lectures, Cartoon and Comics, Debate,
Recorded Videos, Textual Messages, Pictures with
Background Music
2. Early Prediction of
Civil Unrest Related
Events on Twitter
TwitterLargest and most popular free micro-blogging and
social networking website- Found in 2006
Agarwal S., Sureka A., Goyal V. - Invited Talk @ 4th International Big Data Analytics Conference, Hyderabad, India (December 17, 2015)