Top Banner
Sentiment and Affect analysis of Dark Web Forums: Measuring Radicalization on the Internet Hsinchun Chen, Fellow, IEEE
32
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 投影片 1

Sentiment and Affect analysis of Dark Web Forums: Measuring Radicalization on the

Internet

Hsinchun Chen, Fellow, IEEE

Page 2: 投影片 1

Introduction

• Web forums offer participants a medium to express their opinions and emotions freely in discussion.

• Extremist and terrorist groups also use web forums for community.– Expression and dissemination of their ideologies

and propaganda• Such forums are often referred to as being

part of Dark Web

Page 3: 投影片 1

Introduction

• Information contained within Dark Web forums represent a significant source of knowledge for security and intelligence organizations.

• The opinions and emotions expressed within these forums provide valuable insights:– the nature and position of the online community – Characterizing individual participants

• Manual analysis of the vast quantities of messages to measure the opinions and emotions expressed is often infeasible.

Page 4: 投影片 1

Introduction

• This paper presents an automated approach to sentiment and affect analysis of two Dark Web forums related to the Iraqi insurgency and Al-Qaeda.

• The automated approach utilizes a rich set of textual features and machine learning techniques.

Page 5: 投影片 1

Related Work

• Sentiment and affect analysis are related tasks in text mining that focus on directional text, containing opinions, emotions, and biases.

[5] M. A. Hearst, “Direction-based text interpretation as an information access refinement,” In Text-Based Intelligent Systems: Current Research and Practice in Information Extraction and Retrieval. Lawrence Erlbaum Associates, 1992.

[6] J. Wiebe, “Tracking point of view in narrative,” ComputationalLinguistics, vol. 20 (2), pg. 233-287, 1994.

Page 6: 投影片 1

Related Work

• Sentiment analysis attempt to identify, analyze, and measure opinions expressed in text.

• Affect analysis focuses on the emotional content of the communication.

R. Agrawal, S. Rajagopalan, R. Srikant, and Y. Xu, “Mining newsgroups using networks arising from social behavior,” Proc. of the 12th Int’l WWW Conf., 2003.

P. Subasic and A. Huettner, “Affect analysis of text using fuzzy semantic typing,” IEEE Trans. Fuzzy Systems, vol. 9 (4), pg. 483-496.

Page 7: 投影片 1

Related Work

• There are some important distinction between the two– Affect analysis evaluates the intensity of a number of

potential emotions, including happiness, sadness, anger, fear, etc

– Sentiment analysis considers the polarity of opinions along a positive-neutral-negative continuum.

– The words and phrases associated with sentiments are mutually exclusive.

– Segments of text can convey multiple affects

Page 8: 投影片 1

Related Work

• Researchers have utilized various machine learning approaches to perform automated sentiment and affect analysis.B. Pang, L. Lee, and S. Vaithyanathain, “Thumbs up? sentiment classification using machine learning techniques,” Proc. Empirical Methods in Natural Language Processing, pg. 79-86, 2002.

R. W. Picard, E. Vyzas, and J. Healey, “Toward machine emotional intelligence: analysis of affective physiological state,” IEEE Tran. Pattern Analysis and Machine Intelligence, vol. 23 (10), pg. 1179-1191, 2001.

Page 9: 投影片 1

Related Work

• In particular, the SVM learning approach has been shown to be particularly effective in determining whether a text segment contains expression of a particular affects class.

• Only for discrete label.

Y. H. Cho and K. J. Lee, “Automatic affect recognition using natural language processing techniques and manually built affect lexicon,” IEICE Tran. Information Systems, vol. E89 (12), pg. 2964-2971, 2006.

Page 10: 投影片 1

Related Work

• SVR is an alternate approach that is capable of predicting continuous sentiment and affect intensities while benefitting from the robustness of SVM. A. Webb, Statistical Pattern Recognition. John Wiley & Sons, 2002.

Page 11: 投影片 1

Research Questions

• In a recent book by Ryan, the author highlights the critical role that the Web forums play for militant Islamic radicalization on the Internet.

• Marc Sageman, an internationally renowned terrorism study consultant, also emphasizes the importance of the internet, especially forums.

• This paper presents our web mining research on sentiment and affect analysis of two large-scale, internal Jihadist forums.

Page 12: 投影片 1

Research Questions

• This study seeks to answer the following research questions:– How effective are automated methods of

sentiment and affect analysis in measuring the polarities of opinions and intensities of emotions in Dark Web forums?

– What insights into the Dark Web forums are gained by performing sentiment and affect analysis?

Page 13: 投影片 1

Data• Two Dark Web forums were selected for sentiment and affect

analysis– Al-Firdaws (www.alfirdaws.org/vb)– Montada (www.montada.com)

• Al-Firdaws – a more radical forum– considerable content dedicated to support of the Iraqi insurgency and

Al-Qaeda.• Montada

– Montada is a general discussion forum with content pertaining to a variety of social and religious issues.

– Domain experts consider Montada to be more moderate compared to Al-Firdaws, with less radical content.

Page 14: 投影片 1

Data

• Spidering programs were used to collect the content from the two web forums.

• A summary of the collection statistics is presented in Table I.

Data set is larger.1. An older forum2. Al-Firdaws is too radical

Page 15: 投影片 1

Data

• Both Al-Firdaws and Montada are major forums for their respective purposes and communities, with relatively high membership levels and numerous authors.

Page 16: 投影片 1

Data

• In both cases postings are more evenly distributed across web forum threads.

• Although the Montada forum has a larger average number of posts per thread compared to Al-Firdaws, the median number of posts per thread is nearly equal.

Page 17: 投影片 1

Data

• 500 sentences were selected from each web forum, and scored for the intensities of sentiments and affects expressed.

• The affects of interest in the study included those of most interest to security and intelligence organizations– including violence, anger, hate, and racism.

• These affects were measured on a continuous scale ranging from 0 to 1.

• The sentiment measurement was on a continuous scale from -1 to 1

Page 18: 投影片 1

Data

Page 19: 投影片 1

Methods

Page 20: 投影片 1

Methods

• Annotation step– Character, word, root, collocation n-grams

• Character and word n-grams are commonly used in text mining applications.

• To derive root level n-grams, Arabic words were converted to their roots using a clustering algorithm.

• Collocation n-grams included the Hapax and Dis collocations.

• Features with less than four occurrences in the test bed were excluded.

Page 21: 投影片 1

Methods

Page 22: 投影片 1

Methods

• The machine learning approach for identifying the presence and intensities of sentiments and affects in Dark Web forum sentences utilized a SVR ensemble.

• SVR was utilized to leverage the robustness of SVM, while accommodating the continuous intensities of sentiments and affects.

• Ensemble classifiers aggregate multiple independent classifiers built using different techniques or feature subsets– improving performance over a single classifier.

Page 23: 投影片 1

Methods

• For the analysis of the Al-Firdaws and Montada web forums, a separate classifier was developed for each of the five sentiment and affect classes

Page 24: 投影片 1

Methods

• Feature selection– Information gain (IG) heuristic

• Discretization of intensities were performed before IG could be applied and the relevant features selected.

• To compensate for the discretization, multiple iterations were performed varying the number of class bins for intensity between 2 and 10.

• The IG heuristic was used recursively to select relevant features in these iterations using recursive feature elimination (RFE).

Page 25: 投影片 1

Methods

'

IG:

Selected each feature with score above threshold

REF:

Removed half features each iteraction until only remained

where:

is the selected subset of features for class discr

x IG x IG RFE x

x

n

F F F F

F x

etization

and are the selected features for the class discretization when using IG and RFE, repectively

2,3,......10IG RFE xF F x

x

Page 26: 投影片 1

Methods

• The feature selection phase resulted in a subset of the features identified in the test bed selected for each of the 5 classifiers in the ensemble. Originally 7556 features.

Only 22% was selected

Page 27: 投影片 1

Methods

• Evaluation was performed using 10-fold cross validation

Page 28: 投影片 1

Results

• A sample of messages and their sentiment and affect intensities determined through automated analysis are presented inTable VII.

Page 29: 投影片 1

Results

• Results confirm the assessment of the forums by domain experts.

• The Al-Firdaws forum contained higher intensities of violence and hate affects with a more negative sentiment polarity

Page 30: 投影片 1

Results

• The percentage of postings containing intense levels of the four affects are greater in the Al-Firdaws forum compared to the Montada forum, as shown in Figs. 8 and 9.

Page 31: 投影片 1

Results

• The violence and hate affects were used by a relatively large percentage of Al-Firdaw authors

Page 32: 投影片 1

Results

• A time series analysis was performed to understand how forum affect intensities progressed over time