Detecting the presence of cyberbullying using computer software

DETECTING THE PRESENCE OF CYBERBULLYING USING

COMPUTER SOFTWARE

Ashish AroraDepartment of Computer and Electrical Engineering and Computer ScienceFlorida Atlantic UniversityMentor: Dr. Taghi M. Khoshgoftaar

WHAT IS CYBERBULLYING ?

The use of electronic media or communication channel to bully a person, typically by sending messages of an intimidating or threatening nature is known as cyberbullying.

The Technology is used to intentionally hurt or embarrass another person.

It involves the use of information and communication technologies to support hostile behaviour by a person or group

COMMENTS INVOLVING NEGATIVITY AND PROFANITY

Cyberbullying

Profanity Negativity

Sexuality Race/Culture Intelligence Physical

Attributes

ISSUES RELATED TO CYBERBULLYINGClassifying the conversation in to normal chat/text or under bullying attributes.Cyberbullying is one of the most mentally damaging problems on internet.It results in catastrophic impact on self-esteem and personal lives especially of students.The Data needs to be categorized properly before using any approach to stop the Cyberbullying activity.

WHAT IS THE SUITABLE SOLUTION ?

Machine Learning Online Patrol Crawler

Sentiment Analysis

Softwares to detect cyberbullying content

MACHINE LEARNING METHOD -ONLINE PATROL CRAWLER This method is designed to curb the issue of online malicious entries especially on Informal School websitesThis method uses a machine learning method known as Support Vector Method(SVM) to detect any inappropriate entry. The software is Designed for automatically detecting the cyberbullying casesThe data for classification purpose is taken from Informal School websites.These informal school websites contains Slandering information about teachers and students

PREVIOUS APPROACH1.Detection of Cyberbullying

activity2.Saving the URL of website

3. Printing out websites containing cyberbullying entry

Sending deletion request of the suspicious entry to the website

admin or internet provider.Informing the police or legal

affair bureau

Confirming the deletion of the entry containing Cyber-

Bullying activity

MACHINE LEARNING APPROACH

Machine Learning Module

Training Phase Test Phase

TRAINING PHASE STEPS

Crawling School Website

Detecting Manually Cyber-bullying entries

Extraction of vulgar words and adding them to lexicon

Estimating word similarity with Levenshtein distance

Training with Support Vector Machine Algorithm

TEST PHASE STEPS

Crawling School Website

Detecting Cyber-bullying entries by SVM model

Part of speech analysis of the detected harmful entry

Estimating word similarity with Levenshtein distance

Marking and visualizing harmful entries

ESTIMATION OF WORD SIMILARITY-LEVENSHTEIN DISTANCEManually gathered suspicious entries to form a lexicon of vulgar words distinctive for cyberbullying entries.

Users often change spelling of words and write in an un-normalized behaviour. E.g. ‘ See You’ is written as ‘CU’ in chat or forums

Using Levenshtein Distance to calculate similarity of words used in chat.The Levenshtein Distance between two strings is calculated as the minimum number of operations required to transform one string in to another, where the available operations are only deletion, insertion or substitution of a single character

For example, the Levenshtein distance between "kitten" and "sitting" is 3, since the following three edits change one into the other, and there is no way to do it with fewer than three edits:

kitten → sitten (substitution of "s" for "k") sitten → sittin (substitution of "i" for "e") sittin → sitting (insertion of "g" at the end).As the result , with the threshold equal 2 the system was able to correctly determine the similar words with a 85% of Precision

SUPPORT VECTOR MACHINE METHOD OF CLASSIFICATIONSVM is a method of supervised machine learning which is used for classification of data

With a set of training samples, divided into two categories A and B, SVM training algorithm generates a model for prediction of whether test samples belong to either category A or B. To classify the entries into harmful(Cyberbullying) or Non harmful.

Samples are represented as points in space (vectors).SVM constructs a hyperplane in a space with largest distance to the nearest training data points.

The larger the margin the lower the generalization error of the classifier

Training samples divided in to two categories.Samples are represented as points in space.

EVALUATION OF SVM MODEL

Data needs to be prepared for training the SVM model.For training data 966 entries were gathered during manual online patrol , from which human annotators classified 750 entries as harmful and 216 as non-harmful.The above entries were applied to SVM_light a software to implement SVM algorithm. The result is represented in terms of F-Score where F-Score is represented in terms of Precision and Recall.

METHODOLOGYTraini

ng Data set 966

750 harmf

ul

216 not harmful

SVM light (a software for building SVM Models)SVM

training

10-fold cross

validation

Result of SVM model79.9% of Precision

and 98.3%of Recall.Test Data

SetEvaluate

Pre-processing

Feature Extraction

DATA

RANKING THE WORDSApart from the classification of cyber-bullying entries, there is a need to appropriately determine how harmful is a certain entry Harmfulness of an entry is calculated using T-scoreTo calculate the harmfulness of the whole entry of words ,a sum of T scores is calculated for all vulgar words. The higher occurrence frequency a word has in a sentence, the higher is the value of T-scoreThe more frequently occurring words there are in the entry, the higher rank the entry achieves in the ranking of harmfulness.

T-score = a/b

DISCUSSION The results of SVM model used to distinguish between harmful and non-harmful information were 79.9% of Precision and 98.3% of Recall.This approach is not as accurate for preparing lexicon of vulgar words , the words being matched by Levenshtein distance sometimes does not give accurate results. New vulgar words appearing frequently , need to find a way to automatically extract new harmful words from internet automatically.

DETECTING CYBERBULLYING ON SOCIAL NETWORK SITE – TWITTER Sentiment Classifier is used to classify tweets in to negative and positive categories by using Machine Learning AlgorithmThe aims is to determine the bullying instances in social networks and increase their visibility.

Twitter is used as the Source of data.

PREVIOUS APPROACHMachine Learning Algorithm for classifying the sentiment of twitter messages.Previous approach classified tweets in to positive or negative with respect to specific emoticons found in twitter messages.In this approach instead of emoticons commonly used abuse words are used for labelling.Graph visualizations, both dynamic and static, to illustrate clustering of bullies over a period .

PROPOSED APPROACHThis software application would be capable of accurately classifying Twitter messages as negative or positive with respect to some commonly used terms .Mainly Focussed on Gender Bullying by using four words with different Polarity.To confirm their “bullying” polarity, Amazon’s Mechanical Turk was used.

PROPOSED APPROACHOnce polarity of words is confirmed, data would be processed to extract some relevant information, such as the username of the person who posted the negative tweet (potential bully) and the username of the person mentioned in the tweet.The outcome of the monitoring process will be several social graphs.The Social Graphs will be categorized in to bully and victim Social Graph.The purpose of this graph is to visualize all detected bullying instances, find clusters of bullies, and show hidden connections between victims over a period of time.

TECHNOLOGY USED LingPipe – A tool kit for processing text using computational linguistics. Implements naïve Bayes algorithm.

Tweet Extractor – To extract tweets from twitter continuously.

Gephi – Open Source Graph Visualization and manipulation software

Amazon’s Mechanical Turk Service – Crowdsourcing Market place , coordinate the use of human intelligence to perform tasks that computers are unable to do.

DATA COLLECTION AND PRE-PROCESSINGTweets were collected from different sources , around 5000 tweets.Use of Bag-of-words model. It takes every word in a sentence as features , the whole sentence is represented by an unordered collection of words.

5000 tweets

Previously collected data from Stanford

students

Previously collected data from university

professors

Used Mechanical Turk to validate the

polarity of tweets

APPROACHBuilt a framework on top of LingPipe tool kit for processing text using computational linguisticsFramework uses LingPipe’s Naive Bayes machine learning classifier as baselineFramework treats the classifier and feature extractor as one componentAs part of data collection and pre-processing, accessed Twitter looking for the tweets containing the words of interest (negative words)

Framework Ling Pipe Naïve Bayes Classifier

+ Tweet Extractor

Extracts tweets

DATA COLLECTION

Open Source Library and Streaming API

Crawls twitter timeline

Tweets containing Words of interest

For training data

For training data, messages that contained the words “Gay,” “Homo,” “Dike,” and “Queer” were collected by using our in-house Tweets extractor.

The Test Data was collected at random by streaming in public tweets from twitter’s public timeline.

To train classifier created a training data set and a test data set. Training data consists of messages containing 4 words of interest –’Gay’, ‘Homo’, ’Dike’ and ‘Queer’5000 tweets – Approximately 3/4 of the collected tweets were negative and 1/4 is positive tweets..Manually labelled 460 tweets as negative and 500 tweets were labelled positive by Amazon’s Mechanical Turk ServiceThe labelled data is being validated by selecting a random sample of the collected data and use Amazon’s mechanical Turk to confirm their sentiment.

Survey Used

Opinion Polarity ValueNegative with Bullying Intentions

B

Negative without Bullying Intentions

A

Positive or good content PNeutral N

CLASSIFICATION – NAIVE BAYES CLASSIFIERThe Focus of this approach is to find polarity of tweets.Each word in a tweet considered unique variable in Naïve Bayes model.Goal – Probability of word whether it belongs to positive or negative classCollecting Data set

for training

Pre processing Data Set

Training Data

Training the model Sentime

nt Detection(Posit

ive , Negativ

e)

RESULTS Amazon’s Mechanical Turk classified unlabelled data which was used to verify and validate newly labelled data provided by Machine Learning Algorithm.

Results

Training 500 TweetsPositive Negative Accuracy

Naïve Bayes

65.7% 72.9% 67.3%

Amazon’s Mturk

65.2% 74.0% 67.1%

CONCLUSIONThis approach leverages the power of sentiment analysis.The classifier was close to 70% accurate. It is not the best result as expected due to restriction from accessing unlimited content from twitter.

CYBERBULLYING BLOCKER APPLICATION FOR ANDROID New types of devices connected to internet such as smartphones and tablets further exacerbated the problem of cyberbullying.

Android Application which automatically detects a possible harmful content in a text.

This application uses machine learning method to spot any undesirable content

APPLICATION Application is built for devices supporting Android OS. Java8 and Android Studio was used. Gives users interface for detection of harmful contents.

HARMFUL CONTENT DETECTION PROCESS The Application contains one activity responsible for interacting with the user.

For the process of checking harmful content the application starts a background thread.

The user can still use the device even if checking process takes a while.

User Inputs text on mobile screen

Push Button to select the method

Feedback to the user

METHODOLOGYThe method classifies messages as harmful or not by using a classifier trained with language modelling method based on Brute Force Algo.Brute Force - Algorithms using combinatorial approach usually generate a massive number of combinations - potential answers to a given problem.Algorithm applied for automatic extraction of sentence patternsActual data collected by Internet Patrol (annotated by experts)1490 harmful and 1508 non-harmful entries.All patterns used in classification was stored on mobile device.Method operates locally does not require internet connection.During training of this method , ordered non-repeated combination were generated from all elements of training sentences.Tested on actual data obtained by Internet Patrol

METHOD CONT.…

RESULT Precision = 79 % Recall = 79 % Requires minimal human effort RECALL is the ratio of the number of relevant records retrieved to the total number of relevant records in the database.

PRECISION is the ratio of the number of relevant records retrieved to the total number of irrelevant and relevant records retrieved

OTHER SOFTWARE PRODUCTS IN THE MARKET FOR DETECTING CYBERBULLYINGFearNot ! – Interactive drama/video game that teaches children strategies to prevent bullying and social exclusion.Samaritans Radar – The application function was alerting a user when it spotted someone of either being bullied, depressed or sending disturbing suicidal signals. Application was stopped due to privacy concerns.ReThink – This is a smartphone application which shows a pop-up warning message when user tries to send a message having harmful content.PocketGuardian – It’s a parental monitoring App which detects not only cyberbullying texting but also harmful images. It uses machine learning algorithm. Disadvantage – Costs $4 per month.

PROPOSAL TO FILTER SUSPECTED MESSAGESA filtering mechanism to classify messages as “abusive” or “non-abusive”(or“positive”and“negative,”) respectively.In a practical system, the filter will not be completely reliable; there will be false positives and false negatives in at least some cases.Some cases likes of threats requires extra efforts.Difficult to create an automated system to reliably recognize threats that should be reported to the police.The problem of false positives and the problem of discarding threats can both be dampened by diverting messages labelled abusive to a trusted third party.

EXAMPLE OF FILTERING SYSTEM

CHALLENGES Preventing the removal of valuable messages when attempting to filter the data.Privacy concernsIncidents should be reported as early as possible.False reporting

THANK YOU

Detecting the presence of cyberbullying using computer software

Data & Analytics