Top Banner
Conversation Level Constraints on Pedophile Detection in Chat Rooms PAN 2012 — Sexual Predator Identification Claudia Peersman, Frederik Vaassen, Vincent Van Asch and Walter Daelemans
21

Conversation Level Constraints on Pedophile Detection in Chat Rooms

Feb 25, 2016

Download

Documents

Andra

Conversation Level Constraints on Pedophile Detection in Chat Rooms. PAN 2012 — Sexual Predator Identification. Claudia Peersman, Frederik Vaassen , Vincent Van Asch and Walter Daelemans. Overview. Task 1: Sexual Predator Identification Preprocessing Experimental Setup and Results - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Conversation Level Constraints on  Pedophile  Detection in Chat Rooms

Conversation Level Constraints on Pedophile Detection in Chat RoomsPAN 2012 — Sexual Predator Identification

Claudia Peersman, Frederik Vaassen, Vincent Van Asch and Walter Daelemans

Page 2: Conversation Level Constraints on  Pedophile  Detection in Chat Rooms

2

Overview• Task 1: Sexual Predator Identification• Preprocessing• Experimental Setup and Results• Test Run Results

• Task 2: Identifying Grooming Posts• Grooming Dictionary• Test Run Results

• Discussion

Page 3: Conversation Level Constraints on  Pedophile  Detection in Chat Rooms

3

Task 1: Preprocessing of the Data• Data: PAN 2012 competition training set• predator vs. non-predator• info on the conversation, user and post level

• Two splits: training and validation set

• No user was present in more than one cluster

prevent overfitting of user-specific features

Page 4: Conversation Level Constraints on  Pedophile  Detection in Chat Rooms

4

Experimental Setup

• Features: token unigrams

• LiBSVM

• Probability output

• Parameter optimization

• Experiments on 3 levels

• data resampling

Page 5: Conversation Level Constraints on  Pedophile  Detection in Chat Rooms

5

Level 1: the Post Classifier

• Resample the number of posts

Equal distribution of posts per class

• About 40,000 posts per class in training

• No resampling in the validation sets

Page 6: Conversation Level Constraints on  Pedophile  Detection in Chat Rooms

6

Level 1: the Post Classifier (2)• Only output on the post level

• Aggregate the post level predictions to the user level:• LiBSVM’s probability outputs• Predators = average of the 10 highest

predator class probabilities ≥ 0.85

Page 7: Conversation Level Constraints on  Pedophile  Detection in Chat Rooms

7

Results for the Predator Class

Scores Post ClassifierRecall 0.93Precision 0.36F-score 0.52

Page 8: Conversation Level Constraints on  Pedophile  Detection in Chat Rooms

8

Level 2: the User Classifier• Resampling on the user level

exclude users with no suspicious posts

• Filter: dictionary of grooming vocabulary see Task 2

• Why? • reduce the amount of data • “hard” classification higher precision?

Page 9: Conversation Level Constraints on  Pedophile  Detection in Chat Rooms

9

Update Results (1)

Scores Post Classifier User ClassifierRecall 0.93 0.82Precision 0.36 0.88F-score 0.52 0.84

Combine systems?

Data reduction: up to 48.4%

Page 10: Conversation Level Constraints on  Pedophile  Detection in Chat Rooms

10

Combining the systems• Weighted voting using LiBSVM’s

probability outputs

• 70% of the weight on the high precision User Classifier

Page 11: Conversation Level Constraints on  Pedophile  Detection in Chat Rooms

11

Update Results (2)

Scores Post Classifier

User Classifier

Combined Results

Recall 0.93 0.82 0.85Precision 0.36 0.88 0.84F-score 0.52 0.84 0.84

Page 12: Conversation Level Constraints on  Pedophile  Detection in Chat Rooms

12

Level 3: Conversation Level Constraints

• Both users in a conversation labeled as predators

• Our approach: • go back to predator probability output • use the high precision user classifier• Predator probability ≥ 0.75

Page 13: Conversation Level Constraints on  Pedophile  Detection in Chat Rooms

13

System Overview

Combined Prediction

Apply Conversatio

n Level Constraints

Final Predator

ID List

Post Classifier

User Classifier

Page 14: Conversation Level Constraints on  Pedophile  Detection in Chat Rooms

14

Update Results (3)

Scores Post Classifier

User Classifier

Combined Results

Combined +

ConstraintsRecall 0.93 0.82 0.85 0.85Precision 0.36 0.88 0.84 0.94F-score 0.52 0.84 0.84 0.89

Page 15: Conversation Level Constraints on  Pedophile  Detection in Chat Rooms

15

Results on the PAN 2012 Test Set

Scores Combined + Constraints PAN Test Set

Recall 0.85 0.60Precision 0.94 0.89F-score (β = 1) 0.89 0.72

• Future research: • more splits• investigate ensembles

Page 16: Conversation Level Constraints on  Pedophile  Detection in Chat Rooms

16

Task 2: Identifying Grooming Posts • From the final predator ID list detect posts

expressing typical grooming behavior

• No gold standard labels What is grooming?

• Predator conversations have predictable stages (e.g. Lanning, 2010; McGhee et al., 2011)

Page 17: Conversation Level Constraints on  Pedophile  Detection in Chat Rooms

17

Task 2: Identifying Grooming Posts (2)

• Dictionary containing references to 6 stages:• sexual topic• reframing• approach • data requests• isolation from adult supervision• age (difference)

Page 18: Conversation Level Constraints on  Pedophile  Detection in Chat Rooms

18

Task 2: Identifying Grooming Posts (3)

• Resources:• McGhee et al. (2011)• English Urban Dictionary website

http://www.urbandictionary.com/• English Synonyms

http://www.synonym.net/

• cf. user classifier filter

Page 19: Conversation Level Constraints on  Pedophile  Detection in Chat Rooms

19

Results on the PAN 2012 Test Set

• Precision = 0.36

• Recall = 0.26

• F-score (β = 1) = 0.30

Page 20: Conversation Level Constraints on  Pedophile  Detection in Chat Rooms

20

Discussion

• Use of β-factors to calculate the F-score:• Task 1: focus on precision (β = 0.5)• Task 2: focus on recall (β = 3.0)

• However, in practice:• find all predators (recall in Task 1)• find the most striking posts (precision in

Task 2)

Page 21: Conversation Level Constraints on  Pedophile  Detection in Chat Rooms

Questions?

Contact: [email protected]