BOTs or not? - person.dibris.unige.it · Telegram I Conversational bots are aimed at automatically interacting with humans using natural language I Wikipedia bots perform routine

BOTs or not?A case study on bot recognition from web session log

Alberto Cabri - email: [email protected] student

DIBRIS - University of Genoa

June 6, 2017

1/23

2/23

Our aim. . .

I analyse the usage logs of a web site to verifywhether it is possible to distinguish legitimatehuman crawlers from bots using computationalintelligence

I identify bots sessions by taking the earliestpossible decision on online or real time HTTPrequests of an incomplete session

Alberto Cabri - email: [email protected] student BOTs or not?

3/23

What is a BOT?

I Bots are programs that perform specific actionson computers connected to a network, withoutany intervention of human user

I Bots is the short for Web Robots, akaAutonomous Internet Agents

I Statistics report that more than half the trafficof a web site is due to bots [Zeifman - 2016]

I They can be good or malicious (those representmore than half the bot traffic)


4/23

Other definitions

I A session is a sequence of web server requestthat can be associated to a single IP address ora user

I HTTP is the HyperText Transfer Protocol,defined at the application level, used toexchange network resources in a client-servercomputing model

I A logfile or simply log is a file that recordsspecific events that occur on a system


5/23

Bot types and operation modes - 1

I Primary application for bots is web crawling,typically for indexing purposes

I Social bots are used to post automaticallygenerated messages on platforms like Twitter orTelegram

I Conversational bots are aimed at automaticallyinteracting with humans using natural language

I Wikipedia bots perform routine maintenancetasks, such as adding templates and replacingtext

[Microsoft bot framework]


6/23


I All these categories are collaborative agents, sayethical bots, and usually comply with thedirectives as of the file robots.txt

I The main drawback of ethical bots is theincrease in network traffic, which is usually keptunder control by the bots themselves.


7/23


I Malicious bots are used to perform harmful orfraudulent activities

I They can impersonate different user-agents ormimic human behaviour

I Botnets can be created to operate from differentIP addresses, thus yielding the control to asupervisor bot, called herder[Goodman - 2017]

I They are increasingly used to gain undueadvantage in online business [Invalid Clicks]


8/23

The recognition problem

Recognizing bots from humans can be formalized inthe 2 following problems:

I Offline Bot Recognition – Given a set ofHTTP requests from a web session, label the itas BOT or NON-BOT [Suchacka - 2014 and2015]

I Online Bot Recognition – Given an incomingstream of HTTP requests from a web session,detect BOTs as soon as possible, before thesequence ends (if doable)


9/23

The offline problem

I it’s basically a classification problem as thesessions can be regarded as sets, entirelyavailable at the time of decision taking

I the analysis is based on a set of descriptivesummary features, extracted from web site logs

Our dataset consists of more than 13500sessions


10/23

The online problem

I the requests must be considered as time orderedsequences of descriptive features

I a correlation between 2 subsequent requests islikely to exist

I shortest decision time is required to minimizethe negative impact of bots

We’re now getting into the online problem.


11/23

Features pre-processing - 1

In the log file, a request record is a set of features as shown below:

Feature Sample Value Description

Interarrival Time (ms) 38 time interval between two subsequent requests of the

same session

HTTP method GET indicates the desired action to be performed for a given

resource

HTTP code 200 response status codes, divided into five classes

Size 1,43 volume of data transferred in the response in KiloBytes

Empty referrer True boolean value indicating whether we know the web-

page requesting the resource or not

is embedded False boolean value indicating if the requested resource is

an embedded object

is graphic False boolean value indicating if the requested resource is a

graphic file

is style False boolean value indicating if the requested resource is a

stylesheet

is datafile False boolean value indicating if the requested resource is a

file with data

is script False boolean value indicating if the requested resource is a

script

session # 2 incremental value for session identification: not used

in BOT detection

Table: Online request features


12/23

Features pre-processing - 2

To improve the classification results, originalfeatures must be transformed as follows:

I for each boolean feature, True becomes 1 andFalse is set to 0

I the categorical features (say HTTP method andcode) are encoded in the one-hot mapping

After encoding, the initial 10 feature columns(excluding the session id) become 25 input features,used to feed the neural network.


13/23

The challenge

Online bot classification is complex because:I sessions have variable lengthI bots may change their navigation styleI there’s no a-priori information on user-agent

strings to identify botsI a reliable decision should be taken as-you-go,

without the acquisition of entire sequence at thebeginning of the decision process

I samples must be processed one at a time,sequentially, therefore we can assume they arecorrelated and time-dependent


14/23

The approach - 1

Q. is it really a sequential classification problem?

A. some heuristic tests have been performed on theinput dataset, in order to consolidate ourperception of the data structure; experimentalresults show that a simple MLP is capable ofclassifying samples with an accuracy above 95%and up to 99.80%

This implies that core information on BOT requestsis intrinsic and sequentially independent.

Great result!


15/23

The approach - 2

I consider sequential nature of samples to improveclassification results

I three outputs are possible for each observedsample:

1: the crawler is a BOT; no further observations sampled0: the crawler is human; no further observations sampled

None: no decision is taken at present; it’s delayed to future samples

I learning is supervised with a MLP, using aleave-one-out training model

I decision taken on posterior probability of eachclass according to a sequential probability ratiocriterion [Wald - 1945]


16/23

The sequential probability ratio - 1

At step t, f1(xt) and f0(xt) are the class conditional probabilities ofthe current observation xt for BOTs and humans respectively.

Assumptions

I known probabilities (output by MLP)

I mutual independence of observation (relaxed constraint)

The sequential probability ratio is:

p1(t)

p0(t)=

f1(x1)f1(x2) · · · f1(xt)

f0(x1)f0(x2) · · · f0(xt)=

t∏i=1

f1(xi)

f0(xi)(1)

Equation (1) has been transformed in a logarithmic form to avoidnumerical problems.


17/23

The sequential probability ratio - 2

Assigning two thresholds, C0 and C1 for humans and BOTsrespectively, the classifier outputs:

1 if p1(t)

p0(t) ≥ C1

0 if p1(t)p0(t) ≤ C0

None if C0 ≤ p1(t)p0(t) ≤ C1

(2)

If None is still output when the session ends, the decision is takenbased on the highest value of the sequential probability.

Note: our implementation uses symmetrical thresholdsC1 > 0 and C0 = 1

C1


18/23

The classifier architecture

Figure: MLP with SPRT

Modified multi-layer perceptronfrom Andrej Karpathy, with anadditional SPRT output module,as shown in the figure aside.

I MLP geometry definedby subsequent refinementsto find optimal model

I cross-entropyloss function on training

I tanh as activation function

I softmax for output probability

I adaptive learning rate


19/23

Classification results

Initial tests over 1000 sessions with MLP without SPRT withsession request grouping and initial learning rate LR = 0.0001

#hlayers #hunits final LR Session Accuracy Sample Accuracy

2 13 3.3 · 10−6 80.96% 99.07%

2 25 1.56 · 10−6 74.46% 99.12%

2 50 7.8 · 10−7 85.41% 99.80%

3 50 7.8 · 10−7 87.04% 99.19%

Initial tests over 1000 sessions with MLP, SPRT and leave-one-out:

Sample Accuracy 100% TOO GOOD TO BE TRUE

Preliminary results to be verified on all sessions available!


20/23

The end

Thank you for your attention.

Contact: [email protected]


21/23

Cross Entropy Loss

It’s an optimal objective function for neuron learning evaluation.

If y is the target value and y the estimated output, then

H(y , y) = −∑

[y log y + (1 − y) log(1 − y)]

Properties

I Non negative

I if neuron output is close to target, cross-entropy is close tozero


22/23

Hyperbolic Tangent

It can be expressed as:

tanh(z) = ez−e−z

ez+e−z

and its derivative is:

tanh′(z) = 1 − tanh2(z)


23/23

Softmax

Activation function for the MLP output layer that squashes aK-dim vector z of arbitrary real values to a K-dim vector σ(z) ofreal values in the range (0, 1] that add up to 1.

σ(z) = ez∑K ezk

where z =∑

wx + b, being x the neuron input, w the inputweights and b the bias.

The σ(z) values can be interpreted as output classes probabilities


BOTs or not? - person.dibris.unige.it · Telegram I Conversational bots are aimed at automatically interacting with humans using natural language I Wikipedia bots perform routine

Documents