Top Banner
Fast Statistical Spam Filter by Approximate Classifications Authors: Kang Li Zhenyu Zhong University of Georgia Reader: Deke Guo
28

Fast Statistical Spam Filter by Approximate Classifications Authors: Kang Li Zhenyu Zhong University of Georgia Reader: Deke Guo.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Fast Statistical Spam Filter by Approximate Classifications Authors: Kang Li Zhenyu Zhong University of Georgia Reader: Deke Guo.

Fast Statistical Spam Filter by Approximate Classifications

Authors:Kang LiZhenyu ZhongUniversity of Georgia

Reader: Deke Guo

Page 2: Fast Statistical Spam Filter by Approximate Classifications Authors: Kang Li Zhenyu Zhong University of Georgia Reader: Deke Guo.

Outline

Motivations of this paper The concrete problems Basic idea and solutions Questions needed to clarify

Page 3: Fast Statistical Spam Filter by Approximate Classifications Authors: Kang Li Zhenyu Zhong University of Georgia Reader: Deke Guo.

Motivations1. Speedup the classification process in

order to defense against spam quickly, furthermore, improve the throughout of system.

2. Improve the scalability of the statistical-based classification methods.

3. Keep high classification accuracy.

Page 4: Fast Statistical Spam Filter by Approximate Classifications Authors: Kang Li Zhenyu Zhong University of Georgia Reader: Deke Guo.

The background and concrete problem

Background Statistical-based Bayesian filters and its variants are

used to block spam. The statistical value of each individual token is

stored by a dictionary. A decision-making is based on the summarization of

values of much tokens. Problems needed to research

How to improve the performance of value retrieval operation for each individual token. (the motivation 1 and 2)

The solutions should not have much negative effect on the classification accuracy. (the motivation 3)

Page 5: Fast Statistical Spam Filter by Approximate Classifications Authors: Kang Li Zhenyu Zhong University of Georgia Reader: Deke Guo.

Basic idea and solutions (1) A straightforward idea

Use the Bloom filters to store the values of tokens, and retrieve the value of any token on demand.

The first obstacle How to extend the standard Bloom filter?

Page 6: Fast Statistical Spam Filter by Approximate Classifications Authors: Kang Li Zhenyu Zhong University of Georgia Reader: Deke Guo.

01010 10110

Data set B

00011

a b c d

x y Data set A

A hash function family

A bit vector

m-10

Page 7: Fast Statistical Spam Filter by Approximate Classifications Authors: Kang Li Zhenyu Zhong University of Georgia Reader: Deke Guo.

00010 00000

test set B

00010

token1

x y token universe

A hash function family

01000 00000 00000

01010 00000 00010

Multi-bit vector

Bit-wise AND

0

1

0output value

token2 token4token3

q-1

0

First dimension

Second dimension

Page 8: Fast Statistical Spam Filter by Approximate Classifications Authors: Kang Li Zhenyu Zhong University of Georgia Reader: Deke Guo.

Basic idea and solutions (2) Instead the bit vector with a two dimensions

vector, with (multiply m by q) size. The first dimension denotes the hash locations for

each token in a m bits vector, the same as the standard Bloom filter.

The second dimension of each hash locations denotes the value of token. One bit for one identical value.

The second obstacle The size of value universe is usually large even

huge. It is impossible to allocate bits in the second dimension for all elements of the value universe.

Page 9: Fast Statistical Spam Filter by Approximate Classifications Authors: Kang Li Zhenyu Zhong University of Georgia Reader: Deke Guo.

Basic idea and solutions (3) Encode

In this field, the value universe ranges from 0 to 1.

This paper does not propose new encoding method, just use a algorithm referred from the paper [20].

Choose and tune the parameter q , which denotes the number of possible elements resulting from encoding algorithm.

Page 10: Fast Statistical Spam Filter by Approximate Classifications Authors: Kang Li Zhenyu Zhong University of Georgia Reader: Deke Guo.

Why the idea can meet the motivation one and two? Space (for the set of pairs (token, value))

If use the extended Bloom filter to store them, it need less space than others . K bits for each token.

Given the allocated memory, the solution can store more pairs (token, value) than others.

Time Extended Bloom filter are small enough to load in

memory. No other I/O operations. The response delay is a constant for the query with

any input no matter how many pairs have been stored.

In the same time slot, the solution can retrieve the values of more tokens than previous solutions.

Page 11: Fast Statistical Spam Filter by Approximate Classifications Authors: Kang Li Zhenyu Zhong University of Georgia Reader: Deke Guo.

The negative effects on the classification accuracy (1) The query based on the extended Bloom

filter may output two kinds of mistake. For any query with a token outside of the test

data set as input, may get a useful output entry (just one bit is set to 1).

For any query with a token inside the test data set as input, may get a conflict output entry (more than one bits are set to 1).

For any token, the decoding result usually does not equal the real statistical value.

Page 12: Fast Statistical Spam Filter by Approximate Classifications Authors: Kang Li Zhenyu Zhong University of Georgia Reader: Deke Guo.

01010 00000

token set B

00010

token1

x y token set A

A hash function family

01000 00000 00000

01010 00000 00010

Multi-bit vector

Bit-wise AND

1

1

0output value

token2 token4token3

q-1

0

First dimension

Second dimension

Page 13: Fast Statistical Spam Filter by Approximate Classifications Authors: Kang Li Zhenyu Zhong University of Georgia Reader: Deke Guo.

The negative effects on the classification accuracy (2) The misclassification

The former error will affect the summarization of values of a message, and maybe influence the decision.

For a multi-bits error, choose the smallest value. If it is wrongly chosen, the error only makes the classification result less likely as spam, and maybe result in a false negative. This can be tolerated.

The decoding deviation It can not been avoided. Design better

algorithms and/or select the parameters carefully.

Page 14: Fast Statistical Spam Filter by Approximate Classifications Authors: Kang Li Zhenyu Zhong University of Georgia Reader: Deke Guo.

Questions needed to clarify(1) For a query output entry, the possibility for a single bit of the

output entry being zero asPm,n,h(0)=1-Pm,n,h(fpos)

=1-(1-(1-1/m)n*h)h

For a query output entry , the probability of the former case:

Pm,n,h,q(fpos)=1-(Pm,n,h(0))q (6)

The probability of the latter case: Pm,n,h,q(multi)=1-(Pm,n,h(0))q

-q* (1-Pm,n,h(0))(q-1) (7)

Page 15: Fast Statistical Spam Filter by Approximate Classifications Authors: Kang Li Zhenyu Zhong University of Georgia Reader: Deke Guo.

Questions needed to clarify(1) The formulas 6 and 7 are wrong or not

consistent with the error definitions. The probability of the event (just one bit of

the output entry is set to 1) is:

The probability of the event (more than one

bits of the output entry are set to 1) is: One minus the probability of all bits being set to

0 and the probability of only one bit getting 1.

1

, , , , , , ,P (fpos) P fpos P 01

q

m n h q m n h m n h

q

, , , , , , , ,P (multi) 1 P (0) P (fpos)qm n h q m n h m n h q

Page 16: Fast Statistical Spam Filter by Approximate Classifications Authors: Kang Li Zhenyu Zhong University of Georgia Reader: Deke Guo.

Questions needed to clarify(2) In order to store and retrieve values, can

this idea be a general way to improve the standard Bloom filter? The size of value universe. The multi-bit output error. Deletion operation of pairs (key,value).

Page 17: Fast Statistical Spam Filter by Approximate Classifications Authors: Kang Li Zhenyu Zhong University of Georgia Reader: Deke Guo.

Questions and Answers

Page 18: Fast Statistical Spam Filter by Approximate Classifications Authors: Kang Li Zhenyu Zhong University of Georgia Reader: Deke Guo.

Beyond Bloom Filters: From Approximate MembershipChecks to Approximate State Machines

Authors:Flavio BonomiMichael MitzenmacherRina Panigrahy

SIGCOMM 2006

Reader: Deke Guo

Page 19: Fast Statistical Spam Filter by Approximate Classifications Authors: Kang Li Zhenyu Zhong University of Georgia Reader: Deke Guo.

Questions

How to track the simultaneous state of a large number of connections at each network device.

The size of tracking result should be small in order to load in on-chip memory.

Page 20: Fast Statistical Spam Filter by Approximate Classifications Authors: Kang Li Zhenyu Zhong University of Georgia Reader: Deke Guo.

Solution(1)

Uses standard bloom filters to summarize the simultaneous state of a large number of connections.

lookups the state of each connection according to its summarization.

Introduces a new error named “don’t know” besides false positive and false negative.

Page 21: Fast Statistical Spam Filter by Approximate Classifications Authors: Kang Li Zhenyu Zhong University of Georgia Reader: Deke Guo.

Solution(1) Introduces the timing-based deletion

mechanism to deal with ill-behaving or non-terminating.

Operations: Put (id, state) Lookup (id) or Lookup (id, state) Delete (id, state) Update (id, old state, new state)

Ill-behaving or attacking may result in false negative error.

Page 22: Fast Statistical Spam Filter by Approximate Classifications Authors: Kang Li Zhenyu Zhong University of Georgia Reader: Deke Guo.

01010 10110

Data set B

00011

h1(x) h2(x) hk(x)h3(x)

a b c d

x doesn’t belong to set B, yet its bits have been set 1h1(y) h2(y) hk(y)h3(y)

y doesn’t belong to set B, and its bits aren’t all 1.a belongs to set B, and its bits are all 1.

x y Data set A

Page 23: Fast Statistical Spam Filter by Approximate Classifications Authors: Kang Li Zhenyu Zhong University of Georgia Reader: Deke Guo.

00010 10010

Data set B

00001

a b c d

x y Data set A

a belongs to set B, and its bits are not all 1 after the false deletion of x.

A false positive error may result in at most k false negative.

Page 24: Fast Statistical Spam Filter by Approximate Classifications Authors: Kang Li Zhenyu Zhong University of Georgia Reader: Deke Guo.

Solution(2)

Introduce the Stateful Bloom Filter Approach. Instead the bit vector used by standard bloo

m filters with cell vector. Its rate of false positive is less than that of st

andard bloom filters. Note that the storage space used by two filters are not same. Thus, it is need to compare more carefully.

Page 25: Fast Statistical Spam Filter by Approximate Classifications Authors: Kang Li Zhenyu Zhong University of Georgia Reader: Deke Guo.

01010 12001

20

Data set B

00031

h1(x) h2(x) hk(x)h3(x)

a b c d

X don’t belong to set B.The lookup based on the filter also make right judge .

x y Data set A

Page 26: Fast Statistical Spam Filter by Approximate Classifications Authors: Kang Li Zhenyu Zhong University of Georgia Reader: Deke Guo.

Solution(3)

An Approach Using d-left Hashing The authors did not explain why it is the best

solution among the three solutions through formal compare and analysis.

The simulation tries to prove it, but it is not strong enough, especially don’t compare under the same space used.

Page 27: Fast Statistical Spam Filter by Approximate Classifications Authors: Kang Li Zhenyu Zhong University of Georgia Reader: Deke Guo.

Data set B

00031

a b c d

x y Data set A

00231 00031

Page 28: Fast Statistical Spam Filter by Approximate Classifications Authors: Kang Li Zhenyu Zhong University of Georgia Reader: Deke Guo.

Questions needed to analyze Analyze the relationship between false positive and false

negative, and try to give formula. If the old value of a cell was “don’t know”, then the cell

keeps the value before its register becomes 0. Analyze the fraction of cell which value is “don’t know”,

and compute the rate of this error. If the register becomes 1 from a larger value, value

“don’t know” should become a identify value, but SBF can’t support this transformation.

If we use the idea of SBF to redesign the standard Bloom Filters, whether we can achieve some benefits, such as lower false positive rate.