Top Banner
Copyright 2011 Trend Micro Inc. 1 Mathematical Modeling for Practical Problems Liwei Ren, Ph.D Scientific Adviser, Trend Micro May 12, 2014, UC Santa Cruz, Silicon Valley Center, Santa Clara
34

Mathematical Modeling for Practical Problems

Apr 15, 2017

Download

Technology

Liwei Ren
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Mathematical Modeling for Practical Problems

Copyright 2011 Trend Micro Inc. 1

Mathematical Modeling for Practical Problems

Liwei Ren, Ph.D

Scientific Adviser, Trend Micro

May 12, 2014, UC Santa Cruz, Silicon Valley Center, Santa Clara

Page 2: Mathematical Modeling for Practical Problems

Copyright 2011 Trend Micro Inc.

Backgrounds:

• Liwei Ren– Research interests:

• DLP, cloud data security, network security, differential compression, math modeling & practical algorithms.

– Education:

• MS/BS in mathematics, Tsinghua University, Beijing

• Ph.D in mathematics, MS in information science, University of Pittsburgh

– Relevant works for this talk:

• Provilla : a startup focusing on endpoint based DLP products and solutions. It was co-founded by Liwei and acquired by Trend Micro.

• Patents --- Liwei has 20 patents granted in both DLP & differential compression … most works include strong algorithmic elements.

• Trend Micro™

– Global security software company with headquarter in Tokyo, and R&D centers in Nanjing, Taipei and Silicon Valley.

– Acquired Provilla™ in 2007.

2

Page 3: Mathematical Modeling for Practical Problems

Copyright 2011 Trend Micro Inc.

Agenda

• What Is a Math Model?

• A Process of Practice

• A Problem from a Startup

• Math Modeling

• Math Modeling Again

• Summary

Classification 5/12/2014 3

Page 4: Mathematical Modeling for Practical Problems

Copyright 2011 Trend Micro Inc.

What is a Math Model?

• A math model describes a practical problem in mathematical language:– Using mathematical symbols, expressions, concepts, and even logic

operations;

– Using mathematical equations;

– Using mathematical structures such as graphs;

– Using mathematical procedures such as algorithms.

• A math model may describe a practical problem approximately:– It needs to include the most essential parts of the problem while ignoring

those unimportant features.

– However, we cannot go too far for ignoring unimportant features.

4

Page 5: Mathematical Modeling for Practical Problems

Copyright 2011 Trend Micro Inc.

What is a Math Model?

• A simple example:– Problem: Two cars are driving toward each other on a street with an

initial distance one and half mile. A naughty dog is running between them. Two cars drive at 4 miles/hr and 6 miles/hr respectively. The dog runs at 20 miles/hr. What is the total in mile that the dog runs?

Classification 5/12/2014 5

Page 6: Mathematical Modeling for Practical Problems

Copyright 2011 Trend Micro Inc.

What is a Math Model?

• A simple example:

6

– Analysis:

– to calculate the distance that the dog runs, one needs to know the time T it takes. T is how long two cars take to meet;

– T = D / ( V1 + V2).

– Math model: d = V * D/( V1 + V2).

– Solution: d = 20*1.5/(4+6)= 3 miles.

Page 7: Mathematical Modeling for Practical Problems

Copyright 2011 Trend Micro Inc.

What is a Math Model?

• A notable example: – Seven Bridges of Königsberg (in Prussia, 18th century)

– Problem Proposal: to find a walk through the city that would cross each bridge once and only once.

Classification 5/12/2014 7

Page 8: Mathematical Modeling for Practical Problems

Copyright 2011 Trend Micro Inc.

What is a Math Model?

• A notable example : – Analysis : Leonhard Euler in 1735.

Classification 5/12/2014 8

Page 9: Mathematical Modeling for Practical Problems

Copyright 2011 Trend Micro Inc.

What is a Math Model?

• Classic example: – Model: to find a path ( or Euler Trail) that uses each edge in this

undirected graph exactly once.

Classification 5/12/2014 9

• Solution: Euler proved that there exists no solution.

• Contribution: This problem started 2 important branches of modern mathematics --- graph theory & topology.

Page 10: Mathematical Modeling for Practical Problems

Copyright 2011 Trend Micro Inc.

A Process of Practice

• Let me summarize a process from my experience:– How to create mathematical models from practical

problems.

Classification 5/12/2014 10

Page 11: Mathematical Modeling for Practical Problems

Copyright 2011 Trend Micro Inc.

A Problem from a Startup

• A conversation in 2004 :

Classification 5/12/2014 11

Page 12: Mathematical Modeling for Practical Problems

Copyright 2011 Trend Micro Inc.

A Problem from a Startup

• Text Model for constructing EvalSim:

Classification 5/12/2014 12

Page 13: Mathematical Modeling for Practical Problems

Copyright 2011 Trend Micro Inc.

A Problem from a Startup

• A conversation in 2004 :

Classification 5/12/2014 13

Page 14: Mathematical Modeling for Practical Problems

Copyright 2011 Trend Micro Inc.

A Problem from a Startup

• A conversation in 2004 :

Classification 5/12/2014 14

Data Inspection Problem: S is a set of documents . For any document d, we need to find D from S such that EvalSim(D,d) ≥ X%.

Page 15: Mathematical Modeling for Practical Problems

Copyright 2011 Trend Micro Inc.

A Problem from a Startup

• A conversation in 2004 :

Classification 5/12/2014 15

Page 16: Mathematical Modeling for Practical Problems

Copyright 2011 Trend Micro Inc.

Math Modeling

• To solve the DLP Data Inspection Problem, we introduce the concept of fingerprints:

1. To identify unique and robust features from a string;

2. To generate fingerprints from these features by hashing.

• Given a string T, we denote its fingerprints as:– SFP(T) = {FP1, FP2 ,…, FPm(T)}

16

NOTE: Many years later, we realized the problem is actually close to the problem :

• Near Duplicate Document Detection.

Page 17: Mathematical Modeling for Practical Problems

Copyright 2011 Trend Micro Inc.

Math Modeling

• With fingerprints, the problem is divided into two parts:– Indexing:

• For each string T ∊ S that is assigned a unique string ID as SID, we generate fingerprints SFP(T), then we index SID with all fingerprints in SFP(T).

• The whole indices is contained in FP-DB.

– Searching + Matching:

• For given T, we have SFP(T). We search SFP(T) against FP-DB to identify possible candidates (i.e., suspects) of similar strings, say, {t1, t2 ,…, tk}

• Calculate EvalSim(T, tj) where j = 1,2,…,k.

– Pick those with EvalSim(T,*) ≥ X% as result.

• The above is similar to keyword-based search if we view fingerprints as keywords.

• What remains :– How to generate fingerprints from a given string?

Classification 5/12/2014 17

Page 18: Mathematical Modeling for Practical Problems

Copyright 2011 Trend Micro Inc.

Math Modeling

• String fingerprints : 1. Fingerprints are generated from features of a given string.

2. Robust: we expect SFP(T1) ∩ SFP(T2) ≠ NIL if they are similar;

3. Unique: SFP(T1) ∩ SFP(T2) = NIL if they are irrelevant.

• How to select robust and unique features? – Selecting anchor points may be a good choice.

– A character in the string is an anchor point if

• Its neighborhood ( of fixed length M) could be a common sub-string across similar strings with high probability;

– A fingerprint is generated by hashing the neighborhood:

• When M is long enough, we should have uniqueness;

• The high probability means robustness:

– Resilient to changes.

Classification 5/12/2014 18

Page 19: Mathematical Modeling for Practical Problems

Copyright 2011 Trend Micro Inc.

Math Modeling

• Anchor points and fingerprints:

Classification 5/12/2014 19

• How to identify anchor points?

Page 20: Mathematical Modeling for Practical Problems

Copyright 2011 Trend Micro Inc.

Math Modeling

• Review: A character in the string is an anchor point if• Its neighborhood could be a common sub-string across similar strings with

high probability;

• This definition is not rigorous.

• Let us try a rigorous way to describe anchor points:– That is what mathematical modeling is about.

• Math Modeling for Anchor Points:– Let A = *0x00, 0x01, ….,0xFF+ as the binary alphabet.

– Let K be a small integer (say, 5). We select K different binary characters from A in order for identifying anchor point candidates .

– Two requirements:

1. Those candidates must have high frequency in given string;

2. They are as evenly distributed as possible. Classification 5/12/2014 20

Page 21: Mathematical Modeling for Practical Problems

Copyright 2011 Trend Micro Inc.

Math Modeling

• Math Modeling for Anchor Points:– We use a score function F to describe the requirements :

where b ϵ A , n is the number of occurrences of character b, and {P1, P2…, Pn} represent all offsets of b in string.

– measures the frequency of character b … intuitively !

– The 2nd term measures its distribution.

• WHY ?

21

Page 22: Mathematical Modeling for Practical Problems

Copyright 2011 Trend Micro Inc.

Math Modeling

• Let us consider the constrained optimization problem :

where (C is a constant), and Xi ≥ 0, i=1,2,…,m

• It is equivalent to the problem:

where and Xi ≥ 0, i=1,2,…,m

Classification 5/12/2014 22

]

Page 23: Mathematical Modeling for Practical Problems

Copyright 2011 Trend Micro Inc.

Math Modeling

• Its solution is Xi = , i =1, 2 , …, m

• It means the even distribution of character b in the string:

– Let Xi = Pi+1 - Pi , i = 1, 2 , …, m, and m=n-1;

– For even distribution, we have Pi+1 - Pi = C/(n-1) for i = 1, 2 , …, n-1.

– Meaning : If character b appears n times in a constant range C, F(b) achieves the maximum value when evenly distributed!

23

Page 24: Mathematical Modeling for Practical Problems

Copyright 2011 Trend Micro Inc.

Math Modeling

• With this score function F(b), we select K characters {b1, b2, …,bK} from A with K top scores.

• For each selected character bk , at each occurrence in string, we generate a fingerprint from its neighborhood with a hash function H1:

• We obtain a set of fingerprints {FP1, FP2, …, FPn}.

• Let us sort them in an ascending order, and pick up first N fingerprints. The number N may be pre-selected depending on the string size.

24

Page 25: Mathematical Modeling for Practical Problems

Copyright 2011 Trend Micro Inc.

Math Modeling

• We get K*N anchor points ( to generate K*N fingerprints).

• We are done with modeling the anchor points:– It should be very easy to provide an algorithm based on the model.

• Let us name the Math Model ( of anchor points) as MODEL 1.

• With MODEL 1, we developed an algorithm to generate fingerprints from a given string:– DataDNA 1.0.

• With DataDNA 1.0, we solve DLP Data Inspection Problem:

25

S is a set of documents . For any document d, we need to find D from S such that EvalSim(D,d) ≥ X%.

Page 26: Mathematical Modeling for Practical Problems

Copyright 2011 Trend Micro Inc.

Math Modeling Again

• Not long, we started to face a few challenges:1. If we make more than 60% change to a document D, we find the

new document d may share 0 fingerprints with D;

2. Our customers challenged us with a question:

• If we copy & paste a small text into a very large document, does your DLP Data Inspection technology work?

3. Due to product architecture change, we replaced new EvalSim with:

26

NOTE: This is because that the original EvalSim has to compare two strings byte-to-byte for common sub-strings. This new formula is based on number of common fingerprints.

• We have an issue : the anchor points selected by DataDNA 1.0 are not evenly distributed over the string. So the EvalSim() as calculated above is not as accurate as expected . We need to fix it!

Page 27: Mathematical Modeling for Practical Problems

Copyright 2011 Trend Micro Inc.

Math Modeling Again

• We had to propose new model to select anchor points.– We use rolling hash H to describe anchor points this time.

27

NOTE 1: Many applications do the similar trick for identifying anchor points:

• Data de-duplication ( cut points)

• SSDEEP

NOTE 2: We can use

• Karp-Rabin rolling hash OR

• Adler-32 .

Page 28: Mathematical Modeling for Practical Problems

Copyright 2011 Trend Micro Inc.

Math Modeling Again

• After identifying anchor points, we can generate fingerprints from right neighborhoods (of anchor points) with another hash function h:

– This h can be a regular hash function, however, it is better use 2nd

rolling hash for performance.

28

Page 29: Mathematical Modeling for Practical Problems

Copyright 2011 Trend Micro Inc.

Math Modeling Again

• This is MODEL 2 for describing anchor points. It can solve the 3 issues that we raised.

• WHY?– Statistically, H(x)=0 mod p provides us with an anchor point per p

consecutive characters in average.

– This is close to our expectation:

• Even distribution of anchor points.

29

Page 30: Mathematical Modeling for Practical Problems

Copyright 2011 Trend Micro Inc.

Math Modeling Again

• With MODEL 2, we developed an algorithm to generate fingerprints from a given string.

– DataDNA 2.0

• With DataDNA 2.0, we solve DLP Data Inspection Problem with better solution and simple EvalSim function:

where

30

S is a set of documents . For any document d, we need to find D from S such that EvalSim(D,d) ≥ X%.

Page 31: Mathematical Modeling for Practical Problems

Copyright 2011 Trend Micro Inc.

Summary

• We proposed a process for math modeling of real world problems.

• We practiced the process with DLP Data Inspection Problem .– Proposed by a DLP startup many years ago.

• The problem was reduced to string fingerprinting problem :

31

• MODEL 1 was introduced to describe anchor points in order for generating fingerprints.

• MODEL 2 was introduced to describe evenly distributed anchor points in order for generating fingerprints.

Page 32: Mathematical Modeling for Practical Problems

Copyright 2011 Trend Micro Inc.

Summary

• The problem of DLP Data Inspection has been studied as the problem of Near Duplicate Document Detection.

• Many applications:– Data leak prevention

– Document classification and clustering

– Anti-plagiarism

– eDiscovery

– Web search engine: index optimization.

– More….

32

Page 33: Mathematical Modeling for Practical Problems

Copyright 2011 Trend Micro Inc.

Q&A

• Thank you for your attention.

• Do you have questions?

33

Page 34: Mathematical Modeling for Practical Problems

Copyright 2011 Trend Micro Inc.

References1. US patent 8359472, Document fingerprinting with asymmetric

selection of anchor points, Jan 2013

2. US Patent 8266150, Scalable document signature search engine, Sep 2012

3. US patent 7860853, Document matching engine using asymmetric signature generation, Dec 28, 2010

4. US patent 7516130, Matching engine with signature generation, April, 2009

5. My Information:– Email : [email protected]

– Linkedin: http://www.linkedin.com/in/drliweiren

– Academic Space: https://pittsburgh.academia.edu/LiweiRen

34