Kargus: A Highly scalable Software based Intrusion ...

Kargus: A Highly‐scalable Software‐based Intrusion Detection System

M. Asim Jamshed*, Jihyung Lee†, SangwooMoon†, InsuYun*, Deokjin Kim‡, Sungryoul Lee‡, Yung Yi†, KyoungSoo Park*

* Networked & Distributed Computing Systems Lab, KAIST† Laboratory of Network Architecture Design & Analysis, KAIST

‡ Cyber R&D Division, NSRI

InternetInternet

Network Intrusion Detection Systems (NIDS)

• Detect known malicious activities– Port scans, SQL injections, buffer overflows, etc.

• Deep packet inspection– Detect malicious signatures (rules) in each packet

• Desirable features– High performance (> 10Gbps) with precision– Easy maintenance

• Frequent ruleset updates

2

NIDSNIDSAttack

Hardware vs. Software

• H/W‐based NIDS– Specialized hardware

• ASIC, TCAM, etc.– High performance– Expensive

• Annual servicing costs– Low flexibility

• S/W‐based NIDS– Commodity machines– High flexibility– Low performance

• DDoS/packet drops

3

IDS/IPS Sensors (10s of Gbps)

IDS/IPS M8000(10s of Gbps)

Open‐source S/W

~ US$ 20,000 ‐ 60,000

~ US$ 10,000 ‐ 24,000

≤ ~2 Gbps

Goals

• S/W‐based NIDS– Commodity machines– High flexibility

4

– High performance

Typical Signature‐based NIDS Architecture

5

PacketAcquisition

Preprocessing

DecodeFlow management

Reassembly

MatchSuccess

Match Failure(Innocent Flow)

Multi‐string Pattern Matching

Evaluation Failure(Innocent Flow)

EvaluationSuccess

Rule Options Evaluation

Output

MaliciousFlow

alert tcp $EXTERNAL_NET any -> $HTTP_SERVERS 80 (msg:“possible attack attempt BACKDOOR optix runtime detection"; content:"/whitepages/page_me/100.html"; pcre:"/body=\x2521\x2521\x2521Optix\s+Pro\s+v\d+\x252E\d+\S+sErver\s+Online\x2521\x2521\x2521/")

Bottlenecks

* PCRE: Perl Compatible Regular Expression

Contributions

A highly‐scalable software‐based NIDS for high‐speed networkGoal A highly‐scalable software‐based NIDS for high‐speed networkGoal

Slow software NIDS Fast software NIDS

Inefficient packet acquisition

Expensive string & PCRE pattern matching

Multi‐core packet acquisition

Parallel processing & GPU offloading

Bottlenecks Solutions

Fastest S/W signature‐based IDS: 33Gbps100% malicious traffic: 10 GbpsReal network traffic: ~24 Gbps

OutcomeFastest S/W signature‐based IDS: 33Gbps100% malicious traffic: 10 GbpsReal network traffic: ~24 Gbps

Outcome

6

Challenge 1: Packet Acquisition

• Default packet module: Packet CAPture (PCAP) library– Unsuitable for multi‐core environment– Low performing– More power consumption

• Multi‐core packet capture library is required

7

[Core 1] [Core 2] [Core 3] [Core 4] [Core 5]

10 Gbps NIC B10 Gbps NIC A




10 Gbps NIC D10 Gbps NIC C


10 Gbps NIC D10 Gbps NIC C

Packet RX bandwidth*

0.4‐6.7 GbpsCPU utilization

100 %

* Intel Xeon X5680, 3.33 GHz, 12 MB L3 Cache

Solution: PacketShader I/O

• PacketShader I/O– Uniformly distributes packets based on flow info by RSS hashing

• Source/destination IP addresses, port numbers, protocol‐id– 1 core can read packets from RSS queues of multiple NICs– Reads packets in batches (32 ~ 4096)

• Symmetric Receive‐Side Scaling (RSS)– Passes packets of 1 connection to the same queue

8

* S. Han et al., “PacketShader: a GPU‐accelerated software router”, ACM SIGCOMM 2010

RxQA1

RxQA1

RxQB1

RxQB1

RxQA2

RxQA2

RxQB2

RxQB2

RxQA3

RxQA3

RxQB3

RxQB3

RxQA4

RxQA4

RxQB4

RxQB4

RxQA5

RxQA5

RxQB5

RxQB5


10 Gbps NIC B10 Gbps NIC B10 Gbps NIC A10 Gbps NIC A

RxQA1

RxQB1

RxQA2

RxQB2

RxQA3

RxQB3

RxQA4

RxQB4

RxQA5

RxQB5



Packet RX bandwidth 0.4 ‐ 6.7 Gbps

40 Gbps

CPU utilization 100 %

16‐29%

Challenge 2: Pattern Matching

• CPU intensive tasks for serial packet scanning

• Major bottlenecks– Multi‐string matching (Aho‐Corasick phase)– PCRE evaluation (if ‘pcre’ rule option exists in rule)

• On an Intel Xeon X5680, 3.33 GHz, 12 MB L3 Cache– Aho‐Corasick analyzing bandwidth per core: 2.15 Gbps– PCRE analyzing bandwidth per core: 0.52 Gbps

9

Solution: GPU for Pattern Matching

• GPUs– Containing 100s of SIMD processors

• 512 cores for NVIDIA GTX 580– Ideal for parallel data processing without branches

• DFA‐based pattern matching on GPUs– Multi‐string matching using Aho‐Corasick algorithm– PCRE matching

• Pipelined execution in CPU/GPU– Concurrent copy and execution

10

Engine Thread

Packet Acquisition

PreprocessMulti‐stringMatching

Rule OptionEvaluation

GPU Dispatcher Thread

Offloading Offloading

GPU

Multi‐stringMatching

PCREMatching

Multi‐string Matching Queue PCRE Matching Queue

Engine Thread

Packet AcquisitionPacket

AcquisitionPreprocessPreprocess



Rule OptionEvaluationRule OptionEvaluation

GPU Dispatcher Thread

OffloadingOffloading OffloadingOffloading

GPU



PCREMatchingPCRE

Matching

Multi‐string Matching Queue PCRE Matching Queue

Aho‐Corasick bandwidth 2.15 Gbps

39 Gbps

PCRE bandwidth0.52 Gbps

8.9 Gbps

Optimization 1: IDS Architecture

• How to best utilize the multi‐core architecture?• Pattern matching is the eventual bottleneck

• Run entire engine on each core

11

Function Time % ModuleacsmSearchSparseDFA_Full 51.56 multi‐string matching

List_GetNextState 13.91 multi‐string matching

mSearch 9.18 multi‐string matching

in_chksum_tcp 2.63 preprocessing

* GNU gprof profiling results

Solution: Single‐process Multi‐thread

• Runs multiple IDS engine threads & GPU dispatcher threads concurrently– Shared address space– Less GPU memory consumption– Higher GPU utilization & shorter service latency

12

GPU memory usage

1/6

Packet Acquisition

Core 1Core 1

Preprocess


Rule Option

Evaluation

Packet Acquisition

Core 2Core 2

Preprocess


Rule Option

Evaluation

Packet Acquisition

Core 3Core 3

Preprocess


Rule Option

Evaluation

Packet Acquisition

Core 4Core 4

Preprocess


Rule Option

Evaluation

Packet Acquisition

Core 5Core 5

Preprocess


Rule Option

Evaluation

Core 6

GPU Dispatcher ThreadSingle thread pinned at core 1


Acquisition

Core 1

PreprocessPreprocess



Rule Option

Evaluation

Rule Option

Evaluation


Acquisition

Core 2




Rule Option

Evaluation

Rule Option

Evaluation


Acquisition

Core 3




Rule Option

Evaluation

Rule Option

Evaluation


Acquisition

Core 4




Rule Option

Evaluation

Rule Option

Evaluation


Acquisition

Core 5




Rule Option

Evaluation

Rule Option

Evaluation

Core 6

GPU Dispatcher ThreadGPU Dispatcher ThreadSingle thread pinned at core 1

Architecture

• Non Uniform Memory Access (NUMA)‐aware• Core framework as deployed in dual hexa‐core system• Can be configured to various NUMA set‐ups accordingly

13

▲ Kargus configuration on a dual NUMA hexanode machine having 4 NICs, and 2 GPUs

• Caveats– Long per‐packet processing latency:

• Buffering in GPU dispatcher

– More power consumption• NVIDIA GTX 580: 512 cores

• Use:– CPU when ingress rate is low (idle GPU)– GPU when ingress rate is high

Optimization 2: GPU Usage

14

• Load balancing between CPU & GPU– Reads packets from NIC queues per cycle– Analyzes smaller # of packets at each cycle (a < b < c)– Increases analyzing rate if queue length increases– Activates GPU if queue length increases

CPU CPU GPU

Solution: Dynamic Load Balancing

15

a b

b ca

c

α β γ

Internal packet queue (per engine)

GPU Queue Length

Packet latency withGPU : 640 μsecs

CPU: 13 μsecs

Optimization 3: Batched Processing

• Huge per‐packet processing overhead– > 10 million packets per second for small‐sized packets at 10 Gbps– reduces overall processing throughput

• Function call batching– Reads group of packets from RX queues at once– Pass the batch of packets to each function

16

Decode(p) Preprocess(p) Multistring_match(p)

Decode(list‐p) Preprocess(list‐p) Multistring_match(list‐p)

2x faster processing rate

Kargus Specifications

17

NUMA node 1

12 GB DRAM (3GB x 4)

Intel 82599 Gigabit Ethernet Adapter (dual port)

NVIDIA GTX 580 GPU

NUMA node 2

Intel X5680 3.33 GHz (hexacore)12 MB L3 NUMA‐Shared Cache

$1,210

$512

$370

$100

Total Cost (incl. serverboard) = ~$7,000

IDS Benchmarking Tool

• Generates packets at line rate (40 Gbps) – Random TCP packets (innocent)– Attack packets are generated by attack rule‐set

• Support packet replay using PCAP files• Useful for performance evaluation

18

Kargus Performance Evaluation

• Micro‐benchmarks– Input traffic rate: 40 Gbps– Evaluate Kargus (~3,000 HTTP rules) against:

• Kargus‐CPU‐only (12 engines)• Snort with PF_RING• MIDeA*

• Refer to the paper for more results

19

* G. Vasiliadis et al., “MIDeA: a multi‐parallel intrusion detection architecture”, ACM CCS ‘11

0

5

10

15

20

25

30

35

64 218 256 818 1024 1518

Throug

hput (G

bps)

Packet size (Bytes)

MIDeASnort w/ PF_RingKargus CPU‐onlyKargus CPU/GPU

0

5

10

15

20

25

30

35

64 218 256 818 1024 1518

Throug

hput (G

bps)

Packet size (Bytes)

Innocent Traffic Performance

20

Actual payload analyzing bandwidth

• 2.7‐4.5x faster than Snort• 1.9‐4.3x faster than MIDeA

Malicious Traffic Performance

21

0

5

10

15

20

25

30

35

64 256 1024 1518

Throug

hput (G

bps)

Packet size (Bytes)

Kargus, 25%Kargus, 50%Kargus, 100%Snort+PF_Ring, 25%Snort+PF_Ring, 50%Snort+PF_Ring, 100%

• 5x faster than Snort

Real Network Traffic

• Three 10Gbps LTE backbone traces of a major ISP in Korea:– Time duration of each trace: 30 mins ~ 1 hour– TCP/IPv4 traffic:

• 84 GB of PCAP traces• 109.3 million packets• 845K TCP sessions

• Total analyzing rate: 25.2 Gbps– Bottleneck: Flow Management (preprocessing)

22

Effects of Dynamic GPU Load Balancing

23

400450500550600650700750800850900

0 5 10 20 33

Kargus w/o LB (polling)

Kargus w/o LB

Kargus w/ LB

Offered Incoming Traffic (Gbps) [Packet Size: 1518 B]

Power Con

sumpt

ion(W

atts)

• Varying incoming traffic rates– Packet size = 1518 B

8.7%20%

Conclusion

• Software‐based NIDS:– Based on commodity hardware

• Competes with hardware‐based counterparts

– 5x faster than previous S/W‐based NIDS– Power efficient– Cost effective

24

> 25 Gbps (real traffic)> 33 Gbps (synthetic traffic)US $~7,000/‐

Thank You

25

fast‐[email protected]

https://shader.kaist.edu/kargus/

Backup Slides

Kargus vs. MIDeA

27

UPDATE MIDEA KARGUS OUTCOME

* G. Vasiliadis, M.Polychronakis, and S. Ioannidis, “MIDeA: a multi‐parallel intrusion detection architecture”, ACM CCS 2011

Kargus vs. MIDeA

28


Packet acquisition PF_RING PacketShader I/O 70% lower CPU utilization


Kargus vs. MIDeA

29



Detection engineGPU‐support for Aho‐Corasick

GPU‐support for Aho‐Corasick & PCRE

65% faster detection rate


Kargus vs. MIDeA

30






Architecture Process‐based Thread‐based 1/6GPU memory usage


Kargus vs. MIDeA

31







Batch processingBatching only for

detection engine (GPU)Batching from packet acquisition to output

1.9x higher throughput


Kargus vs. MIDeA

32







Batch processingBatching only for

detection engine (GPU)Batching from packet acquisition to output

1.9x higher throughput

Power‐efficient

AlwaysGPU (does not offload

only when packet sizeis too small)

Opportunistic offloading toGPUs

(Ingress traffic rate)15% power saving


Receive‐Side Scaling (RSS)

33

• RSS uses Toeplitz hash function (with a random secret key)

Algorithm: RSS Hash Computation

function ComputeRSSHash(Input[], RSK)ret = 0;for each bit b in Input[] do

if b == 1 thenret ^= (left‐most 32 bits of RSK);

endifshift RSK left 1 bit position;

end forend function

Symmetric Receive‐Side Scaling

34

• Update RSK (Shinae et al.)

0x6d5a 0x6d5a 0x6d5a 0x6d5a





0x6d5a 0x56da 0x255b 0x0ec2

0x4167 0x253d 0x43a3 0x8fb0

0xd0ca 0x2bcb 0xae7b 0x30b4

0x77cb 0x2d3a 0x8030 0xf20c

0x6a42 0xb73b 0xbeac 0x01fa

Why use a GPU?

35

GTX 580: 512 cores

ALU

Xeon X5680: 6 cores

Control

Cache

ALU

ALU ALU

ALUALU

ALU

VS

*Slide adapted from NVIDIA CUDA C A Programming Guide Version 4.2 (Figure 1‐2)

GPU Microbenchmarks –Aho‐Corasick

36

0

5

10

15

20

25

30

35

40

32 64 128 256 512 1,024 2,048 4,096 8,192 16,384

Throug

hput (G

bps)

The number of packets in a batch (pkts/batch)

GPU throughput (2B per entry)

CPU throughput

2.15 Gbps

39 Gbps

GPU Microbenchmarks – PCRE

37

0

1

2

3

4

5

6

7

8

9

10

32 64 128 256 512 1,024 2,048 4,096 8,192 16,384

Throug

hput (G

bps)

The number of packets in a batch (pkts/batch)

GPU throughput

CPU throughput

0.52 Gbps

8.9 Gbps

• Use of global variables minimal– Avoids compulsory cache misses– Eliminates cross‐NUMA cache bouncing effects

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

64 128 256 512 1024

Innocent Traffic Malicious Traffic

Effects of NUMA‐aware Data Placement

38

Packet Size (Bytes)

Performan

ce Spe

edup

1518

CPU‐only analysis for small‐sized packets

• Offloading small‐sized packets to the GPU is expensive– Contention across page‐locked DMA accessible memory with GPU– GPU operational cost of packet metadata increases

39

0

2,000

4,000

6,000

8,000

10,000

12,000

64 68 72 76 80 84 88 92 96 100 104 108 112 116 120 124 128

Latenc

y (m

sec)

Packet Size (Bytes)

GPU total latency CPU total latencyGPU pattern matching latency CPU pattern matching latency

82

Challenge 1: Packet Acquisition

• Default packet module: Packet CAPture (PCAP) library– Unsuitable for multi‐core environment– Low Performing

40

0.4 0.8 1.5 2.9 5.0 6.7

0

20

40

60

80

100

0

5

10

15

20

25

30

35

40

64 128 256 512 1024 1518

CPU Utiliz

ation (%

)

Rec

eiving Throu

ghpu

t (Gbp

s)

Packet Size (bytes)

PCAP polling PCAP polling CPU %

Solution: PacketShader* I/O

41

0.4 0.8 1.5 2.9 5.0 6.7

0

20

40

60

80

100

0

5

10

15

20

25

30

35

40

64 128 256 512 1024 1518

CPU Utiliz

ation (%

)

Rec

eiving Throu

ghpu

t (Gbp

s)

Packet Size (bytes)

PCAP polling PSIO PCAP polling CPU % PSIO CPU %

Kargus: A Highly scalable Software based Intrusion ...

Documents