Page 1
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
Scaling Face Recognition
with Big Data
Bogdan BOCȘE
Solutions Architect & Co-founder VisageCloud
https://VisageCloud.com
https://www.linkedin.com/in/bogdanbocse/
https://twitter.com/bocse
Page 2
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
Many thanks to our sponsors & partners!
GOLD
SILVER
PARTNERS
PLATINUM
POWERED BY
Page 3
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• How to learn ?
• What to learn?
• Defining learning objectives
• How to scale learning?
• Gotchas
• VisageCloud
–Architecture
–Use Cases
Agenda
Page 4
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• What questions to ask before writing the code?
• How to look at the data before feeding it to the
machine?
• What is the state of the art regarding ML?
• What frameworks to use?
• What are the common traps to avoid?
• How to design for scale?
Objectives
Page 5
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
HOW TO LEARN?
Page 6
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
Vision
• Convolutional Neural Networks
• Inception Paper
NLP
• Word2Vec
• GloVe: Global Vectors for Words Representation
Generic
• Classification
• Prediction
How to Learn?
Page 7
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
Convolutional Neural Networks: Big Picture
Page 8
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• Pooling / Max Pooling
• Convolution
• Fully Connected Activation–Activation Function, eg. ReLu
Convolutional Neural Networks : Components
Page 9
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• Learning is an optimization problem
–Find parameters of a system (neural network) that minimize a fixed error function
–Not unlike planning orbital paths
• Defining the network architecture
• Defining the training algorithm
–Stochastic Gradient Descent
• With momentum
• With noisy
Taking a Step Back: The Math
Page 10
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• DeepLearning4j– Independent company
– Java interface with C-bindings for performance
• TensorFlow– Python & C++ API
– Developed by Google
– Compatible with TPU
• Torch– Developed by Facebook
– Written in LuaJIT, with Python bindings
Frameworks
Page 11
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
WHAT TO LEARN?
Page 12
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• Public data sets
– Labelled Faces in the Wild (LFW)
–Youtube faces
–Kaggle
• Private data sets
• Build your own
–Outsourcing: Mechanical Turk
–Crowsourcing: ReCaptcha model
Data Sets
Page 13
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
Preparing Data
Clean data
Cropping
Structure
Homogeneity
Normalization
Histograms
Filtering
Page 14
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• Machine learning is not magic
• If you can’t understand the data, a machine probably
won’t either
• Preprocessing makes the difference between results
• Applying filters, normalization, anomaly detection is
computationally inexpensive
Preparing Data
Page 15
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
DEFINING LEARNING OBJECTIVES
Page 16
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• Supervised
–Classification
–Scoring and regression
– Identification
• Unsupervised
–Clustering
Defining learning objectives
Page 17
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• Projecting input onto a fixed set of classes
• “Don’t use a cannon to kill a fly”
–Support Vector Machines
• Linear
• Radial Based Functions
Classification
Page 18
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• Embedding
–Projecting input (image) onto an vector space with a
known property
• Triplet Loss Function
Identification
Page 19
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• Splitting a set of items into non-overlapping subsets,
based on item attributes
• Counting people in video streams
• Algorithms:
–Fixed threshold
–K-means
–Rank-order clustering
Clustering
Page 20
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
HOW TO SCALE LEARNING?
Page 21
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• Scaling training
– Requires shared memory space
– Vertical scaling
• GPU
• Soon-to-come: TPU (tensor processing unit)
• Scaling evaluation
– Shared nothing architecture
–Neural network/classifier rarely change
– Load balancing pattern
– Partitioning data if needed
How to scale learning?
Page 22
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• There is no “reduce” for neural networks
• Averaging weights/parameters
–Usually not a good idea
• Genetic algorithms
– Requires a lot of processing power
– Running independent iterations on different machines
– Crossover between weights/parameters of independently trained neural networks after each epoch
Ideas for horizontal scaling
Page 23
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
GOTCHAS
Page 24
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• Our 2D and 3D intuition often fails in high dimensions
• Distances tend to become relatively “the same” as
number of dimensions increases
• Dimensionality reduction
– Embedding functions
– Principal component analysis
The Curse of Dimensionality
Page 25
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• “The bottom of a valley is not necessarily the lowest
point on Earth”
• Learning algorithms may get stuck in local optima
• Using momentum or some random noise reduces
this possibility
• Using genetic algorithms can be even more robust,
but it’s computationally expensive
Local Optima
Page 26
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
Visualizing Local Optima
monkey saddle
Page 27
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
“Based on state-of-the-art machine learning, our
weather forecast system can predict tomorrow’s
weather with 72% accuracy”
Evaluating of Learning
You get the same results by saying “it’s going to be the same as today”
Page 28
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• Don’t test on the data you train on–Use different data set
– Split the data sets you have
• Beware of data biases– Confirmation bias
– Survivorship bias
– Selection bias
• Compare against a benchmark, even a dummy one– Coin flip
– Linear algorithms
– “Same-as-before”
Evaluation of Learning
Page 29
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
Architecture and Use Cases
Page 30
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
High Level Architecture
VisageCloud Production
HAProxy(reverse proxy)
Image StorageAWS S3
Service(API Controller)
Cassandra Containers
(Docker)
Neural Networks(OpenCV, Dlib,
Torch, pixie magic)CQL Binary
HTTP
API Consumer(Customer Infrastructure)
HTTPS
HTTP
HTTPS
Page 31
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
Detect faces
Align facesPre-
processingFeature
extractionFeature
comparison
Processing Pipeline
Page 32
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• The collection
–Slice of data used together
– 10K-100K records
• The Cache-Inside Pattern
– Loading / preloading collection in one application server
–Content based routing/balancing to maximize cache hits
–No logic in the database layer
–Requires periodic polling for updates
• Weaker consistency
Partitioning Data: Application Level Logic
Page 33
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
Partitioning Data: Application Level Logic
Application Layer
Application Application Application
Cassandra (Database Layer)
Cassandra Node Cassandra Node Cassandra Node Cassandra Node
Content-based balancing/routing
Preload collectionPoll for updatesWrite updates
Page 34
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• Perform comparison logic in database
–User Defined Aggregate Functions
• Removes the need to move data around between
application and database
• Harder to deploy/test
• Stronger consistency
Partitioning Data: Application Level Logic
Page 35
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• It’s math, not magic
• If you don’t understand the data, neither will the
machine
• Preprocessing makes the difference
• Test against a benchmark, any benchmark
• Evaluate first, scale later
Key Take-away
Page 36
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
[email protected]
+(40) 724 714 234
https://www.linkedin.com/in/bogdanbocse/
https://twitter.com/bocse
Let’s keep in touch
Page 37
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
Many thanks to our sponsors & partners!
GOLD
SILVER
PARTNERS
PLATINUM
POWERED BY
Page 38
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
Q & A