1 Approved for Public Release: Distribution Unlimited. AFOTEC Public Affairs Public Release Number 2017-01 Machine Learning: Overview & Applications to Test 1st Lt Takayuki Iguchi 1st Lt Megan E. Lewis AFOTEC/Det 5/DTS Release Date: 6 MAR 17
1
Approved for Public Release: Distribution Unlimited. AFOTEC Public Affairs Public Release Number 2017-01
Machine Learning: Overview
& Applications to Test
1st Lt Takayuki Iguchi
1st Lt Megan E. Lewis
AFOTEC/Det 5/DTSRelease Date: 6 MAR 17
2
Why use Machine Learning in test?
• It takes more time to analyze large high dimensional
data than it does to collect it
− Video
− Audio
− BUS data
•Machine learning is designed to work with large high
dimensional data
3
Visualizing Large High Dimensional Data
4
Visualizing Large High Dimensional Data
5
Visualizing Large High Dimensional Data
6
Visualizing Large High Dimensional Data
7
Visualizing Large High Dimensional Data
8
Visualizing Large High Dimensional Data
9
What is Machine Learning?
• A computer program is said to learn from experience 𝑬with respect to some class of tasks 𝑻 and performance
measure 𝑷, if its performance at tasks in 𝑻, as measured
by 𝑷, improves with experience 𝑬.
• “The field of study that gives computers the ability to
learn without being explicitly programmed.”
10
Types of Machine Learning
• Reinforcement Learning
− Learn to select action that maximize the accumulated
reward over time.
• Unsupervised Learning
− Infer a function from unlabeled training data
• Supervised Learning
− Infer a function from labeled training data
11
Types of ML
• Unsupervised Learning
“These things are
similar”
“These things will add up to
something that will look like a 2.”
(van der Maaten [2008])
(Hinton [2013])
12
“This is the correct
salary of a professor
given the time since
highest degree earned.”
Types of Machine Learning
• Supervised Learning
“These are a camels.
Those are a people.”
Include graph of a
simple linear
regression
(Weisberg [1985]) (ImageNet [2014])
13
Unsupervised Learning Tasks
• Anomaly detection / outlier detection
• Dimensionality Reduction
•Manifold Learning
• Clustering
14
Anomaly Detection
• As instrumentation has improved, often the limiting
factor isn’t that there isn’t enough data but that there is
too much data
•With flight test there is often a small time window
between sorties. Thus quick cursory data analysis is
desired but not currently possible
• Anomaly Detection methods can help identify otherwise
hidden issues (not detected by aircrew) before they
manifest into larger issues
15
Anomaly Detection
• Perform problem ID with logged & uncontrollable factors
− Variety of different algorithms and methodologies
− Choice of algorithm & methodology is dependent on
application and nature of data
16
Dimensionality Reduction
• The goal: Take a high dimensional dataset and find a
“good” representation in a lower dimension (e.g., 2-D).
• Signal decomposition
methods:
− Principal Component
Analysis (PCA)
− Kernel PCA
− Factor Analysis
− Non-negative matrix
factorization
• Manifold learning:
– Isomap
– Locally linear embedding (LLE)
– Spectral embedding
– Multi-dimensional scaling (MDS)
– 𝒕-stochastic Neighbor
Embedding (𝒕-SNE)
17
𝒕-stochastic Neighbor Embedding
• PCA is variance based.
• If the structure in the high dimensional space lies on a
non-linear manifold, PCA will not work well.
(Vanderplas, scikit-learn [2016])
18
𝒕-stochastic Neighbor Embedding
(van der Maaten [2016])
19
𝒕-stochastic Neighbor Embedding
(van der Maaten [2016])
20
𝒕-stochastic Neighbor Embedding
(van der Maaten [2016])
21
𝒕-stochastic Neighbor Embedding
(van der Maaten [2016])
22
𝒕-stochastic neighbor embedding
0
1
2
3
4
5
6
7
8
9
t-SNE
Isomap
Sammon mapping
Locally Linear Embedding
(van der Maaten [2008])
23
Clustering
• The goal: Partition a dataset to maximize similarity
within each partition.
• Connectivity-based /
hierarchical clustering
– Single Linkage
Clustering (SLINK)
• Centroid based
clustering
– 𝒌-means++
– 𝒌-medians
• Density based clustering
– Density-based spatial
clustering of
applications with noise
(DBSCAN)
• Distribution based
clustering
– Gaussian Mixture
Models
24
𝒌-means
• Randomly draw
cluster centroids
• Until clustering
remains unchanged
− Assign points to
nearest centroid
− Calculate new
centroids
•Output Clustering
25
𝒌-means
• Randomly draw
cluster centroids
• Until clustering
remains unchanged
− Assign points to
nearest centroid
− Calculate new
centroids
•Output Clustering
26
𝒌-means
• Randomly draw
cluster centroids
• Until clustering
remains unchanged
− Assign points to
nearest centroid
− Calculate new
centroids
•Output Clustering
27
𝒌-means
• Randomly draw
cluster centroids
• Until clustering
remains unchanged
− Assign points to
nearest centroid
− Calculate new
centroids
•Output Clustering
28
𝒌-means
• Randomly draw
cluster centroids
• Until clustering
remains unchanged
− Assign points to
nearest centroid
− Calculate new
centroids
•Output Clustering
29
𝒌-means
• Randomly draw
cluster centroids
• Until clustering
remains unchanged
− Assign points to
nearest centroid
− Calculate new
centroids
•Output Clustering
30
𝒌-means
• Randomly draw
cluster centroids
• Until clustering
remains unchanged
− Assign points to
nearest centroid
− Calculate new
centroids
•Output Clustering
31
𝒌-means
• Randomly draw
cluster centroids
• Until clustering
remains unchanged
− Assign points to
nearest centroid
− Calculate new
centroids
•Output Clustering
32
𝒌-means
• Randomly draw
cluster centroids
• Until clustering
remains unchanged
− Assign points to
nearest centroid
− Calculate new
centroids
•Output Clustering
33
Supervised Learning Tasks
• Classification
− Output is discrete
Speech Recognition
Image Classification
• Regression
− Output is continuous
34
Neural Networks
• A Neuron
•Mathematical
model for a
neuron
𝑔
𝑎𝑗Σ𝑎𝑖𝑤𝑖𝑗
𝑤0𝑗
𝑎0 = 1
(Russell, Norvig [2010])
35
Perceptrons
𝑥2
𝑥3
𝑥1
𝑦
𝑤1
𝑤2
𝑤3
• All inputs connected to outputs
• Error function:
− 𝑬 =𝟏
𝟐𝒕 − 𝒚 𝟐
• Update weights with each training case:
− Output unit is a threshold unit
𝚫𝒘𝒊 = 𝝐 𝒚 − 𝒕𝒉𝒓𝒆𝒔𝒉𝒐𝒍𝒅(𝒘⊤𝒙 ) ∗ 𝒙𝒊
− Output unit is a logistic unit
𝚫𝒘𝒊 = 𝝐𝝏𝑬
𝝏𝒘𝒊= 𝝐𝝈 𝒘⊤𝒙 (𝒚 − 𝝈 𝒘⊤𝒙 )(𝟏 − 𝝈 𝒘⊤𝒙 )𝒙𝒊
(Russell, Norvig [2010])
36
Multi-layer Perceptron
• Better performing than a single-layer feed forward
neural network
• Trained with backpropagation
𝑥2
𝑥3
𝑥1
𝑤1
𝑤4
𝑤2
𝑤6
𝑤5
𝑤3
𝑦
𝑤7
𝑤8
ℎ1
ℎ2
37
Image Text Recognition
•Over-the-shoulder videos are
common data sources
− Cheap to implement
− Processing is time intensive
• ANNs can help
(Karpathy [2015]) (Shi, et. al. [2016])
38
Convolutional Neural Network
• Typically used for image classification
• RGB Images can be thought of as a 3d matrix
• Fully connected hidden layers too many weights
• The forward pass: pass a filter over the image
(Hinton [2013])
39
Convolutional Neural Network
(Karpathy [2016])
40
Convolutional Neural Network
(Karpathy [2016])
41
Convolutional Neural Network
(Karpathy [2016])
42
Convolutional Neural Network
(Karpathy [2016])
43
Convolutional Neural Network
(Karpathy [2016])
44
Convolutional Neural Network
(Karpathy [2016])
45
Convolutional Neural Network
(Karpathy [2016])
46
Convolutional Neural Network
(Karpathy [2016])
47
Convolutional Neural Network
(Karpathy [2016])
48
Hyperspectral Classification
Data from https://engineering.purdue.edu/biehl/MultiSpec/hyperspectral.html
Per-pixel classification from hyperspectral data
49
CNNs on MNIST
•Misclassifications of
LeNet5
(LeCun [1998])
50
Recurrent Neural Networks
• Directed cycles in their connection graph
•MLPs and CNNs require fixed sized input
• Used to model sequential data
• Hard to Train
Input
Layer
Hidden
Layer
Output
Layer
𝑡1 𝑡2 𝑡3 𝑡4 𝑡5 𝑡6
51
Recurrent Neural Networks
(Karpathy [2015])
52
Image Text Recognition
• Different ways of thinking
about the problem
• Long Short-Term Memory
Layer most recently used
(Karpathy [2015]) (Shi, et. al. [2016])
53
Audio-speech Recognition
• Traditional Speech Models
Speech
waveform
Acoustic
Model
argmax𝑋
𝑃 𝑊 𝑋 = argmax𝑊,𝐿
𝑃 𝑋 𝐿 𝑃 𝐿 𝑊 𝑃(𝑊)
Pronunciation
Model
Language
ModelSentence
(kaˈfā) (cafe)
Phenomes Words
(Beaufays [2016])
54
Other Acoustic Models
•Other DNN based approaches to acoustic modeling
(Beaufays [2016])
Method Year
DBN 2012
Long Short Term Memory
(LSTM)
2013
Convolutional LSTM DNN 2014
Connectionist Temporal
Classification (CTC)
2015
55
Summary of Applications
• Audio to text
− Transcribe in-flight audio/ conversations
− Transcribe survey conversations
− Easily slew to audio of interest
• Image captioning
− Write text in image to a text file
− In-flight data
•Object recognition in images
− Help label truth data when testing sensors
• Video is just adding a time dimension to images
− Techniques from images may be applied to video
56
Next Steps
• Low hanging fruit
− Use already existing open source text recognition in
images/video
OpenCV
− Use free audio transcription software
TensorFlow (Google)
SwiftScribe (Baidu)
57
Next Steps
•Open areas for development:
− Transcribing acronyms
− Using Machine Learning on bus data to tell a maintainer of
a certain risk.
− ATC radar more accurately narrowing down location of a/c
in real time (Hrastovec et. al. [2014])
− Identifying early indications of airframe stress and strain
(Hickinbotham et. al. [2000])
58
Acknowledgements
•Workshop organizers
• AFOTEC Det 5 leadership
•Mr. Jeff Wilson
• Capt Joshua Vaughan
59
References
• Hinton, Geoffrey. Artificial Neural Networks. Coursera. (2013)
• ImageNet (2014). http://www.image-net.org/challenges/LSVRC/2014/ui/det.html
• Karpathy, Andrej et. al. CS321n online course notes: http://cs231n.stanford.edu/
• Karpathy, Andrej RNN github page (2015): http://karpathy.github.io/2015/05/21/rnn-
effectiveness/
• LeCun, Yann “Gradient-Based Learning Applied to document Recognition”
• MATLAB documentation (2017): https://www.mathworks.com/discovery/support-
vector-machine.html
• van der Maaten 𝒕-SNE github page (2016): https://lvdmaaten.github.io/tsne/
• van der Maaten, Hinton. “Visualizing Data using 𝒕-SNE” JMLR 2008
• Russell, Norvig. Artificial Intelligence: A Modern Approach. 3rd Ed. 2010. New Jersey:
Pearson.
• scikit-learn documentation (2016). http://scikit-learn.org/stable/documentation.html
• S. Weisberg (1985). Applied Linear Regression, Second Edition. New York: John
Wiley and Sons.
• Wolberg, W.H., & Mangasarian, O.L. (1990). Multisurface method of pattern
separation for medical diagnosis applied to breast cytology. In Proceedings of the
National Academy of Sciences, 87, 9193--9196.
60
References
• van der Maaten, Hinton. “Visualizing Data using 𝒕-SNE” JMLR (2008).
• Shi, Baoguang, Xiang Bai, and Cong Yao. "An end-to-end trainable neural network
for image-based sequence recognition and its application to scene text recognition."
IEEE Transactions on Pattern Analysis and Machine Intelligence (2016).
• Ghamisi, Pedram, et al. "Advanced Supervised Spectral Classifiers for Hyperspectral
Images: A Review." IEEE Geoscience and Remote Sensing Magazine (GRSM) (2017).
• Dahl, George E., Dong, Deng, Acero. “ Context-Dependent Pre-Trained Deep Neural
Networks for Large-Vocabulary Speech Recognition.” IEEE Transactions on Audio,
Speech, and Language Processing (2012).
• Gupta, Manish, et. al. “Outlier Detection for Temporal Data: A Survey.” IEE
Transactions on Knowledge and Data Engineering (2014).
• Beaufays, Francoise. “Speech Recognition” Google I/O (2016).
• Yoon, Seunghyun, et al. "Efficient Transfer Learning Schemes for Personalized
Language Modeling using Recurrent Neural Network." arXiv preprint
arXiv:1701.03578 (2017).
61
Questions?