Supervised vs. unsupervised learning: How to choose the right approach and data labeling technique for your ML projects
1
Sup
ervi
sed
vs. U
nsup
ervi
sed
App
roac
hes
How to choose the right approach and data labeling technique for your ML projects
Supervised vs. unsupervised learning:
How to choose the right approach and data labeling technique for your ML projects
2
Sup
ervi
sed
vs. U
nsup
ervi
sed
App
roac
hes
Index
3
Sup
ervi
sed
vs. U
nsup
ervi
sed
App
roac
hes
Artificial intelligence (AI) augmentation will generate $2.9 trillion in
business value and recover 6.2 billion hours of worker productivity by
2021, according to Gartner1. Companies will realize this value by using AI
techniques to build differentiated business models around data and driving
new revenue, customer experiences, and product innovations.
Machine learning (ML) is one AI technique well suited for the challenges
and opportunities that lie in what many herald as the fourth industrial
revolution. The principle is simple: Give a machine access to a large store
of data and, rather than providing it with predefined rules, allow it to learn
a decision-making model from the data itself. Thanks to the availability
of cheap storage, dynamic infrastructure, and open-source solutions,
enterprise data scientists can now focus on ML applications that leverage
these massive data stores and event streams to address pointed business
objectives.
Artificial intelligence (AI) augmentation will generate $2.9 trillion in business value and recover 6.2 billion hours of worker productivity by 2021.
1 Lovelock et al., Forecast: The Business Value of Artificial Intelligence, Worldwide, 2017-2025 (Gartner, 2018).
Introduction
Intro
duct
ion
4
Sup
ervi
sed
vs. U
nsup
ervi
sed
App
roac
hes
There’s no question that AI/ML will continue to revolutionize business
models for enterprises with access to large volumes of data, but how will
they leverage this intelligence to extract the most value out of their data?
We will examine the technical and business significance of choosing a
supervised or an unsupervised approach to machine learning, relevant
applications and limitations of each, and discuss when an investment in
data labeling efforts is required to ensure a successful outcome.
Enterprises are employing ML techniques like computer vision (CV) and
natural language processing (NLP) to drive value from new revenue centers,
lower operational costs, and improved customer experiences. Because
ML has the inherent ability to learn and deliver increasing value over time,
organizations that invest in ML stand to gain greater revenue and cost
advantages over those who are late to adopt AI/ML initiatives.
Intro
duct
ion
MachineLearning
ImageClassification
Customer Retention
Advertising PopularityPrediction
Weather Forecasting
MarketForecasting
Estimating LifeExpectancy
PopulationGrowth Prediction
Identity FraudDetection Diagnostics
MeaningfulCompression
Structure Discovery
Big DataVisualization
RecommendationSystems
TargetedMarketing
Customer Segmentation
FeatureElicitation
UnsupervisedLearning
Classification
Regression
SupervisedLearning
Clustering
Dimensionality
Supervised vs. Unsupervised Learning
5
Sup
ervi
sed
vs. U
nsup
ervi
sed
App
roac
hes
Data scientists broadly classify ML approaches as supervised or
unsupervised, depending on how and what the models learn from the
input data. They address different types of problems, and the appropriate
approach depends on the business objective and the use case. In general,
an unsupervised learning approach will describe characteristics of a data
set, and supervised learning approaches will answer a prescribed question
about data points in a data set. The more prescriptive the use case, the
better the fit for supervised learning. For example, identifying guardrail
damage from a drone video enables better management of highway safety,
cost, and logistics for repairs.
Unsupervised learning, on the other hand, is a better fit when the business
objective is open ended, such as exploring options to optimize targeted
marketing campaigns or understanding market segments for product
development. It is common to combine both approaches: unsupervised
techniques for exploring the data and supervised techniques to answer
specific questions for a predefined use case. Note, however, that the two
solve different types of problems and are not interchangeable.
Selecting the Right Approach: Supervised vs. Unsupervised
Supervised Unsupervised
Purpose/ characterizations
• Used to turn data into actionable information
• Answer a prescribed question about data points in a data set
• Can be used to carry out specific, isolated tasks
• Data has known labels or output
• Used for data discovery • Answers questions about the
aggregate dataset• Finds structures and patterns in
the data • Data does not have known labels
or output
Techniques • Classification• Regression
• Clustering/Segmentation • Dimensionality Reduction
Output • Prescriptive results• Prediction on the right answer• Questions & answers for specified
input
• Descriptive results about the dataset
• Summary of data distribution• Feature separability
Example applications
Computer vision: Does the person in the frame have product X in their basket?
Customer segmentation: Can we identify clusters of customers with similar characteristics together?
Is the output actionable?
Yes: You can charge the shopper for item X that has been placed in their checkout basket.
No: You can identify the segments but the output will not specify what actions to take on those segments.
Sel
ectin
g th
e R
ight
App
roac
h
6
Sup
ervi
sed
vs. U
nsup
ervi
sed
App
roac
hes
How the Models LearnUnsupervised models learn by identifying characteristics of a data set.
Different unsupervised learning techniques yield different characterizations.
This may include clustering, which groups data into similar segments
and exposes patterns in the data set, or dimensionality reduction, which
identifies features of the data values that provide the most differentiation
among data points. Because unsupervised learning is exploratory in nature,
there is no concept of a right or a wrong answer.
Supervised models, in contrast, learn by example how to answer a
predefined question about each data point. They generate a model that
maps input data to specific desired outcomes, called targets. The target
represents the ‘right answer’. The input data for supervised learning is a set
of training data that has been labeled so that it contains the target value.
The supervised approach uses this association between the targets and
input data records in its learning process. Sel
ectin
g th
e R
ight
App
roac
h
Unsupervised learning is exploratory in nature, there is no concept of a predefined ‘right or wrong answer.’ Supervised models, in contrast, learn by example how to answer a predefined question about each data point.
7
Sup
ervi
sed
vs. U
nsup
ervi
sed
App
roac
hes
The data an organization has collected may not inherently contain the
target values. If the targets cannot be extracted from the existing data, it is
possible to modify business processes to capture the target values as part
of day-to-day operations. For example, a company using an ML model to
automate product quality control can amend their data collection process
so that human quality control decisions (the target value) are captured at
the same time the photo or video input data are collected.
When the existing data does not include the targets or when the targets
cannot be captured as part of the data collection, the data must be
manually labeled after-the-fact. The data labeling is almost always done
by humans, because of the high level of accuracy required to train the ML
models. Supervised learning in this context benefits by learning from data
that has been accurately labeled using human judgment. Humans have a
unique cognitive capacity to find and extract meaning from context—a skill
that machines still lack. It is this context that enables supervised models to
map features of the data to desired outcomes.
Organizations can use data labeling to embed specific targets into
the existing data, in order to answer specific questions such as,
“Does the shopper have Product 1 in their cart?”
Sel
ectin
g th
e R
ight
App
roac
h
Examples of Data Labeling:
8
Sup
ervi
sed
vs. U
nsup
ervi
sed
App
roac
hes
When Do You Need Data Labeling?
Unsupervised and supervised learning approaches each solve
different types of problems and have different use cases. The power of
unsupervised methods is widely touted recently, but the term unsupervised
has become overloaded. The preferred term for using ML to harness the
power of the vast amounts of data without requiring external data labeling
is self-supervised learning2. This approach can be powerful, given a set of
data containing meaningful associations or segmentations.
Do you need labeled data?
2 Yann LeCun. Quote from 2018: “Most of human and animal learning is unsupervised learning. If intelligence was a cake, unsupervised learning would be the cake, supervised learning would be the icing on the cake, and reinforcement learning would be the cherry on the cake.” Response quote from April 2019 I now call it “self-supervised learning”, because “unsupervised” is both a loaded and confusing term. https://www.facebook.com/722677142/posts/10155934004262143/
Sel
ectin
g th
e R
ight
App
roac
h
Yet it cannot deliver answers to specific questions that the businesses may
need to deploy targeted ML solutions (e.g. “Is the guardrail in the drone footage
damaged?”), because the answers simply do not exist within the available data.
When the business needs the answers to these specific questions to derive
value, teams can label the data after-the-fact, in order to embed the targets into
the available data. The labeled data set enables data science teams to take the
supervised approach and broaden their business applications.
Supervised Learning
Unsupervised Learning
What do you need to do with the data?
Explore the aggregate data set, discover patterns
Answer a specific question about each data point
Labeling is typically not required for unsupervised learning
Have you identified the target value for each input data record?
Yes, target values have been collected with the data
Additional labeling is not required
No, target values have not been identified in the data
Data labeling or human-generated labels are required.
9
Sup
ervi
sed
vs. U
nsup
ervi
sed
App
roac
hes
In order to extract business value from data, an organization must have
the ability to mine unique insights from that data. AI and ML are powerful
tools in this effort but they rely on insights available within the data. If
those insights are not already encoded in the data itself, data labeling can
be a means to drive significant value for the organization. Although data
labeling consumes time and resources, the potential return on investment
is substantially higher. In many cases, after-the-fact labeling is the only way
to effectively mine these insights. Labeled data enable supervised ML to
extract actionable business value impossible to achieve otherwise.
Building a Data Strategy A sound data strategy sets the stage for enterprises to get value from their data using ML. Organizational considerations for the data itself, the
team of people building value from the data, the infrastructure supporting
data operations, and the project timeline and resources are all critical
foundations for success. System readiness may also impact the decision
on what ML approach the data science team chooses to invest in.
Tapping into the Potential Value of Data
Although data labeling consumes time and resources, the potential return on investment is substantially higher.
Tapp
ing
into
the
Pot
entia
l Val
ue o
f Dat
a
10
Sup
ervi
sed
vs. U
nsup
ervi
sed
App
roac
hes
DATA ACCESSIBILITY
It is vital for organizations to understand what data they have, where it
resides, and what data they don’t yet have access to. It is equally important
to assess the sophistication of the overall data strategy. What kind of cross-
functional support and data integrations already exist within the enterprise?
Is data collection complete or ongoing?
PEOPLE
Implementing a successful ML operation in an enterprise requires a team
of skilled people who understand the technology and application needs.
For enterprise-level efforts, the best approach is to have a mix of technical,
data, and domain experts: technical experts to deploy and maintain
technical ecosystems, data experts to extract and translate data assets, and
domain experts to provide business context.
INFRASTRUCTURE
In the same way that supporting an ML effort requires specialized
people and skill sets, it also requires the right infrastructure to support
the integration, storage, accessibility, and labeling of training data.
Organizations should ask if they have an existing infrastructure to build,
train, manage, and deploy ML models. Does that infrastructure easily
support experimentation? If not, should the organization build this
infrastructure? Does this infrastructure support the necessary encryption
and data security methodologies?
TIME AND RESOURCES
Organizations must determine whether or not they have the time and
resources to collect, transform, explore, label, and operationally use the
data in their ML project. Projects vary in potential payoff and risk—payoff
resulting from a well-performing model trained on viable data, and the risk
of deploying a poorly performing model that fails to deliver business value.
They will also need to be ready to scale their operations as an ML model is
being developed under real-world circumstances.
Tapp
ing
into
the
Pot
entia
l Val
ue o
f Dat
a
11
Sup
ervi
sed
vs. U
nsup
ervi
sed
App
roac
hes
Why Quality Matters in Data Labeling When the business is able to identify a specific question of value that
can be answered with the available data, they can take a supervised ML
approach to realize that value. To train the supervised ML model, you must
identify a set of input values (the data) and corresponding target values (the
known answers), which need to be encoded into the training data set via
data labeling. This process is easier said than done, as the labeling must
be done with a high degree of accuracy and massive volumes of high-
quality training data are required to ensure the effectiveness of the model.
It is well known that training data quality determines the performance of
machine learning systems. This is the “garbage-in, garbage-out” concept.
Overall performance of even the most sophisticated model can be easily
compromised if it is trained on data that is poorly labeled or does not
accurately reflect the target values. Errors in labeling target values or
misclassifying the targets lead to errors in the model. These errors also
proliferate as they flow through an ML application. Failing to maintain
accuracy as data labeling scales from proof-of-concept to production can
bring a project to a halt. Building the ability to generate and scale high-
quality training data over time is a means of protecting the organization’s
ML investments and mitigating the risk of compromising an otherwise
successful model.
Tapp
ing
into
the
Pot
entia
l Val
ue o
f Dat
a
Training data quality determines the performance of machine learning systems. This is the “garbage-in, garbage-out” concept.
12
Sup
ervi
sed
vs. U
nsup
ervi
sed
App
roac
hes
A simple way to think about data labeling is to liken it to the process of
converting crude oil to refined fuel. While data is the new currency in AI/
ML, quantity and availability of data is only part of the story. Supervised
ML models cannot run on data that lacks targets, in the way a vehicle
cannot run on raw petroleum. In fact, the lack of ability to scale labeled
data or training data is often the biggest blocker to accelerated ML model
development. Organizations that can find the most reliable and economical
path to accurately labeled data quickly gain value and competitive
advantage. This path is an essential part of the training data pipeline that
carries tremendous impact on both the performance of the ML model and
productivity of the data science team.
Options to label your include in-house labeling using homegrown tools,
third-party crowdsourcing, and full-service data labeling platforms. Teams
need to evaluate labeling options against their ML use case, timeline
and resources, data volume and accuracy requirements, then determine
how much time their data scientists can afford to invest in the labeling
efforts. The selection process needs to be given careful consideration
as each technique produces data output that vary in accuracy, speed,
security controls, and operational costs—as well as its impact on the model
performance.
Establishing a Training Data Pipeline
Est
ablis
hing
A T
rain
ing
Dat
a Pip
elin
e
Organizations that can find the most reliable and economical path to accurately labeled data quickly gain value and competitive advantage.
13
Sup
ervi
sed
vs. U
nsup
ervi
sed
App
roac
hes
In-house Labeling Crowdsourcing Data Programming Synthetic Labeling Data Labeling Platforms
What is it? • Data scientist teams labeling the data with homegrown tools
• Use of external crowdsourcing resources to label data
• Use of scripts to programmatically label data
• Use of a generative model to programmatically generate data that imitates specific properties of the real data
• External platforms with partial or full services, designed to scale data labeling
Pros • High accuracy• Can be labeled by domain
experts • Visibility into process• Can manage subjective
cases
• Economical for simple tasks• Easy to access resources• Fast• Scalable
• Can be produced without manual labeling
• Useful for cleaning up “noisy data”
• Faster and cheaper than manual labeling
• Can overcome data usage restrictions• Scalable • Can simulate data or conditions not
available in real data
• High accuracy• Removes labeling tasks from team • Can manage complex or subjective tasks• Qualified workers• Scalable
Cons • Extremely resource intensive and expensive
• High operational cost • Detracts DS time from
advancing ML • Long lead time • Not scalable
• Difficult to increase accuracy• Poor quality and security
controls • Unknown workers• Not designed to support
complex use cases
• Less accurate than manual labeling
• Limited ML applications
• Data quality issues • Limited ability to represent real data • Limited to simulating general trends • Can lead to poorly generalized models• Requires higher computational power
• Higher cost than crowdsourcing • Level of quality, speed, and security are
dependent on the chosen platform
Data Labeling Techniques
Est
ablis
hing
A T
rain
ing
Dat
a Pip
elin
e
14
Sup
ervi
sed
vs. U
nsup
ervi
sed
App
roac
hes
Third-party data labeling platforms combine the best of these techniques to accelerate ML initiatives. An enterprise-grade platform
needs to be able to scale your data labeling and remove all labeling tasks off of your team’s plate—without compromising on
accuracy. Data science teams who need to scale data labeling as they move from proof-of-concept to production or those who need
to offset the high cost of in-house data labeling will benefit from a third party labeling platform.
While the labeling quality and throughput vary with respect to each platform, most platforms offer a combination of annotation tools,
task design frameworks, and a team of humans who will annotate your data. In selecting the right platform for your project, it is
important to know what you’re trying to do with the data and the level of complexity involved in labeling your data ahead of time.
Some platforms are great for straightforward tasks, while others are made to handle complex use cases and subjective judgments.
Below are additional things to consider as you look for a project-to-labeling platform fit.
Choosing a Data Labeling Platform
Cho
osin
g a
Dat
a La
belin
g Pla
tform
PLATFORM
Is the platform able to build and distribute high-volume labeling tasks
to a qualified workforce? Can the platform be configured to adapt to
organizational needs? Can it be integrated with existing internal ML
development technologies or support API data exchange? Look for
providers who deliver a high level of visibility, feedback, and flexibility to
allow for continuous feedback and optimizations.
TOOLS
What tools are available to label data, whether it be text, images, audio,
or video? Are these tools customizable, or do they put constraints on data
labeling capabilities? Does the platform technology support high accuracy
labeling for subjective use cases? And is the platform provider investing in
R&D to optimize their platform tools and improve throughput and accuracy?
15
Sup
ervi
sed
vs. U
nsup
ervi
sed
App
roac
hes
PEOPLE
Who comprises the workforce that labels the data? Are the annotators
screened, vetted, and trained appropriately to make domain-specific
judgments? Are project management and support services included?
In case of managed platforms, how well does your account manager
understand your data objectives, and are they able to take task design,
workforce training, tooling configuration, and quality management off the
hands of your data science team?
PROCESS & BEST PRACTICES
Does the solution employ industry best practices to scale and improve
training data without compromising on accuracy? What is their track record
on delivering large-scale enterprise data labeling, and does the platform
offer flexibility to support use cases requiring multi-stage annotation
workflows, complex classifications, or conditional quality control strategies?
QUALITY CONTROL
Does the provider deliver a targeted quality control strategy that is
optimized for the organization’s use case and budget, or is it one size
fits all? What QC strategy do they plan to implement to increase your
data accuracy? Does the provider have a robust screening and training
curriculum to qualify the annotation workforce? Are they able to evaluate the
labeling accuracy against gold or ground truth data?
SECURITY
What data and platform security protocols are in place? Are the integration
points and APIs safe for data transfer? Are their annotators NDA-ready? Do
they have data encryption and portal access controls in place, and are they
able to meet your organization’s compliance requirements?
Cho
osin
g a
Dat
a La
belin
g Pla
tform
16
Sup
ervi
sed
vs. U
nsup
ervi
sed
App
roac
hes
Building a powerful machine learning model, especially with today’s most
popular deep-learning algorithms, requires vast amounts of labeled training
data. However, training is not the only point in the ML development lifecycle
that requires labeled data. Testing and validating models consume data
with externally generated labels, and models in production require labeled
data to handle exceptions or verify that existing labels still accurately reflect
the real-world context.
Therefore, supervised ML initiatives require continuous learning through
a steady supply of high-quality, high-volume labeled datasets throughout
the development lifecycle. Building a systematic solution or a training data
pipeline not only drives model success and business value, it represents an
opportunity to create competitive advantages.
Operationalizing Machine Learning Models
Ope
ratio
naliz
ing
Mac
hine
Lea
rnin
g M
odel
s
DataPreparation
Model Selection & Training
Model Scoring& Validation
Model Deployment
Production Model Monitoring
Training Data Set
ValidationData Set
Test/Holdout Set
Model Monitoring Set
Train a range ofselected models
Evaluate modelfit, check for overfitting or underfitting
Evaluate the accuracy of the trained model with test data
Evaluate the new data set on deployed model to account for context or data drift
17
Sup
ervi
sed
vs. U
nsup
ervi
sed
App
roac
hes
Organizations have the ability to develop differentiated business value and
accelerate their ML initiatives through purpose-specific data labeling. Here
are five key ways that a data labeling platform can improve the execution of
ML initiatives and, ultimately, the quality of ML applications.
1. Increase net data value – Extract value out of your data and develop
new feature opportunities by identifying and embedding target attributes
in your existing data–at scale. Data labeling platforms can be the best
means of accurately injecting the targets or labels into your data and
enabling teams to take the supervised learning approach.
2. Accelerate ML initiatives – Great data labeling platforms convert
collected data into high-accuracy, model-ready data and seamlessly
feed the data into ML models across various stages of the model
development. The ability to scale data labeling drives higher model
confidence, range of ML applications, and speeds time to value.
3. Maintain model accuracy with continuous learning – AI/ML
models have the ability to deliver exponential growth of data value
and productivity gains, if well monitored. By creating a training data
workstream to train, validate, and test the models over time, they will
continue to produce accurate results amidst the context and data drifts.
4. Mitigate project risk – Lower risk for ML initiatives by avoiding wasted
effort or time working with low-quality data labels that can undermine an
organization’s models and AI initiatives. Take advantage of a platform
provider who securely stores and transports data, and who provides a
workforce that protects the value of their client’s proprietary data.
5. Reduce costs and resource investments – Eliminate the costs of
building data annotation tools and project management by outsourcing
data labeling to a platform designed to scale data labeling. Top-
tier data labeling platforms enable data science teams to focus on
advancing their ML model development and eliminate wasted cycles
that compromise project success.
Generating Advantages through Data Labeling Platforms
Gen
erat
ing
Adva
ntag
es thr
ough
Dat
a La
bel P
latfo
rms
18
Sup
ervi
sed
vs. U
nsup
ervi
sed
App
roac
hes
It is an exciting time for enterprise AI. Enterprises that understand how
to effectively extract value from their data using ML will differentiate
themselves from their competitors through increased revenue, cost
savings, and more advanced product features. Because ML models
rely on massive volumes of high-quality training data throughout the
project lifecycle, establishing and maintaining a robust training data
pipeline is critical to its success.
Data labeling platforms present a powerful option to increase the
value of your data, secure a sound training data strategy, and create a
competitive advantage for the business and their data science teams.
Getting Ahead with AI and ML
Get
ting
Ahe
ad w
ith A
I and
ML
19
Sup
ervi
sed
vs. U
nsup
ervi
sed
App
roac
hes
ABOUT ALEGION
Alegion provides data labeling to power enterprise machine learning (ML)
initiatives. Our offering operates at massive scale, combining a data and
task management platform with a global network of trained data specialists
to handle use cases ranging from the straightforward to the highly complex.
We assist data science teams throughout the ML lifecycle, delivering high-
quality data labeling for model training, model validation, and exception
handling. Our platform integrates the best of ML-assisted annotation and
quality control workflows with human insights from a team of trained
specialists to deliver model training data for computer vision, natural
language processing, and entity resolution.
With our specialized labeling platform and dedicated specialists, we
offload all data annotation tasks, freeing data science teams to focus on
their model development. The Alegion platform delivers data to power
ML projects across a range of industries including retail, tech, finance,
healthcare, automotive, and manufacturing.
Want to learn more? REACH OUT TO