Supervised vs. unsupervised learning

1

Sup

ervi

sed

vs. U

nsup

ervi

sed

App

roac

hes

How to choose the right approach and data labeling technique for your ML projects

Supervised vs. unsupervised learning:

How to choose the right approach and data labeling technique for your ML projects

2

Sup

ervi

sed

vs. U

nsup

ervi

sed

App

roac

hes

Index

3

Sup

ervi

sed

vs. U

nsup

ervi

sed

App

roac

hes

Artificial intelligence (AI) augmentation will generate $2.9 trillion in

business value and recover 6.2 billion hours of worker productivity by

2021, according to Gartner1. Companies will realize this value by using AI

techniques to build differentiated business models around data and driving

new revenue, customer experiences, and product innovations.

Machine learning (ML) is one AI technique well suited for the challenges

and opportunities that lie in what many herald as the fourth industrial

revolution. The principle is simple: Give a machine access to a large store

of data and, rather than providing it with predefined rules, allow it to learn

a decision-making model from the data itself. Thanks to the availability

of cheap storage, dynamic infrastructure, and open-source solutions,

enterprise data scientists can now focus on ML applications that leverage

these massive data stores and event streams to address pointed business

objectives.

Artificial intelligence (AI) augmentation will generate $2.9 trillion in business value and recover 6.2 billion hours of worker productivity by 2021.

1 Lovelock et al., Forecast: The Business Value of Artificial Intelligence, Worldwide, 2017-2025 (Gartner, 2018).

Introduction

Intro

duct

ion

4

Sup

ervi

sed

vs. U

nsup

ervi

sed

App

roac

hes

There’s no question that AI/ML will continue to revolutionize business

models for enterprises with access to large volumes of data, but how will

they leverage this intelligence to extract the most value out of their data?

We will examine the technical and business significance of choosing a

supervised or an unsupervised approach to machine learning, relevant

applications and limitations of each, and discuss when an investment in

data labeling efforts is required to ensure a successful outcome.

Enterprises are employing ML techniques like computer vision (CV) and

natural language processing (NLP) to drive value from new revenue centers,

lower operational costs, and improved customer experiences. Because

ML has the inherent ability to learn and deliver increasing value over time,

organizations that invest in ML stand to gain greater revenue and cost

advantages over those who are late to adopt AI/ML initiatives.

Intro

duct

ion

MachineLearning

ImageClassification

Customer Retention

Advertising PopularityPrediction

Weather Forecasting

MarketForecasting

Estimating LifeExpectancy

PopulationGrowth Prediction

Identity FraudDetection Diagnostics

MeaningfulCompression

Structure Discovery

Big DataVisualization

RecommendationSystems

TargetedMarketing

Customer Segmentation

FeatureElicitation

UnsupervisedLearning

Classification

Regression

SupervisedLearning

Clustering

Dimensionality

Supervised vs. Unsupervised Learning

5

Sup

ervi

sed

vs. U

nsup

ervi

sed

App

roac

hes

Data scientists broadly classify ML approaches as supervised or

unsupervised, depending on how and what the models learn from the

input data. They address different types of problems, and the appropriate

approach depends on the business objective and the use case. In general,

an unsupervised learning approach will describe characteristics of a data

set, and supervised learning approaches will answer a prescribed question

about data points in a data set. The more prescriptive the use case, the

better the fit for supervised learning. For example, identifying guardrail

damage from a drone video enables better management of highway safety,

cost, and logistics for repairs.

Unsupervised learning, on the other hand, is a better fit when the business

objective is open ended, such as exploring options to optimize targeted

marketing campaigns or understanding market segments for product

development. It is common to combine both approaches: unsupervised

techniques for exploring the data and supervised techniques to answer

specific questions for a predefined use case. Note, however, that the two

solve different types of problems and are not interchangeable.

Selecting the Right Approach: Supervised vs. Unsupervised

Supervised Unsupervised

Purpose/ characterizations

• Used to turn data into actionable information

• Answer a prescribed question about data points in a data set

• Can be used to carry out specific, isolated tasks

• Data has known labels or output

• Used for data discovery • Answers questions about the

aggregate dataset• Finds structures and patterns in

the data • Data does not have known labels

or output

Techniques • Classification• Regression

• Clustering/Segmentation • Dimensionality Reduction

Output • Prescriptive results• Prediction on the right answer• Questions & answers for specified

input

• Descriptive results about the dataset

• Summary of data distribution• Feature separability

Example applications

Computer vision: Does the person in the frame have product X in their basket?

Customer segmentation: Can we identify clusters of customers with similar characteristics together?

Is the output actionable?

Yes: You can charge the shopper for item X that has been placed in their checkout basket.

No: You can identify the segments but the output will not specify what actions to take on those segments.

Sel

ectin

g th

e R

ight

App

roac

h

6

Sup

ervi

sed

vs. U

nsup

ervi

sed

App

roac

hes

How the Models LearnUnsupervised models learn by identifying characteristics of a data set.

Different unsupervised learning techniques yield different characterizations.

This may include clustering, which groups data into similar segments

and exposes patterns in the data set, or dimensionality reduction, which

identifies features of the data values that provide the most differentiation

among data points. Because unsupervised learning is exploratory in nature,

there is no concept of a right or a wrong answer.

Supervised models, in contrast, learn by example how to answer a

predefined question about each data point. They generate a model that

maps input data to specific desired outcomes, called targets. The target

represents the ‘right answer’. The input data for supervised learning is a set

of training data that has been labeled so that it contains the target value.

The supervised approach uses this association between the targets and

input data records in its learning process. Sel

ectin

g th

e R

ight

App

roac

h

Unsupervised learning is exploratory in nature, there is no concept of a predefined ‘right or wrong answer.’ Supervised models, in contrast, learn by example how to answer a predefined question about each data point.

7

Sup

ervi

sed

vs. U

nsup

ervi

sed

App

roac

hes

The data an organization has collected may not inherently contain the

target values. If the targets cannot be extracted from the existing data, it is

possible to modify business processes to capture the target values as part

of day-to-day operations. For example, a company using an ML model to

automate product quality control can amend their data collection process

so that human quality control decisions (the target value) are captured at

the same time the photo or video input data are collected.

When the existing data does not include the targets or when the targets

cannot be captured as part of the data collection, the data must be

manually labeled after-the-fact. The data labeling is almost always done

by humans, because of the high level of accuracy required to train the ML

models. Supervised learning in this context benefits by learning from data

that has been accurately labeled using human judgment. Humans have a

unique cognitive capacity to find and extract meaning from context—a skill

that machines still lack. It is this context that enables supervised models to

map features of the data to desired outcomes.

Organizations can use data labeling to embed specific targets into

the existing data, in order to answer specific questions such as,

“Does the shopper have Product 1 in their cart?”

Sel

ectin

g th

e R

ight

App

roac

h

Examples of Data Labeling:

8

Sup

ervi

sed

vs. U

nsup

ervi

sed

App

roac

hes

When Do You Need Data Labeling?

Unsupervised and supervised learning approaches each solve

different types of problems and have different use cases. The power of

unsupervised methods is widely touted recently, but the term unsupervised

has become overloaded. The preferred term for using ML to harness the

power of the vast amounts of data without requiring external data labeling

is self-supervised learning2. This approach can be powerful, given a set of

data containing meaningful associations or segmentations.

Do you need labeled data?

2 Yann LeCun. Quote from 2018: “Most of human and animal learning is unsupervised learning. If intelligence was a cake, unsupervised learning would be the cake, supervised learning would be the icing on the cake, and reinforcement learning would be the cherry on the cake.” Response quote from April 2019 I now call it “self-supervised learning”, because “unsupervised” is both a loaded and confusing term. https://www.facebook.com/722677142/posts/10155934004262143/

Sel

ectin

g th

e R

ight

App

roac

h

Yet it cannot deliver answers to specific questions that the businesses may

need to deploy targeted ML solutions (e.g. “Is the guardrail in the drone footage

damaged?”), because the answers simply do not exist within the available data.

When the business needs the answers to these specific questions to derive

value, teams can label the data after-the-fact, in order to embed the targets into

the available data. The labeled data set enables data science teams to take the

supervised approach and broaden their business applications.

Supervised Learning

Unsupervised Learning

What do you need to do with the data?

Explore the aggregate data set, discover patterns

Answer a specific question about each data point

Labeling is typically not required for unsupervised learning

Have you identified the target value for each input data record?

Yes, target values have been collected with the data

Additional labeling is not required

No, target values have not been identified in the data

Data labeling or human-generated labels are required.

9

Sup

ervi

sed

vs. U

nsup

ervi

sed

App

roac

hes

In order to extract business value from data, an organization must have

the ability to mine unique insights from that data. AI and ML are powerful

tools in this effort but they rely on insights available within the data. If

those insights are not already encoded in the data itself, data labeling can

be a means to drive significant value for the organization. Although data

labeling consumes time and resources, the potential return on investment

is substantially higher. In many cases, after-the-fact labeling is the only way

to effectively mine these insights. Labeled data enable supervised ML to

extract actionable business value impossible to achieve otherwise.

Building a Data Strategy A sound data strategy sets the stage for enterprises to get value from their data using ML. Organizational considerations for the data itself, the

team of people building value from the data, the infrastructure supporting

data operations, and the project timeline and resources are all critical

foundations for success. System readiness may also impact the decision

on what ML approach the data science team chooses to invest in.

Tapping into the Potential Value of Data

Although data labeling consumes time and resources, the potential return on investment is substantially higher.

Tapp

ing

into

the

Pot

entia

l Val

ue o

f Dat

a

10

Sup

ervi

sed

vs. U

nsup

ervi

sed

App

roac

hes

DATA ACCESSIBILITY

It is vital for organizations to understand what data they have, where it

resides, and what data they don’t yet have access to. It is equally important

to assess the sophistication of the overall data strategy. What kind of cross-

functional support and data integrations already exist within the enterprise?

Is data collection complete or ongoing?

PEOPLE

Implementing a successful ML operation in an enterprise requires a team

of skilled people who understand the technology and application needs.

For enterprise-level efforts, the best approach is to have a mix of technical,

data, and domain experts: technical experts to deploy and maintain

technical ecosystems, data experts to extract and translate data assets, and

domain experts to provide business context.

INFRASTRUCTURE

In the same way that supporting an ML effort requires specialized

people and skill sets, it also requires the right infrastructure to support

the integration, storage, accessibility, and labeling of training data.

Organizations should ask if they have an existing infrastructure to build,

train, manage, and deploy ML models. Does that infrastructure easily

support experimentation? If not, should the organization build this

infrastructure? Does this infrastructure support the necessary encryption

and data security methodologies?

TIME AND RESOURCES

Organizations must determine whether or not they have the time and

resources to collect, transform, explore, label, and operationally use the

data in their ML project. Projects vary in potential payoff and risk—payoff

resulting from a well-performing model trained on viable data, and the risk

of deploying a poorly performing model that fails to deliver business value.

They will also need to be ready to scale their operations as an ML model is

being developed under real-world circumstances.

Tapp

ing

into

the

Pot

entia

l Val

ue o

f Dat

a

11

Sup

ervi

sed

vs. U

nsup

ervi

sed

App

roac

hes

Why Quality Matters in Data Labeling When the business is able to identify a specific question of value that

can be answered with the available data, they can take a supervised ML

approach to realize that value. To train the supervised ML model, you must

identify a set of input values (the data) and corresponding target values (the

known answers), which need to be encoded into the training data set via

data labeling. This process is easier said than done, as the labeling must

be done with a high degree of accuracy and massive volumes of high-

quality training data are required to ensure the effectiveness of the model.

It is well known that training data quality determines the performance of

machine learning systems. This is the “garbage-in, garbage-out” concept.

Overall performance of even the most sophisticated model can be easily

compromised if it is trained on data that is poorly labeled or does not

accurately reflect the target values. Errors in labeling target values or

misclassifying the targets lead to errors in the model. These errors also

proliferate as they flow through an ML application. Failing to maintain

accuracy as data labeling scales from proof-of-concept to production can

bring a project to a halt. Building the ability to generate and scale high-

quality training data over time is a means of protecting the organization’s

ML investments and mitigating the risk of compromising an otherwise

successful model.

Tapp

ing

into

the

Pot

entia

l Val

ue o

f Dat

a

Training data quality determines the performance of machine learning systems. This is the “garbage-in, garbage-out” concept.

12

Sup

ervi

sed

vs. U

nsup

ervi

sed

App

roac

hes

A simple way to think about data labeling is to liken it to the process of

converting crude oil to refined fuel. While data is the new currency in AI/

ML, quantity and availability of data is only part of the story. Supervised

ML models cannot run on data that lacks targets, in the way a vehicle

cannot run on raw petroleum. In fact, the lack of ability to scale labeled

data or training data is often the biggest blocker to accelerated ML model

development. Organizations that can find the most reliable and economical

path to accurately labeled data quickly gain value and competitive

advantage. This path is an essential part of the training data pipeline that

carries tremendous impact on both the performance of the ML model and

productivity of the data science team.

Options to label your include in-house labeling using homegrown tools,

third-party crowdsourcing, and full-service data labeling platforms. Teams

need to evaluate labeling options against their ML use case, timeline

and resources, data volume and accuracy requirements, then determine

how much time their data scientists can afford to invest in the labeling

efforts. The selection process needs to be given careful consideration

as each technique produces data output that vary in accuracy, speed,

security controls, and operational costs—as well as its impact on the model

performance.

Establishing a Training Data Pipeline

Est

ablis

hing

A T

rain

ing

Dat

a Pip

elin

e

Organizations that can find the most reliable and economical path to accurately labeled data quickly gain value and competitive advantage.

13

Sup

ervi

sed

vs. U

nsup

ervi

sed

App

roac

hes

In-house Labeling Crowdsourcing Data Programming Synthetic Labeling Data Labeling Platforms

What is it? • Data scientist teams labeling the data with homegrown tools

• Use of external crowdsourcing resources to label data

• Use of scripts to programmatically label data

• Use of a generative model to programmatically generate data that imitates specific properties of the real data

• External platforms with partial or full services, designed to scale data labeling

Pros • High accuracy• Can be labeled by domain

experts • Visibility into process• Can manage subjective

cases

• Economical for simple tasks• Easy to access resources• Fast• Scalable

• Can be produced without manual labeling

• Useful for cleaning up “noisy data”

• Faster and cheaper than manual labeling

• Can overcome data usage restrictions• Scalable • Can simulate data or conditions not

available in real data

• High accuracy• Removes labeling tasks from team • Can manage complex or subjective tasks• Qualified workers• Scalable

Cons • Extremely resource intensive and expensive

• High operational cost • Detracts DS time from

advancing ML • Long lead time • Not scalable

• Difficult to increase accuracy• Poor quality and security

controls • Unknown workers• Not designed to support

complex use cases

• Less accurate than manual labeling

• Limited ML applications

• Data quality issues • Limited ability to represent real data • Limited to simulating general trends • Can lead to poorly generalized models• Requires higher computational power

• Higher cost than crowdsourcing • Level of quality, speed, and security are

dependent on the chosen platform

Data Labeling Techniques

Est

ablis

hing

A T

rain

ing

Dat

a Pip

elin

e

14

Sup

ervi

sed

vs. U

nsup

ervi

sed

App

roac

hes

Third-party data labeling platforms combine the best of these techniques to accelerate ML initiatives. An enterprise-grade platform

needs to be able to scale your data labeling and remove all labeling tasks off of your team’s plate—without compromising on

accuracy. Data science teams who need to scale data labeling as they move from proof-of-concept to production or those who need

to offset the high cost of in-house data labeling will benefit from a third party labeling platform.

While the labeling quality and throughput vary with respect to each platform, most platforms offer a combination of annotation tools,

task design frameworks, and a team of humans who will annotate your data. In selecting the right platform for your project, it is

important to know what you’re trying to do with the data and the level of complexity involved in labeling your data ahead of time.

Some platforms are great for straightforward tasks, while others are made to handle complex use cases and subjective judgments.

Below are additional things to consider as you look for a project-to-labeling platform fit.

Choosing a Data Labeling Platform

Cho

osin

g a

Dat

a La

belin

g Pla

tform

PLATFORM

Is the platform able to build and distribute high-volume labeling tasks

to a qualified workforce? Can the platform be configured to adapt to

organizational needs? Can it be integrated with existing internal ML

development technologies or support API data exchange? Look for

providers who deliver a high level of visibility, feedback, and flexibility to

allow for continuous feedback and optimizations.

TOOLS

What tools are available to label data, whether it be text, images, audio,

or video? Are these tools customizable, or do they put constraints on data

labeling capabilities? Does the platform technology support high accuracy

labeling for subjective use cases? And is the platform provider investing in

R&D to optimize their platform tools and improve throughput and accuracy?

15

Sup

ervi

sed

vs. U

nsup

ervi

sed

App

roac

hes

PEOPLE

Who comprises the workforce that labels the data? Are the annotators

screened, vetted, and trained appropriately to make domain-specific

judgments? Are project management and support services included?

In case of managed platforms, how well does your account manager

understand your data objectives, and are they able to take task design,

workforce training, tooling configuration, and quality management off the

hands of your data science team?

PROCESS & BEST PRACTICES

Does the solution employ industry best practices to scale and improve

training data without compromising on accuracy? What is their track record

on delivering large-scale enterprise data labeling, and does the platform

offer flexibility to support use cases requiring multi-stage annotation

workflows, complex classifications, or conditional quality control strategies?

QUALITY CONTROL

Does the provider deliver a targeted quality control strategy that is

optimized for the organization’s use case and budget, or is it one size

fits all? What QC strategy do they plan to implement to increase your

data accuracy? Does the provider have a robust screening and training

curriculum to qualify the annotation workforce? Are they able to evaluate the

labeling accuracy against gold or ground truth data?

SECURITY

What data and platform security protocols are in place? Are the integration

points and APIs safe for data transfer? Are their annotators NDA-ready? Do

they have data encryption and portal access controls in place, and are they

able to meet your organization’s compliance requirements?

Cho

osin

g a

Dat

a La

belin

g Pla

tform

16

Sup

ervi

sed

vs. U

nsup

ervi

sed

App

roac

hes

Building a powerful machine learning model, especially with today’s most

popular deep-learning algorithms, requires vast amounts of labeled training

data. However, training is not the only point in the ML development lifecycle

that requires labeled data. Testing and validating models consume data

with externally generated labels, and models in production require labeled

data to handle exceptions or verify that existing labels still accurately reflect

the real-world context.

Therefore, supervised ML initiatives require continuous learning through

a steady supply of high-quality, high-volume labeled datasets throughout

the development lifecycle. Building a systematic solution or a training data

pipeline not only drives model success and business value, it represents an

opportunity to create competitive advantages.

Operationalizing Machine Learning Models

Ope

ratio

naliz

ing

Mac

hine

Lea

rnin

g M

odel

s

DataPreparation

Model Selection & Training

Model Scoring& Validation

Model Deployment

Production Model Monitoring

Training Data Set

ValidationData Set

Test/Holdout Set

Model Monitoring Set

Train a range ofselected models

Evaluate modelfit, check for overfitting or underfitting

Evaluate the accuracy of the trained model with test data

Evaluate the new data set on deployed model to account for context or data drift

17

Sup

ervi

sed

vs. U

nsup

ervi

sed

App

roac

hes

Organizations have the ability to develop differentiated business value and

accelerate their ML initiatives through purpose-specific data labeling. Here

are five key ways that a data labeling platform can improve the execution of

ML initiatives and, ultimately, the quality of ML applications.

1. Increase net data value – Extract value out of your data and develop

new feature opportunities by identifying and embedding target attributes

in your existing data–at scale. Data labeling platforms can be the best

means of accurately injecting the targets or labels into your data and

enabling teams to take the supervised learning approach.

2. Accelerate ML initiatives – Great data labeling platforms convert

collected data into high-accuracy, model-ready data and seamlessly

feed the data into ML models across various stages of the model

development. The ability to scale data labeling drives higher model

confidence, range of ML applications, and speeds time to value.

3. Maintain model accuracy with continuous learning – AI/ML

models have the ability to deliver exponential growth of data value

and productivity gains, if well monitored. By creating a training data

workstream to train, validate, and test the models over time, they will

continue to produce accurate results amidst the context and data drifts.

4. Mitigate project risk – Lower risk for ML initiatives by avoiding wasted

effort or time working with low-quality data labels that can undermine an

organization’s models and AI initiatives. Take advantage of a platform

provider who securely stores and transports data, and who provides a

workforce that protects the value of their client’s proprietary data.

5. Reduce costs and resource investments – Eliminate the costs of

building data annotation tools and project management by outsourcing

data labeling to a platform designed to scale data labeling. Top-

tier data labeling platforms enable data science teams to focus on

advancing their ML model development and eliminate wasted cycles

that compromise project success.

Generating Advantages through Data Labeling Platforms

Gen

erat

ing

Adva

ntag

es thr

ough

Dat

a La

bel P

latfo

rms

18

Sup

ervi

sed

vs. U

nsup

ervi

sed

App

roac

hes

It is an exciting time for enterprise AI. Enterprises that understand how

to effectively extract value from their data using ML will differentiate

themselves from their competitors through increased revenue, cost

savings, and more advanced product features. Because ML models

rely on massive volumes of high-quality training data throughout the

project lifecycle, establishing and maintaining a robust training data

pipeline is critical to its success.

Data labeling platforms present a powerful option to increase the

value of your data, secure a sound training data strategy, and create a

competitive advantage for the business and their data science teams.

Getting Ahead with AI and ML

Get

ting

Ahe

ad w

ith A

I and

ML

19

Sup

ervi

sed

vs. U

nsup

ervi

sed

App

roac

hes

ABOUT ALEGION

Alegion provides data labeling to power enterprise machine learning (ML)

initiatives. Our offering operates at massive scale, combining a data and

task management platform with a global network of trained data specialists

to handle use cases ranging from the straightforward to the highly complex.

We assist data science teams throughout the ML lifecycle, delivering high-

quality data labeling for model training, model validation, and exception

handling. Our platform integrates the best of ML-assisted annotation and

quality control workflows with human insights from a team of trained

specialists to deliver model training data for computer vision, natural

language processing, and entity resolution.

With our specialized labeling platform and dedicated specialists, we

offload all data annotation tasks, freeing data science teams to focus on

their model development. The Alegion platform delivers data to power

ML projects across a range of industries including retail, tech, finance,

healthcare, automotive, and manufacturing.

Want to learn more? REACH OUT TO

[email protected]

Supervised vs. unsupervised learning

Documents