IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...

IBM Cloud and Cognitive Software Fast Start 2020 #FastStart2020

IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak™ for Data – A Data Quality Deep Dive

Dan SchallenkampData and AI, Offering Manager for Data Quality

Thurs. 30-April-2020 CHI UG Meeting

Legal Disclaimer

© IBM Corporation 2020. All Rights Reserved.The information contained in this publication is provided for informational purposes only. While efforts were made to verify the

completeness and accuracy of the information contained in this publication, it is provided AS IS without warranty of any kind, express or implied. In addition, this information is based on IBM’s current product plans and strategy, which are subject to change by IBM without notice. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this publication or any other materials. Nothing contained in this publication is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreementgoverning the use of IBM software.

References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. Product release dates and/or capabilities referenced in this presentation may change at any time at IBM’s sole discretion based on market opportunities or other factors, and are not intended to be a commitment to future product or feature availability in any way. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results.

Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.

All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer.

Session Agenda

• Where is Data Quality Positioned in our offerings?• Business Value / Purpose

• Data Quality – Key Capabilities

• What’s New in the current GA release?• Demo

• What’s Planned in the Next release?• Demo

3© 2020 IBM Corporation

You Are Here: How this session fits in the DataOps story

5

The AI LadderA prescriptive approach to accelerating the journey to AI

IBM DataOps / © 2020 IBM Corporation

InfuseOperationalize AI throughout the business

AnalyzeBuild and scale AI with trust and transparency

CollectMake data simple and accessible

OrganizeCreate a business-ready analytics foundation

ModernizeMake your data ready for an AI and hybrid cloud world

DataOps is the concept to deliver Business Ready Data

6

COLLECTORGANIZE

ANALYZE

INFUSE

your data with

AI

Analytics and AI at scale and speed

to drive

Operational Efficiency

Data Quality

Data privacy & compliance

DataOps(DevOps for Data + Data Operations)

• A concept, like DevOps for Data, enabling collaboration between data consumer & data provider at speed & scale

• Automated data operations providing curated data pipeline

• Drives agility and innovation everywhere

People Process Technology

© 2020 IBM Corporation

Data Quality – Key Capabilities

Cloud Developer Services / © 2017 IBM Corporation 8

Cloud Pak for Data

Enterprise Data Integration

Enterprise Data Quality

Enterprise Data Governance

Enterprise Data Consumption

DataStage

• Search and find relevant data• Connect & prepare data for consumption & analysis• Consume and analyze the data• Comment, rate and share

• Data lineage• Data ownership• Data stewardship• Data governance workflow• Discover metadata assets• Classify data assets• Build data glossary• Manage metadata repository• Manage Reference Data

• Deep data profiling• Data quality scoring• Apply and monitor validation rules against source data

Data Governance Teams

Data CitizensIBM Watson Knowledge Catalog on Cloud Pak for Data

AI LifecycleGround Truth gathering

Data Cleansing

Feature Engineering

Model Selection

Parameter OptimizationEnsembleModel Validation

Model Deployment

Runtime Monitoring

Model Improvement

Watson Studio, Watson Machine Learning, and Open Scale

• Build ETL jobs• Run ETL jobs• Monitor• Extract data• Collect metadata• Move data• Ingest data

Data Engineers

End-to-End Platform for Business-Ready DataIntegration of data quality (from Information Analyzer) data governance (Information Governance Catalog) and data consumption (from Watson Knowledge Catalog) now under one experience and brand.

Relationship &Overlap Analysis

PrimaryKey Analysis

Colum

nA

nalysis Source 1 Source 2

Rules Analysis

Source 1 Source 2

Analyze – Deep Data Profiling & AnalysisProvides the key understanding of the source data

• Column analysis• Business Term Assignments• Data Classification• Data Quality scores• Primary Key analysis• Relationship and Overlap analysis

Monitor Data Quality – using Business RulesEvaluates user-defined rules against the source data

• Data Rules – targeted evaluation• Rule Sets – combined assessment

…

…

Data Profiling and Quality – Core Capabilities


How to get the best results from Quick scan and Auto Discovery ... Example: for your critical data elements

DQ DimensionsStep 4

Examine the 11 built-in data quality dimensions, enable/disable as needed, create and install custom dimensionsUsed to calculate the DQ Score for Given columns

Business TermsStep 1Define Terms, Policies and Rules for your top 50 or 150 CDEs

Data ClassesStep 2

Examine the 200+ built-in data classes, disable those you don’t need, create and test custom data classes.

You must link every data class to a business term.

Automation RulesStep 3

Create Automation Rules for your top 50 or 150 CDEs

- ARs trigger based on Business term assignments - Can automatically bind/create Quality Rules

Step 5 Auto Discover• Automatic metadata import• Analysis• Auto classification• Auto term assignment• Data quality scores

InnovationHomework

Spend time customizing the tool


Quick scan – Blazing Fast Bulk Discovery

An easy way to start the import, analysis, quality scores, data classification (to find PII data) and automatic business term assignments all with one easy operation.

(see screen shots in demo section below)


Classification

Automatic Business Term Assignment

Data Sources

Systems of Record

Cloud

Social Media

News

Systems of Engagement

Others

Documents

Systems of Insights

Hadoop

Curator DashboardDecisions

Recommendations & Auto Term Assignment

Approve Reject / Modify

Enterprise Data Catalog

Feedback

Data Discovery(Quick scan)

Cognitive & Deep Learning

ML Classification

Rule Based Classifiers

Publish Training


AutomatedData Classification

Regex/Valid Value/Java Classifiers

Java Script Classifiers

Column Similarity classifiers

Public Domain Classifiers

Table Classifiers

Auto Grouping and Suggestion


AutomatedData Quality

Quality Analysis

Quality Rules

Quality Dimensions

Automation Rules

M/L Suggested rules

Business Term Assignment


Data Quality

- The Importance of Quality Addresses- A word on Workflow

The Importance of Quality Addresses

Good quality addresses are foundational to so many initiatives including:

• Know Your Customer (Prospect, Employee, Vendor, Patient)

• Data Quality in general and Matching and Deduplication specifically

• Shipping, mailing, logistics

IBM’s QualityStage Address Verification Interface (AVI) is tightlyintegrated with QualityStage

Questions :

• What do you use today to parse, correct, enhance & verify addresses?

• How often do you cleanse all your addresses and at what cost?

• Do you need to add lat/long coordinates to addresses?


Capabilities

– Supports over 248 countries and territories

– Improved verification, suggestion and correction results in batch or real time

– Bi-directional Transliteration support for 8 languages

– Tightly integrated into InfoSphere QualityStage

– Process multiple countries in a single run

– Latitude and longitude assignment

– US Census* and UK PAF data

Benefits

– Reduced errors in shipping/mailing & other activity, lowers cost

– Better customer service and increased revenue

– Increase business confidence when using enterprise data for critical decision making

– Enhanced and standardized address data supports record matching & de-duplication

Address Parse/Validate/Enhance


Data Quality – What’s New in Watson Knowledge Catalog?

EVERYTHING is New! All DQ is New!

Group Name / DOC ID / Month XX, 2018 / © 2018 IBM Corporation 18

Data Quality – Retire the two older IA clients in 11.7.1 SP2

11.7.1 – Information Analyzer OneUIzero footprint, microservices based client (requires the ‘UG Stack’)

– Information Analyzer WorkbenchWindows based thick client

–Information Analyzer Thin Client(old/first thin client)


A Unified User eXperience (UX) across IIS and WKC

Information Analyzer

+Watson Knowledge Catalog

Information Governance Catalog

IBM Cloud Pak for Data

Unified User Experience &

Single Catalog

ProductStrategyNew


Data Quality within ICP/WKC

+Watson Knowledge Catalog

IBM Cloud Pak for Data

New


Quick scan – Blazing Fast Bulk Discovery

An easy way to start the import, analysis, quality scores, data classification (to find PII data) and automatic business term assignments all with one easy operation.

(see screen shots in demo section below)


Data Rule Definition Management – For the business user


Accelerating Data Quality through ML based automationMachine Learning

assisted Data Quality

• Auto Business Term Assignment – ML assisted

• Auto Business Rule Suggestion – via Automation Rules based on term assignment and data class

• Auto Discovery – a quick way to kickoff bulk analysis operations including:

• Metadata import• Data profiling• Data quality scores• Term assignment

Innovation

Think 2019 / 6912A / February, 2019 / © 2019 IBM Corporation 24

Accelerating the Quality & Governance Process

Automating theGovernance Process

• Utilizing Machine Learning for an accelerated Metadata Classification Process (Auto Business Term assignment)

• Automatically classify data -- including understanding your PII risk

Innovation

Automation through Machine Learning


Automation Rules

• Automatic Actions/Rules and DQ threshold based on Term assignments• Enable/Disable all or individual built-in data quality dimensions• Auto-bind one or more Data Rule Definitions


Automation Rules – Designed for the business user Innovation

• Automatic Actions/Rules and DQ threshold based on Term assignments• Enable/Disable all or individual built-in data quality dimensions• Auto-bind one or more Data Rule Definitions


SQL Virtual Tables

Can greatly simplify the creation and maintenance of data rule logic by ‘pushing’ the complexities to the source database. Table JOINs, filters, etc.


Data Quality – What’s New?

In IIS 11.7.1 SP2 and Also in WKC?

What’s New with the Nov 2019 Release?

IIS 11.7.1 SP2 and CPD WKC 2.5

1. 90% of IA (including Quick scan and Auto discovery is included in WKC and with a common UX - Demo

2. Create/edit/delete virtual columns (both)3. Limit the number of Data Rule output exceptions (both)4. Validity Benchmark is back in Data Rules (both)5. ‘Manage’ Flag in Data Rules (IIS only today)6. Remember many user choices/preferences (both)


Create/Edit/Delete Virtual Columns (both) 1 of 2

• Choose ‘Create virtual column’ from the Columns tab

• If you ‘Select’ an existing virtual column you can choose ‘Edit’ or ‘Delete’


Create/Edit/Delete Virtual Columns (both) 2 of 2• Add two or more

columns

• Move up or down

• Choose field separate and other settings

• Provide a name and description

• Treated like any other column. You can analyze, run Rules against it, etc.


Limit # of Data Rule output exceptions (both)

• Sometimes the first 100 or 1000 exceptions are more than enough to share in order to describe and diagnose the quality issue

• Can be a big time savings and disk savings vs the output of all exceptions


Validity Benchmark is back (both)

• A longtime IA feature that some customers are using

• Added to help those customers make the move to the new UI and to WKC


‘Manage’ Flag in Data Rules (IIS only today)

• Previously only available in DQEC

• And only showed up in DQEC if the Data Rule has been executed


Planned Live Demo


IBM Cloud Pak for Data WKCSelect Roadmap Items

What’s New with the Nov 2019 Release?

IIS 11.7.1 SP2 and CPD WKC 2.5

1. 90% of IA (including Quick scan and Auto discovery is included in WKC and with a common UX - Demo

2. Create/edit/delete virtual columns (both)3. Limit the number of Data Rule output exceptions (both)4. Validity Benchmark is back in Data Rules (both)5. ‘Manage’ Flag in Data Rules (IIS only today)6. Remember many user choices/preferences (both)


What Can We Expect in the Next Release?Planned for mid-June, 2020 release (subject to change) WKC 3.0 and 11.7.1 FP1

1. New much more intuitive Data Quality menu structure (both)2. Negative term classification (both)3. WKC experience for Data Rule exceptions (DQEC replacement) (WKC)4. Data Rule binding drag and drop (both)5. Visualization of Data Quality scores over time (both)6. On-going DQ architecture modernization (WKC)7. New ‘Column Similarity’ (aka Fingerprint) data class (WKC)8. Many minor UX improvements (retain user preferences, etc.) (both)9. Relationship Analysis more intuitive (both)10.Globalization (Translation of our UIs into several languages) (WKC)

11.ML Based Data Rule Definition Generation (WKC)12.Suggested Automation Rule (available today in 11.7.1 SP2, planned for WKC)


Negative Term Classification

• Improving DQ & Governance for business term assignment

• Remember what the user has manually rejected

• Compare to what is already published


Innovation – Column Similarity

41

• ‘No Class Detected’ columns are grouped based on similarity

• User can inspect each group, determine the cutoff score

• Create a new codeless Data Class

• The next time analysis is run, the new Data Class is working

• This is a quick way to create codeless custom Data Classes that are unique to a given customer’s data

Easy Data Class Creation – ‘Column Similarity’

• Mimic how a human brain thinks

• Find patterns that are similar across the multiple datasets under evaluation,

• Present them to the user as clusters of “similar patterns”


New Visualizations and Navigation


New Visualizations – Data Quality score over time


New Visualizations – Data Quality score over time


New Navigation Structure


New Navigation Structure


Relationship Analysis


Thank you

Dan SchallenkampData and AI, Offering Manager for Data Quality—[email protected]+1-704-458-0467