IBM Cloud and Cognitive Software Fast Start 2020 #FastStart2020 IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak™ for Data – A Data Quality Deep Dive Dan Schallenkamp Data and AI, Offering Manager for Data Quality Thurs. 30-April-2020 CHI UG Meeting
50
Embed
IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak ...
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
IBM Cloud and Cognitive Software Fast Start 2020 #FastStart2020
IBM Watson® Knowledge Catalog, powered by IBM Cloud Pak™ for Data – A Data Quality Deep Dive
Dan SchallenkampData and AI, Offering Manager for Data Quality
completeness and accuracy of the information contained in this publication, it is provided AS IS without warranty of any kind, express or implied. In addition, this information is based on IBM’s current product plans and strategy, which are subject to change by IBM without notice. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this publication or any other materials. Nothing contained in this publication is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreementgoverning the use of IBM software.
References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. Product release dates and/or capabilities referenced in this presentation may change at any time at IBM’s sole discretion based on market opportunities or other factors, and are not intended to be a commitment to future product or feature availability in any way. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results.
Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.
All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer.
Session Agenda
• Where is Data Quality Positioned in our offerings?• Business Value / Purpose
• Search and find relevant data• Connect & prepare data for consumption & analysis• Consume and analyze the data• Comment, rate and share
• Data lineage• Data ownership• Data stewardship• Data governance workflow• Discover metadata assets• Classify data assets• Build data glossary• Manage metadata repository• Manage Reference Data
• Deep data profiling• Data quality scoring• Apply and monitor validation rules against source data
Data Governance Teams
Data CitizensIBM Watson Knowledge Catalog on Cloud Pak for Data
AI LifecycleGround Truth gathering
Data Cleansing
Feature Engineering
Model Selection
Parameter OptimizationEnsembleModel Validation
Model Deployment
Runtime Monitoring
Model Improvement
Watson Studio, Watson Machine Learning, and Open Scale
• Build ETL jobs• Run ETL jobs• Monitor• Extract data• Collect metadata• Move data• Ingest data
Data Engineers
End-to-End Platform for Business-Ready DataIntegration of data quality (from Information Analyzer) data governance (Information Governance Catalog) and data consumption (from Watson Knowledge Catalog) now under one experience and brand.
Relationship &Overlap Analysis
PrimaryKey Analysis
Colum
nA
nalysis Source 1 Source 2
Rules Analysis
Source 1 Source 2
Analyze – Deep Data Profiling & AnalysisProvides the key understanding of the source data
• Column analysis• Business Term Assignments• Data Classification• Data Quality scores• Primary Key analysis• Relationship and Overlap analysis
Monitor Data Quality – using Business RulesEvaluates user-defined rules against the source data
How to get the best results from Quick scan and Auto Discovery ... Example: for your critical data elements
DQ DimensionsStep 4
Examine the 11 built-in data quality dimensions, enable/disable as needed, create and install custom dimensionsUsed to calculate the DQ Score for Given columns
Business TermsStep 1Define Terms, Policies and Rules for your top 50 or 150 CDEs
Data ClassesStep 2
Examine the 200+ built-in data classes, disable those you don’t need, create and test custom data classes.
You must link every data class to a business term.
Automation RulesStep 3
Create Automation Rules for your top 50 or 150 CDEs
- ARs trigger based on Business term assignments - Can automatically bind/create Quality Rules
Step 5 Auto Discover• Automatic metadata import• Analysis• Auto classification• Auto term assignment• Data quality scores
An easy way to start the import, analysis, quality scores, data classification (to find PII data) and automatic business term assignments all with one easy operation.
An easy way to start the import, analysis, quality scores, data classification (to find PII data) and automatic business term assignments all with one easy operation.
• Automatic Actions/Rules and DQ threshold based on Term assignments• Enable/Disable all or individual built-in data quality dimensions• Auto-bind one or more Data Rule Definitions
Automation Rules – Designed for the business user Innovation
• Automatic Actions/Rules and DQ threshold based on Term assignments• Enable/Disable all or individual built-in data quality dimensions• Auto-bind one or more Data Rule Definitions
1. 90% of IA (including Quick scan and Auto discovery is included in WKC and with a common UX - Demo
2. Create/edit/delete virtual columns (both)3. Limit the number of Data Rule output exceptions (both)4. Validity Benchmark is back in Data Rules (both)5. ‘Manage’ Flag in Data Rules (IIS only today)6. Remember many user choices/preferences (both)
1. 90% of IA (including Quick scan and Auto discovery is included in WKC and with a common UX - Demo
2. Create/edit/delete virtual columns (both)3. Limit the number of Data Rule output exceptions (both)4. Validity Benchmark is back in Data Rules (both)5. ‘Manage’ Flag in Data Rules (IIS only today)6. Remember many user choices/preferences (both)
What Can We Expect in the Next Release?Planned for mid-June, 2020 release (subject to change) WKC 3.0 and 11.7.1 FP1
1. New much more intuitive Data Quality menu structure (both)2. Negative term classification (both)3. WKC experience for Data Rule exceptions (DQEC replacement) (WKC)4. Data Rule binding drag and drop (both)5. Visualization of Data Quality scores over time (both)6. On-going DQ architecture modernization (WKC)7. New ‘Column Similarity’ (aka Fingerprint) data class (WKC)8. Many minor UX improvements (retain user preferences, etc.) (both)9. Relationship Analysis more intuitive (both)10.Globalization (Translation of our UIs into several languages) (WKC)
11.ML Based Data Rule Definition Generation (WKC)12.Suggested Automation Rule (available today in 11.7.1 SP2, planned for WKC)