IN SQL SERVER 2012 DATA TOOLS
Data Quality Services
A bit about me
Hi, I’m Grant HollyI’m from Portland, ORI love the outdoors, good beers,mathematics, metal working,and I’m a huge SQL geek
Pictures! + Relevant XKCD
I SQL
What I do
Technical training and consulting in: DB administration DB development DB performance tuning ETL / SSIS SSAS cubes and tabular models SSRS reports
Why are we doing today?
What is this DQS thing?Why not just keep doing what we’re doing?Ok, so why use it?You mean other people can help me enforce
data quality?I’m sold! Let’s set this bad boy up!Wait, I have a question.
What is “Data Quality?”
Businesses need reliable dataReliability means both accuracy and
availabilityQuality issues: invalid data, inconsistent
format, duplication, etc.Especially important as the number and
diversity of data sources increase
What is Data Quality Services?
New to SQL Server 2012 Data Tools (BIDS)Designed to ease data stewardshipDesigned to be end-user-ableSpread the data validation work (love?) outIntegrate into ETL process
How does it work?
“Domain based” focusUses knowledge basesRemove erroneous or improperly formatted
data
How does it work
Choose from reference knowledge basesExplore a sample of dataData stewards make the call on “liners”
A picture!
How does it work?
Can find and fix duplicated dataDuplicates evaluated based on policies and
thresholds
Another picture!
What we did / are doing in place of DQS?
LookupsPL SQL / T SQLScripting
How can DQS help?
Dedicated tool for validationInterface aimed at end-usersOffload work from ETL stream
Cleansing data
Validating dataKnowledge bases start out pretty good!Knowledge bases grow over time with user
inputLess complicated than scripting
De-duplicating data
Define policiesDefine thresholdsMuch less overhead than using lookups!
Reference Data Services
Can integrate trusted 3rd party dataData used as a reference to check againstCan be built into knowledge basesWorks with Azure marketplace (if you’re into
that kind of thing)
Automation through SSIS
DQS Cleansing transform in SSISCan be used in-line with other ETL packagesImplement DQS on data sources or
destinationsAutomate DQS with SQL Server Agent
What’s the catch?
Interface feels a little “1.0”Requires a DQS database to be setupAdmins still have to define knowledge bases
and policies / thresholds for de-duplication
Requirements
DQS client installDQS database connected to clientsUser trainingStakeholder collaboration
Recap
DQS gives stewards tools to put quality validated data inline with your ETL process
Specialized tool to offload validation and de-duplication
Can cleanse and de-duplicate dataCan be in-lined with DQS transform in SSISCan be automated through SQL Server agent
OK, real questions