Top Banner
Applied Data Quality Ezra Kenigsberg Data Architect salesforce.com Admin III: Expanding into new areas
59

Aen007 Kenigsberg 091807

Nov 22, 2014

Download

Economy & Finance

Dreamforce07

Full session information and video available at Successforce.com.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Aen007 Kenigsberg 091807

Applied Data Quality

Ezra Kenigsberg

Data Architect

salesforce.com

Admin III: Expanding into new areas

Page 2: Aen007 Kenigsberg 091807

Safe Harbor Statement

“Safe harbor” statement under the Private Securities Litigation Reform Act of 1995: This presentation may contain forward-looking statements including but not limited to statements concerning the potential market for our existing service offerings and future offerings. All of our forward looking statements involve risks, uncertainties and assumptions. If any such risks or uncertainties materialize or if any of the assumptions proves incorrect, our results could differ materially from the results expressed or implied by the forward-looking statements we make.

The risks and uncertainties referred to above include - but are not limited to - risks associated with possible fluctuations in our operating results and cash flows, rate of growth and anticipated revenue run rate, errors, interruptions or delays in our service or our Web hosting, our new business model, our history of operating losses, the possibility that we will not remain profitable, breach of our security measures, the emerging market in which we operate, our relatively limited operating history, our ability to hire, retain and motivate our employees and manage our growth, competition, our ability to continue to release and gain customer acceptance of new and improved versions of our service, customer and partner acceptance of the AppExchange, successful customer deployment and utilization of our services, unanticipated changes in our effective tax rate, fluctuations in the number of shares outstanding, the price of such shares, foreign currency exchange rates and interest rates.

Further information on these and other factors that could affect our financial results is included in the reports on Forms 10-K, 10-Q and 8-K and in other filings we make with the Securities and Exchange Commission from time to time. These documents are available on the SEC Filings section of the Investor Information section of our website at www.salesforce.com/investor. Salesforce.com, inc. assumes no obligation and does not intend to update these forward-looking statements, except as required by law.

Page 3: Aen007 Kenigsberg 091807

Purpose

To give you data techniques that can help make you a better administrator

Page 4: Aen007 Kenigsberg 091807

This session… Other sessions…

Practical demos—things you can do TODAY

Tactical ideas and technical detail

Data Loader, Excel,Excel Connector, Access

Bigger-picture data strategy

Third-party tools

Page 5: Aen007 Kenigsberg 091807

Outside this session:Please use these resources

Successforce.com

Salesforce.com’s Data Services group Brian Wiebe, Program Director Regional Data Architects

Product demos downstairs Developer Network: salesforce.com/developer

This Year’s Sessions: Data Data Everywhere! Data Quality: Who Doesn’t Want It?

Last Year’s Sessions: xrl.us/DataDataData xrl.us/ExcelConnectorDemo xrl.us/SystemOverload

Page 6: Aen007 Kenigsberg 091807

Overview

1. De-duplicating existing data 20 min

2. Managing data loads better 20 min

3. Keeping data clean 5 min

4. Q&A 10 min

Page 7: Aen007 Kenigsberg 091807

WHAT’S HOW do I

my data? match?

De-duplicatinga dupe?

WHAT’S HOW do I

my data? match?

I’ve gotta de-dupe the worst records by tomorrow Four questions:

1. What qualifies as a duplicate?

2. What’s my data look like?

3. How will I determine matches?

4. How will I merge records?

a dupe? merge?

merge?

Page 8: Aen007 Kenigsberg 091807

Typical questions: Should we track individual physical customer locations? Should we merge all Accounts in a single country/region? Is it okay to merge clearly different records? Do we need different definitions of “dupe” for different business

units/regions?

. . . answers to these questions determineHierarchy Depth

Records per Customer

WHAT’S

a dupe?

A dupe’s definition determines the dataset!

De-duplicating1: What’s a dupe?

Page 9: Aen007 Kenigsberg 091807

De-duplicating1: What’s a dupe? (cont’d)

PROBLEM: Too many records in org Successive de-dupe runs

were not eliminating as many records as expected

SOLUTION: Make records more

generalized by1. merging obvious

duplicates

AND2. merging different physical-

location records within US into unified records

WHAT’S

a dupe?

Fortune 500 company8,000+ EmployeesOperates in 30+ countries

Case Study:

Page 10: Aen007 Kenigsberg 091807

Some hierarchy possibilities:1. One-and-only-one Account record2. Single parent Account and many children

(no grandparents or grandchildren)3. Multi-tier

Multi-tier example: D&B implemented within Salesforce Global Ultimate one across the entire world

o Domestic Ultimate one per country• Branch multiple physical locations

– Subsidiary one per physical location

De-duplicating1: What’s a dupe? (cont’d)

DecreasingDecreasingEase-of-Ease-of-

MaintenanceMaintenance

WHAT’S

a dupe?

Page 11: Aen007 Kenigsberg 091807

For this example: We’ll de-dupe to the physical-location level We’ll allow hierarchy to be multi-tier We’ll use Excel Connector

• Following steps apply regardless of de-dupe tool

De-duplicating1: What’s a dupe? (cont’d)

WHAT’S

a dupe?

Page 12: Aen007 Kenigsberg 091807

And now. . . a tool

Excel Connector Work with Salesforce data directly in Excel:

Insert, Update, Extract, Delete Too slow for large data volumes (typ. more than hundreds) No longer being updated, not supported by salesforce.com

(alas) xrl.us/ExcelConnector

Dreamforce session “Excel Connector: Your Golden Ticket to Clean Data” xrl.us/ExcelConnectorDemo

Apex tools (including Excel Connector) xrl.us/ApexTools

WHAT’S

my data?

Page 13: Aen007 Kenigsberg 091807

De-duplicating2: What’s my data?

First stop: Names Examine the toughest cases to get a feel for all data

How do apparent dupes’ names seem to match?• Perfectly?

• Systematically?

• Randomly?

Sort Filter

WHAT’S

my data?

Page 14: Aen007 Kenigsberg 091807

Initial data set

Look for a tough case, then sort and filter to get…

WHAT’S

my data?

Page 15: Aen007 Kenigsberg 091807

(1) Sorted and (2) filtered on name

…a set of records to work with

WHAT’S

my data?

Page 16: Aen007 Kenigsberg 091807

De-duplicating3: How do I match?

Look to other fields after Name1. Account numbers/IDs2. Phone numbers

• Scrub to raw numbers

3. Addresses• Select address fields that look promising• Concatenate as needed; Name & ZIP, Street & ZIP

HOW do I

match?

Page 17: Aen007 Kenigsberg 091807

Numbers/IDs

…IDs/Numbers don’t help here

HOW do I

match?

Page 18: Aen007 Kenigsberg 091807

De-duplicating3: How do I match? (cont’d)

Look to other fields after Name1. Account numbers/IDs2. Phone numbers

• Scrub to raw numbers3. Addresses

• Select address fields that look promising• Concatenate as needed; Name & ZIP, Street & ZIP

HOW do I

match?

Page 19: Aen007 Kenigsberg 091807

Scrubbing phone numbersHOW do I

match?

Page 20: Aen007 Kenigsberg 091807

Scrubbing phone numbers (cont’d)Easier to line up matches

HOW do I

match?

Page 21: Aen007 Kenigsberg 091807

De-duplicating3: How do I match?

Look to other fields after Name1. Account numbers/IDs2. Phone numbers

• Scrub to raw numbers

3. Addresses• Select address fields that look promising• Concatenate as needed; Street & ZIP, Name & ZIP

HOW do I

match?

Page 22: Aen007 Kenigsberg 091807

Concatenating address parts

…consistent addresses …consistent namesBetter when we have…

HOW do I

match?

Page 23: Aen007 Kenigsberg 091807

De-duplicating3: How do I match?

Use these techniques Scrubbing Native & Third-Party Concatenating Native & Third-Party Address-standardizing Third-Party Address lookup Third-Party Fuzzy matching Third-Party Automated scoring Third-Party

Widespread merging requires a third-party tool CRMFusion’s DemandTools Trillium’s Diamond Data Informatica’s Data Quality Microsoft’s SQL Server 2005, while not specifically adapted to Salesforce,

provides fuzzy matching Many other solutions downstairs!

HOW do I

match?

Page 24: Aen007 Kenigsberg 091807

De-duplicating4: How do I merge?

Designate a Winner rec with all proper data Modify records:

Third-Party: create “ID” field and “Winner/Loser” field Salesforce: modify Name field

• Winner:“merge” & generic Account Name & “winner”

• Loser:“merge” & generic Account Name

HOW do I

merge?

Page 25: Aen007 Kenigsberg 091807

Modifying names in SalesforceHOW do I

merge?

Page 26: Aen007 Kenigsberg 091807

Merging in the Salesforce UIHOW do I

merge?

Page 27: Aen007 Kenigsberg 091807

Merging in the Salesforce UI (cont’d)HOW do I

merge?

Page 28: Aen007 Kenigsberg 091807

De-duplicatingKey lessons

1. Establish what qualifies as a duplicate before starting2. Use a tough case to get a feel for the data3a. Determine which other fields can be used

Account numbers/IDs Phone numbers Addresses

3b. Determine which methods can be used Scrubbing Native & Third-Party Concatenating Native & Third-Party Address-standardizing Third-Party Address lookup Third-Party Fuzzy matching Third-Party Automated scoring Third-Party

4. Designate a Winner record and merge Losers into it

a dupe?

WHAT’S HOW do I

my data? match? merge?

Page 29: Aen007 Kenigsberg 091807

Managing data loads better

Gotta import new records by tomorrow We’re creating a repeatable, documented process

“Just load it”…• …is difficult to audit after the fact

• …may not be reversible if I’ve made a mistake

Best practices for loading

Page 30: Aen007 Kenigsberg 091807

Managing data loads better (cont’d)

Three steps:1. How should I map my data?

2. How can I automate the generating of CSVs?

3. How can I load data in an auditable way?

HOW do I

generate? load?map?

HOW do I generate? load?map?

Page 31: Aen007 Kenigsberg 091807

And now. . . a tool

Data Loader Insert, Update, UPSERT, Extract, Delete via CSVs Supported by salesforce.com xrl.us/DataLoader

Apex tools (including Data Loader) xrl.us/ApexTools

HOW do I map?

Page 32: Aen007 Kenigsberg 091807

And now. . . a tool (cont’d)HOW do I

map?

Relational Database: Data stored in tables Capable of importing CSVs Capable of creating joins

between tables

Page 33: Aen007 Kenigsberg 091807

And now. . . a tool (cont’d)HOW do I

map?

Query functionality: Queries/views stored separately

from tables Queries/views use SQL (Structured

Query Language) Bonus: toggling between SQL and

graphical interfaces

Page 34: Aen007 Kenigsberg 091807

Managing data loads better1: How do I map?

Create a mapping file Create list of fields in legacy file/s Create list of API field names (not the UI field labels!)

• Get them with Data Loader

Tie Force.com API field names to legacy fields

HOW do I map?

Page 35: Aen007 Kenigsberg 091807

Grab the source field names

1

2

Transposed using Excel’s“Edit | Paste Special | Transpose”command

HOW do I map?

Page 36: Aen007 Kenigsberg 091807

Grab the Force.com API field names

1

2Transposed using Excel’s“Edit | Paste Special | Transpose”command

HOW do I map?

Page 37: Aen007 Kenigsberg 091807

Build out mapping file

An index columnfacilitates latersorting Not every

column has to map

Comment column is useful for concatenates, lookups

HOW do I map?

Page 38: Aen007 Kenigsberg 091807

Managing data loads better2: How do I generate CSVs?

Import legacy data into Access Reusable tool – Great for subsequent loads

• Test-then-production• Pilot-then-phase 1

Change all fields to Text Concatenate strings to create query

HOW do I generate?

Page 39: Aen007 Kenigsberg 091807

Importing into Access

1

2

HOW do I generate?

Page 40: Aen007 Kenigsberg 091807

Managing data loads better2: How do I generate CSVs?

Import legacy data into Access Reusable tool – Great for subsequent loads

• Test-then-production

• Pilot-then-phase 1

Change all fields to Text Concatenate strings to create query

HOW do I generate?

Page 41: Aen007 Kenigsberg 091807

Access table design

After import, change field types to Text

HOW do I generate?

Page 42: Aen007 Kenigsberg 091807

Managing data loads better2: How do I generate CSVs?

Import legacy data into Access. Why? Reusable tool – Great for subsequent loads

• Test-then-production

• Pilot-then-phase 1

Change all fields to Text Concatenate strings to create query

HOW do I generate?

Page 43: Aen007 Kenigsberg 091807

Concatenate strings to create query

Concatenation formula produces clauses suitable for pasting in SQL SELECT query

HOW do I generate?

Page 44: Aen007 Kenigsberg 091807

Customize SQL where needed

Customized, concatenated DESCRIPTION field (as specified in mapping document)

Lookups to other tables would also require customization

HOW do I generate?

Page 45: Aen007 Kenigsberg 091807

Managing data loads better3: How do I load?

Save query as CSV Folder naming:

Use Date & Task format: [YYYY]-[MM]-[DD]#[NN] [Object] [Task] [Desc]

Save source, success, error files in appropriate folder

Note how Auto-Match works with CSV

HOW do I load?

Page 46: Aen007 Kenigsberg 091807

Folder naming

YYYY-MM-DD#NNYYYY-MM-DD#NNforces foldersforces folders

to sort into sort inchronologicalchronological

orderorder

HOW do I load?

Page 47: Aen007 Kenigsberg 091807

Managing data loads better3: How do I load?

Save query as CSV Folder naming:

Use Date & Task format: [YYYY]-[MM]-[DD]#[NN] [Object] [Task] [Desc]

Save source, success, error files in appropriate folder

Note how Auto-Match works with CSV

HOW do I load?

Page 48: Aen007 Kenigsberg 091807

Folder contents

Save source file, success files, and error files all in same directory

HOW do I load?

Page 49: Aen007 Kenigsberg 091807

1

2

Folder contents (cont’d)

When loading errors:• Copy error file and• Use as source for next load

HOW do I load?

Page 50: Aen007 Kenigsberg 091807

Managing data loads3: How do I load?

Save query as CSV Folder naming:

Use Date & Task format: [YYYY]-[MM]-[DD]#[NN] [Object] [Task] [Desc]

Save source, success, error files in appropriate folder

Note how Auto-Match works with CSV

HOW do I load?

Page 51: Aen007 Kenigsberg 091807

Perfect auto-match

With a file created this way, Auto-Match gets every field mapped

HOW do I load?

Page 52: Aen007 Kenigsberg 091807

Managing data loadsKey lessons

1. Build a mapping file to document how data gets transformed

2. Leverage a tool (in this example, Access) to generate CSVs Concatenating strings can help save time mapping

3. Use a folder & file name discipline to make your work self-documenting and easy-to-follow Use Date & Task format:

[YYYY]-[MM]-[DD]#[NN] [Object] [Task] [Desc]

HOW do I generate? load?map?

Page 53: Aen007 Kenigsberg 091807

Keeping data clean

Record DeDup 1.4xrl.us/RecordDeDupV14

Data Quality Dashboardsxrl.us/DataQualityDashboardsV1

Custom Reports(go crazy!)

Page 54: Aen007 Kenigsberg 091807

Session FeedbackLet us know how we’re doing!

Please score the session from 5 to 1 (5=excellent,1=needs improvement) in the following categories:

Overall rating of the session Quality of content Strength of presentation delivery Relevance of the session to your organization

We strive to improve, thank you for filling out our survey.

Additionally, please score each individual speaker on: Overall delivery of session

Page 55: Aen007 Kenigsberg 091807

Q&A

Post-Session Questions?Ezra KenigsbergData Architect (Midwest) [email protected]

Links: Successforce.com salesforce.com/developer Previous Dreamforce presentations:

xrl.us/DataDataData xrl.us/ExcelConnectorDemo xrl.us/SystemOverload

Data tools: xrl.us/ApexTools xrl.us/DataLoader xrl.us/ExcelConnector

Keeping data clean: xrl.us/RecordDeDupV14 xrl.us/DataQualityDashboardsV1

Page 56: Aen007 Kenigsberg 091807

Extras:Ezra’s Toughest Data Projects

De-duplication Data normalization/denormalization

Mapping a table of addresses or phone numbers to specific fields according to rules

Data flow-through “If value is not in table A, use value in table B”

Forcing free text entries to behave like external keys Dealing with typos, whitespace

Attributing record ownership

Page 57: Aen007 Kenigsberg 091807

Extras:Manage your time! Make shortcuts!

Get good with keyboard shortcuts We all know Alt+Tab… Alt+Spacebar and its subsequent accelerators Use accelerator keys…

• …in naming app & internet shortcuts• …in navigating menus

Learn individual apps’ keystrokes Shortcuts for apps and sites

Populate your Start Menu or Taskbar• Sort your Programs menu often

Ctrl+Alt+(letter) Use your Bookmarks/Favorites

Shortcuts for Salesforce tabs Shortcuts for Salesforce setup links Users want to go to a specific page?

• Help them put it on their Start Menus, Taskbars, or Desktops

Be curious!

…after

before…

Page 58: Aen007 Kenigsberg 091807

Extras:Ezra’s favorite keystrokes

User-defined app shortcuts: Ctrl+Alt+(letter) Windows app menu: Alt+Spacebar Show desktop toggle: Windows+D Menu navigation: Alt, then letters Excel select row: Shift+Spacebar Excel select column: Ctrl+Spacebar Excel insert: Ctrl+Shift+Plus Excel delete: Ctrl+Minus Excel numformats: Ctrl+Shift+(number) Excel hide (unhide) row: Ctrl+(Shift)+9 Excel hide (unhide) column: Ctrl+(Shift)+0 Excel edit: F2

Page 59: Aen007 Kenigsberg 091807

Extras:Other cool data tricks

Use FixID() to convert 15-char IDs to 18-char IDs Funnction available in Excel Connector

Use Pivot Tables to quickly list and count unique values Skip unneeded Wizard steps

Fill in blank cells in a table with a killer combo: Go To… Special | Blanks Ctrl+Enter