Applied Data Quality Ezra Kenigsberg Data Architect salesforce.com Admin III: Expanding into new areas
Nov 22, 2014
Applied Data Quality
Ezra Kenigsberg
Data Architect
salesforce.com
Admin III: Expanding into new areas
Safe Harbor Statement
“Safe harbor” statement under the Private Securities Litigation Reform Act of 1995: This presentation may contain forward-looking statements including but not limited to statements concerning the potential market for our existing service offerings and future offerings. All of our forward looking statements involve risks, uncertainties and assumptions. If any such risks or uncertainties materialize or if any of the assumptions proves incorrect, our results could differ materially from the results expressed or implied by the forward-looking statements we make.
The risks and uncertainties referred to above include - but are not limited to - risks associated with possible fluctuations in our operating results and cash flows, rate of growth and anticipated revenue run rate, errors, interruptions or delays in our service or our Web hosting, our new business model, our history of operating losses, the possibility that we will not remain profitable, breach of our security measures, the emerging market in which we operate, our relatively limited operating history, our ability to hire, retain and motivate our employees and manage our growth, competition, our ability to continue to release and gain customer acceptance of new and improved versions of our service, customer and partner acceptance of the AppExchange, successful customer deployment and utilization of our services, unanticipated changes in our effective tax rate, fluctuations in the number of shares outstanding, the price of such shares, foreign currency exchange rates and interest rates.
Further information on these and other factors that could affect our financial results is included in the reports on Forms 10-K, 10-Q and 8-K and in other filings we make with the Securities and Exchange Commission from time to time. These documents are available on the SEC Filings section of the Investor Information section of our website at www.salesforce.com/investor. Salesforce.com, inc. assumes no obligation and does not intend to update these forward-looking statements, except as required by law.
Purpose
To give you data techniques that can help make you a better administrator
This session… Other sessions…
Practical demos—things you can do TODAY
Tactical ideas and technical detail
Data Loader, Excel,Excel Connector, Access
Bigger-picture data strategy
Third-party tools
Outside this session:Please use these resources
Successforce.com
Salesforce.com’s Data Services group Brian Wiebe, Program Director Regional Data Architects
Product demos downstairs Developer Network: salesforce.com/developer
This Year’s Sessions: Data Data Everywhere! Data Quality: Who Doesn’t Want It?
Last Year’s Sessions: xrl.us/DataDataData xrl.us/ExcelConnectorDemo xrl.us/SystemOverload
Overview
1. De-duplicating existing data 20 min
2. Managing data loads better 20 min
3. Keeping data clean 5 min
4. Q&A 10 min
WHAT’S HOW do I
my data? match?
De-duplicatinga dupe?
WHAT’S HOW do I
my data? match?
I’ve gotta de-dupe the worst records by tomorrow Four questions:
1. What qualifies as a duplicate?
2. What’s my data look like?
3. How will I determine matches?
4. How will I merge records?
a dupe? merge?
merge?
Typical questions: Should we track individual physical customer locations? Should we merge all Accounts in a single country/region? Is it okay to merge clearly different records? Do we need different definitions of “dupe” for different business
units/regions?
. . . answers to these questions determineHierarchy Depth
Records per Customer
WHAT’S
a dupe?
A dupe’s definition determines the dataset!
De-duplicating1: What’s a dupe?
De-duplicating1: What’s a dupe? (cont’d)
PROBLEM: Too many records in org Successive de-dupe runs
were not eliminating as many records as expected
SOLUTION: Make records more
generalized by1. merging obvious
duplicates
AND2. merging different physical-
location records within US into unified records
WHAT’S
a dupe?
Fortune 500 company8,000+ EmployeesOperates in 30+ countries
Case Study:
Some hierarchy possibilities:1. One-and-only-one Account record2. Single parent Account and many children
(no grandparents or grandchildren)3. Multi-tier
Multi-tier example: D&B implemented within Salesforce Global Ultimate one across the entire world
o Domestic Ultimate one per country• Branch multiple physical locations
– Subsidiary one per physical location
De-duplicating1: What’s a dupe? (cont’d)
DecreasingDecreasingEase-of-Ease-of-
MaintenanceMaintenance
WHAT’S
a dupe?
For this example: We’ll de-dupe to the physical-location level We’ll allow hierarchy to be multi-tier We’ll use Excel Connector
• Following steps apply regardless of de-dupe tool
De-duplicating1: What’s a dupe? (cont’d)
WHAT’S
a dupe?
And now. . . a tool
Excel Connector Work with Salesforce data directly in Excel:
Insert, Update, Extract, Delete Too slow for large data volumes (typ. more than hundreds) No longer being updated, not supported by salesforce.com
(alas) xrl.us/ExcelConnector
Dreamforce session “Excel Connector: Your Golden Ticket to Clean Data” xrl.us/ExcelConnectorDemo
Apex tools (including Excel Connector) xrl.us/ApexTools
WHAT’S
my data?
De-duplicating2: What’s my data?
First stop: Names Examine the toughest cases to get a feel for all data
How do apparent dupes’ names seem to match?• Perfectly?
• Systematically?
• Randomly?
Sort Filter
WHAT’S
my data?
Initial data set
Look for a tough case, then sort and filter to get…
WHAT’S
my data?
(1) Sorted and (2) filtered on name
…a set of records to work with
WHAT’S
my data?
De-duplicating3: How do I match?
Look to other fields after Name1. Account numbers/IDs2. Phone numbers
• Scrub to raw numbers
3. Addresses• Select address fields that look promising• Concatenate as needed; Name & ZIP, Street & ZIP
HOW do I
match?
Numbers/IDs
…IDs/Numbers don’t help here
HOW do I
match?
De-duplicating3: How do I match? (cont’d)
Look to other fields after Name1. Account numbers/IDs2. Phone numbers
• Scrub to raw numbers3. Addresses
• Select address fields that look promising• Concatenate as needed; Name & ZIP, Street & ZIP
HOW do I
match?
Scrubbing phone numbersHOW do I
match?
Scrubbing phone numbers (cont’d)Easier to line up matches
HOW do I
match?
De-duplicating3: How do I match?
Look to other fields after Name1. Account numbers/IDs2. Phone numbers
• Scrub to raw numbers
3. Addresses• Select address fields that look promising• Concatenate as needed; Street & ZIP, Name & ZIP
HOW do I
match?
Concatenating address parts
…consistent addresses …consistent namesBetter when we have…
HOW do I
match?
De-duplicating3: How do I match?
Use these techniques Scrubbing Native & Third-Party Concatenating Native & Third-Party Address-standardizing Third-Party Address lookup Third-Party Fuzzy matching Third-Party Automated scoring Third-Party
Widespread merging requires a third-party tool CRMFusion’s DemandTools Trillium’s Diamond Data Informatica’s Data Quality Microsoft’s SQL Server 2005, while not specifically adapted to Salesforce,
provides fuzzy matching Many other solutions downstairs!
HOW do I
match?
De-duplicating4: How do I merge?
Designate a Winner rec with all proper data Modify records:
Third-Party: create “ID” field and “Winner/Loser” field Salesforce: modify Name field
• Winner:“merge” & generic Account Name & “winner”
• Loser:“merge” & generic Account Name
HOW do I
merge?
Modifying names in SalesforceHOW do I
merge?
Merging in the Salesforce UIHOW do I
merge?
Merging in the Salesforce UI (cont’d)HOW do I
merge?
De-duplicatingKey lessons
1. Establish what qualifies as a duplicate before starting2. Use a tough case to get a feel for the data3a. Determine which other fields can be used
Account numbers/IDs Phone numbers Addresses
3b. Determine which methods can be used Scrubbing Native & Third-Party Concatenating Native & Third-Party Address-standardizing Third-Party Address lookup Third-Party Fuzzy matching Third-Party Automated scoring Third-Party
4. Designate a Winner record and merge Losers into it
a dupe?
WHAT’S HOW do I
my data? match? merge?
Managing data loads better
Gotta import new records by tomorrow We’re creating a repeatable, documented process
“Just load it”…• …is difficult to audit after the fact
• …may not be reversible if I’ve made a mistake
Best practices for loading
Managing data loads better (cont’d)
Three steps:1. How should I map my data?
2. How can I automate the generating of CSVs?
3. How can I load data in an auditable way?
HOW do I
generate? load?map?
HOW do I generate? load?map?
And now. . . a tool
Data Loader Insert, Update, UPSERT, Extract, Delete via CSVs Supported by salesforce.com xrl.us/DataLoader
Apex tools (including Data Loader) xrl.us/ApexTools
HOW do I map?
And now. . . a tool (cont’d)HOW do I
map?
Relational Database: Data stored in tables Capable of importing CSVs Capable of creating joins
between tables
And now. . . a tool (cont’d)HOW do I
map?
Query functionality: Queries/views stored separately
from tables Queries/views use SQL (Structured
Query Language) Bonus: toggling between SQL and
graphical interfaces
Managing data loads better1: How do I map?
Create a mapping file Create list of fields in legacy file/s Create list of API field names (not the UI field labels!)
• Get them with Data Loader
Tie Force.com API field names to legacy fields
HOW do I map?
Grab the source field names
1
2
Transposed using Excel’s“Edit | Paste Special | Transpose”command
HOW do I map?
Grab the Force.com API field names
1
2Transposed using Excel’s“Edit | Paste Special | Transpose”command
HOW do I map?
Build out mapping file
An index columnfacilitates latersorting Not every
column has to map
Comment column is useful for concatenates, lookups
HOW do I map?
Managing data loads better2: How do I generate CSVs?
Import legacy data into Access Reusable tool – Great for subsequent loads
• Test-then-production• Pilot-then-phase 1
Change all fields to Text Concatenate strings to create query
HOW do I generate?
Importing into Access
1
2
HOW do I generate?
Managing data loads better2: How do I generate CSVs?
Import legacy data into Access Reusable tool – Great for subsequent loads
• Test-then-production
• Pilot-then-phase 1
Change all fields to Text Concatenate strings to create query
HOW do I generate?
Access table design
After import, change field types to Text
HOW do I generate?
Managing data loads better2: How do I generate CSVs?
Import legacy data into Access. Why? Reusable tool – Great for subsequent loads
• Test-then-production
• Pilot-then-phase 1
Change all fields to Text Concatenate strings to create query
HOW do I generate?
Concatenate strings to create query
Concatenation formula produces clauses suitable for pasting in SQL SELECT query
HOW do I generate?
Customize SQL where needed
Customized, concatenated DESCRIPTION field (as specified in mapping document)
Lookups to other tables would also require customization
HOW do I generate?
Managing data loads better3: How do I load?
Save query as CSV Folder naming:
Use Date & Task format: [YYYY]-[MM]-[DD]#[NN] [Object] [Task] [Desc]
Save source, success, error files in appropriate folder
Note how Auto-Match works with CSV
HOW do I load?
Folder naming
YYYY-MM-DD#NNYYYY-MM-DD#NNforces foldersforces folders
to sort into sort inchronologicalchronological
orderorder
HOW do I load?
Managing data loads better3: How do I load?
Save query as CSV Folder naming:
Use Date & Task format: [YYYY]-[MM]-[DD]#[NN] [Object] [Task] [Desc]
Save source, success, error files in appropriate folder
Note how Auto-Match works with CSV
HOW do I load?
Folder contents
Save source file, success files, and error files all in same directory
HOW do I load?
1
2
Folder contents (cont’d)
When loading errors:• Copy error file and• Use as source for next load
HOW do I load?
Managing data loads3: How do I load?
Save query as CSV Folder naming:
Use Date & Task format: [YYYY]-[MM]-[DD]#[NN] [Object] [Task] [Desc]
Save source, success, error files in appropriate folder
Note how Auto-Match works with CSV
HOW do I load?
Perfect auto-match
With a file created this way, Auto-Match gets every field mapped
HOW do I load?
Managing data loadsKey lessons
1. Build a mapping file to document how data gets transformed
2. Leverage a tool (in this example, Access) to generate CSVs Concatenating strings can help save time mapping
3. Use a folder & file name discipline to make your work self-documenting and easy-to-follow Use Date & Task format:
[YYYY]-[MM]-[DD]#[NN] [Object] [Task] [Desc]
HOW do I generate? load?map?
Keeping data clean
Record DeDup 1.4xrl.us/RecordDeDupV14
Data Quality Dashboardsxrl.us/DataQualityDashboardsV1
Custom Reports(go crazy!)
Session FeedbackLet us know how we’re doing!
Please score the session from 5 to 1 (5=excellent,1=needs improvement) in the following categories:
Overall rating of the session Quality of content Strength of presentation delivery Relevance of the session to your organization
We strive to improve, thank you for filling out our survey.
Additionally, please score each individual speaker on: Overall delivery of session
Q&A
Post-Session Questions?Ezra KenigsbergData Architect (Midwest) [email protected]
Links: Successforce.com salesforce.com/developer Previous Dreamforce presentations:
xrl.us/DataDataData xrl.us/ExcelConnectorDemo xrl.us/SystemOverload
Data tools: xrl.us/ApexTools xrl.us/DataLoader xrl.us/ExcelConnector
Keeping data clean: xrl.us/RecordDeDupV14 xrl.us/DataQualityDashboardsV1
Extras:Ezra’s Toughest Data Projects
De-duplication Data normalization/denormalization
Mapping a table of addresses or phone numbers to specific fields according to rules
Data flow-through “If value is not in table A, use value in table B”
Forcing free text entries to behave like external keys Dealing with typos, whitespace
Attributing record ownership
Extras:Manage your time! Make shortcuts!
Get good with keyboard shortcuts We all know Alt+Tab… Alt+Spacebar and its subsequent accelerators Use accelerator keys…
• …in naming app & internet shortcuts• …in navigating menus
Learn individual apps’ keystrokes Shortcuts for apps and sites
Populate your Start Menu or Taskbar• Sort your Programs menu often
Ctrl+Alt+(letter) Use your Bookmarks/Favorites
Shortcuts for Salesforce tabs Shortcuts for Salesforce setup links Users want to go to a specific page?
• Help them put it on their Start Menus, Taskbars, or Desktops
Be curious!
…after
before…
Extras:Ezra’s favorite keystrokes
User-defined app shortcuts: Ctrl+Alt+(letter) Windows app menu: Alt+Spacebar Show desktop toggle: Windows+D Menu navigation: Alt, then letters Excel select row: Shift+Spacebar Excel select column: Ctrl+Spacebar Excel insert: Ctrl+Shift+Plus Excel delete: Ctrl+Minus Excel numformats: Ctrl+Shift+(number) Excel hide (unhide) row: Ctrl+(Shift)+9 Excel hide (unhide) column: Ctrl+(Shift)+0 Excel edit: F2
Extras:Other cool data tricks
Use FixID() to convert 15-char IDs to 18-char IDs Funnction available in Excel Connector
Use Pivot Tables to quickly list and count unique values Skip unneeded Wizard steps
Fill in blank cells in a table with a killer combo: Go To… Special | Blanks Ctrl+Enter