Comparing Files without Proc Compare Pharmasug 2008 Alejandro Jaramillo Russ Lavery Jaramillo & Lavery Pharmasug 208 1
Comparing Files without Proc Compare
Pharmasug 2008Alejandro Jaramillo
Russ Lavery
Jaramillo & Lavery Pharmasug 208 1
• Purpose• To present an efficient methodology to compare and validate files
that are expected to have the same data structure and contents
• Business Case• In migrating data to new system business rules may indivertibly
changed. Having an adequate method and process to independently and efficiently flag potential unexpected changes in the data is the key to the project success
• Parallel File Comparison• Parallel File Comparison is defined as the process for recreating data
files from raw data sources by an independent team and comparing them to files produced by development team to feed enterprise or production system. The goal is to detect differences due to different interpretation or application of business rules or human error.
Jaramillo & Lavery Pharmasug 208 2
Scenarios for Parallel Data Comparisons• Forward Data is compared and validated independently at every stage
before data goes into production to feed enterprise applications
• Backward Data is fed to enterprise application and results are regenerated independently from raw data. Even if enterprise results match, validation of granular data feeding enterprise application is done
Enterprise Results
Pre summarization
Granular
Raw Data
Validation
Validation
Validation
ValidationForward
Backward
Jaramillo & Lavery Pharmasug 208 3
The Parallel File Comparison Method• Method discussed in this presentation is based on using SAS for
comparison and validation. However method can be applied when using any other system
• SAS Proc compare provides an excellent way to compare files when they are expected to have no differences. However when proc compare shows differences, a more detail methodology should be used to trace the source of the differences.
• The method of comparing two files that are suppose to have the same data and file structure but show differences via “Proc compare” has the following 5 steps:
1. Produce the files to be compared against development or production data
2. Start comparing pairs of similar files using proc compared. If comparison fails go to #3
3. Compare file structure via proc contents= > If fails stop and get files to conform to the same structure
4. Define keys and data 5. On both files run summaries on major keys (time, period, product code,
market code..etc) 6. Compare both raw files at the record level with regards to keys and data
=> if 6 or 5 fail, inquiry about file differences using raw data must be followed
Jaramillo & Lavery Pharmasug 208 4
Early DiagnosisIf Proc compare shows differences. A more detail analysis is
required. Start with Proc contents
After confirming same file structure and number of observations, a
more detail check on the raw data must be conducted
Jaramillo & Lavery Pharmasug 208 5
Store Reg Prod LQ1 LQ2 LQ3 RQ1 RQ2 RQ3
AAA A 0P1 12 10 8 12 10 8
BBB A 0P1 10 11 7 10 11 7
FFF A 0P1 17 11 8 19 10 6
CCC A 0P1 12 10 8
DDD B 0P2 10 15 2
EEE B 0P3 10 15 2 10 15 4
NNN c 1P1 19 15 11
CCZ A 0P1 12 10 8
Store Reg Prod LQ1 LQ2 LQ3
AAA A 0P1 12 10 8
BBB A 0P1 10 11 7
FFF A 0P1 19 10 6
EEE B 0P3 10 15 4
NNN c 1P1 19 15 11
CCZ A 0P1 12 10 8
Store Reg Prod LQ1 LQ2 LQ3
AAA A 0P1 12 10 8
BBB A 0P1 10 11 7
FFF A 0P1 17 11 8
CCC A 0P1 12 10 8
DDD B 0P2 10 15 2
EEE B 0P3 10 15 2
Logic
---Left FIle--- ---Right FIle---
Match
Match
Data
Key
ODD
Data
ODD
Key
This method
checks
EVERY
value.
Match lines
on Key and
use array &
loop to
compare
data values.
Checking Keys and Data gives exact answers
Key
Error
Jaramillo & Lavery Pharmasug 208 6
Left FileRight File
Both
both files
good and bad
matches
bad_left
It is only on
the left file
Badright
It is only on
the right file
Get Merge by keys
Generation of
matching variables
Top view for comparing left and right files
run freqs on matching variables
List and compare a few raw records form bad files to get an idea
of the source of mismatchesJaramillo & Lavery Pharmasug 208 7
Store Reg Prod LQ1 LQ2 LQ3 RQ1 RQ2 RQ3
AAA A 0P1 12 10 8 12 10 8
BBB A 0P1 10 11 7 10 11 7
FFF A 0P1 17 11 8 19 10 6
CCC A 0P1 12 10 8
DDD B 0P2 10 15 2
EEE B 0P3 10 15 2 10 15 4
NNN c 1P1 19 15 11
CCZ A 0P1 12 10 8
mismatch Left_vs_Right
|1= Obs |10= Obs |11= Obs | Total
Frequency |in Left |in |in both |
Percent |Only |Right |Left and|
| |only |Right |
-----------------ˆ--------ˆ--------ˆ--------ˆ
NO problems | 0 | 0 | 2 | 2
with key or data | 0.00 | 0.00 | 25.00 | 25.00
-----------------ˆ--------ˆ--------ˆ--------ˆ
Yes: Problems | 2 | 2 | 2 | 6
with key or data | 25.00 | 25.00 | 25.00 | 75.00
-----------------ˆ--------ˆ--------ˆ--------ˆ
Total 2 2 4 8
25.00 25.00 50.00 100.00
Logic
Match
Match
Data
Key
ODD
Data
ODD
Key
Checking Keys and Data gives exact answers
We are
comparing
data with
missing
values.
Data
problemJaramillo & Lavery Pharmasug 208 8
mismatch Sand_vs_ODW
|1= Obs |10= Obs |11= Obs | Total
Frequency |in Left |in |in both |
Percent |Only |Right |Left and|
| |only |Right |
-----------------ˆ--------ˆ--------ˆ--------ˆ
NO problems | 0 | 0 | 2 | 2
with key or data | 0.00 | 0.00 | 25.00 | 25.00
-----------------ˆ--------ˆ--------ˆ--------ˆ
Yes: Problems | 2 | 2 | 2 | 6
with key or data | 25.00 | 25.00 | 25.00 | 75.00
-----------------ˆ--------ˆ--------ˆ--------ˆ
Total 2 2 4 8
25.00 25.00 50.00 100.00
Store Reg Prod STrx1 STrx2 STrx3 OTrx1 OTrx2 OTrx3
AAA A 0P1 12 10 8 12 10 8
BBB A 0P1 10 11 7 10 11 7
FFF A 0P1 17 11 8 19 10 6
CCC A 0P1 12 10 8
DDD B 0P2 10 15 2
EEE B 0P3 10 15 2 10 15 4
NNN c 1P1 19 15 11
CCZ A 0P1 12 10 8
Logic
Match
Match
Data
Key
ODD
Data
ODD
Key
Ideally, all obs should be
here
Checking Keys and Data gives exact answers
Keys Match, problems
with the data
Jaramillo & Lavery Pharmasug 208 9
Timeline
Left File Right Left File
Check for duplicates Check for duplicates
Check for bad codes Check for bad codes
Clean the
file
Clean the
file
Contents: date & size Contents: date & size
Freq by Prod_code Freq by Prod_code
R
P
T
Rpt
Merge-Calc
Diff by
Prod_cd
Merge-
Calc High
Level Diffs
Rpt
electroni
c copy
Identify every row with problem
electroni
c copy
Problem Analysis – Row. electroni
c copy
Key Analysis
Problem Analysis – Rx
Rpt
Jaramillo & Lavery Pharmasug 208 10
Timeline QC Process
Write programs, for
series of files, in
anticipation of file
delivery.
A batch of
files to be
compared
is
delivered
Run QC
Programs on
the batch
files
Assemble
report on
batch files
(concurrent
w/ run)
QC Programming
Review/ annotate
report (1 day)
Arrange meeting
with Responsible
Group.
(1 week)
Discuss report W/
Responsible Group
and create action
items. (1 day)
FAIL
Create
new
version of
files
(2 weeks)
Investigate /
fix action
items.
(1 week)
File is OK
or “close”
If files are close user runs
reports with new file and
compares results(1 week)
Pass
F
A
I
L
log as
file done
Jaramillo & Lavery Pharmasug 208 11
Conclusion & Recommendations
• When data sources and process change use of a systematic approach as the outlined in this presentation to compare data at the top and record level provides an efficient mechanism to track progress, identify and resolve potential problems.
• Comparison and validation should be included in project timeline.
• QC metrics should be established for development team. However total validation must be conducted independently.
• Differences in data must be accounted 100% of the times.
Jaramillo & Lavery Pharmasug 208 12