LU10 Notes - DbWiki
Lab For LU12 – Physical Database Design and Data Conversion
In this lab you will learn the basics of physical database
design and learn how to use SQL to perform data conversion
techniques.
Learning Objectives
· Describe the steps in the physical design phase
· Explain the various data storage formats for attributes
· Explain the appropriate conditions to de-normalize
· Describe sequential, indexed and direct (hashed) file
organizations
· Explain when to use indexing as a method for improving
performance
· Explain the issues associated with data migration
· Describe various techniques for migrating data in SQL
implementations
Lab Goals
Our lab goals are to:
1. Learn how to import data into SQL server
2. Use SQL to cleanse, transform and load our source data into
the target tables.
3. Use SQL to perform data verification.
This lab consists of two parts in part one; we will walk you
through a data migration scenario. In part two you will attempt a
data migration on your own.
What you will need to begin
1. Microsoft SQL Server Management Studio (for executing SQL
queries).
2. Microsoft Access 2003 or higher for importing data.
3. Your account credentials on the ISTSTUDENTS server.
4. The lab zip file containing the data to import and the target
SQL schema.
Part 1: Data Migration Walk-ThruScenario
You have been hired by fictitious book-selling giant Fudgenbooks
to improve the design of their existing database. Fudgenbooks
currently keeps inventory in a spreadsheet - which is far from
ideal. You normalized their existing database and created the SQL
for the new data model. The last step is to migrate the data into
the new data model.
Step one: Import Source Data into SQL Server
Fudgenbooks keeps their existing database in a spreadsheet. Your
first task will be to import that spreadsheet as-is into your SQL
server database. One of the things Microsoft Access does very well
is connect to other data sources. This makes Access a verify
versatile tool for assisting in data migration.
Create a new project in Microsoft Access. Access Data projects
are different from Access Databases in that the back-end data is in
SQL Server. This creates a two-tier client/server model and a
distributed database with location transparency.
1. Open up Microsoft Access from the start menu. From the Office
menu, select New. Then click on [Blank Database]
2. Click on the folder icon near the filename to browse for the
file destination folder
3. Select your folder then click on [SAVE AS TYPE] drop down
option
Select the [Microsoft Access Projects *.adp] option. Then type a
filename (no need to indicate extension). Notice the name and the
extension adp. Click on create.
4. You will be asked
Select Yes
5. Complete the details using the sample below.
(Be sure to check thisMake sure this is your databaseTest your
connection before clicking OK)
1. From the menu, External Data Excel. From the import dialog,
change the file type to Excel and select the fudgenbooks.xls file
(The file is available in the LMS).
When you’re ready, select Import
2. Follow along with the Import Spreadsheet Wizard. Check First
row contains column headings then click Next. Store data in a new
table, click Next. Name the table fudgenbooks_import and click
Finsh.
3. If you are successful, you should have a table in your
database called fudgenbooks_import which looks like this (yes, it
is not a lot of data, but it is enough )
At this point, you no longer need Microsoft Access. It has done
its job by helping us get the spreadsheet data into SQL server.
Step two: Create target database schema
In this next step you will use SQL Server Management Studio to
create the target database schema (tables, keys and constrains). It
should be noted that this is not a typical step in the data
migration process, since usually the target schema already
exists.
1. Open SQL Server Management Studio and connect to your SQL
server instance.
2. From the menu choose File Open File. Open the file from the
zip file archive called create_fudgenbooks_schema.sql. After the
file opens, execute the SQL script to create the 4 fb_* tables.
3. Create a database diagram by adding all the tables that begin
with fb_ this should create a diagram like this:
4. Save the diagram as Fudgenbooks.
Step three: Data cleansing
(The data here represents the same thing, but it is represented
differently ‘mysql’ vs. ‘MySQL’ this need to be corrected before we
transform the data.)The next step in the migration process is data
cleansing. In this step you “clean up” the source data so that
things will be nice and neat when we do the data transformations.
What do we need to clean up? Take a look at the following:
To clean the data up we run SQL update statements on the source
data. Open a new query window in SQL Management Studio and type the
following:
UPDATE fudgenbooks_import SET Subj1 = 'MySQL' WHERE lower(Subj1)
= 'mysql';
UPDATE fudgenbooks_import SET Subj2 = 'MySQL' WHERE lower(Subj2)
= 'mysql';
UPDATE fudgenbooks_import SET Subj1 = 'MSSQL' WHERE lower(Subj1)
= 'mssql';
UPDATE fudgenbooks_import SET Subj2 = 'MSSQL' WHERE lower(Subj2)
= 'mssql';
The resultant source data changes to look like this:
Now that the data is clean, we’re ready to transform!
Step four: Transform the source data and load into the target
tables
This next step is the critical step. Here’s where we move data
from the source to the destination tables. In looking at our target
schema, we must start our import into the tables without foreign
keys. Therefore we first migrate data into the fb_publishers table,
then fb_bookstore, fb_authors, and fb_subjects in that order.
Regardless of the table we’re migrating, the approach is the
same. First, create the SQL SELECT statement(s) which output the
data as you need it to appear in the target table. Keep in mind
these SQL techniques as you attempt to accomplish the task:
· The DISTINCT keyword outputs unique rows of data. This is
important for building lookup tables, and tables on the “one” side
of the 1-M relationship.
· The UNION keyword allows you to combine like SQL statements
into one output. This is useful for addressing multi-valued
attributes and repeating groups of data.
· To combine columns of data use the + operator. eg.[Last Name]
+ ‘, ‘ + [First Name] as [Full Name]
· To extract one column into two columns use the CHARINDEX
function. eg. left([Full Name],charindex([Full Name],’,’)) as [Last
Name]
· Use cast and convert to change the data type of columns
First we load up the fb_publishers table:
-- 1st step: this statement populates the publishers table
insert into fb_publishers
select distinct Pub_id, [Publisher Name], [Publisher URL] from
fudgenbooks_import
Then the table data should look like this.
Next we load the fb_bookstore table:
-- 2nd this statemnt populates the fb_bookstore table
insert into fb_bookstore
select distinct convert(int, [ISBN]), [Title], [Price], [Pages],
[Pub_id]
from fudgenbooks_import
And the table data looks like:
Finally we move on to the fb_authors and fb_subjects tables.
These require union queries since columnar data needs to be moved
into rows:
-- 3rd this statement populates the fb_authors table
insert into fb_authors
select Author1, convert(int,ISBN) from fudgenbooks_import
where Author1 is not null
union
select Author2, convert(int,ISBN) from fudgenbooks_import
where Author2 is not null
union
select Author3, convert(int,ISBN) from fudgenbooks_import
where Author3 is not null
And the table data looks like:
-- Last: this statement populates the fb_subjects
insert into fb_subjects
select Subj1, convert(int,ISBN) from fudgenbooks_import
where Subj1 is not null
union
select Subj2, convert(int,ISBN) from fudgenbooks_import
where Subj2 is not null
Whew! That seems like a lot of work. We did, however automate it
to the point where it will work each and every time regardless of
the actual data. One thing I might change in hindsight is to alter
the ISBN column in the fudgenbooks_import table during the
cleansing phase to avoid having to use convert(int,ISBN)(This was
necessary because conversion from float to varchar contain decimals
and the ISBN could have come over like this 121398742.0 !
Step five: Data VerificationThe final step is to perform data
verification on the target data to make sure it does indeed have
the same data as the source. The easiest (but farthest from
complete) means of accomplishing this is to check row counts.The
following query compares the total row counts of the original data
source and the corresponding table in the target data source.
SELECT
(select count(*) from fb_bookstore) as fb_bookstore_count,
(select count(*) from fudgenbooks_import) as
fudgenbooks_count
Row counts work as long as you’re also checking to make sure the
SQL commands are not generating errors. Data that errors out during
transformation / loading well definitely affect row counts.
While not true data verification, it is a good idea to spot
check some sample data for correctness. For example, this query
joins all the tables back together to give you a view that closely
resembles the original data.
SELECT b.isbn, title, price, pages, p.pub_id, pub_name, pub_url,
author, subj
FROM fb_bookstore b JOIN fb_publishers p ON
b.pub_id=p.pub_id
JOIN fb_authors a ON b.isbn=a.isbn
JOIN fb_subjects s on b.isbn=s.isbn
ORDER BY title
Questions
1. Data cleansing and data transformation are similar tasks, but
they do differ in some significant ways. How do they differ?
2. What are the challenges associated with automating the
transformation and load phases of data migration?
3. In this example why is it not possible to re-create the
original source data from the target data? Hint: what normal form
was the source in when we started?
4. Now that you’ve experienced a migration. Do you think
explicit transactions might bring anything useful to the process?
If so, what?
5. Many DBMSs like SQL Server have their own built-in import and
export facilities which can be used for the extraction phase of
migration. What might the advantages of using a 3rd party
application such as Microsoft access be?
Part 2: Perform your own data migration
In this part, you will perform your own data migration by
applying the principles and techniques used in the walk-thru
portion of the lab in part one.
Overview
Migrate data from the spreadsheet named fudgeconsulting.xls in
the lab zip file, into the SQL schema for fudge consulting
create_fudgeconsulting_schema.sql once you’ve performed the data
migration, verify the data by “eyeballing” the original table
against a SELECT statement that joins the target data mode back
together
Here are some hints to make the process a little easier:
1. Use Microsoft Access to import the data as-is into SQL
server, just like we did in part 1 of the lab.
2. Execute the SQL file to create the target schema and then
create a database diagram so that you have a picture of what the
internal model looks like, and can determine which tables to
migrate first.
3. Microsoft Access imports the [Proj Num] and [Emp No] columns
as float. During the data cleansing step, alter the
fudgeconsulting_import table and change the columns data type to
int. This will make the transformation / load step much easier. For
example:ALTER TABLE fudgeconsulting_import ALTER COLUMN [Proj Num]
intYou’ll have to do this a second time for the [Emp No]
column.
4. Remember to use DISTINCT to only return the unique rows. Also
when you need to split the last name and first name from the [Emp
Name] use expressions. Hint: This gives you the last name:
right([Emp Name], len([Emp Name]) - charindex(' ',[Emp Name]))
5. You don’t need to worry about [Total billed] it has been
implemented as a calculated column with a function.
6. When you verify the migration run two SQL queries and compare
them:
Questions
1. Write down the SQL statement you used to transform data into
the fc_rates table.
2. Write down the SQL statement you used to transform data into
the fc_projects table.
3. Write down the SQL statement you used to transform data into
the fc_employees table.
4. Write down the SQL statement you used to transform data into
the fc_consulting table.
5. Write Which SQL statement to verify the data is the same as
the source. (The output should look like the bottom grid in #6
above).
9 / 9