1 RTI International is a trade name of Research Triangle Institute 3040 Cornwallis Road ■ P.O. Box 12194 ■ Research Triangle Park, North Carolina, USA 27709 Phone 919-990-8397 e-mail [email protected]Fax 919-541-6178 Database Architecture and Design Workshop George Grubbs May 17, 2005
45
Embed
Database Architecture and Design Workshop - RTI International
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
RTI International is a trade name of Research Triangle Institute
3040 Cornwallis Road ■ P.O. Box 12194 ■ Research Triangle Park, North Carolina, USA 27709 Phone 919-990-8397 e-mail [email protected] 919-541-6178
Database Architecture and Design WorkshopGeorge Grubbs
May 17, 2005
2
Field Director’s Guide to “Database Design Appreciation”
3
Why should you care?Expectation setting
Become familiar with data modeling and the database design process, terminology and concepts.
To understand what goes on when a survey is being developed; or changed.
To better communicate with database designers and programmers when development or modifying a survey instrument.
To appreciate the overall database design process and its value.
To get better study outcomes and smoother system development efforts.
It is a structured collection of related data. Office terminology: file, record, field. Database terminology: logical: entity, entity instance, attribute.Database terminology: physical: table, row, column.
Example: Survey database using “logical terminology”: Questionnaire, Respondent, and Response entities. Response entity instance = “John Smith’s response to question 10. Response entity’s attribute might be called “answer”. And the data value for “John Smith’s response to question 10” might be “true”.
What is the process for getting data into an (electronic) database from a source document? That’s next
7
Source document to electronic form and into the database
DBMSSurvey_ID: “123”.Q1: “Y” Q2: “N”
Some kind of programming language: Visual Basic, “C”, EntryPoint, etc.
DBMS = MS Access, SQL Server 2000, Oracle, DB2. OS = Operating System =
Windows, Linux, Unix.
Hard drive
Uses SQL, (Structured Query Language) to interface with the DBMS.
Example: Survey with 400 questions and each response averages 100 characters.
1GB = 26,844 surveys.
1 TB = 27,487,791
1 PB = 28,147,497,671
10
Database design First things first: Requirements
Output and
End Users
Questionnaire, Interview instrument
Interviewers
process
11
You might get some requirements from considering the interviewees
The Population
12
Data modeling is the heart of db design
1. Construct a logical data modelEntity-Relationship Diagram (ERD)
Key-Based Data Model (KBDM)
Fully-Attributed Data Model (FADM)
2. Construct a physical data modelPhysical Data Model (PDM)
Make data model improvements
13
Main goals in database design
Minimize redundant data (ideally, each data value should be in only one place in the database).
Reflect the business rules of the application domain (data quality).
Construct a clear and understandable data model that is well-documented (used to “communicate”).
Benefits: data quality, structural integrity, data consistency, performance, understand requirements.
14
Logical Data Model - ERD
The ERD is very simple: it only considers entities and their relationships.
An entity models something in the “real world” – that is, something in our “application domain” which is a “survey domain” in our case – e.g., an entity would be a “person”.
Let’s look at some entity examples, then deal with relationships
15
Example entities with a few attributes
QuestionnaireType A, Eff Date: 2/13/04, …
Type B, Eff Date: 2/13/03, …
Type B, Eff Date: 4/1/05, …
RespondentJoyce E. Smith, Female,
Live in North Carolina,
Age 42, …
Question1, What state you from?
2. What is your age?
1. Plan to re-visit?
InterviewerJohn W. Romano, Male, 5 years of experience, …
Carrie Jones, Female, 1 year of experience, …
ResponseTrue, False, True, True, …
So how are these entities related?
Let’s see
16
Entity relationships
Questionnaire
Respondent
InterviewerResponse
Question
is co
mpleted
by
has
has
makes
interviews
completesis made by
is interviewed by
is part of is answer to
17
Cardinality
Questionnaire
Respondent
Interviewer
Response
Question
Cardinality is the occurrence relationship between two entities.
N
1
1
N
1
N
1
N
N
M
The number of times one entity instance can occur for each instance of a related entity.
18
KBDM: Key-Based Data ModelPrimary Keys (PK)
Respondentrespondent_id (PK)
Interviewerinterviewer_id (PK)
Questionnairequestionnaire_id (PK)
Questionquestion_id (PK)
Responseresponse_id (PK)
A primary key value uniquely identifies a row in a table.The lines are used to indicate types of relationships and cardinality.
19
KBDM: Key-Based Data ModelPrimary Keys and Foreign Keys (FK)
Interviewerinterviewer_id (PK)
Questionnairequestionnaire_id (PK)
Questionquestionaire_id (PK) (FK)
question_id (PK)
Responserespondent_id (PK)(FK)
question_id (PK)(FK)
questionnaire_id (PK)(FK)
Resp_Intrvr_Assocrespondent_id (PK)(FK)
interviewer_id (PK)(FK)
Respondentrespondent_id (PK)
questionnaire_id (FK)
Foreign keys are used to establish relationships between tables.
20
FADM: Fully-Attributed Data ModelAdd Attributes (and Normalize)
Questionnairequestionnaire_id (PK)
type_code
effective_date
Questionquestion_id (PK)
questionnaire_id (FK)
question_text
Responserespondent_id (FK)
question_id (FK)
questionnaire_id (FK)
response
Only a few attributes shown.
Respondentrespondent_id (PK)
last_name
questionnaire_id (FK)
Interviewerinterviewer_id (PK)
last_name
Resp_Intrvr_Assocrespondent_id (FK)
interviewer_id (FK)
notes
21
A word about “normalization”
To normalize a database design is to put it in third normal form or 3NF.
There are quite a few normal forms: 1NF, 2NF, 3NF, BCNF, 4NF, 5NF and even others.
The goal of normalization is primarily to minimize data redundancy, but a fully normalized database can be very inefficient due to query complexity; therefore, once performance is known, a database design is de-normalized to improve performance.
22
Normalization examples
Respondentrespondent_id (PK)
interviewer_1_name
interviewer_2_name
Respondentrespondent_id (PK)
Respondent_Intvwr
respondent_id (PK) (FK)
interviewer_name (PK)
Respondentrespondent_id (PK)
Respondent_Intvwrrespondent_id (PK) (FK)
interviewer_id (PK) (FK)
Interviewerinterviewer_id (PK)
interviewer_name
What’s wrong with having repeating fields?
What if you need to have more than 2 interviewers?
What if an interviewer’s name changes?
23
A word about “referential integrity”
Respondentrespondent_id (PK)
last_name
questionnaire_id (FK)
Questionnairequestionnaire_id (PK)
type_code
effective_date
Would it make sense to have someone in the “Respondent” table with a “questionnaire_id” that did not point to” a questionnaire in the “Questionnaire” table?
UPDATE Questionnaire SET type_code = ‘B’ WHERE questionnaire_id = 1
Retrieve data from a database
29
Retrieving information from the database
List the respondents from North Carolina along with their age.
SELECT first_name, middle_initial, last_name, age FROM Respondent WHERE home_state_code = ‘NC’;
What are the questions for the Type B (4/10/05) questionnaire?
SELECT question_text FROM Questionnaire, Question WHEREQuestionnaire.questionnaire_id = Question.questionnaire_id AND type_code= ‘B’ AND effective_date = ‘4/10/05’ ORDER BY question_label
Equating table keys, e.g. “questionnaire_id” is called a “join”.
30
More SQL
What are the questions and responses for Joyce Smith and what is her home state?
(Notice the use of “t1”, “t2” and “t3” – that is just a shorthand way of referring to table names.)
Tables with columns for keys and for fields to contain what the respondents enter.
Include a place for “Other” inputs. Make the design flexible to accommodate changes. Normalize the design.Q1: State, City, Zip, Country. Ex: “NC”, “Charlotte”, “28212”, “USA”.
Q2: Reasons for visiting Miami: First, Second, Third reasons. Pick from lists, plus “other” text.
Q3: Leisure activities: Multiple – pick from list, plus “other” text.
Q4: Time spent on trip. Ex: “2”, “days”; “5”, “hours”.
Q5: Number of nights away from home on trip: Ex: “0”, “1”, 6”.
Q6: Number of total visits to Miami in 2 years: Ex: “1”, “4”.
Q7: Plan to return to Miami? Ex: “Yes”, “No”. Reason.