File Management Bit : Smallest unit of data a computer can process (0 or 1) Byte: 8 bits (one character: e.g. a letter, digit, symbol) UNICODE: UTF-8, UTF-16, UTF-32 Field: A logical grouping of characters e.g. student¶s name appears in name field Record: A logical group of related fields (student name, course, marks, rank) File: A logical group of related records Database: A logical group of related files. Primary Key: At least one field in a record that uniquely identifies that record. Secondary key : have some identifying information but need not be unique. Eg. Student surname Foreign Key: it provides relationship between 2 tables.A foreign key is a field in a relational table that matches the primary key of another table.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Sequential File Organization : Data records are retrieved in the samephysical sequence in which they are stored (magnetic tapes e.g. in taperecorders)
Direct File Access Method: Records can be retrieved in any sequence,without regard to actual physical order on storage medium (eg CD).
File Environment: Major Problems
i) Data Redundancy: Duplicate data in several files linked with differentapplications developed over time (waste of data entry efforts, physicalstorage)
ii) Data Inconsistency: Caused by data redundancy ± data values acrossfiles not synchronized (change in customer address should get changed inall associated files)
iii) Data Isolation: With applications uniquely designed and implemented, datafiles are likely to be organized differently, stored in different formats (sales in units of
thousands, lakhs)
Manual comparison of printed outputs from different applications (product-wise
daily sales, product-wise inventory) ± defeats the purpose of introducing IT.
iv) Data Integrity: Difficulty to enforce data integrity constraints uniformly acrossmultiple data files.(eg. Student roll number should not contain an alphabet)
v) Concurrency Problems: While one application may be up-dating a record,another may access that record. Thus, second application may not get desired
result.
vi) security: new files added on an ad-hoc basis; more applications, more peopleaccessing data.
vii) Data dependency: in file environment, applications & their associatedapplications are dependent.
viii) No central listing of files: hundreds of application and data files, no oneknows what the application did or what data they required.
All files at one location; Database Administration is easy; Maintainingconsistency, protecting from unauthorized access, implementing disaster recovery is easy,
Di sadvantages
Vulnerable to single point of failure ± Access to all the users is affected
Transmission costs and delays in access when users are at far-off distances
Distributed Databases:
Replicated (Complete copies of entire database at many locations toalleviate single-point-of-failure problem in centralized database and toincrease user responsiveness)
However, maintenance overheads are high (Maintaining consistencyamong replicated databases)
DBMS Persistence: attributes permanently stored on a hard drive until removed.
Query Ability: requesting attribute information. Ex, how many 2 door cars intexas are green?
Concurrency: change/read data simultaneously
Backup: replication done 2 fight failure
Rule enforcement: rules defined for filling a particular attribute field. Theycan easily b added and removed as required.
Security: limit on who can access or change data.
Computation: calculations like summing, average, sort etc done easily.
Change and access logging: often required to know who accessed whichattributes, what and when changed. This can b done by logging services datkeep a record of access occurrences and changes.
1. Data model: defines the way data are conceptually structured(network,
hierarchical, relational, multidimensional etc)
2. Data definition language: language used by programmers to specify the type
and structure of the database (link b/w logical and physical views)
3. Data manipulation language: help extract data from database as per
information requests. Help to sort, display and delete data. Eg.
SQL(structured query language)
4. Data Dictionary: stores definitions of data elements(field), data characteristics suchas ownership, authorization, security.
Reduces data inconsistencies (as data elements r defined), makes programdevelopment faster(as data details r predefined), easier to modify(presence of elements
in database is mentioned)
Metadata: data dictionaries r a form of metadata. Metadata is Information aboutinformation. Helps companies 2 track d transaction n analyze its success.
Normalization is a set of rules for transforming data to d one with moredesirable properties of internal consistency, minimal redundancy andmaximum stability.
Properties of a normalized design
The process systematically removes duplication of data which leads tominimum storage requirements
Since data items are stored in minimum number of places, chances of datainconsistency are minimized
Normalized structures are optimum for updates (insert, update and delete)as such an operation needs access to minimum amount of data (at the costof retrieval !)
An attribute A is µfully functionally dependent¶ on a set of attributes B, if there does not exist a subset of attributes constituting B on which A is
functionally dependent.
Student_cd Student_nm Address
K105 ABC A-123, XYZ
K106 DEF Z-333,ABC
Attribute student_nm is functionally dependent on the attribute Student_cd,because a value of student_cd determines value of student_nm
Same logic for Address being functionally dependent on student_cd
Address is functionally dependent on (student_cd+student_nm). H¶ever, itis not fully functionally dependent on this composite key because it isfunctionally dependent on student_cd alone.
Record business interactions as they occur(real time). Theysupport day to day operations of an organization
Optimized for efficiently processing and storing transactions
Based on rules of normalization
Not designed to deliver large aggregates (may take largeprocessing power and time, many records will be locked whenaggregates are produced - These can have serious impact onefficiency of OLTP systems.
Benefits
Online Transaction Processing has two key benefits:simplicity and efficiency. Reduced paper trails and the faster,more accurate forecasts for revenues and expenses are bothexamples of how OLTP makes things simpler for businesses.
often aggregates of hundreds/thousands/millions of transactions. To provide these
aggregates efficiently require a different type of data storage optimization.
Problems arise when we try to use OLTP systems as the source for BI.
The issue of Archiving( An archive is a collection of historical records, as well as theplace they are located)
If transaction tables become large, OLTP systems become slow. Archiving cause problems for BI(Business intelligence) as we may want to havecomparative trends over several years.
The issue of Disparate Systems
An organization may have several OLTP systems for different operations ± order processing, accounting, manufacturing, personnel. Even in ERP systems it is unlikelythat all transactional data is at one place.
BI measures may be based on data from different systems. They treat organizationas a whole.
e.g. a measure could be µProfit margin for a particular product¶. For this we need list of raw materials frommanufacturing system, cost of those materials fromaccounting system, cost of labor required to produce the
product from the time entry system and amount paid for theproduct from order processing system. Thus to compute thismeasure, we need to combine data across systems.
Further the coding schemes, calendars, etc. may be
different. The same product may be µ123¶ in one system andµABC¶ in the other. We need to find some common ground tocombine the data from these systems.
1. Transactional or operational databases and external sources from whichthe data warehouse is populated. (Note that you would not handle asavings account or manage inventory with a data warehouse)
2. A process to extract data from databases and bring it into the datawarehouse. This process must often transform the data from databasestructure to internal formats of data warehouse.
3. A process to cleanse the data to ensure data quality.
4. A process to load the cleansed data into the data warehouse.
(The four processes Extract, transform, cleanse, loading are collectivelycalled Data Staging)
5. A process to create desired summaries of data: pre-calculated totals,
averages etc., which are expected to be requested often.
6. Metadata, ³data about data´ ± Information about contents of datawarehouse
7. The data warehouse database itself (detailed and summary data).
[ A Data warehouse is not used for processing individual transactions. Itsdatabase need not be optimized for individual record retrieval based onkeys. Instead, it is optimized for access patterns used for analysis]
8. Query Tools to enable OLAP (On-line analytical processing) ± enduser interface for posing questions to DW.
9. A DW may also include automated tools for uncovering patterns in
A DW project has to be put in business perspective. (Build a business case
for DW before planning a DW). A DW project should be tied with solvingbusiness problems
A business case should, however include:
Type of data to be included in DW
Decisions likely to be made using this data
Manner in which these decisions are made without DW
How these decisions will benefit the organization
If benefits cannot be quantified, try to quantify some associated
parameters.
Ex: ³Optimize an advertising budget of $4 m / yr´ ; ³Minimize discountsfrom full fare which are currently $200 m /yr´ ; Choose suppliers whoseproducts result in fewest warranty claims which currently cost us $25 m /yr to handle´
None of these statements say anything about how much the DW willbenefit the orgn. But all provide information through which top managementcan understand/appreciate its benefits.
In the context of data warehouse, we have two types of data:Dimension Data and Fact Data
Dimension Data: the dimensions along which data can be analysed. (In
this ex. Data can be anaysed by sales person or by type of pet. Thus :³Salesperson´ and ³ Type of pet´ are dimension data. In reality, time is alsoa dimension in most databases (e.g. To study performance trends over time). Here, we have skipped this by saying that data is for a specific timeperiod.
Fact Data: Actual data along the dimensions
Data warehouse generally store data using multidimensional databases
(or simply a dimensional database). It stores data in the form of a giant
hypercube (extension of cube to 4 or more dimensions)
Databases configured for OL AP use a multidimensional data model,allowing for complex analytical and ad-hoc queries with a rapid executiontime. They borrow aspects of navigational databases and hierarchicaldatabases that are faster than relational databases.[6]
Automated Analysis: Data Mining
Definition: ³ The efficient discovery of valuable, non-obvious informationfrom large collection of data´
(In data mining, special software sifts through the large data to discover new facts and relationships.)
Advantage: Unexpected relationships may get revealed some of which are