Data warehouse

Data Warehouse

•The data warehouse is that portion of an overall Architected Data Environment that serves as the single integrated source of data for processing information.•Data ware house contains Historical data as well as Current data.

Relational Database

• Its known as RDBMS.• Data is stored as Relations• Concept of Primary and Foreign Key• ACID Properties

ACID properties

• ACID properties are an important concept for databases. The acronym stands for Atomicity, Consistency, Isolation, and Durability

• In the context of databases, a single logical operation on the data is called a transaction. An example of a transaction is a transfer of funds from one account to another, even though it might consist of multiple individual operations (such as debiting one account and crediting another). The ACID properties that such transactions are processed reliably.

1. ATOMICITY

• Atomicity refers to the ability of the DBMS to guarantee that either all of the tasks of a transaction are performed or none of them are.

• The transfer of funds can be completed or it can fail for a multitude of reasons, but atomicity guarantees that one account won't be debited if the other is not credited as well.

Consistency

• Consistency refers to the database being in a legal state when the transaction begins and when it ends.

• This means that a transaction can't break the rules, or integrity constraints, of the database. If an integrity constraint states that all accounts must have a positive balance, then any transaction violating this rule will be aborted.

ISOLATION

• Isolation refers to the ability of the application to make operations in a transaction appear isolated from all other operations.

• This means no operation outside the transaction can ever see the data in an intermediate state

Durability

• Refers to the guarantee that once the user has been notified of success, the transaction will persist, and not be undone

Data Warehouse Characteristics

1. Subject Oriented.2. Integrated3. Non Volatile4. Time variant5. Accessible

1. Subject Oriented

• Information is presented according to specific subjects or areas of interest, not simply as computer files. Data is manipulated to provide information about a particular subject.

2. Integrated

• A single source of information for and about understanding multiple areas of interest.

• The data warehouse provides one-stop shopping and contains information about a variety of subjects.

• Thus the University data warehouse has information on students, faculty and staff, instructional workload, and student outcomes.

3. Non Volatile

• Stable information that doesn’t change each time an operational process is executed.

• Information is consistent regardless of when the warehouse is accessed.

4. Time variant

• Containing a history of the subject, as well as current information.

• Historical information is an important component of a data warehouse.

5. Accessible

• The primary purpose of a data warehouse is to provide readily accessible information to end-users.

Some definitions

• Data Warehouse: A data structure that is optimized for distribution. It collects and stores integrated sets of historical data from multiple operational systems and feeds them to one or more data marts. It may also provide end-user access to support enterprise views of data.

• Data Mart: A data structure that is optimized for access. It is designed to facilitate end-user analysis of data. It typically supports a single, analytic application used by a distinct set of workers.

• Staging Area: Any data store that is designed primarily to receive data into a warehousing environment.

• Operational Data Store: A collection of data that addresses operational needs of various operational units. It is not a component of a data warehousing architecture, but a solution to operational needs.

• OLAP (On-Line Analytical Processing): A method by which multidimensional analysis occurs.

• Multidimensional Analysis: The ability to manipulate information by a variety of relevant categories or “dimensions” to facilitate analysis and understanding of the underlying data. It is also sometimes referred to as “drilling-down”, “drilling-across” and “slicing and dicing”

• Hypercube: A means of visually representing multidimensional data.

• OLAP Tools: A set of software products that attempt to facilitate multidimensional analysis. Can incorporate data acquisition, data access, data manipulation, or any combination thereof.

Steps in DW implementation

• Requirements Modeling Attribute: This attribute focuses on techniques of capturing business requirements and modeling them. For building a data warehouse, understanding and representing user requirements accurately is very important.

• Data Modeling Attribute: Once the requirements are captured, an information model (also called a warehouse model) is created based on those requirements. The model is logically represented in the form of an ERD. The logical model is then transformed in to relational schema.

• Support for Normalization/Denormalization Attribute: The normalization/denormalization process is an important part of a data warehousing methodology. To support OLAP queries, relational databases require frequent table joins, which can be very costly. To improve query performance, a methodology must support denormalization.

• Implementation Strategy Attribute. Depending on the methodology, the implementation strategy could vary between an SDLC-type approach and a RAD type approach. Within the RAD category, most vendors have adopted the iterative prototyping approach.

• Metadata Management Attribute.• Query Design Attribute: Large data warehouse tables take a

long time to process, especially if they must be joined with others. Because query performance is an important issue, some vendors place a lot of emphasis on how queries are designed and processed. Some DBMS vendors allow parallel query generation and execution.

Star Schema

Star Schema

• In computing, the star schema (also called star-join schema, data cube, or multi-dimensional schema) is the simplest style of data warehouse schema. The star schema consists of one or more fact tables referencing any number of dimension tables. The star schema is an important special case of the snowflake schema, and is more effective for handling simpler queries.

• Benefit of Star Schema is ease of access in terms of writing queries.

Snow Flake Schema

Snow Flake Schema• In computing, a snowflake schema is a logical arrangement of

tables in a multidimensional database such that the entity relationship diagram resembles a snowflake in shape. The snowflake schema is represented by centralized fact tables which are connected to multiple dimensions.

• However, in the snowflake schema, dimensions are normalized into multiple related tables, whereas the star schema's dimensions are normalized with each dimension represented by a single table.

• Star and snowflake schemas are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations

OLAP and OLTP

• OLTP vs. OLAP

We can divide IT systems into transactional (OLTP) and analytical (OLAP). In general we can assume that OLTP systems provide source data to data warehouses, whereas OLAP systems help to analyze it.

OLAP and OLTP

• OLTP (On-line Transaction Processing) is characterized by a large number of short on-line transactions (INSERT, UPDATE, DELETE). The main emphasis for OLTP systems is put on very fast query processing, maintaining data integrity in multi-access environments and an effectiveness measured by number of transactions per second. In OLTP database there is detailed and current data, and schema used to store transactional databases is the entity model (usually Normalized).

• OLAP (On-line Analytical Processing) is characterized by relatively low volume of transactions. Queries are often very complex and involve aggregations. For OLAP systems a response time is an effectiveness measure. OLAP applications are widely used by Data Mining techniques. In OLAP database there is aggregated, historical data, stored in multi-dimensional schemas (usually star schema).

Data Mining Algorithms

• Classification Algo: predict one or more discrete variables, based on the other attributes in the dataset.

• Regression algorithms predict one or more continuous variables, such as profit or loss, based on other attributes in the dataset.

• Segmentation algorithms divide data into groups, or clusters, of items that have similar properties.

• Association algorithms find correlations between different attributes in a dataset. The most common application of this kind of algorithm is for creating association rules, which can be used in a market basket analysis.

K-Means Algo

• Step 1: Place randomly initial group centroids into the 2d space. Step 2: Assign each object to the group that has the closest centroid.Step 3: Recalculate the positions of the centroids.Step 4: If the positions of the centroids didn't change go to the next step, else go to Step 2.Step 5: End.

• The support of an itemset is defined as the proportion of transactions in the data set which contain the itemset. In the example database, the itemset has a support of since it occurs in 20% of all transactions (1 out of 5 transactions).

• The confidence of a rule is defined . For example, the rule has a confidence of in the database, which means that for 50% of the transactions containing milk and bread the rule is correct (50% of the times a customer buys milk and bread, butter is bought as well). Be careful when reading the expression: here supp(X Y) means ∪"support for occurrences of transactions where X and Y both appear", not "support for occurrences of transactions where either X or Y appears“.

Data warehouse

Documents

current data

data structure

university data warehouse

definitions data warehouse

data warehouse characteristics

data ware house

historical information

accessible information