Top Banner
Application software – Application software – office packets, databases office packets, databases and data warehouses. and data warehouses. Piotr Mielecki Ph. D. Piotr Mielecki Ph. D. Introduction to Computer Introduction to Computer Systems Systems (8) (8) [email protected] [email protected] http://www.wssk.wroc.pl/~mielecki
25

Application software – office packets, databases and data warehouses. Piotr Mielecki Ph. D. Introduction to Computer Systems (8) [email protected].

Mar 31, 2015

Download

Documents

Paige Whittaker
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Application software – office packets, databases and data warehouses. Piotr Mielecki Ph. D. Introduction to Computer Systems (8) mielecki@wssk.wroc.pl.

Application software –Application software –office packets, databases and office packets, databases and

data warehouses.data warehouses.

Piotr Mielecki Ph. D.Piotr Mielecki Ph. D.

Introduction to Computer SystemsIntroduction to Computer Systems (8)(8)

[email protected]

[email protected]

http://www.wssk.wroc.pl/~mielecki

Page 2: Application software – office packets, databases and data warehouses. Piotr Mielecki Ph. D. Introduction to Computer Systems (8) mielecki@wssk.wroc.pl.

1.1. Office packets.Office packets.

Office packet of software, usually called an Office packet of software, usually called an office application suiteoffice application suite or or productivity suite is a set of applications intended to be used by productivity suite is a set of applications intended to be used by typical office worker and/or knowledge workerstypical office worker and/or knowledge workers. The components are . The components are generally distributed together, have a consistent user interface and generally distributed together, have a consistent user interface and usually can interact with each other, sometimes in ways that the usually can interact with each other, sometimes in ways that the operating system would not normally allow – mechanisms like Object operating system would not normally allow – mechanisms like Object Linking and Embedding (OLE), for example.Linking and Embedding (OLE), for example.

Most of office application suites include at least:Most of office application suites include at least: Word processorWord processor – (more formally known as document – (more formally known as document

preparation system), which is an application used for the preparation system), which is an application used for the production (including composition, editing, formatting, and production (including composition, editing, formatting, and possibly printing) of any sort of printable material, stored as the possibly printing) of any sort of printable material, stored as the electronic document.electronic document.

SpreadsheetSpreadsheet – rectangular table (or grid) of arranged information – rectangular table (or grid) of arranged information (financial very often). The electronic spreadsheet supports (financial very often). The electronic spreadsheet supports automatic calculations (mathematical, statistical, financial etc.) automatic calculations (mathematical, statistical, financial etc.) and tools for graphical presentations (different diagrams).and tools for graphical presentations (different diagrams).

1.1. Definitions.1.1. Definitions.

Page 3: Application software – office packets, databases and data warehouses. Piotr Mielecki Ph. D. Introduction to Computer Systems (8) mielecki@wssk.wroc.pl.

In addition to these, the suite may containIn addition to these, the suite may contain::

Presentation programPresentation program – designed for preparing sets of electronic – designed for preparing sets of electronic slides, usually based on contents of other electronic documents slides, usually based on contents of other electronic documents and different multimedia formats. The applications of this kind and different multimedia formats. The applications of this kind have usually poor support for edition of multimedia files or have usually poor support for edition of multimedia files or documents, but the OLE mechanism (advanced graphic editor as documents, but the OLE mechanism (advanced graphic editor as an OLE server, for example) can be used to speed-up work on an OLE server, for example) can be used to speed-up work on presentation. MS-PowerPoint has become a standard for presentation. MS-PowerPoint has become a standard for applications of this kind. applications of this kind.

Desktop database toolDesktop database tool – the application which can be used to – the application which can be used to create small (desktop) databases or as the client and/or report-create small (desktop) databases or as the client and/or report-generator for remote client-server (SQL) databases. Actually MS-generator for remote client-server (SQL) databases. Actually MS-Access is the most popular application of this kind.Access is the most popular application of this kind.

Graphics suiteGraphics suite – the application (or set of applications) designed – the application (or set of applications) designed for editing different bitmap and vector graphical formats. for editing different bitmap and vector graphical formats. Actually the more advanced graphics suites (like CorelDraw! or Actually the more advanced graphics suites (like CorelDraw! or Adobe Photoshop, for example) are not included in the particular Adobe Photoshop, for example) are not included in the particular office packages, but they can be OLE servers for them. On the office packages, but they can be OLE servers for them. On the other hand, the OpenOffice.org suite has its own, not very poor other hand, the OpenOffice.org suite has its own, not very poor vector graphic editor.vector graphic editor.

Page 4: Application software – office packets, databases and data warehouses. Piotr Mielecki Ph. D. Introduction to Computer Systems (8) mielecki@wssk.wroc.pl.

Other additional components of the office suite:Other additional components of the office suite:

Communication toolsCommunication tools, including, including::

e-mail client and/ore-mail client and/or

Personal Information Manager (Personal Information Manager (PIMPIM) or ) or groupware packagegroupware package. . Microsoft Outlook is a good example of this kind.Microsoft Outlook is a good example of this kind.

Programming languageProgramming language designed for designed for supporting automatic supporting automatic processing of documents and data included in these documentsprocessing of documents and data included in these documents (like Visual Basic for Applications in MS-Office, for example).(like Visual Basic for Applications in MS-Office, for example).

Page 5: Application software – office packets, databases and data warehouses. Piotr Mielecki Ph. D. Introduction to Computer Systems (8) mielecki@wssk.wroc.pl.

One of the most important problems is the standard format (or set of One of the most important problems is the standard format (or set of formats) for electronic documents. The attempts were (and still are) formats) for electronic documents. The attempts were (and still are) made to establish a format suitable for different office packets (from made to establish a format suitable for different office packets (from different manufacturers).different manufacturers). Using the format Using the format not dependent on the not dependent on the particular office suiteparticular office suite would be much better for all the customers – they would be much better for all the customers – they could change the office software without changing the document could change the office software without changing the document formats, still having access to older documents. On the other hand, they formats, still having access to older documents. On the other hand, they could exchange electronic documents between each-other (via e-mail, could exchange electronic documents between each-other (via e-mail, for example) not necessary using the same office software to display for example) not necessary using the same office software to display them and work on them. Of course, the large manufacturers like them and work on them. Of course, the large manufacturers like Microsoft are not interested in developing universal standards.Microsoft are not interested in developing universal standards. The Rich Text Format (RTF)The Rich Text Format (RTF) is one of the most well-known formats for is one of the most well-known formats for

word processors (formatted text files), but its implementation in MS-word processors (formatted text files), but its implementation in MS-Word is still not quite correct. Word is still not quite correct.

The ISO/IEC 26300 Open Document Format (ODF)The ISO/IEC 26300 Open Document Format (ODF), based on XML and , based on XML and supporting formatted text files, spreadsheets, diagrams and supporting formatted text files, spreadsheets, diagrams and presentations is an alternative for ”closed” formats like DOC (DOCX), presentations is an alternative for ”closed” formats like DOC (DOCX), XLS or PPT. The international non-profit committee which is XLS or PPT. The international non-profit committee which is developing this standard is Organization for the Advancement of developing this standard is Organization for the Advancement of Structured Information Standards (OASIS).Structured Information Standards (OASIS).

1.2. Common problems.1.2. Common problems.

Page 6: Application software – office packets, databases and data warehouses. Piotr Mielecki Ph. D. Introduction to Computer Systems (8) mielecki@wssk.wroc.pl.

The other important thing is good The other important thing is good support for work in organized groupssupport for work in organized groups. . That means the That means the document-flow management inside the organization’s document-flow management inside the organization’s structure, in many locations sometimesstructure, in many locations sometimes. In the small team of people it’s . In the small team of people it’s relatively easy to share the documents in Local Area Network (LAN) relatively easy to share the documents in Local Area Network (LAN) environment, sometimes using Virtual Private Network (VPN) channels to environment, sometimes using Virtual Private Network (VPN) channels to connect with company’s LAN from remote locations and interchanging connect with company’s LAN from remote locations and interchanging document-files just accessing them from the shared folders or using document-files just accessing them from the shared folders or using common e-mail. common e-mail.

The large companies have problems with The large companies have problems with safe and efficient flow of safe and efficient flow of different documents different documents (and their subsequent versions), created by single (and their subsequent versions), created by single people or by workgroups together. Application software which can people or by workgroups together. Application software which can support these tasks is rather sophisticated and usually database-oriented support these tasks is rather sophisticated and usually database-oriented (it means that the efficient database server with appropriate client (it means that the efficient database server with appropriate client software is the center of the company’s document-flow system). IBM software is the center of the company’s document-flow system). IBM Lotus Notes / DominoLotus Notes / Domino is now probably most advanced system of this is now probably most advanced system of this kind. ”Domino” is the name of the server software package, while kind. ”Domino” is the name of the server software package, while ”Lotus” is client application (a bit similar to MS-Outlook) which organizes ”Lotus” is client application (a bit similar to MS-Outlook) which organizes the user’s work. Another well-known solutions are the user’s work. Another well-known solutions are Microsoft ExchangeMicrosoft Exchange (relatively poor in comparison with Lotus but much cheaper and fully (relatively poor in comparison with Lotus but much cheaper and fully integrated with ActiveDirectory, based on non-SQL database and MS-integrated with ActiveDirectory, based on non-SQL database and MS-Outlook as the only client) and Outlook as the only client) and Novell GroupWiseNovell GroupWise..

Page 7: Application software – office packets, databases and data warehouses. Piotr Mielecki Ph. D. Introduction to Computer Systems (8) mielecki@wssk.wroc.pl.

22. Databases and Data Warehouses.. Databases and Data Warehouses.

DatabaseDatabase is a is a structured collection of recordsstructured collection of records or dataor data that is stored that is stored in a computer system so that a computer program or person using a in a computer system so that a computer program or person using a query language (SQL in most of cases) can query language (SQL in most of cases) can consult it to answer consult it to answer queries or append new dataqueries or append new data. The records retrieved in answer to . The records retrieved in answer to queries are information that can be used to make decisions. The queries are information that can be used to make decisions. The computer software used to manage and query a database is known computer software used to manage and query a database is known as a as a Database Management System (DBMS)Database Management System (DBMS)..

The central concept of a database is that of a The central concept of a database is that of a collection of records, collection of records, or pieces of informationor pieces of information. Typically, for a given database, there is a . Typically, for a given database, there is a structural description of the type of facts held in that database: this structural description of the type of facts held in that database: this description is known as a description is known as a schemaschema. The schema describes the objects . The schema describes the objects that are represented in the database, and the relationships among that are represented in the database, and the relationships among them. There are a number of different ways of organizing a schema, them. There are a number of different ways of organizing a schema, that is, of modeling the database structure – they are known as that is, of modeling the database structure – they are known as database modelsdatabase models (or (or data modelsdata models).).

22..11. Definitions.. Definitions.

Page 8: Application software – office packets, databases and data warehouses. Piotr Mielecki Ph. D. Introduction to Computer Systems (8) mielecki@wssk.wroc.pl.

The data model in most common use today is the The data model in most common use today is the relational modelrelational model, , which represents all information in the form of which represents all information in the form of multiple related multiple related tablestables, each consisting of rows and columns (the formal definition , each consisting of rows and columns (the formal definition uses mathematical terminology and is much less understandable, so uses mathematical terminology and is much less understandable, so rather not used by IT professionals).rather not used by IT professionals).

This model represents This model represents relationshipsrelationships by the use of by the use of values common for values common for more than one table – keysmore than one table – keys. Other models such as the hierarchical . Other models such as the hierarchical model and the network model use a more explicit representation of model and the network model use a more explicit representation of relationshipsrelationships

Page 9: Application software – office packets, databases and data warehouses. Piotr Mielecki Ph. D. Introduction to Computer Systems (8) mielecki@wssk.wroc.pl.

Three key terms are used extensively in relational database models:Three key terms are used extensively in relational database models:

RelationsRelations – a relation is – a relation is a table with columns and rowsa table with columns and rows (notice: (notice: term term relationrelation doesn’t mean the doesn’t mean the relationshiprelationship between two or between two or more tables)more tables)..

AttributesAttributes – the – the named columnsnamed columns of the relation are called of the relation are called attributesattributes..

DomainsDomains – the domain is the – the domain is the set of values the attributes are set of values the attributes are allowed to takeallowed to take..

The basic data structure of the relational model is the table, where The basic data structure of the relational model is the table, where information about a particular kind of objects – information about a particular kind of objects – entityentity (a student, for (a student, for example) is represented in columns and example) is represented in columns and rowsrows (also called (also called tuplestuples). ). Thus, the ”relation” refers to the various different tables in the Thus, the ”relation” refers to the various different tables in the database – database – a relation is a set of tuplesa relation is a set of tuples. The columns enumerate the . The columns enumerate the various attributes of the entity (the student’s name, first name, various attributes of the entity (the student’s name, first name, unique ID number and date of birth, for example), and unique ID number and date of birth, for example), and a row is an a row is an actual instance of the entityactual instance of the entity (particular student) that is represented (particular student) that is represented by the relation. As a result, each tuple (or simply row) of the table of by the relation. As a result, each tuple (or simply row) of the table of students represents various attributes of a single student.students represents various attributes of a single student.

22..22. Relational databases.. Relational databases.

Page 10: Application software – office packets, databases and data warehouses. Piotr Mielecki Ph. D. Introduction to Computer Systems (8) mielecki@wssk.wroc.pl.

All relations (tables) in a relational database have to adhere to some All relations (tables) in a relational database have to adhere to some basic rules to be qualified as relations:basic rules to be qualified as relations:

the ordering of columns is not important in a table,the ordering of columns is not important in a table,

there can’t be identical tuples (rows) in a table,there can’t be identical tuples (rows) in a table,

each tuple will contain a single (only one at the moment) value each tuple will contain a single (only one at the moment) value for each of its attributes i.e. each tuple has an atomic value.for each of its attributes i.e. each tuple has an atomic value.

A relational database contains multiple tables, each similar to A relational database contains multiple tables, each similar to the one in the ”flat” database model. One of the advantages of the one in the ”flat” database model. One of the advantages of the relational model is that the relational model is that any value occurring in two different any value occurring in two different records records (belonging to the same table or more frequently to (belonging to the same table or more frequently to different tables), different tables), implies a relationship among those two recordsimplies a relationship among those two records. . In order to enforce explicit In order to enforce explicit integrity constraintsintegrity constraints, relationships , relationships between records in tables can also be defined explicitly, by between records in tables can also be defined explicitly, by identifying parent-child relationships characterized by assigning identifying parent-child relationships characterized by assigning cardinality:cardinality:

1 to 1 (or 0),1 to 1 (or 0),

1 to Many,1 to Many,

Many to Many.Many to Many.

Page 11: Application software – office packets, databases and data warehouses. Piotr Mielecki Ph. D. Introduction to Computer Systems (8) mielecki@wssk.wroc.pl.

Tables can also have a designated single attribute or a set of Tables can also have a designated single attribute or a set of attributes that can act as a ”attributes that can act as a ”keykey”, which can be used to uniquely ”, which can be used to uniquely identify each tuple in the table. Such a unique key is called a identify each tuple in the table. Such a unique key is called a primary keyprimary key..

Keys are commonly used Keys are commonly used to join or combine data from two or more to join or combine data from two or more tablestables. For example, the . For example, the StudentsStudents table may contain a column table may contain a column named named FacultyFaculty which contains a value that matches the key of a which contains a value that matches the key of a FacultiesFaculties table (notice that one student can have relationships with table (notice that one student can have relationships with many faculties, so this example with only one many faculties, so this example with only one FacultyFaculty column in the column in the StudentsStudents table is maybe too simplified). table is maybe too simplified).

Keys are also critical in the creation of Keys are also critical in the creation of indicesindices ( (indexesindexes), which ), which facilitate fast retrieval of data from large tables.facilitate fast retrieval of data from large tables.

Any column can be a key, or multiple columns can be grouped Any column can be a key, or multiple columns can be grouped together into a together into a compound keycompound key. .

Page 12: Application software – office packets, databases and data warehouses. Piotr Mielecki Ph. D. Introduction to Computer Systems (8) mielecki@wssk.wroc.pl.

Users (or programs) request data from a relational database by Users (or programs) request data from a relational database by sending it a querysending it a query that is written in a special language, usually a that is written in a special language, usually a dialect of dialect of Structured Query Language (SQL)Structured Query Language (SQL). Although SQL was . Although SQL was originally intended for end-users, it is much more common for SQL originally intended for end-users, it is much more common for SQL queries to be embedded into software that provides an easier user queries to be embedded into software that provides an easier user interface (see the chapter about Data Warehouses). In response to a interface (see the chapter about Data Warehouses). In response to a query, the database returns a result set, which is just a list of rows query, the database returns a result set, which is just a list of rows from one or more (related) tables, containing the answers.from one or more (related) tables, containing the answers.

The simplest query is just to return all the rows from a particular, The simplest query is just to return all the rows from a particular, single table which interests us at this moment:single table which interests us at this moment:

SELECT * FROM <table>SELECT * FROM <table> More often, the rows are filtered in some way to return just the More often, the rows are filtered in some way to return just the

answer wanted:answer wanted:

SELECT <column_list> FROM <table> WHERE <condition>SELECT <column_list> FROM <table> WHERE <condition> Very often data from multiple tables are combined into one, by doing Very often data from multiple tables are combined into one, by doing

a a JOINJOIN. There are a number of relational operations in addition to . There are a number of relational operations in addition to JOINJOIN..

22..33. Relational operations.. Relational operations.

Page 13: Application software – office packets, databases and data warehouses. Piotr Mielecki Ph. D. Introduction to Computer Systems (8) mielecki@wssk.wroc.pl.

Database normalizationDatabase normalization is is a technique for designing relational a technique for designing relational database tables to minimize duplication of information and, in so database tables to minimize duplication of information and, in so doing, to safeguard the database against certain types of logical or doing, to safeguard the database against certain types of logical or structural problems (data anomalies)structural problems (data anomalies). .

For example, when multiple instances of a given piece of information For example, when multiple instances of a given piece of information occur in a table, the possibility exists that occur in a table, the possibility exists that these instances will not be these instances will not be kept consistent when the data within the table is updatedkept consistent when the data within the table is updated, leading to , leading to a a loss of data integrityloss of data integrity. .

A table that is sufficiently normalized is less sensitive for problems of A table that is sufficiently normalized is less sensitive for problems of this kind.this kind.

On the other hand, in systems designed to hold the important On the other hand, in systems designed to hold the important electronic documentation (medical, for example) it’s very important electronic documentation (medical, for example) it’s very important to save all the subsequent versionsto save all the subsequent versions of each document, so sometimes of each document, so sometimes very similar or even identical tuples can be found. Good design very similar or even identical tuples can be found. Good design should lead to distinguish between them (using the ”version should lead to distinguish between them (using the ”version number” column, for example).number” column, for example).

22..44. Database normalization.. Database normalization.

Page 14: Application software – office packets, databases and data warehouses. Piotr Mielecki Ph. D. Introduction to Computer Systems (8) mielecki@wssk.wroc.pl.

Database theory describes a Database theory describes a table’s degree of normalizationtable’s degree of normalization in terms in terms of of normal formsnormal forms of successively higher degrees. A table in third of successively higher degrees. A table in third normal form (3NF), for example, is consequently in second normal normal form (3NF), for example, is consequently in second normal form (2NF) as well. form (2NF) as well. Higher degrees of normalization typically involve Higher degrees of normalization typically involve more tables and create the need for a larger number of joinsmore tables and create the need for a larger number of joins, which , which can reduce performance. Accordingly, more highly normalized tables can reduce performance. Accordingly, more highly normalized tables are typically used in database applications involving many isolated are typically used in database applications involving many isolated transactions, while less normalized tables tend to be used in transactions, while less normalized tables tend to be used in database applications that do not need to map complex relationships database applications that do not need to map complex relationships between data entities and data attributes.between data entities and data attributes.

Although the normal forms are often defined informally in terms of Although the normal forms are often defined informally in terms of the characteristics of tables, the characteristics of tables, rigorous definitions of the normal forms rigorous definitions of the normal forms are concerned with the characteristics of pure mathematical are concerned with the characteristics of pure mathematical constructs known as relations (more theoretical approach than just constructs known as relations (more theoretical approach than just tables)tables). Whenever information is represented relationally, it is . Whenever information is represented relationally, it is meaningful to consider the extent to which the representation is meaningful to consider the extent to which the representation is normalized.normalized.

Page 15: Application software – office packets, databases and data warehouses. Piotr Mielecki Ph. D. Introduction to Computer Systems (8) mielecki@wssk.wroc.pl.

2.5. Data Warehouses.2.5. Data Warehouses.

2.5.1. Definitions.2.5.1. Definitions.

A Data Warehouse is something more than database-supported A Data Warehouse is something more than database-supported information system or simply database itself. information system or simply database itself. It’s the main repository It’s the main repository of all organization’s historical data, its ”corporate memory”of all organization’s historical data, its ”corporate memory”. It . It contains the raw materials for management’s decision support contains the raw materials for management’s decision support systems like systems like Enterprise Resource Planning (ERP)Enterprise Resource Planning (ERP) systems first of all. systems first of all.

The critical factor leading to the use of a Data Warehouse is that a The critical factor leading to the use of a Data Warehouse is that a data analyst can perform complex queries and analysis, such as data analyst can perform complex queries and analysis, such as data data miningmining, on the information without slowing down the , on the information without slowing down the operation operation systemssystems or or application-level information systemsapplication-level information systems (sometimes called (sometimes called Operational SystemsOperational Systems) which are supporting everyday work ) which are supporting everyday work (accounting, human-resource management and so on). (accounting, human-resource management and so on).

Page 16: Application software – office packets, databases and data warehouses. Piotr Mielecki Ph. D. Introduction to Computer Systems (8) mielecki@wssk.wroc.pl.

Formally we can define a Data Warehouse in the following terms:Formally we can define a Data Warehouse in the following terms: Subject-orientedSubject-oriented – the data in the database is organized so that – the data in the database is organized so that

all the data elements relating to the same real-world event or all the data elements relating to the same real-world event or object are linked togetherobject are linked together..

Time-variantTime-variant – all the changes to the data in the database are – all the changes to the data in the database are tracked and recorded so that tracked and recorded so that reports can be produced showing reports can be produced showing changes over timechanges over time..

Non-volatileNon-volatile – – data in the database is never over-written data in the database is never over-written nnor or deleteddeleted; once committed, the data is static, read-only, but ; once committed, the data is static, read-only, but retained for future reporting.retained for future reporting.

IntegratedIntegrated – – the database contains data from most or all of an the database contains data from most or all of an organization’s operational applicationsorganization’s operational applications, and that this data is , and that this data is made consistent.made consistent.

Operational Systems are optimized for Operational Systems are optimized for simplicity and speed simplicity and speed of of modification, using Online Transaction Processing (OLTP) for data modification, using Online Transaction Processing (OLTP) for data entry and retrieval, database normalization and an entity-entry and retrieval, database normalization and an entity-relationship model for clear design. The data warehouse is optimized relationship model for clear design. The data warehouse is optimized for for reporting and analysisreporting and analysis (Online Analytical Processing – OLAP). (Online Analytical Processing – OLAP). Frequently data in Data Warehouses are heavily denormalised, Frequently data in Data Warehouses are heavily denormalised, summarised or stored in a dimension-based model. However, this is summarised or stored in a dimension-based model. However, this is not always required to achieve very short query response times.not always required to achieve very short query response times.

Page 17: Application software – office packets, databases and data warehouses. Piotr Mielecki Ph. D. Introduction to Computer Systems (8) mielecki@wssk.wroc.pl.

2.5.2.5.22. History.. History.

Data Warehouses are a distinct type of computer database that were Data Warehouses are a distinct type of computer database that were first developed during the late 1980-ties and early 1990-ties. first developed during the late 1980-ties and early 1990-ties. They They were developed to meet a growing demand for management were developed to meet a growing demand for management information and analysis that could not be met by operational information and analysis that could not be met by operational systemssystems. Operational systems were unable to meet this need for a . Operational systems were unable to meet this need for a range of reasonsrange of reasons::

The processing load of sophisticated reporting was not neutral The processing load of sophisticated reporting was not neutral for the response time of the operational systemsfor the response time of the operational systems, so the , so the everyday work in strongly computer-supported organizations everyday work in strongly computer-supported organizations was slower.was slower.

The database designs of operational systems were not optimized The database designs of operational systems were not optimized for advanced information analysis and reportingfor advanced information analysis and reporting..

Most organizations had more than one operational system Most organizations had more than one operational system (several domain subsystems), so (several domain subsystems), so company-wide reporting company-wide reporting couldn’t be supported from a single systemcouldn’t be supported from a single system..

Development of reports in operational systems often required Development of reports in operational systems often required writing specific computer programs writing specific computer programs which was slow, difficult to which was slow, difficult to use and expensive.use and expensive.

Page 18: Application software – office packets, databases and data warehouses. Piotr Mielecki Ph. D. Introduction to Computer Systems (8) mielecki@wssk.wroc.pl.

As a result, As a result, separate computer databases began to be built that separate computer databases began to be built that were specifically designed to support management information and were specifically designed to support management information and advanced analysis purposesadvanced analysis purposes. .

These Data Warehouses were able These Data Warehouses were able to bring in data from a range of to bring in data from a range of different data sourcesdifferent data sources, such as mainframe computers, , such as mainframe computers, minicomputers, as well as personal computers and office automation minicomputers, as well as personal computers and office automation software such as spreadsheets, software such as spreadsheets, and integrate this information in a and integrate this information in a single placesingle place. .

This capability, coupled with This capability, coupled with user-friendly reporting tools user-friendly reporting tools and and freedom from operational impactsfreedom from operational impacts, has led to a growth of this type of , has led to a growth of this type of computer systems.computer systems.

As technology improved (lower cost for more performance) and user As technology improved (lower cost for more performance) and user requirements increased (faster data load cycle times and more requirements increased (faster data load cycle times and more features), Data Warehouses have evolved through features), Data Warehouses have evolved through several several fundamental stagesfundamental stages::

Off-line Operational Databases Off-line Operational Databases – Data Warehouses in this initial – Data Warehouses in this initial stage are developed by simply stage are developed by simply copying the database of an copying the database of an operational system to an off-line serveroperational system to an off-line server, where the processing , where the processing load of reporting does not impact on the operational system’s load of reporting does not impact on the operational system’s performance.performance.

Page 19: Application software – office packets, databases and data warehouses. Piotr Mielecki Ph. D. Introduction to Computer Systems (8) mielecki@wssk.wroc.pl.

Off-line Data WarehousesOff-line Data Warehouses – Data Warehouses in this stage of – Data Warehouses in this stage of evolution are updated evolution are updated on a regular time cycleon a regular time cycle (usually daily, (usually daily, weekly or monthly) from the operational systems and weekly or monthly) from the operational systems and the data is the data is stored in an integrated reporting-oriented data structure stored in an integrated reporting-oriented data structure (different from the operational)(different from the operational). In this case Extract-Transform-. In this case Extract-Transform-Load (Load (ETLETL) processes performed by enterprise -level integration ) processes performed by enterprise -level integration tools are responsible for supplying data from operational tools are responsible for supplying data from operational systems.systems.

Real Time Data WarehousesReal Time Data Warehouses – Data Warehouses at this stage are – Data Warehouses at this stage are updated on a updated on a transaction or event basistransaction or event basis, every time an , every time an operational system performs a transaction (e.g. an order or a operational system performs a transaction (e.g. an order or a delivery or a booking etc.). On-line Enterprise Service Bus (delivery or a booking etc.). On-line Enterprise Service Bus (ESBESB) ) is responsible for supplying data from operational systems.is responsible for supplying data from operational systems.

Integrated Data WarehouseIntegrated Data Warehousess – Data Warehouses at this stage – Data Warehouses at this stage are used to generate activity or transactions that are are used to generate activity or transactions that are passed passed back into the operational systems for use in the daily activity of back into the operational systems for use in the daily activity of the organizationthe organization. Today’s application-level protocols, defined for . Today’s application-level protocols, defined for data interchange between different domain-oriented applications data interchange between different domain-oriented applications used by the organization or some organizations working together used by the organization or some organizations working together (HL7 in medicine, for example), were designed on this concept (HL7 in medicine, for example), were designed on this concept and can be used with ESB subsystems.and can be used with ESB subsystems.

Page 20: Application software – office packets, databases and data warehouses. Piotr Mielecki Ph. D. Introduction to Computer Systems (8) mielecki@wssk.wroc.pl.

2.5.2.5.33. Architecture of Data Warehouses.. Architecture of Data Warehouses.

The concept of ”data warehousing” dates back at least to the mid-The concept of ”data warehousing” dates back at least to the mid-1980-ties, and possibly earlier. In essence, it was intended to provide 1980-ties, and possibly earlier. In essence, it was intended to provide an architectural model for the flow of data from operational systems an architectural model for the flow of data from operational systems to decision support environmentsto decision support environments. It attempted to address the . It attempted to address the various problems associated with this flow, and the high costs various problems associated with this flow, and the high costs associated with it.associated with it.

In the absence of such architecture, there usually existed an In the absence of such architecture, there usually existed an enormous amount of redundancy in the delivery of management enormous amount of redundancy in the delivery of management informationinformation. In larger corporations it was typical for multiple decision . In larger corporations it was typical for multiple decision support projects to operate independently, each serving different support projects to operate independently, each serving different users but often requiring much of the same data. users but often requiring much of the same data.

The process of gathering, cleaning and integrating data from various The process of gathering, cleaning and integrating data from various sourcessources, often legacy systems (old computer systems or application , often legacy systems (old computer systems or application programs that continue to be used because the organization doesn’t programs that continue to be used because the organization doesn’t want to replace or redesign them), want to replace or redesign them), was typically replicated for each was typically replicated for each decision support projectdecision support project. Moreover, legacy systems were frequently . Moreover, legacy systems were frequently being revisited as new requirements emerged, each requiring a being revisited as new requirements emerged, each requiring a different view of the legacy datadifferent view of the legacy data..

Page 21: Application software – office packets, databases and data warehouses. Piotr Mielecki Ph. D. Introduction to Computer Systems (8) mielecki@wssk.wroc.pl.

Based on analogies with real-life warehouses, Based on analogies with real-life warehouses, Data Warehouses Data Warehouses were intended as large-scale collection/storage/staging areas for were intended as large-scale collection/storage/staging areas for corporate datacorporate data. .

From here data could be distributed to ”retail stores” or ”data From here data could be distributed to ”retail stores” or ”data marts” marts” which were tailored for access by decision support users (or which were tailored for access by decision support users (or ”consumers”). While the Data Warehouse was designed to manage ”consumers”). While the Data Warehouse was designed to manage the bulk supply of data from its ”suppliers” (e.g. operational the bulk supply of data from its ”suppliers” (e.g. operational systems), and to handle the organization and storage of this data, systems), and to handle the organization and storage of this data, the ”retail stores” or ”data marts” could be focused on packaging the ”retail stores” or ”data marts” could be focused on packaging and presenting selections of the data to end-users, to meet specific and presenting selections of the data to end-users, to meet specific management information needs.management information needs.

Somewhere along the way this analogy and architectural vision was Somewhere along the way this analogy and architectural vision was lost, as lost, as some vendors and industry speakers redefined the Data some vendors and industry speakers redefined the Data Warehouse as simply a management reporting databaseWarehouse as simply a management reporting database. This is a . This is a subtle but important deviation from the original vision of the Data subtle but important deviation from the original vision of the Data Warehouse as Warehouse as the hub of a management information architecturethe hub of a management information architecture, , where the decision support systems were actually the ”data marts” where the decision support systems were actually the ”data marts” or ”retail stores”.or ”retail stores”.

Page 22: Application software – office packets, databases and data warehouses. Piotr Mielecki Ph. D. Introduction to Computer Systems (8) mielecki@wssk.wroc.pl.

2.5.2.5.44. Data storage models for Data Warehouses.. Data storage models for Data Warehouses.

The goal of a Data Warehouse is to bring data together from a The goal of a Data Warehouse is to bring data together from a variety of existing databases to support management and reporting variety of existing databases to support management and reporting needsneeds. The generally accepted principle is that data should be stored . The generally accepted principle is that data should be stored at its most elementary level because this provides for the most at its most elementary level because this provides for the most useful and flexible basis for use in reporting and information useful and flexible basis for use in reporting and information analysis.analysis.

However, because of different focus on specific requirements, there However, because of different focus on specific requirements, there can be alternative methods for design and implementing data can be alternative methods for design and implementing data warehouses. There are two leading approaches to organizing the warehouses. There are two leading approaches to organizing the data in a data warehouse: the dimensional approach and the data in a data warehouse: the dimensional approach and the normalized approach.normalized approach.

Page 23: Application software – office packets, databases and data warehouses. Piotr Mielecki Ph. D. Introduction to Computer Systems (8) mielecki@wssk.wroc.pl.

The dimensional approachThe dimensional approach is very useful in data mart design, but it is very useful in data mart design, but it can result in serious problems of long term data integration and can result in serious problems of long term data integration and abstraction complications when used in a Data Warehouse. In the abstraction complications when used in a Data Warehouse. In the ”dimensional” approach, ”dimensional” approach, transaction data is partitioned into either a transaction data is partitioned into either a measured ”facts”, which are generally numeric data that captures measured ”facts”, which are generally numeric data that captures specific values or ”dimensions”, which contain the reference specific values or ”dimensions”, which contain the reference information that gives each transaction its contextinformation that gives each transaction its context::

As an example, a sales transaction could be broken up into As an example, a sales transaction could be broken up into facts facts such as the number of products ordered and the price paidsuch as the number of products ordered and the price paid, and , and dimensions such as date, customer, product (and its price), dimensions such as date, customer, product (and its price), geographical location and salespersongeographical location and salesperson..

The main advantage of a dimensional approach is that the Data The main advantage of a dimensional approach is that the Data Warehouse is easy for business staff with limited IT experience Warehouse is easy for business staff with limited IT experience to understand and use.to understand and use.

Also, because the data is pre-joined into the dimensional form, Also, because the data is pre-joined into the dimensional form, the data warehouse tends to operate very quickly. the data warehouse tends to operate very quickly.

The main disadvantage of the dimensional approach is thatThe main disadvantage of the dimensional approach is that it’s it’s difficult to add or change the stored informationdifficult to add or change the stored information later, if the later, if the company changes the way in which it makes business.company changes the way in which it makes business.

Page 24: Application software – office packets, databases and data warehouses. Piotr Mielecki Ph. D. Introduction to Computer Systems (8) mielecki@wssk.wroc.pl.

The normalized approachThe normalized approach uses database normalization. In this uses database normalization. In this method, method, the data in the Data Warehouse is stored in 3rd normal the data in the Data Warehouse is stored in 3rd normal formform. Tables are then grouped together by subject areas that reflect . Tables are then grouped together by subject areas that reflect the general definition of the data (customer, product, finance, etc.)the general definition of the data (customer, product, finance, etc.)..

The main advantage of this approach is that The main advantage of this approach is that it is quite it is quite straightforward to add new information into the databasestraightforward to add new information into the database..

The primary disadvantage is that The primary disadvantage is that because of the number of tables because of the number of tables involved, it can be rather slow to produce information and reportsinvolved, it can be rather slow to produce information and reports. .

Furthermore, since the segregation of facts and dimensions is not Furthermore, since the segregation of facts and dimensions is not explicit in this type of data model, it is difficult for users to join the explicit in this type of data model, it is difficult for users to join the required data elements into meaningful information without a required data elements into meaningful information without a precise understanding of the data structureprecise understanding of the data structure (having detailed (having detailed documentation about database and more advanced skills).documentation about database and more advanced skills).

Page 25: Application software – office packets, databases and data warehouses. Piotr Mielecki Ph. D. Introduction to Computer Systems (8) mielecki@wssk.wroc.pl.

Operational (Domain) Systems

Accounting

Personal Resources

Sales

Production

ERP

Business Intelligence (BI) System

Data Warehouse

Analytical workers

BI applications (Data Marts)

Integration Tools:- ETL- ESB