01DataWarehoudingandAbInitioConcepts.ppt
Post on 13-Oct-2015
10 Views
Preview:
Transcript
AccentureAccenture Ab Initio TrainingAb Initio Training 11
Introduction to Ab Initio
Prepared By : Ashok Chanda
AccentureAccenture Ab Initio TrainingAb Initio Training 22
Ab initio Session 1Ab initio Session 1
Introduction to DWH Introduction to DWH Explanation of DW ArchitectureExplanation of DW Architecture Operating System / Hardware SupportOperating System / Hardware Support Introduction to ETL ProcessIntroduction to ETL Process Introduction to Ab InitioIntroduction to Ab Initio Explanation of Ab Initio ArchitectureExplanation of Ab Initio Architecture
AccentureAccenture Ab Initio TrainingAb Initio Training 33
What is Data WarehouseWhat is Data Warehouse
A data warehouse is a copy of transaction data A data warehouse is a copy of transaction data specifically structured for specifically structured for querying and querying and reporting. reporting.
A data warehouse is a subject-oriented, A data warehouse is a subject-oriented, integrated, time-variant and non-volatile integrated, time-variant and non-volatile collection of data in support of management's collection of data in support of management's decision making process.decision making process.
A data warehouse is a central repository for all A data warehouse is a central repository for all or significant parts of the data that an or significant parts of the data that an enterprise's various business systems collect. enterprise's various business systems collect.
AccentureAccenture Ab Initio TrainingAb Initio Training 44
Data Warehouse-DefinitionsData Warehouse-Definitions
A data warehouse is a database geared towards A data warehouse is a database geared towards the business intelligence requirements of an the business intelligence requirements of an organization. The data warehouse integrates organization. The data warehouse integrates data from the various operational systems and is data from the various operational systems and is typically loaded from these systems at regular typically loaded from these systems at regular intervals. Data warehouses contain historical intervals. Data warehouses contain historical information that enables analysis of business information that enables analysis of business performance over time. A collection of databases performance over time. A collection of databases combined with a flexible data extraction system. combined with a flexible data extraction system.
AccentureAccenture Ab Initio TrainingAb Initio Training 55
Data WarehouseData Warehouse
A data warehouse can be normalized or A data warehouse can be normalized or denormalized. It can be a relational denormalized. It can be a relational database, multidimensional database, flat database, multidimensional database, flat file, hierarchical database, object file, hierarchical database, object database, etc. Data warehouse data often database, etc. Data warehouse data often gets changed. And data warehouses often gets changed. And data warehouses often focus on a specific activity or entity. focus on a specific activity or entity.
AccentureAccenture Ab Initio TrainingAb Initio Training 66
Why Use a Data Warehouse?Why Use a Data Warehouse?
Data Exploration and Discovery Integrated and Consistent data Quality assured data Easily accessible data Production and performance awareness Access to data in a timely manner
AccentureAccenture Ab Initio TrainingAb Initio Training 77
Simplified Datawarehouse Architecture
AccentureAccenture Ab Initio TrainingAb Initio Training 88
Data warehouse Architecture
Data Warehouses can be architected in many different Data Warehouses can be architected in many different ways, depending on the specific needs of a ways, depending on the specific needs of a business. The model shown below is the "hub-and-business. The model shown below is the "hub-and-spokes" Data Warehousing architecture that is popular in spokes" Data Warehousing architecture that is popular in many organizations. many organizations.
In short, data is moved from databases used in In short, data is moved from databases used in operational systems into a data warehouse staging area, operational systems into a data warehouse staging area, then into a data warehouse and finally into a set of then into a data warehouse and finally into a set of conformed data marts. Data is copied from one conformed data marts. Data is copied from one database to another using a technology called ETL database to another using a technology called ETL (Extract, Transform, Load).(Extract, Transform, Load).
AccentureAccenture Ab Initio TrainingAb Initio Training 99
AccentureAccenture Ab Initio TrainingAb Initio Training 1010
The ETL ProcessThe ETL Process
CaptureCapture Scrub or Data cleansingScrub or Data cleansing TransformTransform Load and IndexLoad and Index
AccentureAccenture Ab Initio TrainingAb Initio Training 1111
ETL TechnologyETL Technology ETL Technology is an important component of the Data ETL Technology is an important component of the Data
Warehousing Architecture. It is used to copy data from Warehousing Architecture. It is used to copy data from Operational Applications to the Data Warehouse Staging Operational Applications to the Data Warehouse Staging Area, from the DW Staging Area into the Data Area, from the DW Staging Area into the Data Warehouse and finally from the Data Warehouse into a Warehouse and finally from the Data Warehouse into a set of conformed Data Marts that are accessible by set of conformed Data Marts that are accessible by decision makers.decision makers.
The ETL software extracts data, transforms values of The ETL software extracts data, transforms values of inconsistent data, cleanses "bad" data, filters data and inconsistent data, cleanses "bad" data, filters data and loads data into a target database. The scheduling of loads data into a target database. The scheduling of ETL jobs is critical. Should there be a failure in one ETL ETL jobs is critical. Should there be a failure in one ETL job, the remaining ETL jobs must respond appropriately. job, the remaining ETL jobs must respond appropriately.
AccentureAccenture Ab Initio TrainingAb Initio Training 1212
Data Warehouse Staging AreaData Warehouse Staging Area
The Data Warehouse Staging Area is temporary location The Data Warehouse Staging Area is temporary location where data from source systems is copied. A staging where data from source systems is copied. A staging area is mainly required in a Data Warehousing area is mainly required in a Data Warehousing Architecture for timing reasons. In short, all required Architecture for timing reasons. In short, all required data must be available before data can be integrated data must be available before data can be integrated into the Data Warehouse. into the Data Warehouse.
Due to varying business cycles, data processing cycles, Due to varying business cycles, data processing cycles, hardware and network resource limitations and hardware and network resource limitations and geographical factors, it is not feasible to extract all the geographical factors, it is not feasible to extract all the data from all Operational databases at exactly the same data from all Operational databases at exactly the same timetime
AccentureAccenture Ab Initio TrainingAb Initio Training 1313
Examples-Examples- Staging AreaStaging Area For example, it might be reasonable to extract sales data on a daily For example, it might be reasonable to extract sales data on a daily
basis, however, daily extracts might not be suitable for financial basis, however, daily extracts might not be suitable for financial data that requires a month-end reconciliation process. Similarly, it data that requires a month-end reconciliation process. Similarly, it might be feasible to extract "customer" data from a database in might be feasible to extract "customer" data from a database in Singapore at noon eastern standard time, but this would not be Singapore at noon eastern standard time, but this would not be feasible for "customer" data in a Chicago database. feasible for "customer" data in a Chicago database.
Data in the Data Warehouse can be either persistent (i.e. remains Data in the Data Warehouse can be either persistent (i.e. remains around for a long period) or transient (i.e. only remains around around for a long period) or transient (i.e. only remains around temporarily). temporarily).
Not all business require a Data Warehouse Staging Area. For many Not all business require a Data Warehouse Staging Area. For many businesses it is feasible to use ETL to copy data directly from businesses it is feasible to use ETL to copy data directly from operational databases into the Data Warehouse.operational databases into the Data Warehouse.
AccentureAccenture Ab Initio TrainingAb Initio Training 1414
Data warehouseData warehouse
The purpose of the Data Warehouse in the overall Data The purpose of the Data Warehouse in the overall Data Warehousing Architecture is to integrate corporate data. It Warehousing Architecture is to integrate corporate data. It contains the "single version of truth" for the organization contains the "single version of truth" for the organization that has been carefully constructed from data stored in that has been carefully constructed from data stored in disparate internal and external operational databases.disparate internal and external operational databases.
The amount of data in the Data Warehouse is The amount of data in the Data Warehouse is massive. Data is stored at a very granular level of massive. Data is stored at a very granular level of detail. For example, every "sale" that has ever occurred detail. For example, every "sale" that has ever occurred in the organization is recorded and related to dimensions in the organization is recorded and related to dimensions of interest. This allows data to be sliced and diced, of interest. This allows data to be sliced and diced, summed and grouped in unimaginable ways. summed and grouped in unimaginable ways.
AccentureAccenture Ab Initio TrainingAb Initio Training 1515
Data WarehouseData Warehouse Contrary to popular opinion, the Data Warehouses does Contrary to popular opinion, the Data Warehouses does
not contain all the data in the organization. It's purpose not contain all the data in the organization. It's purpose is to provide key business metrics that are needed by is to provide key business metrics that are needed by the organization for strategic and tactical decision the organization for strategic and tactical decision making.making.
Decision makers don't access the Data Warehouse Decision makers don't access the Data Warehouse directly. This is done through various front-end Data directly. This is done through various front-end Data Warehouse Tools that read data from subject specific Warehouse Tools that read data from subject specific Data Marts.Data Marts.
The Data Warehouse can be either "relational" or The Data Warehouse can be either "relational" or "dimensional". This depends on how the business "dimensional". This depends on how the business intends to use the information.intends to use the information.
AccentureAccenture Ab Initio TrainingAb Initio Training 1616
Data Warehouse EnvironmentData Warehouse Environment
In addition to a In addition to a relational/multidimensional database, a relational/multidimensional database, a data warehouse environment often data warehouse environment often consists of an ETL solution, an OLAP consists of an ETL solution, an OLAP engine, client analysis tools, and other engine, client analysis tools, and other applications that manage the process of applications that manage the process of gathering data and delivering it to gathering data and delivering it to business users. business users.
AccentureAccenture Ab Initio TrainingAb Initio Training 1717
Data MartData Mart
A subset of a data warehouse, for use by a A subset of a data warehouse, for use by a single department or function.single department or function.
A repository of data gathered from operational A repository of data gathered from operational data and other sources that is designed to serve data and other sources that is designed to serve a particular community of knowledge workers. a particular community of knowledge workers.
A subset of the information contained in a data A subset of the information contained in a data warehouse.warehouse.
Data marts have the same definition as the data Data marts have the same definition as the data warehouse (see below), but data marts have a warehouse (see below), but data marts have a more limited audience and/or data content. more limited audience and/or data content.
AccentureAccenture Ab Initio TrainingAb Initio Training 1818
Data MartData Mart ETL (Extract Transform Load) jobs extract data from the Data ETL (Extract Transform Load) jobs extract data from the Data
Warehouse and populate one or more Data Marts for use by groups Warehouse and populate one or more Data Marts for use by groups of decision makers in the organizations. The Data Marts can be of decision makers in the organizations. The Data Marts can be Dimensional Dimensional (Star Schemas)(Star Schemas) or relational, depending on how the or relational, depending on how the information is to be used and what "front end" Data Warehousing information is to be used and what "front end" Data Warehousing Tools will be used to present the information. Tools will be used to present the information.
Each Data Mart can contain different combinations of tables, Each Data Mart can contain different combinations of tables, columns and rows from the Enterprise Data Warehouse. For columns and rows from the Enterprise Data Warehouse. For example, an business unit or user group that doesn't require a lot of example, an business unit or user group that doesn't require a lot of historical data might only need transactions from the current historical data might only need transactions from the current calendar year in the database. The Personnel Department might calendar year in the database. The Personnel Department might need to see all details about employees, whereas data such as need to see all details about employees, whereas data such as "salary" or "home address" might not be appropriate for a Data Mart "salary" or "home address" might not be appropriate for a Data Mart that focuses on Sales.that focuses on Sales.
AccentureAccenture Ab Initio TrainingAb Initio Training 1919
Star SchemaStar Schema
The The star schemastar schema is perhaps the simplest data is perhaps the simplest data warehouse schema. warehouse schema.
It is called a star schema because the entity-It is called a star schema because the entity-relationship diagram of this schema resembles a relationship diagram of this schema resembles a star, with points radiating from a central table. star, with points radiating from a central table.
The center of the star consists of a large fact The center of the star consists of a large fact table and the points of the star are the table and the points of the star are the dimension tables.dimension tables.
AccentureAccenture Ab Initio TrainingAb Initio Training 2020
Star Schema continuedStar Schema continued
A star schema is characterized by one or A star schema is characterized by one or more very large more very large factfact tables that contain tables that contain the primary information in the data the primary information in the data warehouse, and a number of much warehouse, and a number of much smaller smaller dimensiondimension tables (or lookup tables (or lookup tables), each of which contains tables), each of which contains information about the entries for a information about the entries for a particular attribute in the fact table.particular attribute in the fact table.
AccentureAccenture Ab Initio TrainingAb Initio Training 2121
Advantages of Star SchemasAdvantages of Star Schemas
Provide a direct and intuitive mapping between Provide a direct and intuitive mapping between the business entities being analyzed by end the business entities being analyzed by end users and the schema design. users and the schema design.
Provide highly optimized performance for typical Provide highly optimized performance for typical star queries. star queries.
Are widely supported by a large number of Are widely supported by a large number of business intelligence tools, which may anticipate business intelligence tools, which may anticipate or even require that the data-warehouse schema or even require that the data-warehouse schema contain dimension tables contain dimension tables
Star schemas are used for both simple data Star schemas are used for both simple data marts and very large data warehouses.marts and very large data warehouses.
AccentureAccenture Ab Initio TrainingAb Initio Training 2222
Star schemaStar schema
Diagrammatic representation of star Diagrammatic representation of star schemaschema
AccentureAccenture Ab Initio TrainingAb Initio Training 2323
Snowflake SchemaSnowflake Schema
The snowflake schema is a more complex The snowflake schema is a more complex data warehouse model than a star data warehouse model than a star schema, and is a type of star schema.schema, and is a type of star schema.
It is called a snowflake schema because It is called a snowflake schema because the diagram of the schema resembles a the diagram of the schema resembles a snowflake.snowflake.
Snowflake schemas normalize dimensions Snowflake schemas normalize dimensions to eliminate redundancy. to eliminate redundancy.
AccentureAccenture Ab Initio TrainingAb Initio Training 2424
Snowflake Schema - ExampleSnowflake Schema - Example
That is, the dimension data has been grouped That is, the dimension data has been grouped into multiple tables instead of one large table. into multiple tables instead of one large table. For example, a product dimension table in a star For example, a product dimension table in a star schema might be normalized into a products schema might be normalized into a products table, a product_category table, and a table, a product_category table, and a product_manufacturer table in a snowflake product_manufacturer table in a snowflake schema. While this saves space, it increases the schema. While this saves space, it increases the number of dimension tables and requires more number of dimension tables and requires more foreign key joins. The result is more complex foreign key joins. The result is more complex queries and reduced query performance. queries and reduced query performance.
AccentureAccenture Ab Initio TrainingAb Initio Training 2525
Diagrammatic representation Diagrammatic representation for Snowflake Schemafor Snowflake Schema
AccentureAccenture Ab Initio TrainingAb Initio Training 2626
Fact TableFact Table
The centralized table in a star schema is The centralized table in a star schema is called as FACT table. A fact table typically called as FACT table. A fact table typically has two types of columns: those that has two types of columns: those that contain facts and those that are foreign contain facts and those that are foreign keys to dimension tables. The primary key keys to dimension tables. The primary key of a fact table is usually a composite key of a fact table is usually a composite key that is made up of all of its foreign keys. that is made up of all of its foreign keys.
AccentureAccenture Ab Initio TrainingAb Initio Training 2727
What happens during the ETL What happens during the ETL process?process?
During extraction, the desired data is identified and During extraction, the desired data is identified and extracted from many different sources, including extracted from many different sources, including database systems and applications. Depending on the database systems and applications. Depending on the source system's capabilities (for example, operating source system's capabilities (for example, operating system resources), some transformations may take place system resources), some transformations may take place during this extraction process. The size of the extracted during this extraction process. The size of the extracted data varies from hundreds of kilobytes up to gigabytes, data varies from hundreds of kilobytes up to gigabytes, depending on the source system and the business depending on the source system and the business situation. After extracting data, it has to be physically situation. After extracting data, it has to be physically transported to the target system or an intermediate transported to the target system or an intermediate system for further processing. system for further processing.
AccentureAccenture Ab Initio TrainingAb Initio Training 2828
Examples of Second-Examples of Second-Generation ETL ToolsGeneration ETL Tools
Powermart 4.5 Informatica Corporation Powermart 4.5 Informatica Corporation Pioneer due to market share Pioneer due to market share
Ardent DataStage Ardent Software, Inc. Ardent DataStage Ardent Software, Inc. General-purpose tool oriented to data marts General-purpose tool oriented to data marts
Sagent Data Mart Solution 3.0 Sagent Sagent Data Mart Solution 3.0 Sagent Technology Technology Progressively integrated with Microsoft Progressively integrated with Microsoft
Ab Initio 2.2 Ab Initio Software Ab Initio 2.2 Ab Initio Software A kit of tools that can be used to build applications A kit of tools that can be used to build applications
Tapestry 2.1 D2K, Inc Tapestry 2.1 D2K, Inc End-to-end data warehousing solution from a single vendorEnd-to-end data warehousing solution from a single vendor
AccentureAccenture Ab Initio TrainingAb Initio Training 2929
What to look for in ETL toolsWhat to look for in ETL tools
Use optional data cleansing tool to clean-up source data Use optional data cleansing tool to clean-up source data Use extraction/transformation/load tool to retrieve, Use extraction/transformation/load tool to retrieve,
cleanse, transform, summarize, aggregate, and load data cleanse, transform, summarize, aggregate, and load data Use modern, engine-driven technology for fast, parallel Use modern, engine-driven technology for fast, parallel
operation operation Goal: define 100% of the transform rule with point and Goal: define 100% of the transform rule with point and
click interface click interface Support development of logical and physical data models Support development of logical and physical data models Generate and manage central metadata repository Generate and manage central metadata repository Open metadata exchange architecture to integrate central Open metadata exchange architecture to integrate central
metadata with local metadata. metadata with local metadata. Support metadata standards Support metadata standards Provide end users access to metadata in business termsProvide end users access to metadata in business terms
AccentureAccenture Ab Initio TrainingAb Initio Training 3030
Operating System / Hardware Operating System / Hardware SupportSupport
This section discusses how a DBMS utilizes This section discusses how a DBMS utilizes OS/hardware features such as parallel OS/hardware features such as parallel functionality, SMP/MPP support, and functionality, SMP/MPP support, and clustering. These OS/hardware features clustering. These OS/hardware features greatly extend the scalability and improve greatly extend the scalability and improve performance. However, managing an performance. However, managing an environment with these features is difficult environment with these features is difficult and expensive. and expensive.
AccentureAccenture Ab Initio TrainingAb Initio Training 3131
Parallel FunctionalityParallel Functionality
The introduction and maturation of parallel The introduction and maturation of parallel processing environments are key enablers of processing environments are key enablers of increasing database sizes, as well as providing increasing database sizes, as well as providing acceptable response times for storing, retrieving, acceptable response times for storing, retrieving, and administrating data. DBMS vendors are and administrating data. DBMS vendors are continually bringing products to market that take continually bringing products to market that take advantage of multi-processor hardware advantage of multi-processor hardware platforms. These products can perform table platforms. These products can perform table scans, backups, loads, and queries in parallel.scans, backups, loads, and queries in parallel.
AccentureAccenture Ab Initio TrainingAb Initio Training 3232
Parallel FeaturesParallel Features An overview of typical parallel functionality is given below : An overview of typical parallel functionality is given below : QueriesQueries Parallel queries can enhance scalability for many query Parallel queries can enhance scalability for many query
operations operations Data loadData load Performance is always a serious issue when loading Performance is always a serious issue when loading
large databases. Meeting response time requirements is the large databases. Meeting response time requirements is the overriding factor for determining the best load method and should overriding factor for determining the best load method and should be a key part of a performance benchmark be a key part of a performance benchmark
Create table as selectCreate table as select This feature makes it possible to create This feature makes it possible to create aggregated tables in parallel aggregated tables in parallel
Index creationIndex creation Parallel index creation exploits the benefits of Parallel index creation exploits the benefits of parallel hardware by distributing the workload generated by a large parallel hardware by distributing the workload generated by a large index created for a large number of processors .index created for a large number of processors .
AccentureAccenture Ab Initio TrainingAb Initio Training 3333
Which parallel processor Which parallel processor configuration, SMP or MPPconfiguration, SMP or MPP ??
SMP and clustered SMP environments , have the SMP and clustered SMP environments , have the flexibility and ability to scale in small increments. flexibility and ability to scale in small increments.
SMP environments are often useful for the large, SMP environments are often useful for the large, but static data warehouse, where the data but static data warehouse, where the data cannot be easily partitioned, due to the cannot be easily partitioned, due to the unpredictable nature of how the data is joined unpredictable nature of how the data is joined over multiple tables for complex searches and over multiple tables for complex searches and ad-hoc queries. ad-hoc queries.
AccentureAccenture Ab Initio TrainingAb Initio Training 3434
Which parallel processor Which parallel processor configuration, SMP or MPPconfiguration, SMP or MPP ??
MPP works well in environments where growth is potentially MPP works well in environments where growth is potentially unlimited, access patterns to the database are predictable, and the unlimited, access patterns to the database are predictable, and the data can be easily partitioned across different MPP nodes with data can be easily partitioned across different MPP nodes with minimal data accesses crossing between them. This often occurs in minimal data accesses crossing between them. This often occurs in large OLTP environments, where transactions are generally small large OLTP environments, where transactions are generally small and predictable, as opposed to decision support and data and predictable, as opposed to decision support and data warehouse environments, where multiple tables can be joined in warehouse environments, where multiple tables can be joined in unpredictable ways.unpredictable ways.
In fact, data warehousing and decision support are the areas most In fact, data warehousing and decision support are the areas most vendors of parallel hardware platforms and DBMSs are targeting. vendors of parallel hardware platforms and DBMSs are targeting.
MPP does not scale well if heavy data warehouse database accesses MPP does not scale well if heavy data warehouse database accesses must cross MPP nodes, causing I/O bottlenecks over the MPP must cross MPP nodes, causing I/O bottlenecks over the MPP interconnect, or if multiple MPP nodes are continually locked for interconnect, or if multiple MPP nodes are continually locked for concurrent record updates. concurrent record updates.
AccentureAccenture Ab Initio TrainingAb Initio Training 3535
A Multi-CPU Computer (SMP)A Multi-CPU Computer (SMP)
AccentureAccenture Ab Initio TrainingAb Initio Training 3636
A Network of Multi-CPU NodesA Network of Multi-CPU Nodes
AccentureAccenture Ab Initio TrainingAb Initio Training 3737
A Network of NetworksA Network of Networks
AccentureAccenture Ab Initio TrainingAb Initio Training 3838
Parallel Computer ArchitectureParallel Computer Architecture
Computers come in many shapes and sizes:Computers come in many shapes and sizes: Single-CPU, Multi-CPUSingle-CPU, Multi-CPU Network of single-CPU computersNetwork of single-CPU computers Network of multi-CPU computersNetwork of multi-CPU computers
Multi-CPU machines are often called SMPs (for Multi-CPU machines are often called SMPs (for Symmetric Multi Processors).Symmetric Multi Processors).
Specially-built networks of machines are often called Specially-built networks of machines are often called MPPs (for Massively Parallel Processors).MPPs (for Massively Parallel Processors).
AccentureAccenture Ab Initio TrainingAb Initio Training 3939
Introduction to Ab Introduction to Ab InitioInitio
AccentureAccenture Ab Initio TrainingAb Initio Training 4040
History of Ab InitioHistory of Ab Initio
Ab Initio Software CorporationAb Initio Software Corporation was founded was founded in the mid in the mid 1990's1990's by Sheryl Handler, the former by Sheryl Handler, the former CEO at Thinking Machines Corporation, after CEO at Thinking Machines Corporation, after TMC filed for bankruptcy. In addition to Handler, TMC filed for bankruptcy. In addition to Handler, other former TMC people involved in the other former TMC people involved in the founding of Ab Initio included Cliff Lasser, founding of Ab Initio included Cliff Lasser, Angela Lordi, and Craig Stanfill.Angela Lordi, and Craig Stanfill.
Ab Initio is known for being very secretive in the Ab Initio is known for being very secretive in the way that they run their business, but their way that they run their business, but their software is widely regarded as top notch.software is widely regarded as top notch.
AccentureAccenture Ab Initio TrainingAb Initio Training 4141
History of Ab InitioHistory of Ab Initio
The Ab Initio software is a fourth generation The Ab Initio software is a fourth generation data analysis, batch processing, data data analysis, batch processing, data manipulation graphical user interface (GUI)-manipulation graphical user interface (GUI)-based parallel processing tool that is used based parallel processing tool that is used mainly to extract, transform and load data.mainly to extract, transform and load data.
The Ab Initio software is a suite of products that The Ab Initio software is a suite of products that together provides platform for robust data together provides platform for robust data processing applications. The Core Ab Initio processing applications. The Core Ab Initio Products are: The [Co>Operating System] The Products are: The [Co>Operating System] The Component Library The Graphical Development Component Library The Graphical Development EnvironmentEnvironment
AccentureAccenture Ab Initio TrainingAb Initio Training 4242
What Does What Does Ab InitioAb Initio Mean? Mean?
Ab Initio is Latin for From the Beginning.Ab Initio is Latin for From the Beginning.
From the beginning our software was designed to From the beginning our software was designed to support a complete range of business applications, from support a complete range of business applications, from simple to the most complex. Crucial capabilities like simple to the most complex. Crucial capabilities like parallelism and checkpointing cant be added after the parallelism and checkpointing cant be added after the fact.fact.
The Graphical Development Environment and a powerful The Graphical Development Environment and a powerful set of components allow our customers to get valuable set of components allow our customers to get valuable results from the beginning.results from the beginning.
AccentureAccenture Ab Initio TrainingAb Initio Training 4343
Ab Initios focusAb Initios focus
Moving DataMoving Data move small and large volumes of data in an move small and large volumes of data in an
efficient mannerefficient manner deal with the complexity associated with business deal with the complexity associated with business
datadata High Performance High Performance
scalable solutionsscalable solutions Better productivityBetter productivity
AccentureAccenture Ab Initio TrainingAb Initio Training 4444
Ab Initios SoftwareAb Initios Software
Ab Initio software is a general-purpose Ab Initio software is a general-purpose data processing platform for mission-data processing platform for mission-critical applications such as:critical applications such as:
Data warehousingData warehousing Batch processingBatch processing Click-stream analysisClick-stream analysis Data movementData movement Data transformationData transformation
AccentureAccenture Ab Initio TrainingAb Initio Training 4545
Applications of Ab Initio Applications of Ab Initio SoftwareSoftware
Processing just about any form and volume of data.Processing just about any form and volume of data.
Parallel sort/merge processing.Parallel sort/merge processing.
Data transformation.Data transformation.
Rehosting of corporate data.Rehosting of corporate data.
Parallel execution of existing applications.Parallel execution of existing applications.
AccentureAccenture Ab Initio TrainingAb Initio Training 4646
Ab Initio Provides For:Ab Initio Provides For:
Distribution - a platform for applications to Distribution - a platform for applications to execute across a collection of processors within execute across a collection of processors within the confines of a single machine or across the confines of a single machine or across multiple machines.multiple machines.
Reduced Run Time Complexity - the ability for Reduced Run Time Complexity - the ability for applications to run in parallel on any applications to run in parallel on any combination of computers where the Ab Initio combination of computers where the Ab Initio Co>Operating System is installed from a single Co>Operating System is installed from a single point of control.point of control.
AccentureAccenture Ab Initio TrainingAb Initio Training 4747
Applications of Ab Initio Applications of Ab Initio Software in terms of Data Software in terms of Data
WarehouseWarehouse Front end of Data Warehouse:Front end of Data Warehouse:
Transformation of disparate sourcesTransformation of disparate sources Aggregation and other preprocessingAggregation and other preprocessing Referential integrity checkingReferential integrity checking Database loadingDatabase loading
Back end of Data Warehouse:Back end of Data Warehouse: Extraction for external processingExtraction for external processing Aggregation and loading of Data MartsAggregation and loading of Data Marts
AccentureAccenture Ab Initio TrainingAb Initio Training 4848
Ab Initio or Informatica-Ab Initio or Informatica-Powerful ETLPowerful ETL
Informatica and Ab Initio both support Informatica and Ab Initio both support parallelismparallelism. But Informatica . But Informatica supports only one type of parallelism but the Ab Initio supports supports only one type of parallelism but the Ab Initio supports three types of parallelism. In Informatica the developer need to do three types of parallelism. In Informatica the developer need to do some partitions in server manager by using that you can achieve some partitions in server manager by using that you can achieve parallelism concepts. But in Ab Initio the tool it self take care of parallelism concepts. But in Ab Initio the tool it self take care of parallelism we have three types of parallelisms in Ab Initio 1. parallelism we have three types of parallelisms in Ab Initio 1. Component 2. Data Parallelism 3. Pipe Line parallelism this is the Component 2. Data Parallelism 3. Pipe Line parallelism this is the difference in parallelism concepts.difference in parallelism concepts.
2. We don't have scheduler in Ab Initio like Informatica you need to 2. We don't have scheduler in Ab Initio like Informatica you need to schedule through script or u need to run manually.schedule through script or u need to run manually.
3. Ab Initio supports different types of text files means you can read 3. Ab Initio supports different types of text files means you can read same file with different structures that is not possible in Informatica, same file with different structures that is not possible in Informatica, and also Ab Initio is more user friendly than Informatica so there is and also Ab Initio is more user friendly than Informatica so there is a lot of differences in Informatica and Ab initio.a lot of differences in Informatica and Ab initio.
8. AbInitio doesn't need a dedicated administrator, UNIX or NT Admin will suffice, where as other ETL tools do have administrative work.8. AbInitio doesn't need a dedicated administrator, UNIX or NT Admin will suffice, where as other ETL tools do have administrative work.
AccentureAccenture Ab Initio TrainingAb Initio Training 4949
Ab Initio or Informatica-Ab Initio or Informatica-Powerful ETL-continuedPowerful ETL-continued
Error Handling - In Ab Initio you can attach error and reject files to Error Handling - In Ab Initio you can attach error and reject files to each transformation and capture and analyze the message and data each transformation and capture and analyze the message and data separately. Informatica has one huge log! Very inefficient when separately. Informatica has one huge log! Very inefficient when working on a large process, with numerous points of failure.working on a large process, with numerous points of failure.
Robust transformation language - Informatica is very basic as far as Robust transformation language - Informatica is very basic as far as transformations go. While I will not go into a function by function transformations go. While I will not go into a function by function comparison, it seems that Ab Initio was much more robust.comparison, it seems that Ab Initio was much more robust.
Instant feedback - On execution, Ab Initio tells you how many Instant feedback - On execution, Ab Initio tells you how many records have been processed/rejected/etc. and detailed records have been processed/rejected/etc. and detailed performance metrics for each component. Informatica has a debug performance metrics for each component. Informatica has a debug mode, but it is slow and difficult to adapt to. mode, but it is slow and difficult to adapt to.
AccentureAccenture Ab Initio TrainingAb Initio Training 5050
Both tools are fundamentally Both tools are fundamentally differentdifferent
Which one to use depends on the work at hand and Which one to use depends on the work at hand and existing infrastructure and resources available. existing infrastructure and resources available. Informatica is an engine based ETL tool, the power this Informatica is an engine based ETL tool, the power this tool is in it's transformation engine and the code that it tool is in it's transformation engine and the code that it generates after development cannot be seen or generates after development cannot be seen or modified. Ab Initio is a code based ETL tool, it generates modified. Ab Initio is a code based ETL tool, it generates ksh or bat etc. code, which can be modified to achieve ksh or bat etc. code, which can be modified to achieve the goals, if any that cannot be taken care through the the goals, if any that cannot be taken care through the ETL tool itself.ETL tool itself.Ab Initio doesn't need a dedicated administrator, UNIX Ab Initio doesn't need a dedicated administrator, UNIX or NT Admin will suffice, where as other ETL tools do or NT Admin will suffice, where as other ETL tools do have administrative work. have administrative work.
AccentureAccenture Ab Initio TrainingAb Initio Training 5151
Ab Initio Product ArchitectureAb Initio Product Architecture
Native Operating System (Unix, Windows, OS/390)
The Ab Initio Co>Operating System
Component Library
Development EnvironmentsGDE Shell
3rd Party Components
User-definedComponents
User Applications
Ab Initio
EME
AccentureAccenture Ab Initio TrainingAb Initio Training 5252
Ab Initio Architecture-Ab Initio Architecture-ExplanationExplanation
The Ab Initio Cooperating system unites the network of The Ab Initio Cooperating system unites the network of computing resources-CPUs,storage disks , programs , computing resources-CPUs,storage disks , programs , datasets into a production quality data processing datasets into a production quality data processing system with scalable performance and mainframe class system with scalable performance and mainframe class reliability.reliability.
The Cooperating system is layered on the top of the The Cooperating system is layered on the top of the native operating systems of the collection of servers .It native operating systems of the collection of servers .It provides a distributed model for process execution, file provides a distributed model for process execution, file management ,debugging, process monitoring , management ,debugging, process monitoring , checkpointing .A user may perform all these functions checkpointing .A user may perform all these functions from a single point of control.from a single point of control.
AccentureAccenture Ab Initio TrainingAb Initio Training 5353
Co>Operating System ServicesCo>Operating System Services
Parallel and distributed application executionParallel and distributed application execution ControlControl Data TransportData Transport
Transactional semantics at the application level. Transactional semantics at the application level. Checkpointing.Checkpointing. Monitoring and debugging.Monitoring and debugging. Parallel file management.Parallel file management. Metadata-driven components.Metadata-driven components.
AccentureAccenture Ab Initio TrainingAb Initio Training 5454
Ab Initio: What We DoAb Initio: What We Do
Ab Initio software helps you build large-scale data Ab Initio software helps you build large-scale data processing applications and run them in parallel processing applications and run them in parallel environments. Ab Initio software consists of two main environments. Ab Initio software consists of two main programs: programs:
Co>Operating System:Co>Operating System: which your system administrator installs on a which your system administrator installs on a hosthost Unix Unix
or Windows NT server, as well as on processing or Windows NT server, as well as on processing computers. computers.
The Graphical Development Environment (GDE):The Graphical Development Environment (GDE): which you install on your PC ( which you install on your PC (GDE ComputerGDE Computer) and ) and
configure to communicate with the host. configure to communicate with the host.
AccentureAccenture Ab Initio TrainingAb Initio Training 5555
The Ab Initio Co>Operating System
The Co>Operating SystemThe Co>Operating System Runs across Runs across a variety of Operating Systems and a variety of Operating Systems and Hardware Platforms including OS/390 on Hardware Platforms including OS/390 on MainframeMainframe, , UnixUnix, and , and WindowsWindows. Supports . Supports distributed and parallel execution. Can distributed and parallel execution. Can provide scalability proportional to the provide scalability proportional to the hardware resources provided. Supports hardware resources provided. Supports platform independent data transport.platform independent data transport.
AccentureAccenture Ab Initio TrainingAb Initio Training 5656
The Ab Initio Co>Operating System-Continued
The Ab Initio Co>Operating System depends on parallelism to connect (i.e., cooperate with) diverse databases. It
extracts, transforms and loads data to and from Teradata and other data sources.
AccentureAccenture Ab Initio TrainingAb Initio Training 5757
Solaris,AIX, NT,Linux,NCR
Top LayerCo-Op System
Any OS
Same Co-Op CommandOn any OS.
Graphs can be moved fromOne OS to another w/o anyChanges.
Co-Operating System Layer
GDE
GDE
GDE
GDE
AccentureAccenture Ab Initio TrainingAb Initio Training 5858
The Ab Initio Co>Operating System The Ab Initio Co>Operating System Runs on:Runs on:
Sun SolarisSun Solaris IBM AIXIBM AIX Hewlett-Packard HP-Hewlett-Packard HP-
UX UX Siemens Pyramid Siemens Pyramid
Reliant UNIXReliant UNIX IBM DYNIX/ptx IBM DYNIX/ptx Silicon Graphics IRIXSilicon Graphics IRIX
Red Hat LinuxRed Hat Linux Windows NT 4.0 Windows NT 4.0
(x86)(x86) Windows NT 2000 Windows NT 2000
(x86)(x86) Compaq Tru64 UNIXCompaq Tru64 UNIX IBM OS/390 IBM OS/390 NCR MP-RASNCR MP-RAS
AccentureAccenture Ab Initio TrainingAb Initio Training 5959
Connectivity to Other SoftwareConnectivity to Other Software
Common, high performance database Common, high performance database interfaces:interfaces: IBM DB2, DB2/PE, DB2EEE, UDB, IMSIBM DB2, DB2/PE, DB2EEE, UDB, IMS Oracle, Informix XPS,Sybase,Teradata,MS SQL Oracle, Informix XPS,Sybase,Teradata,MS SQL
Server 7Server 7 OLE-DBOLE-DB ODBCODBC
Other software packages:Other software packages: Connectors to many other third party productsConnectors to many other third party products
Trillium, ErWin, Siebel, etc.Trillium, ErWin, Siebel, etc.
AccentureAccenture Ab Initio TrainingAb Initio Training 6060
Ab Initio Cooperating SystemAb Initio Cooperating System
Ab Initio Software Corporation, headquartered in Lexington, MA, develops software solutions that process vast amounts of data (well into the terabyte range) in a timely fashion by employing many (often hundreds) of server processors in parallel. Major corporations worldwide use Ab Initio software in mission critical, enterprise-wide, data processing systems. Together, Teradata and Ab Initio
deliver: End-to-end solutions for integrating and processing data throughout the enterprise Software that is flexible, efficient, and robust, with unlimited scalability Professional and highly responsive supportThe Co>Operating System executes your application by creating and managing
the processes and data flows that the components and arrows represent.
AccentureAccenture Ab Initio TrainingAb Initio Training 6161
Graphical DevelopmentEnvironment GDE
AccentureAccenture Ab Initio TrainingAb Initio Training 6262
The GDE The GDE
The Graphical Development Environment (GDE) provides The Graphical Development Environment (GDE) provides a graphical user interface into the services of the a graphical user interface into the services of the Co>Operating System.Co>Operating System. The Graphical Development The Graphical Development EnvironmentEnvironment Enables you to create applications by Enables you to create applications by dragging and dropping Components. Allows you to point dragging and dropping Components. Allows you to point and click operations on executable flow charts. The and click operations on executable flow charts. The Co>Operating System can execute these flowcharts Co>Operating System can execute these flowcharts directly. Graphical monitoring of running applications directly. Graphical monitoring of running applications allows you to quantify data volumes and execution allows you to quantify data volumes and execution times, helping spot opportunities for improving times, helping spot opportunities for improving performance.performance.
AccentureAccenture Ab Initio TrainingAb Initio Training 6363
The Graph ModelThe Graph Model
AccentureAccenture Ab Initio TrainingAb Initio Training 6464
The Component Library:The Component Library:
The Component Library:The Component Library: Reusable software Reusable software Modules for Sorting, Data Transformation, Modules for Sorting, Data Transformation, database Loading Etc. The components adapt at database Loading Etc. The components adapt at runtime to the record formats and business rules runtime to the record formats and business rules controlling their behavior.controlling their behavior.
Ab Initio products have helped reduce a projectAb Initio products have helped reduce a projects development and research time significantly.s development and research time significantly.
AccentureAccenture Ab Initio TrainingAb Initio Training 6565
ComponentsComponents
Components may run on any computer running Components may run on any computer running the Co>Operating System.the Co>Operating System.
Different components do different jobs.Different components do different jobs. The particular work a component accomplishes The particular work a component accomplishes
depends upon its parameter settings.depends upon its parameter settings. Some parameters are data transformations, that Some parameters are data transformations, that
is business rules to be applied to an input (s) to is business rules to be applied to an input (s) to produce a required output.produce a required output.
AccentureAccenture Ab Initio TrainingAb Initio Training 6666
33rdrd Party Components Party Components
AccentureAccenture Ab Initio TrainingAb Initio Training 6767
EME EME The Enterprise Meta>Environment (EME) is a high-The Enterprise Meta>Environment (EME) is a high-
performance object-oriented storage system that performance object-oriented storage system that inventories and manages various kinds of information inventories and manages various kinds of information associated with Ab Initio applications. It provides storage associated with Ab Initio applications. It provides storage for all aspects of your data processing system, from for all aspects of your data processing system, from design information to operations data.design information to operations data.
The EME also provides rich store for the applications The EME also provides rich store for the applications themselves, including data formats and business rules. It themselves, including data formats and business rules. It acts as hub for data and definitions . Integrated acts as hub for data and definitions . Integrated metadata management provides the global and metadata management provides the global and consolidated view of the structure and meaning of consolidated view of the structure and meaning of applications and data- information that is usually applications and data- information that is usually scattered throughout you business .scattered throughout you business .
AccentureAccenture Ab Initio TrainingAb Initio Training 6868
Benefits of EMEBenefits of EME The Enterprise Meta>Environment provides a rich store The Enterprise Meta>Environment provides a rich store
for applications and all of their associated information for applications and all of their associated information including :including :
Technical Metadata-Applications related business rules ,Technical Metadata-Applications related business rules ,record formats and execution statistics record formats and execution statistics
Business Metadata-User defined documentations of job Business Metadata-User defined documentations of job functions ,roles and responsibilities.functions ,roles and responsibilities.
Metadata is data about data and is critical to understanding Metadata is data about data and is critical to understanding and driving your business process and computational and driving your business process and computational resources .Storing and using metadata is as important to resources .Storing and using metadata is as important to your business as storing and using data.your business as storing and using data.
AccentureAccenture Ab Initio TrainingAb Initio Training 6969
EME-Ab Initio RelevanceEME-Ab Initio Relevance
By integrating technical and business By integrating technical and business metadata ,you can grasp the entirety of metadata ,you can grasp the entirety of your data processing from operational to your data processing from operational to analytical systems.analytical systems.
The EME is completely integrated The EME is completely integrated environment. The following figure shows environment. The following figure shows how it fits in to the high level architecture how it fits in to the high level architecture of Ab Initio software.of Ab Initio software.
AccentureAccenture Ab Initio TrainingAb Initio Training 7070
AccentureAccenture Ab Initio TrainingAb Initio Training 7171
Stepwise explanation of Ab Stepwise explanation of Ab Initio ArchitectureInitio Architecture
You construct your application from the building blocks You construct your application from the building blocks called components, manipulating them through the called components, manipulating them through the Graphical Development Environment (GDE).Graphical Development Environment (GDE).
You check in your applications to the EME.You check in your applications to the EME. The EME and GDE uses the underlining functionality of The EME and GDE uses the underlining functionality of
the Co>Operating System to perform many of their the Co>Operating System to perform many of their tasks. The Cooperating System units the distributed tasks. The Cooperating System units the distributed resources into a single resources into a single virtual computervirtual computer to run to run applications in parallel.applications in parallel.
Ab Initio software runs on Unix ,Windows NT,MVS Ab Initio software runs on Unix ,Windows NT,MVS operating systems.operating systems.
AccentureAccenture Ab Initio TrainingAb Initio Training 7272
Stepwise explanation of Ab Stepwise explanation of Ab Initio Architecture - continuedInitio Architecture - continued
Ab Initio connector applications extract Ab Initio connector applications extract metadata from third part metadata sources into metadata from third part metadata sources into the EME or extract it from the EME into a third the EME or extract it from the EME into a third party destination.party destination.
You view the results of project and application You view the results of project and application dependency analysis through a Web user dependency analysis through a Web user interface .You also view and edit your business interface .You also view and edit your business metadata through a web user interface.metadata through a web user interface.
AccentureAccenture Ab Initio TrainingAb Initio Training 7373
EME :Various users EME :Various users constituency served constituency served
The EME addresses the metadata needs of The EME addresses the metadata needs of three different constituencies:three different constituencies:
Business UsersBusiness Users DevelopersDevelopers System Administrators System Administrators
AccentureAccenture Ab Initio TrainingAb Initio Training 7474
EME :Various users EME :Various users constituency servedconstituency served
Business users are interested in exploiting data Business users are interested in exploiting data for analysis, in particular with regard to for analysis, in particular with regard to databases ,tables and columns.databases ,tables and columns.
Developers tend to be oriented towards Developers tend to be oriented towards applications ,needing to analyze the impact of applications ,needing to analyze the impact of potential program changes.potential program changes.
System Administrator and production personnel System Administrator and production personnel want job status information and run statistics.want job status information and run statistics.
AccentureAccenture Ab Initio TrainingAb Initio Training 7575
EME InterfacesEME Interfaces
We can create and manage EME through We can create and manage EME through 3 interfaces:3 interfaces:
GDEGDE Web User InterfaceWeb User Interface Air UtilityAir Utility
AccentureAccenture Ab Initio TrainingAb Initio Training 7676
Thank YouThank You
End of Session 1End of Session 1
Ab initio Session 1What is Data WarehouseData Warehouse-DefinitionsData WarehouseWhy Use a Data Warehouse?Data warehouse ArchitectureThe ETL ProcessETL Technology Data Warehouse Staging Area Examples- Staging Area Data warehouseData WarehouseData Warehouse EnvironmentData MartData MartStar SchemaStar Schema continuedAdvantages of Star SchemasStar schemaSnowflake SchemaSnowflake Schema - ExampleDiagrammatic representation for Snowflake SchemaFact TableWhat happens during the ETL process? Examples of Second-Generation ETL ToolsWhat to look for in ETL toolsOperating System / Hardware SupportParallel FunctionalityParallel FeaturesWhich parallel processor configuration, SMP or MPP ?Which parallel processor configuration, SMP or MPP ?A Multi-CPU Computer (SMP)A Network of Multi-CPU NodesA Network of NetworksParallel Computer ArchitectureIntroduction to Ab InitioHistory of Ab InitioHistory of Ab InitioWhat Does Ab Initio Mean?Ab Initios focusAb Initios SoftwareApplications of Ab Initio SoftwareAb Initio Provides For:Applications of Ab Initio Software in terms of Data WarehouseAb Initio or Informatica-Powerful ETLAb Initio or Informatica-Powerful ETL-continued Both tools are fundamentally differentAb Initio Product ArchitectureAb Initio Architecture-ExplanationCo>Operating System ServicesAb Initio: What We DoThe Ab Initio Co>Operating SystemThe Ab Initio Co>Operating System-ContinuedThe Ab Initio Co>Operating System Runs on:Connectivity to Other SoftwareAb Initio Cooperating SystemThe GDE The Graph ModelThe Component Library:Components3rd Party ComponentsEME Benefits of EME EME-Ab Initio RelevanceStepwise explanation of Ab Initio ArchitectureStepwise explanation of Ab Initio Architecture - continuedEME :Various users constituency served EME :Various users constituency servedEME InterfacesThank You