Top Banner

of 76

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • AccentureAccenture Ab Initio TrainingAb Initio Training 11

    Introduction to Ab Initio

    Prepared By : Ashok Chanda

  • AccentureAccenture Ab Initio TrainingAb Initio Training 22

    Ab initio Session 1Ab initio Session 1

    Introduction to DWH Introduction to DWH Explanation of DW ArchitectureExplanation of DW Architecture Operating System / Hardware SupportOperating System / Hardware Support Introduction to ETL ProcessIntroduction to ETL Process Introduction to Ab InitioIntroduction to Ab Initio Explanation of Ab Initio ArchitectureExplanation of Ab Initio Architecture

  • AccentureAccenture Ab Initio TrainingAb Initio Training 33

    What is Data WarehouseWhat is Data Warehouse

    A data warehouse is a copy of transaction data A data warehouse is a copy of transaction data specifically structured for specifically structured for querying and querying and reporting. reporting.

    A data warehouse is a subject-oriented, A data warehouse is a subject-oriented, integrated, time-variant and non-volatile integrated, time-variant and non-volatile collection of data in support of management's collection of data in support of management's decision making process.decision making process.

    A data warehouse is a central repository for all A data warehouse is a central repository for all or significant parts of the data that an or significant parts of the data that an enterprise's various business systems collect. enterprise's various business systems collect.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 44

    Data Warehouse-DefinitionsData Warehouse-Definitions

    A data warehouse is a database geared towards A data warehouse is a database geared towards the business intelligence requirements of an the business intelligence requirements of an organization. The data warehouse integrates organization. The data warehouse integrates data from the various operational systems and is data from the various operational systems and is typically loaded from these systems at regular typically loaded from these systems at regular intervals. Data warehouses contain historical intervals. Data warehouses contain historical information that enables analysis of business information that enables analysis of business performance over time. A collection of databases performance over time. A collection of databases combined with a flexible data extraction system. combined with a flexible data extraction system.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 55

    Data WarehouseData Warehouse

    A data warehouse can be normalized or A data warehouse can be normalized or denormalized. It can be a relational denormalized. It can be a relational database, multidimensional database, flat database, multidimensional database, flat file, hierarchical database, object file, hierarchical database, object database, etc. Data warehouse data often database, etc. Data warehouse data often gets changed. And data warehouses often gets changed. And data warehouses often focus on a specific activity or entity. focus on a specific activity or entity.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 66

    Why Use a Data Warehouse?Why Use a Data Warehouse?

    Data Exploration and Discovery Integrated and Consistent data Quality assured data Easily accessible data Production and performance awareness Access to data in a timely manner

  • AccentureAccenture Ab Initio TrainingAb Initio Training 77

    Simplified Datawarehouse Architecture

  • AccentureAccenture Ab Initio TrainingAb Initio Training 88

    Data warehouse Architecture

    Data Warehouses can be architected in many different Data Warehouses can be architected in many different ways, depending on the specific needs of a ways, depending on the specific needs of a business. The model shown below is the "hub-and-business. The model shown below is the "hub-and-spokes" Data Warehousing architecture that is popular in spokes" Data Warehousing architecture that is popular in many organizations. many organizations.

    In short, data is moved from databases used in In short, data is moved from databases used in operational systems into a data warehouse staging area, operational systems into a data warehouse staging area, then into a data warehouse and finally into a set of then into a data warehouse and finally into a set of conformed data marts. Data is copied from one conformed data marts. Data is copied from one database to another using a technology called ETL database to another using a technology called ETL (Extract, Transform, Load).(Extract, Transform, Load).

  • AccentureAccenture Ab Initio TrainingAb Initio Training 99

  • AccentureAccenture Ab Initio TrainingAb Initio Training 1010

    The ETL ProcessThe ETL Process

    CaptureCapture Scrub or Data cleansingScrub or Data cleansing TransformTransform Load and IndexLoad and Index

  • AccentureAccenture Ab Initio TrainingAb Initio Training 1111

    ETL TechnologyETL Technology ETL Technology is an important component of the Data ETL Technology is an important component of the Data

    Warehousing Architecture. It is used to copy data from Warehousing Architecture. It is used to copy data from Operational Applications to the Data Warehouse Staging Operational Applications to the Data Warehouse Staging Area, from the DW Staging Area into the Data Area, from the DW Staging Area into the Data Warehouse and finally from the Data Warehouse into a Warehouse and finally from the Data Warehouse into a set of conformed Data Marts that are accessible by set of conformed Data Marts that are accessible by decision makers.decision makers.

    The ETL software extracts data, transforms values of The ETL software extracts data, transforms values of inconsistent data, cleanses "bad" data, filters data and inconsistent data, cleanses "bad" data, filters data and loads data into a target database. The scheduling of loads data into a target database. The scheduling of ETL jobs is critical. Should there be a failure in one ETL ETL jobs is critical. Should there be a failure in one ETL job, the remaining ETL jobs must respond appropriately. job, the remaining ETL jobs must respond appropriately.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 1212

    Data Warehouse Staging AreaData Warehouse Staging Area

    The Data Warehouse Staging Area is temporary location The Data Warehouse Staging Area is temporary location where data from source systems is copied. A staging where data from source systems is copied. A staging area is mainly required in a Data Warehousing area is mainly required in a Data Warehousing Architecture for timing reasons. In short, all required Architecture for timing reasons. In short, all required data must be available before data can be integrated data must be available before data can be integrated into the Data Warehouse. into the Data Warehouse.

    Due to varying business cycles, data processing cycles, Due to varying business cycles, data processing cycles, hardware and network resource limitations and hardware and network resource limitations and geographical factors, it is not feasible to extract all the geographical factors, it is not feasible to extract all the data from all Operational databases at exactly the same data from all Operational databases at exactly the same timetime

  • AccentureAccenture Ab Initio TrainingAb Initio Training 1313

    Examples-Examples- Staging AreaStaging Area For example, it might be reasonable to extract sales data on a daily For example, it might be reasonable to extract sales data on a daily

    basis, however, daily extracts might not be suitable for financial basis, however, daily extracts might not be suitable for financial data that requires a month-end reconciliation process. Similarly, it data that requires a month-end reconciliation process. Similarly, it might be feasible to extract "customer" data from a database in might be feasible to extract "customer" data from a database in Singapore at noon eastern standard time, but this would not be Singapore at noon eastern standard time, but this would not be feasible for "customer" data in a Chicago database. feasible for "customer" data in a Chicago database.

    Data in the Data Warehouse can be either persistent (i.e. remains Data in the Data Warehouse can be either persistent (i.e. remains around for a long period) or transient (i.e. only remains around around for a long period) or transient (i.e. only remains around temporarily). temporarily).

    Not all business require a Data Warehouse Staging Area. For many Not all business require a Data Warehouse Staging Area. For many businesses it is feasible to use ETL to copy data directly from businesses it is feasible to use ETL to copy data directly from operational databases into the Data Warehouse.operational databases into the Data Warehouse.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 1414

    Data warehouseData warehouse

    The purpose of the Data Warehouse in the overall Data The purpose of the Data Warehouse in the overall Data Warehousing Architecture is to integrate corporate data. It Warehousing Architecture is to integrate corporate data. It contains the "single version of truth" for the organization contains the "single version of truth" for the organization that has been carefully constructed from data stored in that has been carefully constructed from data stored in disparate internal and external operational databases.disparate internal and external operational databases.

    The amount of data in the Data Warehouse is The amount of data in the Data Warehouse is massive. Data is stored at a very granular level of massive. Data is stored at a very granular level of detail. For example, every "sale" that has ever occurred detail. For example, every "sale" that has ever occurred in the organization is recorded and related to dimensions in the organization is recorded and related to dimensions of interest. This allows data to be sliced and diced, of interest. This allows data to be sliced and diced, summed and grouped in unimaginable ways. summed and grouped in unimaginable ways.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 1515

    Data WarehouseData Warehouse Contrary to popular opinion, the Data Warehouses does Contrary to popular opinion, the Data Warehouses does

    not contain all the data in the organization. It's purpose not contain all the data in the organization. It's purpose is to provide key business metrics that are needed by is to provide key business metrics that are needed by the organization for strategic and tactical decision the organization for strategic and tactical decision making.making.

    Decision makers don't access the Data Warehouse Decision makers don't access the Data Warehouse directly. This is done through various front-end Data directly. This is done through various front-end Data Warehouse Tools that read data from subject specific Warehouse Tools that read data from subject specific Data Marts.Data Marts.

    The Data Warehouse can be either "relational" or The Data Warehouse can be either "relational" or "dimensional". This depends on how the business "dimensional". This depends on how the business intends to use the information.intends to use the information.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 1616

    Data Warehouse EnvironmentData Warehouse Environment

    In addition to a In addition to a relational/multidimensional database, a relational/multidimensional database, a data warehouse environment often data warehouse environment often consists of an ETL solution, an OLAP consists of an ETL solution, an OLAP engine, client analysis tools, and other engine, client analysis tools, and other applications that manage the process of applications that manage the process of gathering data and delivering it to gathering data and delivering it to business users. business users.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 1717

    Data MartData Mart

    A subset of a data warehouse, for use by a A subset of a data warehouse, for use by a single department or function.single department or function.

    A repository of data gathered from operational A repository of data gathered from operational data and other sources that is designed to serve data and other sources that is designed to serve a particular community of knowledge workers. a particular community of knowledge workers.

    A subset of the information contained in a data A subset of the information contained in a data warehouse.warehouse.

    Data marts have the same definition as the data Data marts have the same definition as the data warehouse (see below), but data marts have a warehouse (see below), but data marts have a more limited audience and/or data content. more limited audience and/or data content.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 1818

    Data MartData Mart ETL (Extract Transform Load) jobs extract data from the Data ETL (Extract Transform Load) jobs extract data from the Data

    Warehouse and populate one or more Data Marts for use by groups Warehouse and populate one or more Data Marts for use by groups of decision makers in the organizations. The Data Marts can be of decision makers in the organizations. The Data Marts can be Dimensional Dimensional (Star Schemas)(Star Schemas) or relational, depending on how the or relational, depending on how the information is to be used and what "front end" Data Warehousing information is to be used and what "front end" Data Warehousing Tools will be used to present the information. Tools will be used to present the information.

    Each Data Mart can contain different combinations of tables, Each Data Mart can contain different combinations of tables, columns and rows from the Enterprise Data Warehouse. For columns and rows from the Enterprise Data Warehouse. For example, an business unit or user group that doesn't require a lot of example, an business unit or user group that doesn't require a lot of historical data might only need transactions from the current historical data might only need transactions from the current calendar year in the database. The Personnel Department might calendar year in the database. The Personnel Department might need to see all details about employees, whereas data such as need to see all details about employees, whereas data such as "salary" or "home address" might not be appropriate for a Data Mart "salary" or "home address" might not be appropriate for a Data Mart that focuses on Sales.that focuses on Sales.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 1919

    Star SchemaStar Schema

    The The star schemastar schema is perhaps the simplest data is perhaps the simplest data warehouse schema. warehouse schema.

    It is called a star schema because the entity-It is called a star schema because the entity-relationship diagram of this schema resembles a relationship diagram of this schema resembles a star, with points radiating from a central table. star, with points radiating from a central table.

    The center of the star consists of a large fact The center of the star consists of a large fact table and the points of the star are the table and the points of the star are the dimension tables.dimension tables.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 2020

    Star Schema continuedStar Schema continued

    A star schema is characterized by one or A star schema is characterized by one or more very large more very large factfact tables that contain tables that contain the primary information in the data the primary information in the data warehouse, and a number of much warehouse, and a number of much smaller smaller dimensiondimension tables (or lookup tables (or lookup tables), each of which contains tables), each of which contains information about the entries for a information about the entries for a particular attribute in the fact table.particular attribute in the fact table.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 2121

    Advantages of Star SchemasAdvantages of Star Schemas

    Provide a direct and intuitive mapping between Provide a direct and intuitive mapping between the business entities being analyzed by end the business entities being analyzed by end users and the schema design. users and the schema design.

    Provide highly optimized performance for typical Provide highly optimized performance for typical star queries. star queries.

    Are widely supported by a large number of Are widely supported by a large number of business intelligence tools, which may anticipate business intelligence tools, which may anticipate or even require that the data-warehouse schema or even require that the data-warehouse schema contain dimension tables contain dimension tables

    Star schemas are used for both simple data Star schemas are used for both simple data marts and very large data warehouses.marts and very large data warehouses.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 2222

    Star schemaStar schema

    Diagrammatic representation of star Diagrammatic representation of star schemaschema

  • AccentureAccenture Ab Initio TrainingAb Initio Training 2323

    Snowflake SchemaSnowflake Schema

    The snowflake schema is a more complex The snowflake schema is a more complex data warehouse model than a star data warehouse model than a star schema, and is a type of star schema.schema, and is a type of star schema.

    It is called a snowflake schema because It is called a snowflake schema because the diagram of the schema resembles a the diagram of the schema resembles a snowflake.snowflake.

    Snowflake schemas normalize dimensions Snowflake schemas normalize dimensions to eliminate redundancy. to eliminate redundancy.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 2424

    Snowflake Schema - ExampleSnowflake Schema - Example

    That is, the dimension data has been grouped That is, the dimension data has been grouped into multiple tables instead of one large table. into multiple tables instead of one large table. For example, a product dimension table in a star For example, a product dimension table in a star schema might be normalized into a products schema might be normalized into a products table, a product_category table, and a table, a product_category table, and a product_manufacturer table in a snowflake product_manufacturer table in a snowflake schema. While this saves space, it increases the schema. While this saves space, it increases the number of dimension tables and requires more number of dimension tables and requires more foreign key joins. The result is more complex foreign key joins. The result is more complex queries and reduced query performance. queries and reduced query performance.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 2525

    Diagrammatic representation Diagrammatic representation for Snowflake Schemafor Snowflake Schema

  • AccentureAccenture Ab Initio TrainingAb Initio Training 2626

    Fact TableFact Table

    The centralized table in a star schema is The centralized table in a star schema is called as FACT table. A fact table typically called as FACT table. A fact table typically has two types of columns: those that has two types of columns: those that contain facts and those that are foreign contain facts and those that are foreign keys to dimension tables. The primary key keys to dimension tables. The primary key of a fact table is usually a composite key of a fact table is usually a composite key that is made up of all of its foreign keys. that is made up of all of its foreign keys.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 2727

    What happens during the ETL What happens during the ETL process?process?

    During extraction, the desired data is identified and During extraction, the desired data is identified and extracted from many different sources, including extracted from many different sources, including database systems and applications. Depending on the database systems and applications. Depending on the source system's capabilities (for example, operating source system's capabilities (for example, operating system resources), some transformations may take place system resources), some transformations may take place during this extraction process. The size of the extracted during this extraction process. The size of the extracted data varies from hundreds of kilobytes up to gigabytes, data varies from hundreds of kilobytes up to gigabytes, depending on the source system and the business depending on the source system and the business situation. After extracting data, it has to be physically situation. After extracting data, it has to be physically transported to the target system or an intermediate transported to the target system or an intermediate system for further processing. system for further processing.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 2828

    Examples of Second-Examples of Second-Generation ETL ToolsGeneration ETL Tools

    Powermart 4.5 Informatica Corporation Powermart 4.5 Informatica Corporation Pioneer due to market share Pioneer due to market share

    Ardent DataStage Ardent Software, Inc. Ardent DataStage Ardent Software, Inc. General-purpose tool oriented to data marts General-purpose tool oriented to data marts

    Sagent Data Mart Solution 3.0 Sagent Sagent Data Mart Solution 3.0 Sagent Technology Technology Progressively integrated with Microsoft Progressively integrated with Microsoft

    Ab Initio 2.2 Ab Initio Software Ab Initio 2.2 Ab Initio Software A kit of tools that can be used to build applications A kit of tools that can be used to build applications

    Tapestry 2.1 D2K, Inc Tapestry 2.1 D2K, Inc End-to-end data warehousing solution from a single vendorEnd-to-end data warehousing solution from a single vendor

  • AccentureAccenture Ab Initio TrainingAb Initio Training 2929

    What to look for in ETL toolsWhat to look for in ETL tools

    Use optional data cleansing tool to clean-up source data Use optional data cleansing tool to clean-up source data Use extraction/transformation/load tool to retrieve, Use extraction/transformation/load tool to retrieve,

    cleanse, transform, summarize, aggregate, and load data cleanse, transform, summarize, aggregate, and load data Use modern, engine-driven technology for fast, parallel Use modern, engine-driven technology for fast, parallel

    operation operation Goal: define 100% of the transform rule with point and Goal: define 100% of the transform rule with point and

    click interface click interface Support development of logical and physical data models Support development of logical and physical data models Generate and manage central metadata repository Generate and manage central metadata repository Open metadata exchange architecture to integrate central Open metadata exchange architecture to integrate central

    metadata with local metadata. metadata with local metadata. Support metadata standards Support metadata standards Provide end users access to metadata in business termsProvide end users access to metadata in business terms

  • AccentureAccenture Ab Initio TrainingAb Initio Training 3030

    Operating System / Hardware Operating System / Hardware SupportSupport

    This section discusses how a DBMS utilizes This section discusses how a DBMS utilizes OS/hardware features such as parallel OS/hardware features such as parallel functionality, SMP/MPP support, and functionality, SMP/MPP support, and clustering. These OS/hardware features clustering. These OS/hardware features greatly extend the scalability and improve greatly extend the scalability and improve performance. However, managing an performance. However, managing an environment with these features is difficult environment with these features is difficult and expensive. and expensive.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 3131

    Parallel FunctionalityParallel Functionality

    The introduction and maturation of parallel The introduction and maturation of parallel processing environments are key enablers of processing environments are key enablers of increasing database sizes, as well as providing increasing database sizes, as well as providing acceptable response times for storing, retrieving, acceptable response times for storing, retrieving, and administrating data. DBMS vendors are and administrating data. DBMS vendors are continually bringing products to market that take continually bringing products to market that take advantage of multi-processor hardware advantage of multi-processor hardware platforms. These products can perform table platforms. These products can perform table scans, backups, loads, and queries in parallel.scans, backups, loads, and queries in parallel.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 3232

    Parallel FeaturesParallel Features An overview of typical parallel functionality is given below : An overview of typical parallel functionality is given below : QueriesQueries Parallel queries can enhance scalability for many query Parallel queries can enhance scalability for many query

    operations operations Data loadData load Performance is always a serious issue when loading Performance is always a serious issue when loading

    large databases. Meeting response time requirements is the large databases. Meeting response time requirements is the overriding factor for determining the best load method and should overriding factor for determining the best load method and should be a key part of a performance benchmark be a key part of a performance benchmark

    Create table as selectCreate table as select This feature makes it possible to create This feature makes it possible to create aggregated tables in parallel aggregated tables in parallel

    Index creationIndex creation Parallel index creation exploits the benefits of Parallel index creation exploits the benefits of parallel hardware by distributing the workload generated by a large parallel hardware by distributing the workload generated by a large index created for a large number of processors .index created for a large number of processors .

  • AccentureAccenture Ab Initio TrainingAb Initio Training 3333

    Which parallel processor Which parallel processor configuration, SMP or MPPconfiguration, SMP or MPP ??

    SMP and clustered SMP environments , have the SMP and clustered SMP environments , have the flexibility and ability to scale in small increments. flexibility and ability to scale in small increments.

    SMP environments are often useful for the large, SMP environments are often useful for the large, but static data warehouse, where the data but static data warehouse, where the data cannot be easily partitioned, due to the cannot be easily partitioned, due to the unpredictable nature of how the data is joined unpredictable nature of how the data is joined over multiple tables for complex searches and over multiple tables for complex searches and ad-hoc queries. ad-hoc queries.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 3434

    Which parallel processor Which parallel processor configuration, SMP or MPPconfiguration, SMP or MPP ??

    MPP works well in environments where growth is potentially MPP works well in environments where growth is potentially unlimited, access patterns to the database are predictable, and the unlimited, access patterns to the database are predictable, and the data can be easily partitioned across different MPP nodes with data can be easily partitioned across different MPP nodes with minimal data accesses crossing between them. This often occurs in minimal data accesses crossing between them. This often occurs in large OLTP environments, where transactions are generally small large OLTP environments, where transactions are generally small and predictable, as opposed to decision support and data and predictable, as opposed to decision support and data warehouse environments, where multiple tables can be joined in warehouse environments, where multiple tables can be joined in unpredictable ways.unpredictable ways.

    In fact, data warehousing and decision support are the areas most In fact, data warehousing and decision support are the areas most vendors of parallel hardware platforms and DBMSs are targeting. vendors of parallel hardware platforms and DBMSs are targeting.

    MPP does not scale well if heavy data warehouse database accesses MPP does not scale well if heavy data warehouse database accesses must cross MPP nodes, causing I/O bottlenecks over the MPP must cross MPP nodes, causing I/O bottlenecks over the MPP interconnect, or if multiple MPP nodes are continually locked for interconnect, or if multiple MPP nodes are continually locked for concurrent record updates. concurrent record updates.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 3535

    A Multi-CPU Computer (SMP)A Multi-CPU Computer (SMP)

  • AccentureAccenture Ab Initio TrainingAb Initio Training 3636

    A Network of Multi-CPU NodesA Network of Multi-CPU Nodes

  • AccentureAccenture Ab Initio TrainingAb Initio Training 3737

    A Network of NetworksA Network of Networks

  • AccentureAccenture Ab Initio TrainingAb Initio Training 3838

    Parallel Computer ArchitectureParallel Computer Architecture

    Computers come in many shapes and sizes:Computers come in many shapes and sizes: Single-CPU, Multi-CPUSingle-CPU, Multi-CPU Network of single-CPU computersNetwork of single-CPU computers Network of multi-CPU computersNetwork of multi-CPU computers

    Multi-CPU machines are often called SMPs (for Multi-CPU machines are often called SMPs (for Symmetric Multi Processors).Symmetric Multi Processors).

    Specially-built networks of machines are often called Specially-built networks of machines are often called MPPs (for Massively Parallel Processors).MPPs (for Massively Parallel Processors).

  • AccentureAccenture Ab Initio TrainingAb Initio Training 3939

    Introduction to Ab Introduction to Ab InitioInitio

  • AccentureAccenture Ab Initio TrainingAb Initio Training 4040

    History of Ab InitioHistory of Ab Initio

    Ab Initio Software CorporationAb Initio Software Corporation was founded was founded in the mid in the mid 1990's1990's by Sheryl Handler, the former by Sheryl Handler, the former CEO at Thinking Machines Corporation, after CEO at Thinking Machines Corporation, after TMC filed for bankruptcy. In addition to Handler, TMC filed for bankruptcy. In addition to Handler, other former TMC people involved in the other former TMC people involved in the founding of Ab Initio included Cliff Lasser, founding of Ab Initio included Cliff Lasser, Angela Lordi, and Craig Stanfill.Angela Lordi, and Craig Stanfill.

    Ab Initio is known for being very secretive in the Ab Initio is known for being very secretive in the way that they run their business, but their way that they run their business, but their software is widely regarded as top notch.software is widely regarded as top notch.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 4141

    History of Ab InitioHistory of Ab Initio

    The Ab Initio software is a fourth generation The Ab Initio software is a fourth generation data analysis, batch processing, data data analysis, batch processing, data manipulation graphical user interface (GUI)-manipulation graphical user interface (GUI)-based parallel processing tool that is used based parallel processing tool that is used mainly to extract, transform and load data.mainly to extract, transform and load data.

    The Ab Initio software is a suite of products that The Ab Initio software is a suite of products that together provides platform for robust data together provides platform for robust data processing applications. The Core Ab Initio processing applications. The Core Ab Initio Products are: The [Co>Operating System] The Products are: The [Co>Operating System] The Component Library The Graphical Development Component Library The Graphical Development EnvironmentEnvironment

  • AccentureAccenture Ab Initio TrainingAb Initio Training 4242

    What Does What Does Ab InitioAb Initio Mean? Mean?

    Ab Initio is Latin for From the Beginning.Ab Initio is Latin for From the Beginning.

    From the beginning our software was designed to From the beginning our software was designed to support a complete range of business applications, from support a complete range of business applications, from simple to the most complex. Crucial capabilities like simple to the most complex. Crucial capabilities like parallelism and checkpointing cant be added after the parallelism and checkpointing cant be added after the fact.fact.

    The Graphical Development Environment and a powerful The Graphical Development Environment and a powerful set of components allow our customers to get valuable set of components allow our customers to get valuable results from the beginning.results from the beginning.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 4343

    Ab Initios focusAb Initios focus

    Moving DataMoving Data move small and large volumes of data in an move small and large volumes of data in an

    efficient mannerefficient manner deal with the complexity associated with business deal with the complexity associated with business

    datadata High Performance High Performance

    scalable solutionsscalable solutions Better productivityBetter productivity

  • AccentureAccenture Ab Initio TrainingAb Initio Training 4444

    Ab Initios SoftwareAb Initios Software

    Ab Initio software is a general-purpose Ab Initio software is a general-purpose data processing platform for mission-data processing platform for mission-critical applications such as:critical applications such as:

    Data warehousingData warehousing Batch processingBatch processing Click-stream analysisClick-stream analysis Data movementData movement Data transformationData transformation

  • AccentureAccenture Ab Initio TrainingAb Initio Training 4545

    Applications of Ab Initio Applications of Ab Initio SoftwareSoftware

    Processing just about any form and volume of data.Processing just about any form and volume of data.

    Parallel sort/merge processing.Parallel sort/merge processing.

    Data transformation.Data transformation.

    Rehosting of corporate data.Rehosting of corporate data.

    Parallel execution of existing applications.Parallel execution of existing applications.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 4646

    Ab Initio Provides For:Ab Initio Provides For:

    Distribution - a platform for applications to Distribution - a platform for applications to execute across a collection of processors within execute across a collection of processors within the confines of a single machine or across the confines of a single machine or across multiple machines.multiple machines.

    Reduced Run Time Complexity - the ability for Reduced Run Time Complexity - the ability for applications to run in parallel on any applications to run in parallel on any combination of computers where the Ab Initio combination of computers where the Ab Initio Co>Operating System is installed from a single Co>Operating System is installed from a single point of control.point of control.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 4747

    Applications of Ab Initio Applications of Ab Initio Software in terms of Data Software in terms of Data

    WarehouseWarehouse Front end of Data Warehouse:Front end of Data Warehouse:

    Transformation of disparate sourcesTransformation of disparate sources Aggregation and other preprocessingAggregation and other preprocessing Referential integrity checkingReferential integrity checking Database loadingDatabase loading

    Back end of Data Warehouse:Back end of Data Warehouse: Extraction for external processingExtraction for external processing Aggregation and loading of Data MartsAggregation and loading of Data Marts

  • AccentureAccenture Ab Initio TrainingAb Initio Training 4848

    Ab Initio or Informatica-Ab Initio or Informatica-Powerful ETLPowerful ETL

    Informatica and Ab Initio both support Informatica and Ab Initio both support parallelismparallelism. But Informatica . But Informatica supports only one type of parallelism but the Ab Initio supports supports only one type of parallelism but the Ab Initio supports three types of parallelism. In Informatica the developer need to do three types of parallelism. In Informatica the developer need to do some partitions in server manager by using that you can achieve some partitions in server manager by using that you can achieve parallelism concepts. But in Ab Initio the tool it self take care of parallelism concepts. But in Ab Initio the tool it self take care of parallelism we have three types of parallelisms in Ab Initio 1. parallelism we have three types of parallelisms in Ab Initio 1. Component 2. Data Parallelism 3. Pipe Line parallelism this is the Component 2. Data Parallelism 3. Pipe Line parallelism this is the difference in parallelism concepts.difference in parallelism concepts.

    2. We don't have scheduler in Ab Initio like Informatica you need to 2. We don't have scheduler in Ab Initio like Informatica you need to schedule through script or u need to run manually.schedule through script or u need to run manually.

    3. Ab Initio supports different types of text files means you can read 3. Ab Initio supports different types of text files means you can read same file with different structures that is not possible in Informatica, same file with different structures that is not possible in Informatica, and also Ab Initio is more user friendly than Informatica so there is and also Ab Initio is more user friendly than Informatica so there is a lot of differences in Informatica and Ab initio.a lot of differences in Informatica and Ab initio.

    8. AbInitio doesn't need a dedicated administrator, UNIX or NT Admin will suffice, where as other ETL tools do have administrative work.8. AbInitio doesn't need a dedicated administrator, UNIX or NT Admin will suffice, where as other ETL tools do have administrative work.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 4949

    Ab Initio or Informatica-Ab Initio or Informatica-Powerful ETL-continuedPowerful ETL-continued

    Error Handling - In Ab Initio you can attach error and reject files to Error Handling - In Ab Initio you can attach error and reject files to each transformation and capture and analyze the message and data each transformation and capture and analyze the message and data separately. Informatica has one huge log! Very inefficient when separately. Informatica has one huge log! Very inefficient when working on a large process, with numerous points of failure.working on a large process, with numerous points of failure.

    Robust transformation language - Informatica is very basic as far as Robust transformation language - Informatica is very basic as far as transformations go. While I will not go into a function by function transformations go. While I will not go into a function by function comparison, it seems that Ab Initio was much more robust.comparison, it seems that Ab Initio was much more robust.

    Instant feedback - On execution, Ab Initio tells you how many Instant feedback - On execution, Ab Initio tells you how many records have been processed/rejected/etc. and detailed records have been processed/rejected/etc. and detailed performance metrics for each component. Informatica has a debug performance metrics for each component. Informatica has a debug mode, but it is slow and difficult to adapt to. mode, but it is slow and difficult to adapt to.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 5050

    Both tools are fundamentally Both tools are fundamentally differentdifferent

    Which one to use depends on the work at hand and Which one to use depends on the work at hand and existing infrastructure and resources available. existing infrastructure and resources available. Informatica is an engine based ETL tool, the power this Informatica is an engine based ETL tool, the power this tool is in it's transformation engine and the code that it tool is in it's transformation engine and the code that it generates after development cannot be seen or generates after development cannot be seen or modified. Ab Initio is a code based ETL tool, it generates modified. Ab Initio is a code based ETL tool, it generates ksh or bat etc. code, which can be modified to achieve ksh or bat etc. code, which can be modified to achieve the goals, if any that cannot be taken care through the the goals, if any that cannot be taken care through the ETL tool itself.ETL tool itself.Ab Initio doesn't need a dedicated administrator, UNIX Ab Initio doesn't need a dedicated administrator, UNIX or NT Admin will suffice, where as other ETL tools do or NT Admin will suffice, where as other ETL tools do have administrative work. have administrative work.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 5151

    Ab Initio Product ArchitectureAb Initio Product Architecture

    Native Operating System (Unix, Windows, OS/390)

    The Ab Initio Co>Operating System

    Component Library

    Development EnvironmentsGDE Shell

    3rd Party Components

    User-definedComponents

    User Applications

    Ab Initio

    EME

  • AccentureAccenture Ab Initio TrainingAb Initio Training 5252

    Ab Initio Architecture-Ab Initio Architecture-ExplanationExplanation

    The Ab Initio Cooperating system unites the network of The Ab Initio Cooperating system unites the network of computing resources-CPUs,storage disks , programs , computing resources-CPUs,storage disks , programs , datasets into a production quality data processing datasets into a production quality data processing system with scalable performance and mainframe class system with scalable performance and mainframe class reliability.reliability.

    The Cooperating system is layered on the top of the The Cooperating system is layered on the top of the native operating systems of the collection of servers .It native operating systems of the collection of servers .It provides a distributed model for process execution, file provides a distributed model for process execution, file management ,debugging, process monitoring , management ,debugging, process monitoring , checkpointing .A user may perform all these functions checkpointing .A user may perform all these functions from a single point of control.from a single point of control.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 5353

    Co>Operating System ServicesCo>Operating System Services

    Parallel and distributed application executionParallel and distributed application execution ControlControl Data TransportData Transport

    Transactional semantics at the application level. Transactional semantics at the application level. Checkpointing.Checkpointing. Monitoring and debugging.Monitoring and debugging. Parallel file management.Parallel file management. Metadata-driven components.Metadata-driven components.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 5454

    Ab Initio: What We DoAb Initio: What We Do

    Ab Initio software helps you build large-scale data Ab Initio software helps you build large-scale data processing applications and run them in parallel processing applications and run them in parallel environments. Ab Initio software consists of two main environments. Ab Initio software consists of two main programs: programs:

    Co>Operating System:Co>Operating System: which your system administrator installs on a which your system administrator installs on a hosthost Unix Unix

    or Windows NT server, as well as on processing or Windows NT server, as well as on processing computers. computers.

    The Graphical Development Environment (GDE):The Graphical Development Environment (GDE): which you install on your PC ( which you install on your PC (GDE ComputerGDE Computer) and ) and

    configure to communicate with the host. configure to communicate with the host.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 5555

    The Ab Initio Co>Operating System

    The Co>Operating SystemThe Co>Operating System Runs across Runs across a variety of Operating Systems and a variety of Operating Systems and Hardware Platforms including OS/390 on Hardware Platforms including OS/390 on MainframeMainframe, , UnixUnix, and , and WindowsWindows. Supports . Supports distributed and parallel execution. Can distributed and parallel execution. Can provide scalability proportional to the provide scalability proportional to the hardware resources provided. Supports hardware resources provided. Supports platform independent data transport.platform independent data transport.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 5656

    The Ab Initio Co>Operating System-Continued

    The Ab Initio Co>Operating System depends on parallelism to connect (i.e., cooperate with) diverse databases. It

    extracts, transforms and loads data to and from Teradata and other data sources.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 5757

    Solaris,AIX, NT,Linux,NCR

    Top LayerCo-Op System

    Any OS

    Same Co-Op CommandOn any OS.

    Graphs can be moved fromOne OS to another w/o anyChanges.

    Co-Operating System Layer

    GDE

    GDE

    GDE

    GDE

  • AccentureAccenture Ab Initio TrainingAb Initio Training 5858

    The Ab Initio Co>Operating System The Ab Initio Co>Operating System Runs on:Runs on:

    Sun SolarisSun Solaris IBM AIXIBM AIX Hewlett-Packard HP-Hewlett-Packard HP-

    UX UX Siemens Pyramid Siemens Pyramid

    Reliant UNIXReliant UNIX IBM DYNIX/ptx IBM DYNIX/ptx Silicon Graphics IRIXSilicon Graphics IRIX

    Red Hat LinuxRed Hat Linux Windows NT 4.0 Windows NT 4.0

    (x86)(x86) Windows NT 2000 Windows NT 2000

    (x86)(x86) Compaq Tru64 UNIXCompaq Tru64 UNIX IBM OS/390 IBM OS/390 NCR MP-RASNCR MP-RAS

  • AccentureAccenture Ab Initio TrainingAb Initio Training 5959

    Connectivity to Other SoftwareConnectivity to Other Software

    Common, high performance database Common, high performance database interfaces:interfaces: IBM DB2, DB2/PE, DB2EEE, UDB, IMSIBM DB2, DB2/PE, DB2EEE, UDB, IMS Oracle, Informix XPS,Sybase,Teradata,MS SQL Oracle, Informix XPS,Sybase,Teradata,MS SQL

    Server 7Server 7 OLE-DBOLE-DB ODBCODBC

    Other software packages:Other software packages: Connectors to many other third party productsConnectors to many other third party products

    Trillium, ErWin, Siebel, etc.Trillium, ErWin, Siebel, etc.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 6060

    Ab Initio Cooperating SystemAb Initio Cooperating System

    Ab Initio Software Corporation, headquartered in Lexington, MA, develops software solutions that process vast amounts of data (well into the terabyte range) in a timely fashion by employing many (often hundreds) of server processors in parallel. Major corporations worldwide use Ab Initio software in mission critical, enterprise-wide, data processing systems. Together, Teradata and Ab Initio

    deliver: End-to-end solutions for integrating and processing data throughout the enterprise Software that is flexible, efficient, and robust, with unlimited scalability Professional and highly responsive supportThe Co>Operating System executes your application by creating and managing

    the processes and data flows that the components and arrows represent.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 6161

    Graphical DevelopmentEnvironment GDE

  • AccentureAccenture Ab Initio TrainingAb Initio Training 6262

    The GDE The GDE

    The Graphical Development Environment (GDE) provides The Graphical Development Environment (GDE) provides a graphical user interface into the services of the a graphical user interface into the services of the Co>Operating System.Co>Operating System. The Graphical Development The Graphical Development EnvironmentEnvironment Enables you to create applications by Enables you to create applications by dragging and dropping Components. Allows you to point dragging and dropping Components. Allows you to point and click operations on executable flow charts. The and click operations on executable flow charts. The Co>Operating System can execute these flowcharts Co>Operating System can execute these flowcharts directly. Graphical monitoring of running applications directly. Graphical monitoring of running applications allows you to quantify data volumes and execution allows you to quantify data volumes and execution times, helping spot opportunities for improving times, helping spot opportunities for improving performance.performance.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 6363

    The Graph ModelThe Graph Model

  • AccentureAccenture Ab Initio TrainingAb Initio Training 6464

    The Component Library:The Component Library:

    The Component Library:The Component Library: Reusable software Reusable software Modules for Sorting, Data Transformation, Modules for Sorting, Data Transformation, database Loading Etc. The components adapt at database Loading Etc. The components adapt at runtime to the record formats and business rules runtime to the record formats and business rules controlling their behavior.controlling their behavior.

    Ab Initio products have helped reduce a projectAb Initio products have helped reduce a projects development and research time significantly.s development and research time significantly.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 6565

    ComponentsComponents

    Components may run on any computer running Components may run on any computer running the Co>Operating System.the Co>Operating System.

    Different components do different jobs.Different components do different jobs. The particular work a component accomplishes The particular work a component accomplishes

    depends upon its parameter settings.depends upon its parameter settings. Some parameters are data transformations, that Some parameters are data transformations, that

    is business rules to be applied to an input (s) to is business rules to be applied to an input (s) to produce a required output.produce a required output.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 6666

    33rdrd Party Components Party Components

  • AccentureAccenture Ab Initio TrainingAb Initio Training 6767

    EME EME The Enterprise Meta>Environment (EME) is a high-The Enterprise Meta>Environment (EME) is a high-

    performance object-oriented storage system that performance object-oriented storage system that inventories and manages various kinds of information inventories and manages various kinds of information associated with Ab Initio applications. It provides storage associated with Ab Initio applications. It provides storage for all aspects of your data processing system, from for all aspects of your data processing system, from design information to operations data.design information to operations data.

    The EME also provides rich store for the applications The EME also provides rich store for the applications themselves, including data formats and business rules. It themselves, including data formats and business rules. It acts as hub for data and definitions . Integrated acts as hub for data and definitions . Integrated metadata management provides the global and metadata management provides the global and consolidated view of the structure and meaning of consolidated view of the structure and meaning of applications and data- information that is usually applications and data- information that is usually scattered throughout you business .scattered throughout you business .

  • AccentureAccenture Ab Initio TrainingAb Initio Training 6868

    Benefits of EMEBenefits of EME The Enterprise Meta>Environment provides a rich store The Enterprise Meta>Environment provides a rich store

    for applications and all of their associated information for applications and all of their associated information including :including :

    Technical Metadata-Applications related business rules ,Technical Metadata-Applications related business rules ,record formats and execution statistics record formats and execution statistics

    Business Metadata-User defined documentations of job Business Metadata-User defined documentations of job functions ,roles and responsibilities.functions ,roles and responsibilities.

    Metadata is data about data and is critical to understanding Metadata is data about data and is critical to understanding and driving your business process and computational and driving your business process and computational resources .Storing and using metadata is as important to resources .Storing and using metadata is as important to your business as storing and using data.your business as storing and using data.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 6969

    EME-Ab Initio RelevanceEME-Ab Initio Relevance

    By integrating technical and business By integrating technical and business metadata ,you can grasp the entirety of metadata ,you can grasp the entirety of your data processing from operational to your data processing from operational to analytical systems.analytical systems.

    The EME is completely integrated The EME is completely integrated environment. The following figure shows environment. The following figure shows how it fits in to the high level architecture how it fits in to the high level architecture of Ab Initio software.of Ab Initio software.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 7070

  • AccentureAccenture Ab Initio TrainingAb Initio Training 7171

    Stepwise explanation of Ab Stepwise explanation of Ab Initio ArchitectureInitio Architecture

    You construct your application from the building blocks You construct your application from the building blocks called components, manipulating them through the called components, manipulating them through the Graphical Development Environment (GDE).Graphical Development Environment (GDE).

    You check in your applications to the EME.You check in your applications to the EME. The EME and GDE uses the underlining functionality of The EME and GDE uses the underlining functionality of

    the Co>Operating System to perform many of their the Co>Operating System to perform many of their tasks. The Cooperating System units the distributed tasks. The Cooperating System units the distributed resources into a single resources into a single virtual computervirtual computer to run to run applications in parallel.applications in parallel.

    Ab Initio software runs on Unix ,Windows NT,MVS Ab Initio software runs on Unix ,Windows NT,MVS operating systems.operating systems.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 7272

    Stepwise explanation of Ab Stepwise explanation of Ab Initio Architecture - continuedInitio Architecture - continued

    Ab Initio connector applications extract Ab Initio connector applications extract metadata from third part metadata sources into metadata from third part metadata sources into the EME or extract it from the EME into a third the EME or extract it from the EME into a third party destination.party destination.

    You view the results of project and application You view the results of project and application dependency analysis through a Web user dependency analysis through a Web user interface .You also view and edit your business interface .You also view and edit your business metadata through a web user interface.metadata through a web user interface.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 7373

    EME :Various users EME :Various users constituency served constituency served

    The EME addresses the metadata needs of The EME addresses the metadata needs of three different constituencies:three different constituencies:

    Business UsersBusiness Users DevelopersDevelopers System Administrators System Administrators

  • AccentureAccenture Ab Initio TrainingAb Initio Training 7474

    EME :Various users EME :Various users constituency servedconstituency served

    Business users are interested in exploiting data Business users are interested in exploiting data for analysis, in particular with regard to for analysis, in particular with regard to databases ,tables and columns.databases ,tables and columns.

    Developers tend to be oriented towards Developers tend to be oriented towards applications ,needing to analyze the impact of applications ,needing to analyze the impact of potential program changes.potential program changes.

    System Administrator and production personnel System Administrator and production personnel want job status information and run statistics.want job status information and run statistics.

  • AccentureAccenture Ab Initio TrainingAb Initio Training 7575

    EME InterfacesEME Interfaces

    We can create and manage EME through We can create and manage EME through 3 interfaces:3 interfaces:

    GDEGDE Web User InterfaceWeb User Interface Air UtilityAir Utility

  • AccentureAccenture Ab Initio TrainingAb Initio Training 7676

    Thank YouThank You

    End of Session 1End of Session 1

    Ab initio Session 1What is Data WarehouseData Warehouse-DefinitionsData WarehouseWhy Use a Data Warehouse?Data warehouse ArchitectureThe ETL ProcessETL Technology Data Warehouse Staging Area Examples- Staging Area Data warehouseData WarehouseData Warehouse EnvironmentData MartData MartStar SchemaStar Schema continuedAdvantages of Star SchemasStar schemaSnowflake SchemaSnowflake Schema - ExampleDiagrammatic representation for Snowflake SchemaFact TableWhat happens during the ETL process? Examples of Second-Generation ETL ToolsWhat to look for in ETL toolsOperating System / Hardware SupportParallel FunctionalityParallel FeaturesWhich parallel processor configuration, SMP or MPP ?Which parallel processor configuration, SMP or MPP ?A Multi-CPU Computer (SMP)A Network of Multi-CPU NodesA Network of NetworksParallel Computer ArchitectureIntroduction to Ab InitioHistory of Ab InitioHistory of Ab InitioWhat Does Ab Initio Mean?Ab Initios focusAb Initios SoftwareApplications of Ab Initio SoftwareAb Initio Provides For:Applications of Ab Initio Software in terms of Data WarehouseAb Initio or Informatica-Powerful ETLAb Initio or Informatica-Powerful ETL-continued Both tools are fundamentally differentAb Initio Product ArchitectureAb Initio Architecture-ExplanationCo>Operating System ServicesAb Initio: What We DoThe Ab Initio Co>Operating SystemThe Ab Initio Co>Operating System-ContinuedThe Ab Initio Co>Operating System Runs on:Connectivity to Other SoftwareAb Initio Cooperating SystemThe GDE The Graph ModelThe Component Library:Components3rd Party ComponentsEME Benefits of EME EME-Ab Initio RelevanceStepwise explanation of Ab Initio ArchitectureStepwise explanation of Ab Initio Architecture - continuedEME :Various users constituency served EME :Various users constituency servedEME InterfacesThank You