Top Banner

of 32

Chap8 Infrastructure

Jun 04, 2018

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 8/14/2019 Chap8 Infrastructure

    1/32

    Infrastructure of Data Warehouse

    Ms. Ashwini Rao

    Asst.Prof.IT

  • 8/14/2019 Chap8 Infrastructure

    2/32

    Infrastructure supporting architecture

  • 8/14/2019 Chap8 Infrastructure

    3/32

    Infrastructure

    Elements that enable the architecture to beimplemented.

    Operational

    help to keep the DW going People

    Procedures

    Training

    Management software

    Physical Hardware components

    Operating system

    Network, network software

  • 8/14/2019 Chap8 Infrastructure

    4/32

    Physical Infrastructure

  • 8/14/2019 Chap8 Infrastructure

    5/32

    Features of Hardware & OS

    Hardware Scalability

    Vendor support

    Vendor stability

    Vendor reference OS

    Scalability

    Security

    Reliability

    Availability

    Preemptive multitasking

    Memory protection RS SPAM

  • 8/14/2019 Chap8 Infrastructure

    6/32

    Possible options of Hardware & OS

    Mainframes Old hardware

    Designed for OLTP

    Expensive

    Not easily scalable

    Open System Servers UNIX servers are most opted

    Robust

    Adapted for parallel processing

    NT Servers

    Medium-sized data warehouses Limited parallel processing

    Cost effective for small or medium DW

  • 8/14/2019 Chap8 Infrastructure

    7/32

    Platform Options

    A computing platform is the set hardware

    components, operating system, network &

    network software.

    Both Online Transaction Processing and

    Decision Support Systems need a computing

    platform.

  • 8/14/2019 Chap8 Infrastructure

    8/32

    Single Platform Option

    All functions from back-end data extraction tofront-end query processing is performed on oneplatform.

    Data flows smoothly, no conversions required No middleware required

    Limitations

    Legacy platform stretched to capacity

    Non-availability of tools Multiple legacy platforms

    Companys migration policy

  • 8/14/2019 Chap8 Infrastructure

    9/32

    Hybrid Platform Option

    Eliminate s the drawbacks of single platform option

    Data extraction: Each source is extracted on its own

    computing platform

    Initial reformatting & merging: The extracted file from

    each source is reformatted & merged, on their respective

    platforms

    Preliminary data cleansing: Verify extracted data for

    missing values & data types.

    Transformation & Consolidation: Performed on the

    platform where the staging area resides.

    Validation & Final Quality Check

    Creation of Load Images

  • 8/14/2019 Chap8 Infrastructure

    10/32

    Data Movement Considerations

    Shared Disk

    Mass Data Transmission

    Through ports

    Real Time Connection

    TCP/IP

    Manual Methods External medium

  • 8/14/2019 Chap8 Infrastructure

    11/32

    Data movement options

  • 8/14/2019 Chap8 Infrastructure

    12/32

    Client/Server architecture for DW

  • 8/14/2019 Chap8 Infrastructure

    13/32

    Considerations on client

    workstations

    Depends on type of users

    casual user-Web browser and HTML reports Analyst-more powerful workstation machine

    Practically feasible solution is a minimum

    configuration on an appropriate platform thatwould support a standard set of information

    delivery tools in DW

  • 8/14/2019 Chap8 Infrastructure

    14/32

    Platform options as DW matures

  • 8/14/2019 Chap8 Infrastructure

    15/32

    Parallel processing

    Symmetric multiprocessing

    Clusters

    Massively parallel processing Cache-coherent Non uniform Memory

    Architecture

  • 8/14/2019 Chap8 Infrastructure

    16/32

    Symmetric Multiprocessing

  • 8/14/2019 Chap8 Infrastructure

    17/32

    Features:

    This is a shared-everything architecture, the simplest parallel

    processing machine.

    Each processor has full access to the shared memory through a

    common bus. Communication between processors occurs through common

    memory.

    Benefits:

    Provides high concurrency. You can run many concurrent queries.

    Balances workload very well.

    Gives scalable performance. Simply add more processors to the

    system bus.

    Being a simple design, you can administer the server easily.

    Symmetric Multiprocessing

  • 8/14/2019 Chap8 Infrastructure

    18/32

    Limitations:

    Available memory may be limited.

    May be limited by bandwidth for processor-to-processor communication, I/O, and bus

    communication.

    Availability is limited; like a single computer

    with many processors.

    Symmetric Multiprocessing

  • 8/14/2019 Chap8 Infrastructure

    19/32

    Clusters

  • 8/14/2019 Chap8 Infrastructure

    20/32

    Clusters

    Features:

    Each node consists of one or more processors and associated memory.

    Memory is not shared among the nodes; it is shared only within each

    node.

    Communication occurs over a high-speed bus.

    Each node has access to the common set of disks.

    This architecture is a cluster of nodes.

    Benefits:

    This architecture provides high availability; all data is accessible even if

    one node fails.

    Preserves the concept of one database.

    This option is good for incremental growth.

  • 8/14/2019 Chap8 Infrastructure

    21/32

    Clusters

    Limitations:

    Bandwidth of the bus could limit the scalabilityof the system.

    This option comes with a high operating systemoverhead.

    Each node has a data cache; the architectureneeds to maintain cache consistency forinternode synchronization.

    Main memory is like a big file cabinet stretchingacross the entire room.

  • 8/14/2019 Chap8 Infrastructure

    22/32

    Massively Parallel Processing

  • 8/14/2019 Chap8 Infrastructure

    23/32

    Features:

    This is a shared-nothing architecture.

    This architecture is more concerned with disk access than memory access.

    Works well with an operating system that supports transparent disk access.

    If a database table is located on a particular disk, access to that disk depends

    entirely on the processor that owns it. Internode communication is by processor-to-processor connection.

    Benefits:

    This architecture is highly scalable.

    The option provides fast access between nodes.

    Any failure is local to the failed node; improves system availability. Generally, the cost per node is low.

    Limitations:

    The architecture requires rigid data partitioning.

    Data access is restricted.

    Massively Parallel Processing

  • 8/14/2019 Chap8 Infrastructure

    24/32

    NUMA

  • 8/14/2019 Chap8 Infrastructure

    25/32

    Features:

    This is the newest architecture.

    The NUMA architecture is like a big SMP broken into smaller SMPs that are easier

    to build.

    Hardware considers all memory units as one giant memory. The system has a

    single real memory address space over the entire machine; memory addresses

    begin with 1 on the first node and continue on the following nodes. Each node

    contains a directory of memory addresses within that node.

    In this architecture, the amount of time needed to retrieve a memory value varies

    because the first node may need the value that resides in the memory of the third

    node. That is why this architecture is called non uniform memory access

    architecture.

    Benefits:

    Provides maximum flexibility.

    Overcomes the memory limitations of SMP.

    Better scalability than SMP.

    NUMA

  • 8/14/2019 Chap8 Infrastructure

    26/32

    Limitations:

    ProgrammingNUMA architecture is more

    complex than even with MPP.

    Software support for NUMA is fairly limited.

    Technology is still maturing.

    NUMA

  • 8/14/2019 Chap8 Infrastructure

    27/32

    Database Software

    Many operations can be parallelized mass loading of data

    full table scans

    queries with exclusion conditions, queries with

    grouping selection with distinct values

    aggregation

    sorting

    creation of tables using subqueries, creating andrebuilding indexes

    inserting rows into a table from other tables

  • 8/14/2019 Chap8 Infrastructure

    28/32

    Types of parallelization

  • 8/14/2019 Chap8 Infrastructure

    29/32

    Software Tools

  • 8/14/2019 Chap8 Infrastructure

    30/32

    Summing up

    Infrastructure acts as the foundation supportingthe data warehouse architecture

    Data warehouse infrastructure consists of

    operational infrastructure and physicalinfrastructure.

    Hardware and operating systems make up thecomputing environment for the DW.

    Several options exist for the computing platformsneeded to implement the various architecturalcomponents.

  • 8/14/2019 Chap8 Infrastructure

    31/32

    Summing up

    Selecting the server hardware is a key decision.Invariably, the choice is one of the four parallel serverarchitectures.

    Current database software products are able to

    perform interquery and intraquery parallelization.

    Software tools are used in the data warehouse for datamodeling, data extraction, data transformation, dataloading, data quality assurance, queries and reports,

    and online analytical processing (OLAP).

    Tools are also used as middleware, alert systems,andfor data warehouse administration.

  • 8/14/2019 Chap8 Infrastructure

    32/32