2933 West Germantown Pike Building 2, Suite 204 Fairview Village, Pa 19409 Toll Free : 800 618- 0836 Fax : 610 666-1006 Email : [email protected]Toll Free : 800 618- 0836 Fax : 610 666-1006 Email : [email protected]June 2011 Data Integration Study and Results: ETL Versus Cloud Based Data Integration Christopher C. Biddle
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
2933 West Germantown PikeBuilding 2, Suite 204Fairview Village, Pa 19409USA
• Comparison Categories– Business Application– Platform Deployment– Connectivity to Data Sources– Synchronization– Transformation– Data Movement– Test, Development and Operations Environments– Data Modeling– Data Quality & Data Governance– Architecture and Standards
• Ideal Systems, Inc. was formed in 1994 : provided Enterprise Software and consulting
• IdealNet, Inc. was formed in 2002 and absorbed the Ideal Systems, Inc. consulting practice
• Provides expert Business and Technical consulting services to Life Sciences and Financial Institution customers
• Clients include all of the top 10 Tier 1 Pharmaceutical Manufacturers, Leading Financial Institutions including Investment Banks and Commercial Banks, Biotech, Medical Device, Hospital Group Purchasing Organizations (GPOs), and Distributors
• Areas of expertise include Commercial and Regulatory Contracting, Finance, Merger & Acquisition Support, Trading Strategies, Business Intelligence & Analytics, and Sales Force Automation
• Application Development and Customization, Master Data Management and Application Integration
• Many years of experience in implementing, integrating, and upgrading Enterprise Software solutions
• Based on the East Cost in a Philadelphia, PA suburb with clients throughout the world
• Terminology Varies Considerably– ETL (extract/transform/load) – this is the most consistent term– Data Integration, EAI (transaction level), EII (failed in the market - extinct), ESB– Data Virtualization (instance), Advanced Data Virtualization (persistent
metadata server)– Data Alignment, Data Synchronization, Data Harmonization– Master Data Management, Virtual Master Data Management– Cloud: Iaas (integration as a service), Iaas (information as a service), iPaas
• Key Parameters– Data Integration – regular movement of data– Data Migration – one time movement of data– Business Intelligence - loading the BI vendor “domain” – Data Warehouse, Data Mart– Master Data Management – complex and difficult movement of data– Virtual Master Data Management – new version of MDM based upon advanced data
virtualization technology
• Cautions– One “shoe does not fit all”– Cloud, Object Applications in the Cloud don’t fit well with ETL based technologies
• Best Practice– Get a metadata discovery tool (note the free Queplix offering amongst others)– Define clearly the data sources, amount of data to be moved, the keys that will help
make this work, the frequency of update, and the exact minimum of information to be moved/synchronized/federated prior to selecting a technology
• Key Parameters– On-premise– Cloud (Public and/or Private)– Software as a service (saas)
• Cautions– ETL tools are not designed for cloud models or software as a service– This includes large enterprise private cloud - problems are not limited to public
cloud– No basic security mechanisms in ETL for cloud deployment – Technologies such as QueCloud support VPN integration and optionally a
special security module that implements enterprise model for cloud to enterprise connectivity
• Best Practice– Understand the full potential for architectural expansion up front– ETL will not adapt to expanded horizons and this will create large problems for
• Key Parameters– Direct database connectivity (insert/update/delete)– Application program interface (access only through API)– Flat file (EDI), Standards
• Cautions– Programming to proprietary vendor API’s on legacy software such as SAP®, Siebel®,
PeopleSoft® requires extensive application specific skills and may substantially increase the cost of a project
– Stick with SOA– Custom fields – how will you know they exist? How will your vendor handle this?– Cloud based applications are object oriented – not relational – find a toolset that works
well with objects – not just relational tables
• Best Practices– Use a metadata discovery tool (2nd time)– Use intelligent interfaces that eliminate the need to know a vendor API (Application
Software Blades™ - Queplix)– No vendor has every interface or ever will – understand their strategy to add your
• Key Parameters– Batch– Real-time or Near Real-time– Other
• Cautions– Real-time may mean an ACID transaction – you are in a transaction flow and
this is a very different problem than customer data alignment – this is really enterprise application integration as this sort of integration requires application level changes, not just data integration at the database level
– ETL often requires you set triggers in a database – database administrators don’t like this
• Best Practice– Update (synchronize or harmonize) when a “record of truth” changes and then
only update the fields that need updating – you don’t need to update an entire “row” if you choose the right technology set
– Understand strategies for real-time and near-time that don’t require explicit database invasive triggers
• Key Parameters– Target User, Dashboard, Workflow– Security, Data Quality– Data dictionary – metadata repository– Shared library of transforms– Test, Development and Production environment– Reporting, Analytics, Graphics– Disaster recovery
• Cautions– How will your vendor support flow from test & development into production operations?
Automated back-up? – Integration with a software instance is significantly more limiting than a server based
product with persistence– Hub and spoke architectures require LDAP support and more – ETL doesn’t do this
• Best Practice– Understand the difference between test, development and production operations
environments - SMB customers need to understand this better
• Key Parameters– Connection to data sources– Discovery of metadata structure in data sources– Representation of metadata structures in data sources– Automatic update of metadata in data sources– Semantic discovery support– Search of metadata across multiple sources– Direct access to underlying data – in a useful and navigable format – from
metadata– Ability to model all data sources– Virtual structures in metadata to facilitate mapping– Lineage of metadata and Metadata export
• Cautions– This should all be integrated – metadata discovery, data dictionary, semantic
discovery, data catalog access, data integration transforms
• Best Practice– Get a metadata discovery tool (3rd time!)
• Key Parameters– Basic data quality deeply integrated with data integration process
• Development and source clean-up• Production environment – ongoing automation?• The Cloud ***MUST*** support data quality – or you don’t have a solution
– Data governance – implementation by business rule• Cautions
– Data quality problems derail business intelligence and data integration all the time – don’t let it happen to you
– Some 1st and 2nd Generation and most 3rd Generation products integrate data quality – separate products, except for the largest organizations, probably don’t make sense
– Gartner® has noted the convergence of data quality with data integration and other toolsets
• Best Practics– Data integration proposals, without completely addressing data cleansing, data
quality and associated data governance guidelines won’t produce the results you expect (timeframe, quality, return on investment)
• Established ETL Vendors– Informatica® is the leader of the pack for ETL – see release 9.1– Clearly wins and defines ETL for batch oriented bulk file transfer – What does Hadoop® file level integration support actually deliver? Missing Hadoop®
resident database support so there is not much today.– Talend® and Pentaho® are the open source leaders with associated ETL and BI
strategies – these are excellent choices.– ETL doesn’t really scale – hubs for MDM are all manually programmed, expensive to
setup and error prone – each connection is essentially standalone
• Data Virtualization Based Data Integration– Queplix technology is a solid 3rd Generation product which merits your review – they are
clearly the leader in data virtualization today– QueCloud (or the on-premise product, Virtual Data Manager™) integrate 2, 3, 4 or more
sources as uniquely enabled by advanced data virtualization – this reduces risk, lowers cost and provides much more capability
– Lower costs by 50% or more and increase savings as you add more application integrations
– Data virtualization is the future of data integration – persistent metadata servers do things that ETL technology cannot do – learn more about it
• Sponsored Message (Queplix, Inc.)– Please go to www.queplix.com and download the Free Metadata Discovery
Tool or email [email protected] and they will set you up for free– Queplix will email out a copy of my report or follow this link to obtain a copy
directly from my website: http://www.idealnetinc.com/IdealNet_Analysis_of_Data_Integration_Technologies.pdf
– Queplix is running promotions for free QueCloud implementation turnkey – essentially all setup and 1 year free – please contact [email protected] for the terms and conditions or see www.netsuite.com for more about that promotion in the partner section
– Queplix has free video, without registration, accessible from their home page and from their collateral page (under “About”)
– Queplix has other video and white papers in a registration section
• This Presentation– You can request a copy via email from me [email protected] and my