Data Vault + Data Virtualization = Double Flexibility€¦ · and data virtualization. He is managing director of R20/Consultancy B.V.. Rick has been involved in various projects
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
reserved. No part of this material may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photographic, or
otherwise, without the explicit written permission of the copyright owners.
by
Rick F. van der LansR20/Consultancy BVTwitter @rick_vanderlanswww.r20.nl
Data Vault + Data Virtualization = Double Flexibility
Rick F. van der LansRick F. van der Lans is an independent consultant, lecturer, and author. He specializes in data warehousing, business intelligence, database technology, and data virtualization. He is managing director of R20/Consultancy B.V.. Rick has been involved in various projects in which data warehousing, and integration technology was applied.
Rick van der Lans is an internationally acclaimed lecturer. He has lectured professionally for the last twenty five years in many of the European and Middle East countries, the USA, South America, and in Australia. He has been invited by several major software vendors to present keynote speeches.
He is the author of several books on computing, including his new Data Virtualization for Business Intelligence Systems. Some of these books are available in different languages. Books such as the popular Introduction to SQL is available in English, Dutch, Italian, Chinese, and German and is sold world wide. He also authored The SQL Guide to Ingres and SQL for MySQL Developers.
As author for TechTarget.com and BeyeNetwork.com, writer of whitepapers, chairman for the annual European Enterprise Data and Business Intelligence Conference, and as columnist for a few IT magazines, he has close contacts with many vendors.
R20/Consultancy B.V. is located in The Hague, The Netherlands, www.r20.nl. You can get in touch with Rick via: Email: [email protected]: @Rick_vanderlansLinkedIn: http://www.linkedin.com/pub/rick-van-der-lans/9/207/223
Define data structuresDefine ETL logicInstall a database instanceCreate a databaseImplement the tablesDesign physical database structureInitial load of the tablesPeriodic load of the tablesTune and optimize the database (regularly)Tune and optimize ETL logic
Monitor database usageDevelop and run backup andrecovery processesUnload dataChange data structureChange ETL logicTune and optimize physicaldatabase designTune and optimize ETL logicReload data…
Gartner in Data Management Cost-Cutting Tips, March 10, 2008:Consolidate data marts into an application-neutral data warehouse or smaller data marts to reduce the cost and complexity of the data integration processes feeding the data marts. Gartner predicts this could save you 50 percent of what you're spending to support the siloed data marts.
Cirro Data HubCisco/Composite Information ServerDenodo PlatformIBM InfoSphere Federation ServerInformatica Data ServicesInformation Builders EIIOracle Data Services IntegratorProgress EasylRed Hat Teiid and Jboss Data VirtualizationStone Bond Enterprise Enabler VirtuosoAnd many more …
All the satellite data is added to hubs and linksA record in a hub table represents a version of a hub objectA record in a link table represents a version of a link objectThe hub/link id + startdate are the primary keys
SELECT HUB1.HUB_ID, SATELLITES.STARTDATE, SATELLITES.ENDDATE, HUB1.BUSINESS_KEYFROM HUB1 LEFT OUTER JOIN
(SELECT HUB_ID, META_LOAD_DTS AS STARTDATE, META_LOAD_END_DTS AS ENDDATEFROM HUB1_SATELLITE1UNION SELECT HUB_ID, META_LOAD_DTS, META_LOAD_END_DTS FROM HUB1_SATELLITE2) AS SATELLITES ON HUB1.HUB_ID = SATELLITES.HUB_ID)
Join with the original Hub table and get the business key(s):
SELECT HUB1.HUB_ID, SATELLITES.STARTDATE, SATELLITES.ENDDATE, HUB1.BUSINESS_KEYFROM HUB1 LEFT OUTER JOIN
(SELECT HUB_ID, META_LOAD_DTS AS STARTDATE, META_LOAD_END_DTS AS ENDDATEFROM HUB1_SATELLITE1UNION SELECT HUB_ID, META_LOAD_DTS, META_LOAD_END_DTS FROM HUB1_SATELLITE2) AS SATELLITES ON HUB1.HUB_ID = SATELLITES.HUB_ID)
SELECT DISTINCT HUB_ID, STARTDATE, CASE WHEN ENDDATE_NEW <= ENDDATE_OLD THEN ENDDATE_NEW ELSE ENDDATE_OLD END AS ENDDATE,BUSINESS_KEY
FROM (SELECT S1.HUB_ID, ISNULL(S1.STARTDATE,'1900-01-01 00:00:00') AS STARTDATE, (SELECT ISNULL(MIN(STARTDATE - '1' SECOND),'9999-12-31 00:00:00') FROM STARTDATES AS S2WHERE S1.HUB_ID = S2.HUB_IDAND S1.STARTDATE < S2.STARTDATE) AS ENDDATE_NEW, ISNULL(S1.ENDDATE,'9999-12-31 00:00:00') AS ENDDATE_OLD, S1.BUSINESS_KEY
FROM HUB1_VERSIONS LEFT OUTER JOIN HUB1_SATELLITE1
ON HUB1_VERSIONS.HUB_ID = HUB1_SATELLITE1.HUB_ID AND (HUB1_VERSIONS.STARTDATE <= HUB1_SATELLITE1.META_LOAD_END_DTS AND HUB1_VERSIONS.ENDDATE >= HUB1_SATELLITE1.META_LOAD_DTS)
LEFT OUTER JOIN HUB1_SATELLITE2 ON HUB1_VERSIONS.HUB_ID = HUB1_SATELLITE2.HUB_ID AND (HUB1_VERSIONS.STARTDATE <= HUB1_SATELLITE2.META_LOAD_END_DTS AND HUB1_VERSIONS.ENDDATE >= HUB1_SATELLITE2.META_LOAD_DTS)
FROM LINK LEFT OUTER JOIN(SELECT LINK_ID, META_LOAD_DTS AS STARTDATE, META_LOAD_END_DTS AS ENDDATEFROM LINK_SATELLITE1UNION SELECT LINK_ID, META_LOAD_DTS, META_LOAD_END_DTSFROM LINK_SATELLITE2) AS SATELLITES ON LINK.LINK_ID = SATELLITES.LINK_ID)
SELECT DISTINCT LINK_ID, STARTDATE, CASE WHEN ENDDATE_NEW <= ENDDATE_OLD THEN ENDDATE_NEW ELSE ENDDATE_OLD END AS ENDDATE,HUB1_ID, HUB2_ID, EVENTDATE
FROM (SELECT S1.LINK_ID, ISNULL(S1.STARTDATE, '1900-01-01') AS STARTDATE, (SELECT ISNULL(MIN(STARTDATE - INTERVAL '1' SECOND),'9999-12-31 00:00:00') FROM STARTDATES AS S2WHERE S1.LINK_ID = S2.LINK_IDAND S1.STARTDATE < S2.STARTDATE) AS ENDDATE_NEW,ISNULL(S1.ENDDATE,'9999-12-31') AS ENDDATE_OLD,S1.HUB1_ID, S1.HUB2_ID, S1.EVENTDATE
FROM LINK_VERSIONS LEFT OUTER JOIN LINK_SATELLITE1
ON LINK_VERSIONS.LINK_ID = LINK_SATELLITE1.LINK_ID AND (LINK_VERSIONS.STARTDATE <= LINK_SATELLITE1.META_LOAD_END_DTS AND LINK_VERSIONS.ENDDATE >= LINK_SATELLITE1.META_LOAD_DTS)
LEFT OUTER JOIN LINK_SATELLITE2 ON LINK_VERSIONS.LINK_ID = LINK_SATELLITE2.LINK_ID AND (LINK_VERSIONS.STARTDATE <= LINK_SATELLITE2.META_LOAD_END_DTS AND LINK_VERSIONS.ENDDATE >= LINK_SATELLITE2.META_LOAD_DTS)
A link is joined with all its satellites using the data in the link_versions views:
Data is shown in a filtered mannerData is shown in aggregated formData is shown in one large, highly denormalized tableData is shown in a star schema formData is shown with a service interface…
Define data structuresDefine ETL/DV logicInstall a database instanceCreate a databaseImplement the tablesDesign physical database structureInitial load of the tablesPeriodic load of the tablesTune and optimize the database (regularly)Tune and optimize ETL logic
Monitor database usageDevelop and run backup andrecovery processesUnload dataChange data structureChange ETL/DV logicTune and optimize physicaldatabase designTune and optimize ETL logicReload data…
Not database server independentMore advanced distributed join featuresMore advanced heterogeneous join featuresMore advanced caching/refreshing featuresDatabase views offer no lineage/impact analysisDatabase views offer only one API: SQLNo versioning of joinsNo data cleansing featuresNo business glossary…
Data Vault offers data model extensibility and report reproducibilityData vault is half the solutionSuperNova (with data virtualization) is the other halfWith data virtualization a more flexible reporting and analytical environment can be developed (quickly)Avoid the (physical) data mart explosion! Go virtual!