Transcript
8/10/2019 Metadata Managementc
1/16
1. Metadata Management
1.1 Metadata Management Life Cycle
Metadata management Life Cycle defines various phases associated with the end-to-end metadata
management process starting from planning through maintenance till retirement of metadata
1.1.1 Governance and Planning
Governance and Planning involves initial planning, defining the objectives for metadata management
process, identification of owners and associated roles and responsibilities for each of the stake-holders.The ability to ingest and explore any data including structured, semi-structured and unstructured
data. Given this usage, it is challenging to enforce a strict control and governance on the data being
ingested into the Data warehouse environments and hence Governance of Metadata is of relatively
lesser significance in this context.
1.1.2 Metadata Content
Metadata content defines the types of metadata that need to be captured as part of the metadata
management process.
Type of Metadata Definition / Description
Business Metadata
Business Metadata defines the data in the Warehouse in user friendly terms.
Business Metadata captures what data is stored in the Warehouse, where the
data is sourced from, how the data is used and its relationship to other data in
the Warehouse.
Technical MetadataTechnical Metadata defines the data, objects and processes in the Warehouse
from a technical point of view. Technical Metadata captures system metadata
8/10/2019 Metadata Managementc
2/16
1.1.3 Metadata Capture Strategy
Metadata capture strategy defines the process and / or tools that need to be used for capturing the
required metadata. Strategy for metadata capture can include multiple tools / approaches based on
the type of data and feasibility constraints. The strategy outlines the guidelines for using an
appropriate tool or mechanism for identified use cases.
1.1.4 Metadata Model and Integration
Metadata Modelling defines the data modelling strategy for the metadata repository. Metadata
Integration defines the approach for integration of various types of metadata including integration
from various metadata repositories, if applicable.
1.1.5 Metadata Visibility
Metadata Visibility defines the processes associated with enabling access to the metadata elements,
types of analyses and use-cases for usage of metadata by end-users.
1.1.6 Metadata Standards and Quality
Metadata Standards and Quality are of relatively lesser significance compared to the other phases in
the context of Data Warehouse. Metadata is created once and is occasionally used by a limited set of
users. Hence typically Organizations do not invest in tracking or enhancing the quality of metadata
captured either through an automated process or through a manual process.
such as tables, data elements, indices, partitions in a relational database, files
stored in the cluster, security classification for the data elements etc.
Operational Metadata
Operational Metadata (or sometimes also referred to as the Process Metadata) is
the data about the processes in the Warehouse. Operational Metadata captures
process schedules, frequency of batch processes, status summary and usage
statistics for various processes etc.
Business Rules &
Transformation Rules
Business Rules and Transformation Rules related metadata capture the rules
applied on data elements during the data acquisition, data ingestion or data
extraction and loading processes in the Data Warehouse.
In some cases, this metadata can also be used to dynamically process and load the
source data feeds into the Data Warehouse.
System Statistics
System Statistics related metadata captures data related to system resource
utilization for proactive monitoring and maintenance within a Data Warehouse
environment.
Metadata for Downstream
Process
Metadata for downstream processes captures the TechnicalMetadata including
mapping of data elements from the Warehouse to downstream processes or
applications such as BI tools, analytical models or any other downstream
applications.
8/10/2019 Metadata Managementc
3/16
1.1.7 Maintenance and Retirement
Maintenance and Retirements define the following aspects associated with metadata management processes.
Purging and archival or obsolete metadata (Operational Metadata for example)
Restructuring and enhancements to the Metadata Model
Processes and Governance for ensuring accuracy and timeliness of the metadata captured with on-going
changes and project releases
1.2 Metadata Content
This section details the list of recommended metadata data elements that need to be captured for various types of
Metadata as part of the Metadata Management strategy for the environemnt.
1.2.1 Business Metadata
Following are the recommended Business Metadata data elements that need to be captured for the Business
metadata. The Conceptual Model , Logical model information are also stored in the Business metadata for the ease
for usage and to understand the impact analysis for any business changes
Metadata Data Elements Level
Source Feed Business Name Source Feed
Source Feed Business Description Source Feed
Source Feed Usage Source Feed
Source Feed Group Name Source Feed
External Data Source Indicator Source Feed
Source Host Code Name Source Feed
Source Feed Business Owner / Contact Source Feed
Source Feed Technical Contact Source Feed
Source Column Business Name Source Column
Source Column Business Description Source Column
Target File Business Name Target File
Target File Business Description Target File
Target File Usage Target File
Subject Area Target File
8/10/2019 Metadata Managementc
4/16
Data Security Classification Target File
Target Column Business Name Target Column
Target Column Business Description Target Column
Target Column Synonym(s) Target Column
1.3 Technical Metadata
Following are the recommended Technical Metadata data element that needs to be captured for the ODS, Data
warehouse, Data Marts, Source Systems. This should captured for all source, target and extracts provided
Level Metadata Data Elements
Source Feed Source Feed Name
Source Feed Source Database Name
Source Feed Source Table Technical Name
Source Feed Source Data File Name
Source Feed Source Feed Group Name
Source Feed Source Host Type
Source Feed Source System Code Name
Source Feed Source Feed Format Type
Source Feed Source File Layout Definition (XSD / JSON etc.)
Source Feed Source Trigger File Name
Source Feed Source Trigger File Type and Format
Source Feed Source Encryption Method
Source Feed Source Feed Profile Path
Source Feed Source Feed Delivery Frequency
Source Feed Exception Days for the Source Feed
Source Feed Expected Delivery Time of the Source Feed
Source Feed Expected Number of Records
Source Feed Number of Columns (Source Feed)
8/10/2019 Metadata Managementc
5/16
Source Column Source Column Technical Name
Source Column Source Column Data Format
Source Column Source Column Data Type
Source Column Source Column Data Length
Source Column Required / Optional (NULL) Indicator
Target File Target File Name
Target File Target File Format Type
Target File Target File Layout Definition (XSD / JSON etc.)
Target File HDFS Location (Directory Path)
Target File Target Data Security (ARD Role)
Data Source Ingestion Method / Extraction Method
Target File Archive Location
Target File Target Encryption Method
Target Object Target Resource Size
Target File / Table Update Frequency
Target File / Table Update Type
Target Column Target Column Technical Name
Target Column Target Column Data Format
Target Column Target Column Data Type
Target Column Target Column Data Length
Target Column Expression / Transformation (SourceTarget)
Column Column Delimiter Used
Column System of Record / System of Reference
1.3.1 Operational Metadata
Following are the data elements recommended to be captured as part of the Operational Metadata. The
Operational Metadata captured does not vary based on the source system of the type of the source data.
8/10/2019 Metadata Managementc
6/16
Operational Metadata data elements can be classified into 2 broad categories Data Movement and Data Usage,
for each of the source data types.
Following are the recommended Operational Metadata data elements that needs to be captured
Metadata Data Elements Structured Unstructured
Data Movement Metadata
Source Feed Delivery Time SLA
Source Feed Delivery Time (Actual)
Source Feed Exception Indicator
Source Feed Exception Details
Number of Records Received
Expected Number of Columns
Actual Number of Columns Received
Data Load Rule Name
Data Load Rule Threshold Type
Data Load Rule Failure Value
Data Load Rule Last Failure Date and Time
Business Date
Last Data Load Date and Time
Data As of Date
Job Name
Job Description
Job Location
Job Type (Batch / Real-Time etc.)
Job Execution Frequency
Job Execution Start Time
Job Execution End Time
Job Status
Job Completion Time SLA
Job Execution Exception Indicator
8/10/2019 Metadata Managementc
7/16
Job Execution Exception Type
Job Execution Exception Details
Number of Success Records
Number of Exception Records
Number of Rejected Records
Data Usage Metadata
Access Count
Last Access Date and Time
Last Access User / Process
Number of Queries / Extractions
Last Extraction Date and Time
Output Protocol (FTP, Tumbleweed etc.)
1.3.2 Business Rules and Transformation Rules
Following are the recommended Business Rules and Transformation Rules related Metadata data elements that
needs to be captured
Metadata Data Elements File Level Column Level
Rule Name
Rule Type
Rule Level Name
Rule Threshold Type
Alert Threshold Value
Abort Threshold Value
Rule Default Value
Trigger Field Name
Rule Filter Condition
Rule Parameter Name
Rule Parameter Value
8/10/2019 Metadata Managementc
8/16
1.3.3 System Statistics
Following are the recommended System Statistics that needs to be captured. The metadata data elements listed
are high level statistics which can comprise of one or more detailed statistics. The detailed list of system statistics
that can be captured depends on the Operating System, monitoring tools used etc. The table below provides
examples of detailed statistics for each category
Metadata Data Elements Examples
CPU UtilizationCPU Utilization of System Processes, CPU Utilization of Applications / Users,
CPU Idle Time etc.
Memory UtilizationTotal Physical Memory, Memory used for Swap, Memory Used for Caching
etc.
Storage Utilization Total Space Available, Utilized Space
I/O UtilizationNumber of Transfers per Second, Data Reads (kB/s), Data Writes (kB/s), I/O
Wait Time, Reads per Second, Writes per Second etc.
1.4 Metadata Capture Strategy
In the context of Data Warehouse, Metadata is captured only in the production environment
The approach or strategy for capturing the Metadata for the Warehouse can be broadly classified into 4 categories
as follows
Metadata capture for structured data
Metadata capture for semi-structured / unstructured data sources
Metadata capture for downstream processes from Warehouse
The following table summarizes the metadata capture strategy by type of Metadata
Metadata Type Options
Business Metadata Sourced from Commercial BI Metadata Repository
Manual Capture
Technical Metadata Sourced from Commercial BI Metadata Repository
Auto-Capture (from system tables / repositories)
Manual Capture
Operational Metadata Published to Metadata Repository
Auto-Capture (from Application Repositories)
Business Rules & Transformation Rules Custom Manual Capture (through the portal)
System Statistics
Auto-Capture
Metadata for Downstream Processes Manual Capture
8/10/2019 Metadata Managementc
9/16
1.4.1 Business Metadata
Business metadata provides the data definition for each of the data elements processed and loaded into the
Warehouse. The metadata management process should provide a mechanism for manual capture of Business
Metadata during the design phase.
Following are the general guidelines for capturing the Business Metadata
For structured data sourced
o If the Business Metadata is available within the Source Metadata Repository, the required data
elements should be sourced and loaded into the Data Warehouse Metadata Repository
o If the Business Metadata is not available within the Source Metadata Repository, the data owner
responsible for the movement of the data from Source to Data Warehouse should provide the
business metadata. The metadata can be captured manually using a customized template used
for Metadata Management process.
Data Stewards or Analysts responsible for capturing (creating) the business metadata
should be able to upload the metadata through a self-serviced portal. This would enable
authentication and authorization for the users capturing or creating the metadata.
Alternatively, Data Stewards or Analysts can be provided with a UI on the portal for
creating the business metadata that cannot be sourced programmatically.
For any other source data feeds and target objects (in all cases), business metadata should be captured
using the manual capture process. When the data is captured through the manual process
o Metadata certified , validated and released
The table below captures the details of metadata capture by layer for Business Metadata
Layer When Metadata Capture Strategy Responsible Party
Data Access Layer Design Phase Manual Capture Business Analysts
Data Storage Layer Design Phase Manual Capture Business Analysts
1.4.2 Technical Metadata
Technical metadata captures the details of how, what and where the data elements are stored within the Data
Warehouse environments. Given the multitude of options for modelling and storing the various types of data in a
Data Warehouse, the Technical Metadata captured varies based on the type of data being sourced or ingested into
the Data environment.
The table below captures the details of metadata capture by layer for Technical Metadata
Layer When Metadata Capture Strategy Responsible Party
Data Access Layer
Design Phase Auto-Capture Data Stewards
Design Phase Manual Capture Data Stewards
Data Landing Layer Design Phase Auto-Capture Data Stewards
8/10/2019 Metadata Managementc
10/16
Data Integration Layer Design Phase Manual Capture Data Stewards
Data Storage Layer
Design /
Development PhaseAuto-Capture Data Stewards
Design Phase Manual Capture Data Stewards
1.4.3 Operational Metadata
Operational Metadata captures data from the auditing and logging for data acquisition, data transformation and
loading processes, BI usage data, details around data integration job and report execution times etc.
The approach and guidelines for capturing the Operational Metadata depends on the type of operational data
being captured and can be broadly classified into following categories
Operational Metadata for Data Movement
Operational Metadata for BI and Analytics
The Metadata Management process implemented should capture the Operational Metadata for data movement
during the actual job execution. The metadata should be captured programmatically without any manual
intervention. Operational Metadata for Data Usage however can be extracted on a period basis and can be
scheduled.
Metadata Repository
An Operational Metadata repository should be created for the Data Warehouse
It is recommended to implement a metadata repository at least for Operational Metadata irrespective of
the Data Modelling strategy adopted
If an integrated Metadata Repository is implemented, the Operational Metadata can be part of the
repository (subject area approach)
Guidelines
Following are the general guidelines for capturing Operational Metadata for Data Movement
A common approach is used for capturing Operational Metadata for structured, semi-structured and
unstructured data
Metadata capture should be event driven and required data elements should be published into the
metadata repository as soon as the data movement process / cycle completes
Data Ingestion, Data Extraction and the Data Load processes should have a mechanism to publish the
required data elements into the Operational Metadata repository
o The data elements may either be published using pre and post processing scripts for the batch
processes
o Alternatively, a control script can be continuously monitor the batch process and publish the required
data elements into the operational metadata repository
Following are the general guidelines for capturing Operational Metadata for BI and Analytics
Operational Metadata for BI and analytics will be primarily sourced from the application repositories
Metadata capture can be batch oriented, with ability to support intra-day batches
8/10/2019 Metadata Managementc
11/16
The table below captures the details of metadata capture by layer for Operational Metadata
Layer When Metadata Capture Strategy Responsible Party
Data Integration Layer Data Movement Auto-Capture
Data Storage Layer
Post Go-Live, on
regular basis Auto-Capture
1.4.4 Business Rules & Transformation Rules
Business Rules and Transformation Rules applied for the data sourced into the Data environment is always
captured through a custom manual process. This section provides the general guidelines for capturing the Business
Rules and / or Transformation rules based on the type of Data
Structured Data
Business Rules and Transformation Rules should be captured as separate rules
Applicable Business Rules and Transformation Rules should be captured at both Source Table level aswell as Source Column Level
Linkage between the Business Rules and Transformation Rules should be established through the source
object
Multiple rules may be associated with a given Source Table or Source Column
Rules may either be captured and stored in the metadata repository (database) or maintained as Excel
files associated with the source object
Semi-Structured / Unstructured Data
Business Rules and Transformation Rules should be captured as separate rules
Rules should be captured at source feed level
Multiple rules may be associated with a given source feed
It is recommended to capture the rules using Excel files associated with the source objects
o Business rules can be optional at field level
o Transformation rules applicable to field level may be captured in the Excel files
Business Rules and Transformation Rules related metadata is dependent on the Technical Metadata for the source
data feeds or source data elements. In order to ensure data quality and accuracy of the metadata, it is
recommended to capture the business rules and transformation rules metadata through a UI on the portal with
following checks and balances
Source data feeds and data elements should be pre-populated from the Technical Metadata available in
the metadata repository
End-users should not be able to edit or modify the source data elements
UI can have basic validations to ensure mandatory metadata elements are captured
UI should also have a provision to allow users to upload a file with the rules either at source data feed
level or at source data element level
Users should be able to editupdate or delete any rules entered through the UI
8/10/2019 Metadata Managementc
12/16
The table below captures the details of metadata capture by layer for Business Rules and Transformation Rules
related Metadata
Layer When Metadata Capture Strategy Responsible Party
Data Integration Layer Design Phase Manual Capture (Custom Process)
1.4.5 System Statistics
System Statistics for the Warehouse environment should be captured using automated capture from the system
logs or through the use of system monitoring tools and utilities.
Following are the general guidelines for capturing System Statistics
System statistics should always be captured using an automated process
Key utilization statistics such as CPU or memory utilization should be tracked continuously
Utilization statistics for other resources such as storage may be captured on a periodic basis
The table below captures the details of metadata capture by layer for System Statistics
Layer WhenMetadata Capture
StrategyResponsible Party
Data Landing LayerPost Go-Live, on
regular basisAuto-capture System Administrators
Data Integration LayerPost Go-Live, on
regular basisAuto-capture System Administrators
Data Storage LayerPost Go-Live, on
regular basisAuto-capture System Administrators
1.4.6 Metadata for Downstream Processes
Metadata for the downstream processes comprises of business metadata for the target objects, technical
metadata for the target objects including the lineage from warehouse/ Hadoop to the downstream data
repositories (data marts/ Hive / HBase etc.), BI tools or analytical models. This metadata is required to enable
complete lineage analysis from the source systems to the target applications.
Following are the general guidelines for capturing the metadata for downstream processes
Business Analysts or the data stewards responsible for moving the data from the Data Warehouse to the
downstream applications should be primarily responsible for capturing the Business Metadata elements
Technical SMEs / technical point-of-contact for the downstream applications should be primarily
responsible for capturing the Technical Metadata including the lineage metadata
Any business rules and transformation rules applied should be captured at both Entity and Attribute level
Any business rules and transformation rules applied should be captured at both Entity and Attribute level
8/10/2019 Metadata Managementc
13/16
The table below captures the details of metadata capture by layer for System Statistics
Layer WhenMetadata Capture
StrategyResponsible Party
Data Storage Layer Design Phase Manual Capture
Business Analysts
Data Analysts
Data Stewards
1.5 Metadata Modeling and Integration
Metadata modelling defines the approach or data modelling strategy for the metadata repository. This section
describes various options for metadata modelling and provides a comparative analysis between each of the
options.
1.5.1 Metadata Refresh
Metadata Refresh defines the process and frequency for capturing and updating the metadata on an on-going
basis. The processes and frequency of Metadata refresh varies based on the type of the Metadata and the
environment for which Metadata is being captured and refreshed.
The table below provides a consolidated view of the Metadata refresh strategy for each of the environments
Type of Metadata Description
Business Metadata
Metadata is created
Initial Metadata captured during Design Phase
Metadata needs to be updated continuously whenever there is a change to
source data feed or target structures, enforced as part of the code release
process
Technical Metadata
Metadata is created
Metadata that needs to be captured manually is created during the Design
Phase
Metadata captured using automated process is initially created during the
development phase and certified before code release
Metadata needs to be updated continuously whenever there is a change to
source data feed or target structures, enforced as part of the code releaseprocess
Operational Metadata
Data Movement related Operational Metadata is captured using event
driven approach, but on ad-hoc basis
Data Usage related Operational Metadata can be captured on a need basis
(Optional)
8/10/2019 Metadata Managementc
14/16
Business Rules and
Transformation Rules
Rules related Metadata should be created
Initial metadata should be created post the Technical Metadata is sourced
into the repository
Metadata should be updated on a continuous basis, as and when there is a
need for change using the custom manual approach defined
System Statistics
Captured using automated process on a need basis
Need to captured and maintained on a regular basis only if required (for
usage based charge-back mechanism for example)
Metadata for Downstream
Processes / Applications
For any downstream applications designed, metadata should be created in
environment
Metadata should be captured during the Design phase
1.6 Metadata Visibility
Visibility or access to the Metadata captured for the Data Warehouse should be enabled only through a standard
intranet portal. The portal should provide the following functionalities
Provide a layer of abstraction for the metadata capture, integration and storage aspects
Ability to authenticate usersaccessing the portal
o It is assumed that there is no need for user authorization (data security)
Ability to search on the metadatacaptured, using any of the use-cases identified
o Provide a layer of abstraction between the User Interface and the underlying data elements on
which the search operation is performed. For example a basic search on UI for table name
could perform a search on table technical name, table business name, table business description
and the source data file name.
o
Provide ability to perform advanced search using a combination of search criteria. For example search for a given table name within a subject area for a given Market.
o Pagination of the search results for better readability
o Ability to sort the search results on predefined criteria including search relevance (this use case
may need further discussion and elaboration)
o Should provide ability to export the search results to Excel for offline analysis
Ability to establish data lineagefor data entities and elements within the Data Warehouse
o Should support bi-directional lineage analysis
o Completeness and quality of data lineage information will be dependent on the accuracy and
completeness of the metadata captured either through automated process or through the
manual capture process
Ability to generate and view standard operational reports
Following are the general guidelines with respect to the Metadata Visibility
End users (data analysts for example) for metadata should never be provided direct access to the
metadata repositorydatabase tables or the Excel files within Data Warehouse
Only system administrators and technical SMEs for the Data Warehouse may have direct access to the
metadata repository including the physical storage
Access to metadata environments should be enabled through separate user interfaces separate
8/10/2019 Metadata Managementc
15/16
portals, sub-sites etc.
1.6.1 User Groups and Associated Usage
This section captures the details of the target user groups who would need access to the portal and their
associated usage of the portal, in each of the environments
1.6.2 Metadata Analysis & Usage
The Metadata Repository portal supports the following types of analysis and usage of the metadata captured.
Lineage Analysis
Lineage analysis is one of the key requirements for the proposed Metadata Management solution. The metadata
captured should support the following types of lineage analysis
For structured data source extracted from Source, the metadata in Data Metadata repository should
support bi-directional lineage analysis from the tables in Source/ Warehouse to the Data Warehouse or
any downstream applications from Data warehouseo The metadata should support lineage analysis at table and column level
o For each of the tables / Files from Source, the System of Record information for the original
source feed may be made available as additional information. However, the lineage from the
original source data feed to the Source Files/ tables will be out of scope for lineage analysis
o The completeness of lineage metadata will be dependent on the process implemented for
capturing the metadata for downstream processes / applications
For semi-structured or unstructured data sources, the metadata captured should support lineage analysis
as follows
o Bi-directional lineage analysis at object level (web files, video files etc.)
o For data sources like IVR where each transaction can potentially contain an audio file, lineage
analysis should capture the linkage of audio files to the transaction and the source feed
o For structure metadata captured as part of unstructured data sources, the metadata should
support lineage analysis at column (data element) level
Data Usage Analysis
Data usage analysis primarily provides ability to track what data within the Warehouse is being used, frequency of
usage and the access log of end-users accessing the data. Data usage analysis helps in identifying the frequency of
data elements being accessed, improve the data modelling and restructure the data to provide easier and quicker
access to end-users.
Data Analysis usage requires the Data Usage related operational metadata to be captured as part of the metadata
management process. Some of these operational metadata for structured data can be captured through
automated processes either from the system logs or system tables. However, for semi-structured or unstructured
data capturing operational metadata may require some level of tracking at the operating system level and is
subject to feasibility, specific use case requirement and the decision to implement tracking user activity at such
detailed level.
8/10/2019 Metadata Managementc
16/16
BI Usage Analysis
Operational Metadata required for supporting BI usage analysis will be primarily sourced from the application
metadata repositories. BI usage analysis helps to understand the user behaviour on BI tools and applications and
this identifying potential opportunities for redesign and / or optimization.
Following are some examples of analyses typically performed on BI Usage Number of users executing reports on a daily / weekly basis
Average number of reports executed on a daily / weekly basis
Number of times a report is run in the last x days
Audit Analysis
Audit analysis requires Operational Metadata to be captured for the data integration and load processes. Audit
analysis primarily helps to understand the effectiveness of the data movement and data loading processes and
helps to identify potential opportunities for redesign and / or optimization.
Examples or audit analyses reports are as follows:
Average execution times for batch processes, by subject areas Long running jobs at the potential risk of missing data loading SLAs (for proactive tuning)
Jobs exceeding the average execution times on a daily / weekly basis
Average number of errors or exceptions on a periodic basis
Frequently occurring errors or exceptions by Source Feed or Subject Area
1.7 Metadata Maintenance and Retirement
Metadata Maintenance and Retirement process will be closely related and dependent on the Governance and
Planning for Metadata. For the `Warehouse, Metadata Maintenance and Retirement strategy need to be cater to
the differences in target audience, data movement strategy and the data retention strategy for each of these
environments.
Following are the general guidelines for Metadata Maintenance and Retirement:
Metadata will be captured only for the Shared Area
No metadata will be captured or maintained for user specific directories (Private Area)
Metadata capture and updates for any metadata captured using manual or custom process need to be
enforced as part of the code release checklist and should be up-to-date at given time
Technical metadata captured using automated process also should be maintained completely and
accurately for all objects
Following metadata captured using an automated process may be refreshed on a need basis
o Operational Metadata
o
System Statistics
When data is purged, all metadata associated with that data / data objects should also be purged from
the metadata repository
top related