SAP BOBJ DataservicesData services is an end-to-end Data
Integration, Data Management and Text Analytics Software.Any
software which provides Extraction, Integration, Transformation and
Loading functionalities are called Data Integration Software.Data
Management:A software which helps to cleanse master data is called
Data Management software. Any software which provides parsing,
correcting, standardization, enhancement, match and consolidation
functionalities are called Data Management software.Text Analytics:
Converting the unstructured text data into structured data such a
software is called text analytics software.Advantages of BODS: One
product, multiple services ; single window application; supports
unstructured sources(big database); tight integration with SAP;
supports in memory database (HANA); migration tool; content
management tool; supports real time extractions; price; less
maintenance cost.Architecture1) Designer: It is a desktop
application and client tool; it is a developer tool to develop BODS
jobs; it supports batch job and real time jobs; it is only
aapplication , it does not store data.2) Repository: is a storage
location in any database to store BODS objects. Repository is a
group of BODS metadata tables. DB support BODS repository are
oracle; MySQL; MySQL; DB2; Sybase; Hana(4.1 onwards).There are
three types of repositories: A local repository(known in Designer
as the Local Object Library) is used by an application designer to
store definitions of source and target metadata and Data Services
objects. A central repository(known in Designer as the Central
Object Library) is an optional component that can be used to
support multiuser development. The Central Object Library provides
a shared library that allows developers to check objects in and out
for development. A profiler repositoryis used to store information
that is used to determine the quality of data.3) BODS Services: Job
Server; Data Engine; Access Server.I. Job Server: is a service
which process BODS batch jobs as part of processing. Whenever a
batch is executed or triggered by user immediately the assigned job
server will study the design of job and estimate the number of
threads and process to be allocated.II. Data Engines: are services
which takes responsibility in generating threads and process as per
job server estimation; Threads will involve in extraction &
loading and process will involve in transformation of data;
Collectively the Job server and data engines will take
responsibility to processing batch jobs.III. Access Server:it is
also called messaging broker; Access server is a service which will
take responsibility in processing real time jobs; To manage real
time jobs an extra configuration called real time configuration to
be done both sender & receiver side. This configuration sends a
message alert whenever data got updated in sender side;Access
server is a service which is running 24/7and waits for a message
from sender to trigger real time job; Once the real time job is
successfully finish, access server will send a message back to
sender as acknowledgement.4) Data quality components:I.
Directories:They are also one kind of repositories which stores
cleansing packages. They are physical folders in OS level with name
called reference data under LINK DIR/data quality /reference _data
which holds country specific data; Directories enable match and
consolidate function as part of data cleansing.II. Dictionaries:are
storage location in database which consists of data classification,
which is a format of an entity; dictionary helps in enabling
standardization of data.III. Address server: is a service which
will process data quality transformation.5) BODS Management
Console: It is a administrator tool; It is a web application; URL
http://:/ data services; Its a secured application 3.x (admin U/N,
pwd) from 4.x cmccredentials.I. Administrator Scheduling,
monitoring, and executing batch jobs; Configuring, starting, and
stopping real-time services; Configuring Job Server, Access Server,
and repository usage; Configuring and managing adapters; Managing
users; Publishing batch jobs and real-time services via web
services; Reporting on metadata.II. Auto Documentation:View,
analyze, and print graphical representations of all objects as
depicted in Data Services Designer, including their relationships,
properties, and more.III. Data Validation:Evaluate the reliability
of your target data based on the validation rules you create in
your Data Services batch jobs to quickly review, assess, and
identify potential inconsistencies or errors in source data.IV.
Impact and Lineage Analysis: Analyze end-to-end impact and lineage
for Data Services tables and columns,and SAP Business Objects
Business Intelligence platform objects such asuniverses, business
views, and reports.V. Operational Dashboard:View dashboards of
status and performance execution statistics of DataServices jobs
for one or more repositories over a given time period.VI. Data
Quality Reports:Use data quality reports to view and export SAP
Crystal Reports for batchand real-time jobs that include
statistics-generating transforms. Report typesinclude job
summaries, transform-specific reports, and transform
groupreports.6) CMC / IPS (central management console/ Information
platform services)CMC is BO BI platform like netweaver for sap; cmc
is a web application; it is an administrator application; is a
secured and role based application; In cmc w.r.t. BODS we can do
repository management, user management, security mgmt., license
management. URL: http:// :/BOE/CMC.Installation of BODS 4.0: We
need database(MySQL) and web application server(Tomcat)I. Non BOE
customer: IPSBODS Server BODS Client.II. BOE Customer: BO CMC BODS
Server BODS client.III. BOE Customer with dual platform: BO CMC;
IPS BODS Server BODS Client.BODS Tools1. CMC/IPS toolsI. Central
configuration manager:Helps in managing of BO services; desktop
tool/ server tool/ administrator tool; we can manage EIM adoptive
processing service which helps in the creation of RFC connection
for sap extraction.II. Central management console: III. Life cycle
management console (LCM):Helps in deploying BOE content (CMC
content); version management tool;web app; server tool ;
administrator tool; URL: http:// :; from BOE 4.0 sp 5 onwards in
cmc we have two new options instead of LCM : promotion mgmt.,
subversion mgmt.IV. Upgrade Mgmt tool(UMT):Helps in upgrading BOE
content and applications from lower(4.0) version to higher(4.1)
version; it is a desktop tool; server tool; administrator tool;V.
Web deployment tool (WDT):Helps in deploying web applications i.,e
when our web applications get corrupted we can deploy using WDT
i.,e deploy .WAR files into Web application server. 2. BODS Server
toolsI. Repository Manager:It s a tool to create repositories; It
helps to check the version of the repository; upgrade repository.
If BODS version gets upgraded we will upgrade the repository. It is
administrator tool; desktop tool; server tool.II. Server Manager:
It is a tool to create Job server, Access server; Assignment of
repository to job server is done here; It is server tool ;
administrator tool; desktop tool; we can do email configurations;
we can define pageble memory path settings; we can do email
configurations; we can do real time job configuration at receiver
side.III. BODS mgmt console:IV. Metadata integrator: It is a tool
which will integrate BODS and BO BI; here in BODS we will get all
metadata of (universes, reports etc.) and so the impact and linage
analysis; It is desktop tool; Server tool; administrator tool; we
can schedule the metadata integrator to get the data from cmsdb to
BODS repo with regular intervals.V. License manager: We can add,
delete, modify licenses; It is a desktop tool; Administrator tool;
server tool.3. BODS Client toolsI. Designer: It is a desktop
application and client tool; it is a developer tool to develop BODS
jobs; it supports batch job and real time jobs; it is only
aapplication , it does not store data.II. Locale Selector :It will
help you to select a default language of designer; Developer tool;
it is client tool; it is desktop tool.III. Documentation: Technical
manual of the tool.BODS ObjectsBODS objects are two typesI.
Reusable objects: A reusable object has a single definition; all
calls to the object refer to that definition. Accessreusable
objects through the local object library. A data flow, work flow,
datastore, function etc are example of reusable objects. II.
Single-use objects: Some objects are defined only within the
context of a single job or data flow, for example scripts, project,
loop, try-catch, conditionsetc are single use objects.Projects: A
project is a reusable object that allows you to group jobs;A
project is the highest level of organization offered by the
software; You can use a project to group jobs that have schedules
that depend on one another or that you want to monitor
together.Jobs: A job is the only object you can execute. You can
manually execute and test jobs in development.Dataflow: It is an
object to define a ETL definition in BODS (what to extract, how to
extract and where to load).Workflow: It is an object which helps in
grouping of data flows (or) grouping of set of ETL
operationsDatastores: Datastores represent connection
configurations between the software and databases or
applications:There are three kinds of Datastores:I. Database
Datastores:provide a simple way to import metadata directly froma
RDBMS.II. Application Datastores:let users easily import metadata
from mostEnterprise Resource Planning (ERP) systems.III. Adapter
Datastores:can provide access to an applications data and
metadataor just metadata. For example, if the data source is
SQL-compatible, theadapter might be designed to access metadata,
while Data Services extractsdata from or loads data directly to the
application.Formats: It is an object which helps in connecting
between file type sources and targets in BODS.Scripts: It is an
object which enables BODS customization; We write code in BODS
scripiting language (similar to sql).A script is a single-use
object that is used to call functions and assign values ina work
flow.A script can contain: Function calls If statements While
statements Assignment statements Operators.Transforms:It is a
readymade logic provide by SAP along with this tool to perform data
transformations.There are more than 20 transforms (data integrator,
platform, data quality).Template table is a BODS object which helps
in creating table in any database; Template tables are used as
target table in BODS; Once a job is executed the template table
object is created in target database.BODS GUIProject area :
Contains the current project (and the job(s) and other objects
within it) available to you at agiven time.a) Designer tab:wa can
design objects / view objects; the objects are shown in
hierarchical fashion.b) Monitor tab:it is a place where we can see
current running jobs (red-fail, yellow-currently running, green-
successful).c) Log tab:It is a place where we can see the logs of
job( log history 120 days by default)Workspace: The area of the
application window in which you define, display, and modify
objects.Local object library :Provides access to local repository
objects including built-in system objects, such as transforms, and
the objects you build and save, such as jobs and data flows; All
are reusable objects.Tool palette: Buttons on the tool palette
enable you to add new objects to the workspace by drag and drop
functionality.BODS Job Design Methodology1. Identify the source and
target system types. Systems are 2 types a)Database type b)File
type.2. Import the source and target systems metadata (schema) into
BODS; File type Format, DB typeDatastore.3. Create a BODS Joba.
Open an existing project / create a project.b. Define ETL
definition i.,e create dataflow (define source, define target,
define transformation).c. Validate and save the job.d. Execute and
monitor the job.Excel workbook Sheet extractionBODS 3.x supports
.xls; from BODS 4.0 onwards .xls and .xlsxThe add-on called
Microsoft access database engine should be installed on both server
and client side to make designer application to read Microsoft
excel files; This add-on is provided by sap along withBODS software
and is placed in c:// program files/sapbo/data services/ ext/
Microsoft.Multiple excel file extraction
dynamically:Recommendations:All fields structure should be unique;
All fields data should be in Named ranges/ sheets; All fields data
should be in same sheet name / sheet number; Maintain common prefix
or suffix in file name.Solutions:1. Wild card character (* or ?);
OR 2. using list of filenames with comma delimiterDynamic excel
file selection:Recommendations: Structure in all files should be
unique; data should be in sheet/ named range; Maintain unique sheet
name or sheet number; Naming convention of a file name should
contain time stamp when it distributed.Solutions:1. Variable is to
pass the file name dynamically during runtime;2. Script is used to
define a file nameMultiple excel workbook sheets extraction
dynamically:Recommendations:Structure should be unique in all
sheets; Extra sheet in a file should have the list of sheet names
where data is placed.Solution:1. Variable to pass the sheet name
dynamically during runtime2. Script to define the sheet name.3.
Loop to repeat the process to get the data from all the
sheets.Limitations of excel workbook in BODS: BODS is not capable
to distribute data in excel workbook; BODS 3.2 does not support
.xlxs format; We dont have any property to terminate extraction if
we get n number of errors in excel.Flat file extractionA flat file
is a structured file; a flat file has no data limitations.Types of
flat files:Delimited:A structure text file with markers to identify
row, column and text id called delimited flat file. Columncomma,
space, tab, semicolon; Row windows new line, Unix new line, new
line; Text single quotes, double quotes. Custom row BODS will
support special markers also.Fixed width flat file:is a structured
flat file where column identified by some fixed with (size); Row
windows new line, Unix new line, new line; Text single quotes,
double quotes. Custom row BODS will support special markers
also.SAP transport flat file formatsupports to read a flat file
from and write a flat file to SAP appserver.Unstructuredtextflat
fileis used to read data from unstructured text files ex: LinkedIn,
face book; Here data is in the form of text
(characters).Unstructured binary: Here data is in the form of
binary ex .doc,pdf, xml, html etc.Advantages of flat file are we
can process parallel threads; we have the option terminate flat
file if it contains specific number of errors; with the help of
query transform we can generate flat file.XML data extractionIt is
a nested structure; we get header and data separately; Supports
level of extraction; Supports real time. We get xml data in two
fashion .dtd format: helps in importing xml metadata if a header
file is provided in dtd fashion; xml schema definition: helps in
importing metadata if a header file is provided in xml schema
definition.Make current is available under query transform in
output schema which allows to select the required columns from
nested schema, we cant apply this on multiple nodes at a time.
Unnest with subschema: is available under query transform in output
schema which helps to convert the nested schema structure into flat
files.XML_PIPELINE Transform:Functionality: Helps in extracting
nested structured data; It can be used to extract the part of xml
data Type: Data Integrator; Behaviour: Transform; Input: Nested
input (takes single input); Output : Multiple same outputs;
Properties: no create properties; no target properties; no source
object editor properties; no target object editor properties;
Schema In: Source schema (nested structure); Schema out: Under user
control (wecan define schema i., e create columns here);
Limitation: Does not allows to select multiple columns from same
level nodes.COBOL data extractionCobol copybook format: We will get
header, file(fixed width flat file and variable width flat file) in
separate files; Supports nested structure; Supports only one node
column extraction at a time. Cobol copy books always acts as
sources only; If there are 3 nodes in Cobol copy book then we get 3
options, in that we can select any one node.Time Dimension
ImplementationTime dimension is global standard dimension, no need
to depend on source system. Reference entity: based on reference
entity we can derive new entity; Day is the least reference entity
in time dimension.DATE_GENERATION is a transform which provides
least reference time attribute which is day, based on this we can
derive new time entity according to our requirement.DATE_GENERATION
Transform:Functionality:This transform generates a column which
holds the date values based on the start & end dates provided
as an input to the transform by considering the increment provided
to it.Type: Data Integrator transform; Behavior: It acts like a
source;Input: No input; Output: Multiple same outputs;Schema in: No
input schema because it acts as a source; Output schema: Under
system control; we get a single output column
DI_GENERATED_DATE.Time dependent dimension ImplementationA
dimension in which the attributes changes over period of time, if
that changes capture in a dimension then such dimensions are called
Time dependent dimension; We use them for Master data tracking or
History reports; We can define validity of an attribute (Effective
to date and effective from date).EFFECTIVE_DATE
Transform:Functionality: Derives valid_to column based on
valid_from column and sequence column; Type: Data Integrator;
Behaviour: Transform; Input: Input should contain effective_from
column and sequence column; Output: Multiple same outputs; Schema
out: Input schema + one additional date column.Slowly changing
DimensionsDimensions are changing slowly such dimensions are called
slowly changing dimensions; SCD are 3 tpes SCD-T0, SCD-T1, SCD-T2;
T3, T4 and T6(Hybrid SCD-T2) are different representations od
SCD-T2.SCD-T0: It holds current data only; It should not maintain
historical data and it should not capture changes; Functionality to
enable in ETL side is truncate and reload; In BODS it is
implemented by check box Delete data before loading at database
table target editor properties.SCD-T1: It holds current data and it
also holds historical data, but it should not capture
changes.Functionality to enable in ETL side is overwrite In BODS it
is implemented by Table Comparison transform and Map
Operation(optional).SCD-T2: It holds current data and it also
contain Historical data and also capture change data.Functionality
to enable in ETL side is every change to be inserted( we add 4
additional colums SID,CUR_REC _IND,EFF_from,EFF_to)TRANSFORMSTABLE
COMPARISION:Functionality: It helps to identify the type of each
input record which you extracted; Type: Data Integrator transform;
Behaviour: Transform; Prerequistics: here should be atleast one
common key (primary key) in between input table and comparision
table; Input table columns and comparision table names and data
types should be same(not all columns, only common columns); Input:
songle input; Output: Multiple single output; Schema out: under
system control. Compares two data sets and produces the difference
between them as a data set with rows flagged as INSERT or UPDATE.
The Table_Comparison transform allows you to detect and forward
changes that have occurred since the last time a target was
updated.
3)Table Comparison :
Compares two data sets and produces the difference between them
as a dataset with rows flagged as INSERT, UPDATE, or DELETE.
Row-by-row select Look up the target table using SQL every time
it receives an input row. This option is best if the target table
is large. Cached comparison table To load the comparison table into
memory. This option is best when the table fits into memory and you
are comparing the entire target table. Sorted input read the
comparison table in the order of the primary key column(s) using
sequential read.This option improves performance because Data
Integrator reads the comparison table only once.Add a query between
the source and the Table_Comparison transform. Then, from the
querys input schema, drag the primary key columns into the Order By
box of the query.
Input primary key column(s)The input data set columns that
uniquely identify each row. These columnsmust be present in the
comparison table with the same column names anddata types.
Input contains duplicate keysGenerated Key column Detect Deleted
row(s) from comparison tableDetect all rowsDetect row with largest
generated key value.
If we are checking the option Input contains duplicate keysthen
in the Generated Key column we will be selecting the Primary key
which retrieves the largest generated key value in the duplicate
data.
If we are selecting the option Detect Deleted row(s) from
comparison tableIt will flag the record/records with
DELETE.MAP_OPERATION TransformFunctionality: Helps in changing the
record modes of a record in BODS; Type: Platform transform;
Behaviour: Behaves like a transform; Input: Single input; Output:
Multiple single output; Schema out: Under system control; output
schema equals input schema.
HISTORY_PRESERVING TransformFunctionality: Helps in defining
current record indicator flag and validity if each input record as
part of SCD-T2 implementation.;Type: Data Integrator; Behaviour:
Transform; Input: single input (Input should be table comparision);
Output: multiple single outputs; Schema out: under system control;
output schema structure equal to table comparison structure.
Allows you to produce a new row in your target rather than
updating an existing row. You can indicate in which columns the
transform identifies changes to be preserved. If the value of
certain columns change, this transform creates a new row for each
row flagged as UPDATE in the input data set.
KEY_GENERATION TransformFunctionality: Helps in implementing SID
(like sequence generator in database); Type: Data Integrator;
Behavior: Transform; Input: single input; Output: Multiple single
output. Generates new keys for new rows in a data set. The
Key_Generation transform looks up the maximum existing key value
from a table and uses it as the starting value to generate new
keys.
MAP OPERATION Allows conversions between data manipulation
operations. The Map_Operation transform allows you to change
operation codes on data sets to produce the desired output. For
example, if a row in the input data set has been updated in some
previous operation in the data flow, you can use this transform to
map the UPDATE operation to an INSERT. The result could be to
convert UPDATE rows to INSERT rows to preserve the existing row in
the target. Data Integrator can push Map_Operation transforms to
the source database.
MERGE TransformFunctionality: It enables data integration
functionality in BODS; Type: Platform Transform; Behavior:
Transform; Input: Multiple inputs: Output: Multiple single outputs;
Pre requisites: Before passing inputs to merge transform all the
sources should have unique structure (No of columns, column names,
data types order of columns); Schema out: Under system control;
output schema is equal to input schema.
Combines incoming data sets, producing a single output data set
with the same schema as the input data sets.
ROW_GENERATION TransformUsing this transform we can maintain one
special character in dimension table( which handle s null records
DWH).Functionality: Helps in generating the number of rows
requesting; Type: Platform transform; Behavior: Source: Input No
input; Output: Multiple single outputs.
Objectives: Produces a data set with a single column. The column
values start from zero and increment by one to a specified number
of rows.Descriptions: Data Inputs None. Options Row count A
positive integer indicating the number of rows in the output data
set. For added flexibility, you can enter a variable for this
option. Join rank A positive integer indicating the weight of the
output data set if the data set is used in a join. Sources in the
join are accessed in order based on their join ranks. The highest
ranked source is accessed first to construct the join. Cache Select
this check box to hold the output from the transform in memory to
be used in subsequent transforms. Select Cache only if the
resulting data set is small enough to fit in memory. Editor The Row
Generation transform editor includes the target schema, and
transform options. Data Outputs The Row_Generation transform
produces a data set with a single column and the number of rows
specified in the Row option. The rows contain integer values in
sequence starting from zero and incrementing by one in each
row.
PIVOT TransformFunctionality: Helps in applying pivoting on
input data i.,e converting columns into rows: Type: Data
integrator; Behavior: Transform;Input: Single input: Output:
Multiple single output: Schema out: Is under system control;
Default output schema of a pivot transform:
pivot_seq(int);pivot_HDR(varchar); pivot_data (varchar); Output
schema of pivot = no of non pivotcolums + default schema; pivot
sequence columns should be primary key.Pivot Sequence Column : For
each row created from a pivot column, Data Integrator increments
and stores a sequence number.Non-Pivot Columns : The columns in the
source that are to appear in the target without modification.Pivot
set : The number that identifies a pivot set. Each pivot set must
have a a group of pivot columns,unique Data field column and the
Header column. Data Integrator automatically saves this
information.Data field column : Contains the pivoted data. This
column contains all of the Pivot columns values.Header column :
lists the names of the columns where the corresponding data
originated.
Creates a new row for each value in a column that you identify
as a pivot column. The Pivot transform allows you to change how the
relationship between rows is displayed. For each value in each
pivot column, DI produces a row in the output data set. You can
create pivot sets to specify more than one pivot column.
REVERSE_PIVOT TransformFunctionality: Helps in applying pivoting
operation on input data i.,e converting rows into columns; Type:
Data Integrator; Behavior: Transform; Input: Onlyone input of flat
structure: output: Multiple single output; Schema out: Under system
control.Non-pivot columnsThe columns in the source table that will
appear in the target table without modification.Pivoted columns A
set of columns will be created for each unique value in the Pivot
axis column.Pivot axis columnThe column that determines what new
columns are needed in the output table. At run time, a new column
is created for each Pivoted column and each unique value in this
column.Duplicate valueAction taken when a collision occurs. A
collision occurs when there is more than one row with the same key
and value in the Pivot axis column. In this case, you can store
either the first row or the last row, or you can abort the
transform process.Axis valueThe value of the pivot axis column that
represents a particular set of output columns. Column PrefixText
added to the front of the Pivoted column names when creating new
column names for the rotated data.
Creates one row of data from several existing rows. The Reverse
Pivot transform allows you to combine data from several rows into
one row by creating new columns. For each unique value in a pivot
axis column and each selected pivot column, DI produces a column in
the output data set.
CASE TransformFunctionality: Helps in distributing or
categorization of data: Type: Platform;Behavior: Transform: Input
single flat structure input: Output Multiple distinct output;
Output schema structure equal to input schema structure.
Hierarchies: Grouping of master data is hierarchies.In Hierarchy
Flattening data stores in two fashions; Horizontal fashion: Enables
drill down and drill up functionality (mandatory format); Vertical
fashion: enables global filtering (performance tuning
option).Horizontal Structure current leaf, level 0, level 1level n
leaf level; Vertical structure: Ancestor ID, Descendent ID, root
flag, leaf flag, depthHIERARCY_FLATTENING TransformFunctionality:
Helps in handling hierarchy data in BODS; It stores in 2 fashion;
Input: either 2 structured flat sources either 2 or 4 columns which
includes parent and childs; Output: Multiple single output.
Constructs a complete hierarchy from parent/child relationships,
then produces a description of the hierarchy in vertically or
horizontally flattened format. Flattening Types Horizontal
Vertical
Horizontal Flattening Each row of the output describes a single
node in the hierarchy and the path to that node from the root. This
mode requires that you specify the maximum path length through the
tree as the Maximum depth. If there is more than one path to a
node, all paths are described in the result.
Vertical Flattening Each row of the output describes a single
relationship between ancestor and descendent and the number of
nodes the relationship includes. There is a row in the output for
each node and all of the descendents of that node. Each node is
considered its own descendent and therefore is listed one time as
both ancestor and descendent.
MAP_CDC_OPERATION TransformFunctionality: Helps in handling how
source based CDC sources system data should update in target i.,e
it simple updates values (I,U,D) in column DI_Operation_Type based
on CDC column: Type Data integrator ;Behavior: Transform; Input :
Only on e input of type oracle/ Microsoft CDC table or SAP
extractor; Output: Multiple single outputs( with different records
modes).
Using its input requirements performs three functions: Sorts
input data based on values in Sequencing column box and the
Additional Grouping Columns box. Maps output data based on values
in Row Operation Column box. Source table rows are mapped to
INSERT, UPDATE, or DELETE operations before passing them on to the
target. Resolves missing, separated, or multiple before- and
after-images for UPDATE rows. While commonly used to support Oracle
or mainframe changed-data capture, this transform supports any data
stream as long as its input requirements are met. This transform is
typically the last object before the target in a data flow because
it produces INPUT, UPDATE and DELETE operation codes. Data
Integrator produces a warning if other objects are used.
VALIDATION TransformFunctionality: helps in enabling validation
layer in BODS; Type: Platform; Behavior: Transform; Input: Single
input with flat structure(no xml); Output: 3.x (2 set distinct
output); 4.x (3 set distinct outputs); Schema out: Pass output
schema of pass set is equal to input schema; Fail output schema os
equal to onput schema plus 2 colums (DI_ERRORACTION and
DI_ERRORCOLUMN, if we enable the check box we also get DI_ROWID
colum); Rule violation set schema DI_ROWID,DI_ROWNAME and
DI_COLUMNNAME.
One source in a data flow. Qualifies a data set based on rules
for input schema columns. Allows one validation rule per column.
Filter out or replace data that fails your criteria. Outputs two
schemas: Pass and Fail.
DATA TRANSFER TransformFunctionality: Helps I pushing the source
data connected to any database or BODS job server location Or to
cache memory level; Type: Data Integrator; Behaviour: Transform;
Input: Only one input of flat type; Output: Multiple Single output;
Output schema is equal to input schema.
SQL TransformFunctionality: A transform which helps to pushdown
select operations to database level; Type: Platform; Behavior:
source; Input: No input; Output: Multiple single output; Output
schema equal to select statement output schema.
Performs the indicated SQL query operation. Use this transform
to perform standard SQL operations for things that cannot be
performed using other built-in transforms. The options for the SQL
transform include specifying a datastore, join rank, cache, array
fetch size, and entering SQL text.
QUERY TRANSFORM The Query transform can perform the following
operations: Choose (filter) the data to extract from sources Join
data from multiple sources Map columns from input to output schemas
Perform transformations and functions on the data Perform data
nesting and unnesting Add new columns, nested schemas, and function
results to the output schema Assign primary keys to output
columns
DATA ASSESSMENT using BODS
For analyzing the data and measuring the data;mainly about
studying the sources w.r.t dataFunctions available in BODS View
data function, Validation Transform Sstatistics,Auditing and data
profiling;Profiling types: Column profiling and Relationship
profiling; Column profiling are two types Basic column profiling
(min length, max length, medium length, min length % max% medium%,
no of null %, no of blanks%) and Detail column profiling (distinct
%, pattern %, median).Relationship Profiling: Defining relationship
between two objects; % of common data, % of non-common data.SAP has
separate tool: Data insight or Information steward.BODS
VariablesEnables dynamism in etl jobs;Global variable(G_XXX): Job
dependent; Can be used any where inside the job. (not across the
job); Prossessing option: default value, User entry and through
script: Input value: single value.
Global variables are global within a job. Setting parameters is
not necessary when you use global variables. However, once you use
a name for a global variable in a job, that name becomes reserved
for the job. Global variables are exclusive within the context of
the job in which they are created.
Local Variable(L_XXX): Job dependent and workflow dependent; Can
be used only thwpalce where it is defined; Processing option:
through script; Input type vle: single value.
To pass a local variable to another object, define the local
ariable, then from the calling object, create a parameter and map
the parameter to the local variable by entering a parameter
value.
Parameters (P_XXX):Generally parameters are used in functions to
pass the input and take the output like input parameters and output
parameters ; workflow and dataflow dependent; can be used only the
place where it is defined; processing options: Only through local
variable assignment; Input value type: single value.
Pass their values into and out of work flows Pass their values
into data flows Each parameter is assigned a type: input, output,
or input/output. The value passed by the parameter can be used by
any object called by the work flow or data flow.
Substitution Parameters ($$_XXX): They hold file path
information; they are repository dependent; can be used anywhere
inside the repository; Processing options: default value, user
entry and through script; Input value type : single.LOOKUP()Lookups
always work at column level, transform works at row levelNormal
lookup: Derives only one column, lookup source is only db, supports
only equicondition(=); only one equi condition; performance tuning
(preload cache, demand load cache)Lookup_ext(): It derives multiple
columns from one lookuo_ext function; lookup sources are db tables
and flat files; supports all operators (=,..etc); supports multiple
conditions/columns; supports scd t2 sources, performance
tuning(preload cache , demand load cache and run as separate
process).Lookup_seq(): Derives only one colum; lookup sources only
db tables; supports only equi conditions; supports two equi
condition; meant for scd t2 handling: no performance tuning
options.LOOKUP : Retrieves a value in a table or file based on the
values in a different sourcetable or file.
LOOKUP EXT :
Graphic editor in the function wizard. Retrieve a value in a
table or file based on the values in a different source table or
file,but it also provides extended functionality allowing you to:
Return multiple columns from a single lookup Choose from more
operators to specify a lookup condition Perform multiple (including
recursive) lookups Call lookup_ext in scripts and custom functions
(which also lets youreuse the lookup(s) packaged inside scripts)
Define custom SQL, using the SQL_override parameter, to populatethe
lookup cache, narrowing large quantities of data to only the
sectionsrelevant for your lookup(s) Use lookup_ext to dynamically
execute SQL Call lookup_ext, using the function wizard, in the
query output mappingto return multiple columns in a Query transform
Design jobs to use lookup_ext without having to hard-code the name
ofthe translation file at design time. Use lookup_ext with memory
datastore tables
LOOKUP SEQ :
Retrieves a value in a table or file based on the values in a
different sourcetable or file and a particular sequence value.NEW
FUNCTION CALL :
** Works same as Normal Lookup Ext.
** The only difference is this can be editable in the wizard
form where as in normal Lookup Ext which is not possible.
** No Need of Global/Local Variables usage in the output
ports.BODS Project DeploymentRepository content: Two methods Direct
method,(repo-to-repo); good for first time deployment. Export/
import method (atl/xml) (good for version management).BODS mgmt.
console content: All the configuration which we do in BODS mgmt.
console are not stored in repository. They are stored in the files
admin.xml, as.xml and sapconnection.xml which are in /cong and
/binCMC content: Use LCM or promotion management to deploy source
cms database to target.Substitution parameter configuration and
System configuration content: Using export /import method inside
BODS designer option we can deploy to target system.CENTRAL
REPOSITORYLimitation of local repository: no multi user access, no
auto lock feature, no version managmentand no security.Central
repository enables multiuser environment and version management in
BODS..Two types of central repository Unsecured central repository
and secured central repository.
Exporting a new job to Central Repository1. Right click on the
new job that you created we will get an object with name Export to
central Repository just click on that then it will export the job
to the path that mentioned in the Repository.2. For doing some
modifications and to maintain versions in central repository Follow
these steps mentioned below. Checking out objects from the central
repository To check out an object and its dependent objects1. Open
the central object library.2. In the central object library,
right-click the job_.3. Click Check Out > Object and dependents.
Checking in objects to the central repository To check in a single
object1. Open the central object library.2. Right-click the
DF_EmpLoc data flow in the central object library and click Check
In > Object. A Comment window opens.3. Type required comment
with which we will get a version to find out what modification we
have done on the job which is already present.
BODS Performance Tuning TechniquesEnvironmental performance
tuning techniques: OS leve I/O ratio; Application level Increase
the app level cache memory; Database level primary indexes, regular
db statistics collection; Network level increase the bandwidth;
BODS app side check boxes in execution properties such asmonitor
sample rate,print all trace message, disable data validation
statistics collection, enable auditingMulti-Threading: Partitioning
the source table, falt file allocate parallel
threads.Multi-processing: Transform level run as a separate
process; Data flow level Degree of parallelismPaarallel Processing:
parallel data flow is parallel processingPush down: Pushing BODS
operations to other system; BODS automatically push operation if
the source and target are from same schema; if different schema can
useDB link OR Data transfer transform OR SQL Transform.Bulk
Loading: Only enabled for databse type targets; performed with the
help of SQL loader tool in BODS; Target object editor properties
check box bulk loadThrough putDataflow levelcache type (In memory
cache, pageble cache); Transform level Table comparison (cached
comparison method); lookup function pre load cache and demand load
cache.Other: Source level Array fetch size (best practice
1000-5000); Join rank (normal, ranking and cache algorithms);
Target Level Rows per commit 1000-5000; Advanced->general->
Number of loaders.Error Handling: Target editor property-> error
handling-> use overflow file.Source-based performance options
Using array fetch size Caching data Join ordering Minimizing
extracted data Target-based performance options Loading method and
rows per commit Staging tables to speed up auto-correct loads Job
design performance options Improving throughput Maximizing the
number of pushed-down operations Minimizing data type conversion
Minimizing locale conversion Improving Informix repository
performance
1. Utilize a database (like Oracle / Sybase / Informix / DB2
etc...) for significant data handling operations (such as sorts,
groups, aggregates). In other words, staging tables can be a huge
benefit to parallelism of operations. In parallel design - simply
defined by mathematics, nearly always cuts your execution time.
Staging tables have many benefits. Please see the staging table
discussion in the methodologies section for full details. 1.
Localize. Localize all target tables on to the SAME instance of
Oracle (same SID), or same instance of Sybase. Try not to use
Synonyms (remote database links) for anything (including: lookups,
stored procedures, target tables, sources, functions, privileges,
etc...). Utilizing remote links will most certainly slow things
down. For Sybase users, remote mounting of databases can definitely
be a hindrance to performance. 1. If you can - localize all target
tables, stored procedures, functions, views, sequences in the
SOURCE database. Again, try not to connect across synonyms.
Synonyms (remote database tables) could potentially affect
performance by as much as a factor of 3 times or more. 1. Remove
external registered modules. Perform pre-processing /
post-processing utilizing PERL, SED, AWK, GREP instead. The
Application Programmers Interface (API) which calls externals is
inherently slow (as of: 1/1/2000). Hopefully Informatica will speed
this up in the future. The external module which exhibits speed
problems is the regular expression module (Unix: Sun Solaris E450,
4 CPU's 2 GIGS RAM, Oracle 8i and Informatica). It broke speed from
1500+ rows per second without the module - to 486 rows per second
with the module. No other sessions were running. (This was a
SPECIFIC case - with a SPECIFIC map - it's not like this for all
maps). 1. Remember that Informatica suggests that each session
takes roughly 1 to 1 1/2 CPU's. In keeping with this - Informatica
play's well with RDBMS engines on the same machine, but does NOT
get along (performance wise) with ANY other engine (reporting
engine, java engine, OLAP engine, java virtual machine, etc...) 1.
Remove any database based sequence generators. This requires a
wrapper function / stored procedure call. Utilizing these stored
procedures has caused performance to drop by a factor of 3 times.
This slowness is not easily debugged - it can only be spotted in
the Write Throughput column. Copy the map, replace the stored proc
call with an internal sequence generator for a test run - this is
how fast you COULD run your map. If you must use a database
generated sequence number, then follow the instructions for the
staging table usage. If you're dealing with GIG's or Terabytes of
information - this should save you lot's of hours tuning. IF YOU
MUST - have a shared sequence generator, then build a staging table
from the flat file, add a SEQUENCE ID column, and call a POST
TARGET LOAD stored procedure to populate that column. Place the
post target load procedure in to the flat file to staging table
load map. A single call to inside the database, followed by a batch
operation to assign sequences is the fastest method for utilizing
shared sequence generators. 1. TURN OFF VERBOSE LOGGING. The
session log has a tremendous impact on the overall performance of
the map. Force over-ride in the session, setting it to NORMAL
logging mode. Unfortunately the logging mechanism is not "parallel"
in the internal core, it is embedded directly in to the operations.
1. Turn off 'collect performance statistics'. This also has an
impact - although minimal at times - it writes a series of
performance data to the performance log. Removing this operation
reduces reliance on the flat file operations. However, it may be
necessary to have this turned on DURING your tuning exercise. It
can reveal a lot about the speed of the reader, and writer threads.
1. If your source is a flat file - utilize a staging table (see the
staging table slides in the presentations section of this web
site). This way - you can also use SQL*Loader, BCP, or some other
database Bulk-Load utility. Place basic logic in the source load
map, remove all potential lookups from the code. At this point - if
your reader is slow, then check two things: 1) if you have an item
in your registry or configuration file which sets the
"ThrottleReader" to a specific maximum number of blocks, it will
limit your read throughput (this only needs to be set if the
sessions have a demonstrated problems with constraint based loads)
2) Move the flat file to local internal disk (if at all possible).
Try not to read a file across the network, or from a RAID device.
Most RAID array's are fast, but Informatica seems to top out, where
internal disk continues to be much faster. Here - a link will NOT
work to increase speed - it must be the full file itself - stored
locally. 1. Try to eliminate the use of non-cached lookups. By
issuing a non-cached lookup, you're performance will be impacted
significantly. Particularly if the lookup table is also a "growing"
or "updated" target table - this generally means the indexes are
changing during operation, and the optimizer looses track of the
index statistics. Again - utilize staging tables if possible. In
utilizing staging tables, views in the database can be built which
join the data together; or Informatica's joiner object can be used
to join data together - either one will help dramatically increase
speed. 1. Separate complex maps - try to break the maps out in to
logical threaded sections of processing. Re-arrange the
architecture if necessary to allow for parallel processing. There
may be more smaller components doing individual tasks, however the
throughput will be proportionate to the degree of parallelism that
is applied. A discussion on HOW to perform this task is posted on
the methodologies page, please see this discussion for further
details. 1. BALANCE. Balance between Informatica and the power of
SQL and the database. Try to utilize the DBMS for what it was built
for: reading/writing/sorting/grouping/filtering data en-masse. Use
Informatica for the more complex logic, outside joins, data
integration, multiple source feeds, etc... The balancing act is
difficult without DBA knowledge. In order to achieve a balance, you
must be able to recognize what operations are best in the database,
and which ones are best in Informatica. This does not degrade from
the use of the ETL tool, rather it enhances it - it's a MUST if you
are performance tuning for high-volume throughput. 1. TUNE the
DATABASE. Don't be afraid to estimate: small, medium, large, and
extra large source data set sizes (in terms of: numbers of rows,
average number of bytes per row), expected throughput for each,
turnaround time for load, is it a trickle feed? Give this
information to your DBA's and ask them to tune the database for
"wost case". Help them assess which tables are expected to be high
read/high write, which operations will sort, (order by), etc...
Moving disks, assigning the right table to the right disk space
could make all the difference. Utilize a PERL script to generate
"fake" data for small, medium, large, and extra large data sets.
Run each of these through your mappings - in this manner, the DBA
can watch or monitor throughput as a real load size occurs. 1. Be
sure there is enough SWAP, and TEMP space on your PMSERVER machine.
Not having enough disk space could potentially slow down your
entire server during processing (in an exponential fashion).
Sometimes this means watching the disk space as while your session
runs. Otherwise you may not get a good picture of the space
available during operation. Particularly if your maps contain
aggregates, or lookups that flow to disk Cache directory - or if
you have a JOINER object with heterogeneous sources. 1. Place some
good server load monitoring tools on your PMServer in development -
watch it closely to understand how the resources are being
utilized, and where the hot spots are. Try to follow the
recommendations - it may mean upgrading the hardware to achieve
throughput. Look in to EMC's disk storage array - while expensive,
it appears to be extremely fast, I've heard (but not verified) that
it has improved performance in some cases by up to 50% 1. SESSION
SETTINGS. In the session, there is only so much tuning you can do.
Balancing the throughput is important - by turning on "Collect
Performance Statistics" you can get a good feel for what needs to
be set in the session - or what needs to be changed in the
database. Read the performance section carefully in the Informatica
manuals. Basically what you should try to achieve is: OPTIMAL READ,
OPTIMIAL THROUGHPUT, OPTIMAL WRITE. Over-tuning one of these three
pieces can result in ultimately slowing down your session. For
example: your write throughput is governed by your read and
transformation speed, likewise, your read throughput is governed by
your transformation and write speed. The best method to tune a
problematic map, is to break it in to components for testing: 1)
Read Throughput, tune for the reader, see what the settings are,
send the write output to a flat file for less contention - Check
the "ThrottleReader" setting (which is not configured by default),
increase the Default Buffer Size by a factor of 64k each shot -
ignore the warning above 128k. If the Reader still appears to
increase during the session, then stabilize (after a few thousand
rows), then try increasing the Shared Session Memory from 12MB to
24MB. If the reader still stabilizes, then you have a slow source,
slow lookups, or your CACHE directory is not on internal disk. If
the reader's throughput continues to climb above where it
stabilized, make note of the session settings. Check the
Performance Statistics to make sure the writer throughput is NOT
the bottleneck - you are attempting to tune the reader here, and
don't want the writer threads to slow you down. Change the map
target back to the database targets - run the session again. This
time, make note of how much the reader slows down, it's optimal
performance was reached with a flat file(s). This time - slow
targets are the cause. NOTE: if your reader session to flat file
just doesn't ever "get fast", then you've got some basic map tuning
to do. Try to merge expression objects, set your lookups to
unconnected (for re-use if possible), check your Index and Data
cache settings if you have aggregation, or lookups being performed.
Etc... If you have a slow writer, change the map to a single target
table at a time - see which target is causing the "slowness" and
tune it. Make copies of the original map, and break down the
copies. Once the "slower" of the N targets is discovered, talk to
your DBA about partitioning the table, updating statistics,
removing indexes during load, etc... There are many database things
you can do here. 1. Remove all other "applications" on the
PMServer. Except for the database / staging database or Data
Warehouse itself. PMServer plays well with RDBMS (relational
database management system) - but doesn't play well with
application servers, particularly JAVA Virtual Machines, Web
Servers, Security Servers, application, and Report servers. All of
these items should be broken out to other machines. This is
critical to improving performance on the PMServer machine.
BODS Scripting languageKeywords Beginend; ifelse; while;
trycatch; Special Charecters: # comment ; end of statement; () to
define a function; {} to take variable values as text string; [] to
take the value of integer; , to differentiate parameter values; !
not; || for concatenation; string representation; \ escape
character, $ variable; * Multipilication.BODS Recovery
MechanismTool based recovery mech: This tool is acailable in BODS
,but it is switched off , we can enable when required; go to job
execute enable recovery; recover from last failed execution;
recover as a unit (workflow level)Limitations: Not at dataflow
level; job should be sequentialCustom based recovery: should be
done once the last job run should be successfully done without any
failure. Automatically recovering jobs An DI feature that allows
you to run unsuccessful jobs in recovery mode. Manually recovering
jobs A design technique that allows you to rerun jobs without
regard to partial results in a previous run.
Manually Recovering Jobs A job designed for manual recovery
must: Be such that it can be run repeatedly Implement special steps
to recover data when a step did not complete successfully during a
previous run.
BODS DebuggingDebugging is used to trace the job at data level;
used to identify issues (record by record); debugd only dataflow;
Pre requisites: without having breakpoints we cant
debug(mandatory); conditions & filters (optional); Breakpoint
is a point where interface debugging starts workingBODS interface
debugger: available in designer;
Tools->options->designer-> environment->interactive
debugger
Interactive Debugger : It allows to examine and modify data row
by row by placing filters and break points on lines in the data
flow during a debug mode job execution. Using the debugger we can
know what happen to the data after each transform or object in the
flow. Designer displays four additional windows: 1. Call stack 2.
Trace 3. Variables 4. View Data The left View Data pane shows the
data in the source table, and the right pane shows one row at a
time (the default when you set a breakpoint) that has passed to the
Query Filters : A debug filter functions as a simple Query
transform with a WHERE clause; however, complex expressions are not
supported in a debug filter. Use a filter if you want to reduce a
data set in a debug job execution. Breakpoints : We can set a
breakpoint between a source and transform or two transforms. A
breakpoint is the location where a debug job execution pauses and
returns control to you. A breakpoint condition applies after
UPDATE, NORMAL and INSERT row types and to the before DELETE. Steps
to set filters and breakpoints : 1.Open the Job you want to debug
with filters and breakpoints in the workspace.2.Open one of its
Data Flows.3.Right-click the connecting line that you want to
examine, and select Set Filter/Breakpoint. 4.In the breakpoint
window, under the Filter or Breakpoint columns, select the Set
check box.
Complete the Column, Operator, and Value columns accordingly.
TRY CATCHA try/catch block is a combination of one try object and
one or more catch objects that allow you to specify alternative
work flows if errors occur while DI is executing a job. Try/catch
blocks: Catch classes of exceptions thrown by DI, the DBMS, or the
operating system Apply solutions that you provide Continue
execution Try and catch objects are single-use objects. Scripts are
single-use objects used to call functions and assign values to
variables in a work flow. A script can contain the following
statements: Function calls If statements While statements
Assignment statements Operators MIRGRATION AND REPOSITORIES DI
supports a number of environments, including large enterprises with
many developers working on multiple projects. DI supports
multi-site architectures whether centralized or not. The
development process you use to create your ETL application involves
3 distinct phases: design, Test, and production Each phase may
require a different computer in a different environment, and
different security settings for each.
Data Integrator provides two migration mechanisms: Export/import
migration works best with small to medium-sized projects where a
small number of developers work on somewhat independent Data
Integrator applications through all phases of development.
Multi-user development works best in larger projects where two or
more developers or multiple teams are working on interdependent
parts of Data Integrator applications through all phases of
development.
Exporting Objects to a Database
You can export objects from the current repository to another
repository. However, the other repository must be the same version
as the current one. The export process allows you to change
environment-specific information defined in datastores and file
formats to match the new environment
Exporting/Importing Objects to/from a File You can also export
objects to a file. If you choose a file as the export destination,
DI does not provide options to change environment specific
information. Importing objects or an entire repository from a file
overwrites existing objects with the same names in the destination
repository. You must restart DI after the import process
completes.
Using built-in functions : DI provides over 60 built-in
functions.Aggregate : avg, count, count_distinct, max, min,
sumConversion : cast, extract_from_xml, interval_to_char,
julian_to_date, load_to_xml, long_to_varchar, num_to_interval ,
to_char, to_date, to_decimal, to_decimal_ext, varchar_to_long
Database : total_rows, key_generation, sql Date : week_in_year,
week_in_month, sysdate, quarter, month, last_date, julian,
isweekend, fiscal_day, day_in_year, day_in_week, day_in_month,
date_part, date_diff, concat_date_time, add_months Environment :
get_env, get_error_filename, get_monitor_filename,
get_trace_filename, is_set_env, set_env Lookup : lookup,
lookup_ext, lookup_seq Math : ceil, floor, In, log, mod, power,
rand, rand_ext, round, sqrt, trunc Miscellaneous :
current_configuration, current_system_configuration, dataflow_name,
datastore_field_value, db_type, db_version, db_database_name,
db_owner, decode, file_exists, gen_row_num_by_group, gen_row_num,
get_domain_description, get_file_attribute, greatest, host_name,
ifthenelse, is_group_changed, isempty, job_name, least, nvl,
previous_row_value, pushdown_sql, raise_exception,
raise_exception_ext, repository_name, sleep, system_user_name,
table_attribute, truncate_table, wait_for_file, workflow_name
DATA TRANSFER
XML_PIPELINE
VALIDATION TRANSFORM
QUERY TRANSFORM
TABLE COMPARISION
KEY GENERATION
Hierarcy FLATNENNING
Lookup_ext
List of functions
PIVOT TRANSFORM
EXECUTION PROPERTIES
SAP BusinessObjects Data Services 4.0 features.....
Documentation........................................................................................................................92.2
SAP
integration.......................................................................................................................92.3
Security................................................................................................................................112.4
Text Data
Processing.............................................................................................................122.5
Architecture..........................................................................................................................122.6
Transforms............................................................................................................................132.7
Operational
excellence...........................................................................................................142.8
Functions..............................................................................................................................142.9
Source and target
support.....................................................................................................152.10
Data
Quality..........................................................................................................................152.10.1
Data Cleanse
transform.........................................................................................................172.10.2
Geocoder
transform...............................................................................................................182.10.3
Global Address Cleanse
transform.........................................................................................192.10.4
Match
transform.....................................................................................................................212.10.5
USA Regulatory Address Cleanse
transform.........................................................................
What are the benefits of Datawarehousing?
Facilitate integration in an environment characterized by
unintegrated applications. Integrate enterprise data across a
variety of functions. Integrate external as well as internal data.
Support strategic and longterm business planning. Support daytoday
tactical decisions. Enable insight into business trends and
business opportunities What is the difference between dimensional
table and fact table?
A dimension table consists of tuples of attributes of the
dimension. A fact table can be thought of as having tuples, one per
a recorded fact. This fact contains some measured or observed
variables and identifies them with pointers to dimension
tables.
Data Warehousing helps you store the data while business
intelligence helps you to control the data for decision making,
forecasting etc.Dimension Table Examples Retail -- store name, zip
code, product name, productcategory, day of week Telecommunications
-- call origin, call destination Banking -- customer name, account
number, branch,account officer Insurance -- policy type, insured
partyDimension Table CharacteristicsDimension tables have the
following characteristics:- Contain textual information that
represents the attributes of thebusiness- Contain relatively static
data- Are joined to a fact table through foreign key reference-
They are hierarchical in nature and provide the ability to view
data atvarying levels of details.Fact Table Examples Retail --
number of units sold, sales amount Telecommunications -- length of
call in minutes, averagenumber of calls Banking -- average monthly
balance Insurance -- claims amountFact Table CharacteristicsFact
table have the following characteristics Contain numerical metrics
of the business Can hold large volumes of data Can grow quickly Are
joined to dimension table through foreign keysthat reference
primary keys in the dimension tablesIdentifying Measures and
Dimensions The attribute is perceivedas constant or
discrete:Dimension Product Location Time Size The attribute
variescontinuously:
Measures Balance Units Sold Cost Sales
Business intelligence usually refers to the information that is
available for the enterprise to make decisions on. A data
warehousing (or data mart) system is the backend, or the
infrastructural, component for achieving business intellignce.
Business intelligence also includes the insight gained from doing
data mining analysis, as well as unstrctured data (thus the need fo
content management systems). For our purposes here, we will discuss
business intelligence in the context of using a data warehouse
infrastructure.Business Intelligence Tools: Excel, Reporting tool,
OLAP tool, Data mining toolBusiness intelligence users:Business
operations reporting, Forecasting, Dashboard, Multidimensional
analysis, Finding correlation among different factors
When should you use a STAR and when a SNOW-FLAKE schema?The star
schema is the simplest data warehouse schema. Snow flake schema is
similar to the star schema. It normalizes dimension table to save
data storage space. It can be used to represent hierarchies of
information. What is the difference between a data warehouse and a
data mart?This is a heavily debated issue. There are inherent
similarities between the basic constructs used to design a data
warehouse and a data mart. In general a Data Warehouse is used on
an enterprise level, while Data Marts is used on a business
division/department level. A data mart only contains the required
subject specific data for local analysis.
Data Warehouse A data warehouse is a central repository for all
or significant parts of the data that an enterprise's various
business systems collect. Typically, a data warehouse is housed on
an enterprise mainframe server. Data from various online
transaction processing (OLTP) applications and other sources is
selectively extracted and organized on the data warehouse database
for use by analytical applications and user queries. Data
warehousing emphasizes the capture of data from diverse sources for
useful analysis and access. Data Marts A data mart is a repository
of data gathered from operational data and other sources that is
designed to serve a particular community of knowledge workers. In
scope, the data may derive from an enterprise-wide database or data
warehouse or be more specialized. The emphasis of a data mart is on
meeting the specific demands of a particular group of knowledge
users in terms of analysis, content, presentation, and ease-of-use.
Users of a data mart can expect to have data presented in terms
that are familiar
We have two schema models which are suitable for Data
Warehousing in most of the cases. Star Schema A star schema is a
set of tables comprised of a single, central fact table surrounded
by de-normalized dimensions. Each dimension is represented in a
single table. Snowflake Schema If you normalize the star schema
dimensions to separate tables and link them together, you will have
a snowflake schema
Fact TablesTypes of Measures Additive facts Non-additive facts
Semi-additive facts
Additive Facts Additive facts are facts that can be summed up
throughall of the dimensions in the fact table.
Non - Additive FactsNon-additive facts are facts that cannot be
summed up forany of the dimensions present in the fact table
Semi - Additive Facts- Semi-additive facts are facts that can be
summed upfor some of the dimensions in the fact table, but notthe
others.
Initial Load and Incremental/Refresh Initial Load: Single event
that populates the database with historicaldata Involves large
volumes of data Refresh/Incremental: Performed according to a
business cycle Less data to load than first-time load
OLTP:
1. Captures transactional information necessary to run business
operations2. Need performance 3. More DML operations4. OLTP systems
often use fully normalized schemas to optimize update/insert/delete
performance and data consistency5. A typical OLTP operation
accesses only a handful of records6. OLTP systems usually store
data for only a few weeks or months if needed
OLAP:
1. Analyze transaction information at an aggregate level to
improve the decision-making process2. Need flexibility and broad
scope3. Very rare DML operations4. Denormalized or partially
denormalized (star schema) to optimize query performance5. A
typical DW query scans thousands/millions of rows.6. Considerable
historical data is maintained for analysis
Data mart - A logical subset of the complete data warehouse. A
data mart is a complete pie-wedge of the overall data warehouse
pie. A data warehouse is made up of the union of all its data
marts. A data warehouse is fed from the staging area. Every data
mart must be represented by a dimensional model and, within a
single data warehouse, all such data mart must be built from
dimensions and facts.
MappingThe definition of the relationship and data flow between
source and target objects. MetadataData that describes data and
other structures, such as objects, business rules, and processes.
For example, the schema design of a data warehouse is typically
stored in a repository as metadata, which is used to generate
scripts used to build and populate the data warehouse. A repository
contains metadata.Staging AreaA place where data is processed
before entering the warehouse. CleansingThe process of resolving
inconsistencies and fixing the anomalies in source data, typically
as part of the ETL process. TransformationThe process of
manipulating data. Any manipulation beyond copying is a
transformation. Examples include cleansing, aggregating, and
integrating data from multiple sources.
hat is Entity Relationship (E-R) ModelingEntity Relation (E-R)
model is developed to answers the following issues of conventional
Data Base Management System (DBMS).(i) Redundancy of data (ii) Lack
of integratioWn and (iii) Lack of flexibility,
This modeling is based on the relational theory and abides by
the 13 rules proposed by E.F. Codd that a DBMS implementation must
follow to be qualified as truly relational. The data in E-R model
is presented in a simple form of two-dimensional tables.
NormalizationNormalization is a process of decomposing the
tables to prevent redundancy, insert & update anomalies.
First Normal Form (INF):- A table is said to be in a First
Normal Form (1NF) if it satisfy the below three conditions:-1) If
the columns of the table only contain atomic values (Single,
indivisible).2) Primary key is defined for the table3) All the
columns of the table are defined on the primary key.Second Normal
Form (2NF):- A table is said to be in its Second Normal Form if it
satisfied the following conditions:-1) It satisfies the condition
for the First Normal Form (1NF),2) It do not includes and partial
dependencies where a column is dependent only a part of a primary
key.Third Normal Form (3NF):- A table is said to be in the Third
Normal form (3NF) if it satisfy the following conditions:-1) It
should be in the 2NF2) It should not contain any transitive
dependency which means that any non key column of the table should
not be dependent on another non key column.Denormalization:-
Denormalization can be defined as the process of moving from higher
normal form to a lower normal forms in order to speed up the
database access.
time stamp:Stores a database wide unique number that gets
updated every time row gets updatedSQL Join Types:There are
different type of joins available in SQL:
INNER JOIN:returns rows when there is a match in both
tables.
LEFT JOIN:returns all rows from the left table, even if there
are no matches in the right table.
RIGHT JOIN:returns all rows from the right table, even if there
are no matches in the left table.
FULL JOIN:returns rows when there is a match in one of the
tables.
SELF JOIN:is used to join a table to itself, as if the table
were two tables, temporarily renaming at least one table in the SQL
statement.
CARTESIAN JOIN:returns the cartesian product of the sets of
records from the two or more joined tables
TIMESTAMP(expr), TIMESTAMP(expr1,expr2)With a single argument,
this function returns the date ordatetime expression expr as a
datetime value. With two arguments, it adds the time expression
expr2 to the date or datetime expression expr1 and returns the
result as a datetime value.mysql
DELTA LOAD
A delta load, by definition, is loading incremental changes to
the data. When doing a delta load to a fact table, for example, you
perform inserts only... appending the change data to the existing
table.
Load types:-1)Bulk Load2)Normal Load
Normal load:- 1)in case of less data.2)we can get its log
details3)we can rollback and commit.4)Session recovery
possible.5)performance may be low .
Bulk load :-1)In case of large data2)no log details are
available.3)can't rollback and commit4)session recovery not
possible.5)performance improves.
Following are commonly used constraints available in SQL: NOT
NULL Constraint: Ensures that a column cannot have NULL value.
DEFAULT Constraint : Provides a default value for a column when
none is specified. UNIQUE Constraint: Ensures that all values in a
column are different. PRIMARY Key: Uniquely identified each
rows/records in a database table. FOREIGN Key: Uniquely identified
a rows/records in any another database table. CHECK Constraint: The
CHECK constraint ensures that all values in a column satisfy
certain conditions. INDEX: Use to create and retrieve data from the
database very quickly.
NOT NULL Constraint: By default, a column can hold NULL values.
If you do not want a column to have a NULL value then you need to
define such constraint on this column specifying that NULL is now
not allowed for that column. A NULL is not the same as no data,
rather, it represents unknown data. DEFAULT Constraint: The DEFAULT
constraint provides a default value to a column when the INSERT
INTO statement does not provide a specific value. UNIQUE
Constraint: The UNIQUE Constraint prevents two records from having
identical values in a particular column. In the CUSTOMERS table,
for example, you might want to prevent two or more people from
having identical age. PRIMARY Key: A primary key is a field in a
table which uniquely identifies the each rows/records in a database
table. Primary keys must contain unique values. A primary key
column cannot have NULL values. A table can have only one primary
key which may consist of single or multiple fields. When multiple
fields are used as a primary key, they are called a composite key.
If a table has a primary key defined on any field(s) then you can
not have two records having the same value of that field(s).FOREIGN
Key:A foreign key is a key used to link two tables together. This
is sometimes called a referencing key. Primary key field from one
table and insert it into the other table where it becomes a foreign
key ie. Foreign Key is a column or a combination of columns whose
values match a Primary Key in a different table. The relationship
between 2 tables matches the Primary Key in one of the tables with
a Foreign Key in the second table. If a table has a primary key
defined on any field(s) then you can not have two records having
the same value of that field(s).CHECK Constraint: The CHECK
Constraint enables a condition to check the value being entered
into a record. If the condition evaluates to false, the record
violates the constraint and isn.t entered into the table.INDEX: The
INDEX is used to create and retrieve data from the database very
quickly. Index can be created by using single or group of columns
in a table. When index is created it is assigned a ROWID for each
rows before it sort out the data. Proper indexes are good for
performance in large databases but you need to be careful while
creating index. Selection of fields depends on what you are using
in your SQL queries.Database NormalizationDatabase normalization is
the process of efficiently organizing data in a database. There are
two reasons of the normalization process: Eliminating redundant
data, for example, storing the same data in more than one
tables
DELETE1. DELETE is a DML Command.2. DELETE statement is executed
using a row lock, each row in the table is locked for deletion.3.
We can specify filters in where clause4. It deletes specified data
if where condition exists.5. Delete activates a trigger because the
operation are logged individually.6. Slower than truncate because,
it keeps logs.7. Rollback is possible.TRUNCATE1. TRUNCATE is a DDL
command.2. TRUNCATE TABLE always locks the table and page but not
each row.3. Cannot use Where Condition.4. It Removes all the
data.5. TRUNCATE TABLE cannot activate a trigger because the
operation does not log individual row deletions.6. Faster in
performance wise, because it doesn't keep any logs.7. Rollback is
not possible.DELETE and TRUNCATE both can be rolled back when used
with TRANSACTION.Primary KeyUnique Key
It will not accept null valuesOne and only one Null values are
accepted.
There will be only one primary key in a tableMore than one
unique key will be there in a table.
Clustered index is created in Primary keyNon-Clustered index is
created in unique key.
Primary key allows each row in a table to be uniquely identified
and ensures that no duplicate rows exist.Unique key constraint is
used to prevent the duplication of key values within the rows of a
table and allow null values.
Primary Key: -1 Primary key is used to avoid the duplication of
records under same column-2 Primary key is used to uniquely
identify the each record from table -3 You can use only one primary
key at a time on single table-4 It does not allow any null value or
duplicate valueForeign Key it is used to give the reference of
another table primary key..when i update the child table record
then it does not allow me..first i need to update from parent table
then it is updated in child table
Primary Key: It will not allow "Null values" and "Duplicate
values"Foreign Key:It will allow "Null values" and "Duplicte
values" and it refers to a primary key in anoter table
SELECT column1, column2, columnN FROM table_name;
What is surrogate key? Explain it with an example.AnswerData
warehouses commonly use a surrogate key to uniquely identify an
entity. A surrogate is not generated by the user but by the system.
A primary difference between a primary key and surrogate key in few
databases is that PK uniquely identifies a record while a SK
uniquely identifies an entity. E.g. an employee may be recruited
before the year 2000 while another employee with the same name may
be recruited after the year 2000. Here, the primary key will
uniquely identify the record while the surrogate key will be
generated by the system (say a serial number) since the SK is NOT
derived from the data. Datawarehousing-What is surrogate key?-May
11, 2009 at 14:40 pm byVidya SagarWhat is surrogate key? Explain it
with an example.A surrogate key is a unique identifier in database
either for an entity in the modeled word or an object in the
database. Application data is not used to derive surrogate key.
Surrogate key is an internally generated key by the current system
and is invisible to the user. As several objects are available in
the database corresponding to surrogate, surrogate key can not be
utilized as primary key. Advantages of surrogate keys include:
Control over data Reduced fact table size22. What is a Stored
Procedure? Its nothing but a set of T-SQL statements combined to
perform a single task of several tasks. Its basically like a Macro
so when you invoke the Stored procedure, you actually run a set of
statements
To COMMIT means to make changes to data
COMMIT - save work doneSAVEPOINT - identify a point in a
transaction to which you can later roll backROLLBACK - restore
database to original since the last COMMITSET TRANSACTION - Change
transaction options like what rollback segment to use
What is ods (operation data source)ODS is nothing but
operational data source. It contains most recent data .it contains
around 30-60 days data It is placed in between source system and
staging area.. reports can also be taken in ODS. Operational Data
Store (ODS): An ODS is an integrated database of operational data.
Its sources include legacy systems and it contains current or near
term data. An ODS may contain 30 to 60 days of information, while a
data warehouse typically contains years of data. it is before
staging Full Load: completely erasing the contents of one or more
tables and reloading with fresh data.Incremental Load: applying
ongoing changes to one or more tables based on a predefined
scheduleODS (Operational Data Source) is the first point in the
Datawarehouse. Its store the real time data of daily transactions
as the first instance of Date.
Staging Area, is the later part which comes after the ODS. Here
the Data is cleansed and temporarily stored before loaded into the
Datawarehouse.
Interview Questions
12. TRUNCATE TABLE EMP; DELETE FROM EMP; Will the outputs of the
above two commands Delete Command: 1. Its a DML Command 2. Data can
be rolled back. 3. Its slower than Truncate command bcoz it logs
each row deletion. 4. With delete command trigger can be fire.
Truncate Command: 1. Its a DDL Command 2. Data Can not be rolled
back. 3. Its is faster than delete bcoz it does not log rows. With
Truncate command trigger can not be fire. both cases only the table
data is removed, not the table structure.
13. What is the use of the DROP option in the ALTER TABLE
commandDrop option in the ALTER TABLE command is used to drop
columns you no longer need from the table. The column may or may
not contain data Using alter column statement only one column can
be dropped at a time. The table must have at least one column
remaining in it after it is altered. Once a column is dropped, it
cannot be recovered.
20. What is Data warehousing Hierarchy?HierarchiesHierarchies
are logical structures that use ordered levels as a means of
organizing data. A hierarchy can be used to define data
aggregation. For example, in a time dimension, a hierarchy might
aggregate data from the month level to the quarter level to the
year level. A hierarchy can also be used to define a navigational
drill path and to establish a family structure.
Within a hierarchy, each level is logically connected to the
levels above and below it. Data values at lower levels aggregate
into the data values at higher levels. A dimension can be composed
of more than one hierarchy. For example, in the product dimension,
there might be two hierarchies--one for product categories and one
for product suppliers.
Dimension hierarchies also group levels from general to
granular. Query tools use hierarchies to enable you to drill down
into your data to view different levels of granularity. This is one
of the key benefits of a data warehouse.
When designing hierarchies, you must consider the relationships
in business structures. For example, a divisional multilevel sales
organization.
Hierarchies impose a family structure on dimension values. For a
particular level value, a value at the next higher level is its
parent, and values at the next lower level are its children. These
familial relationships enable analysts to access data quickly.
LevelsA level represents a position in a hierarchy. For example,
a time dimension might have a hierarchy that represents data at the
month, quarter, and year levels. Levels range from general to
specific, with the root level as the highest or most general level.
The levels in a dimension are organized into one or more
hierarchies.
Level RelationshipsLevel relationships specify top-to-bottom
ordering of levels from most general (the root) to most specific
information. They define the parent-child relationship between the
levels in a hierarchy.
Hierarchies are also essential components in enabling more
complex rewrites. For example, the database can aggregate an
existing sales revenue on a quarterly base to a yearly aggregation
when the dimensional dependencies between quarter and year are
known..
22. What is surrogate key ? where we use it explain with
examplessurrogate key is a substitution for the natural primary
key.
It is just a unique identifier or number for each row that can
be used for the primary key to the table. The only requirement for
a surrogate primary key is that it is unique for each row in the
table.
Data warehouses typically use a surrogate, (also known as
artificial or identity key), key for the dimension tables primary
keys. They can use Infa sequence generator, or Oracle sequence, or
SQL Server Identity values for the surrogate key.
It is useful because the natural primary key (i.e. Customer
Number in Customer table) can change and this makes updates more
difficult.
Some tables have columns such as AIRPORT_NAME or CITY_NAME which
are stated as the primary keys (according to the business users)
but ,not only can these change, indexing on a numerical value is
probably better and you could consider creating a surrogate key
called, say, AIRPORT_ID. This would be internal to the system and
as far as the client is concerned you may display only the
AIRPORT_NAME.
Another benefit you can get from surrogate keys (SID) is :
Tracking the SCD - Slowly Changing Dimension.
Let me give you a simple, classical example:
On the 1st of January 2002, Employee 'E1' belongs to Business
Unit 'BU1' (that's what would be in your Employee Dimension). This
employee has a turnover allocated to him on the Business Unit 'BU1'
But on the 2nd of June the Employee 'E1' is muted from Business
Unit 'BU1' to Business Unit 'BU2.' All the new turnover have to
belong to the new Business Unit 'BU2' but the old one should Belong
to the Business Unit 'BU1.'
If you used the natural business key 'E1' for your employee
within your datawarehouse everything would be allocated to Business
Unit 'BU2' even what actualy belongs to 'BU1.'
If you use surrogate keys, you could create on the 2nd of June a
new record for the Employee 'E1' in your Employee Dimension with a
new surrogate key.
This way, in your fact table, you have your old data (before 2nd
of June) with the SID of the Employee 'E1' + 'BU1.' All new data
(after 2nd of June) would take the SID of the employee 'E1' +
'BU2.'
You could consider Slowly Changing Dimension as an enlargement
of your natural key: natural key of the Employee was Employee Code
'E1' but for you it becomes Employee Code + Business Unit - 'E1' +
'BU1' or 'E1' + 'BU2.' But the difference with the natural key
enlargement process, is that you might not have all part of your
new key within your fact table, so you might not be able to do the
join on the new enlarge key -> so you need another id.
25. What is the main difference between schema in RDBMS and
schemas in Data Warehouse....?RDBMS Schema * Used for OLTP systems
* Traditional and old schema * Normalized * Difficult to understand
and navigate * Cannot solve extract and complex problems * Poorly
modeled DWH Schema * Used for OLAP systems * New generation schema
* De Normalized * Easy to understand and navigate * Extract and
complex problems can be easily solved * Very good model
26. What is Dimensional Modelling?Dimensional Modelling is a
design concept used by many data warehouse designers to build their
data-warehouse. In this design model all the data is stored in two
types of tables - Facts table and Dimension table. Fact table
contains the facts/measurements of the business and the dimension
table contains the context of measurements i.e., the dimensions on
which the facts are calculated.
27. What is real time data-warehousing?In real-time data
warehousing, your warehouse contains completely up-to-date data and
is synchronized with the source systems that provide the source
data. In near-real-time data warehousing, there is a minimal delay
between source data being generated and being available in the data
warehouse. Therefore, if you want to achieve real-time or
near-real-time updates to your data warehouse, youll need to do
three things:
1. Reduce or eliminate the time taken to get new and changed
data out of your source systems. 2. Eliminate, or reduce as much as
possible, the time required to cleanse, transform and load your
data. 3. Reduce as much as possible the time required to update
your aggregates. Starting with version 9i, and continuing with the
latest 10g release, Oracle has gradually introduced features into
the database to support real-time, and near-real-time, data
warehousing. These features include:
Change Data Capture External tables, table functions,
pipelining, and the MERGE command, and Fast refresh materialized
views
28. What is a lookup table?When a table is used to check for
some data for its presence prior to loading of some other data or
the same data to another table, the table is called a LOOKUP
Table.
40. What are the Different methods of loading Dimension
tables?Conventional Load: Before loading the data, all the Table
constraints will be checked against the data. Direct load:(Faster
Loading) All the Constraints will be disabled. Data will be loaded
directly. Later the data will be checked against the table
constraints and the bad data won't be indexed.
DataWareHousing - ETL Project Life Cycle ( Simple to understand
)-> Datawarehousing projects are categorized into 4 types.1)
Development Projects.2) Enhancement Projects3) Migration Projects4)
Production support Projects.-> The following are the different
phases involved in a ETL project development life cycle.1) Business
Requirement Collection ( BRD )2) System Requirement Collection (
SRD )3) Design Phasea) High Level Design Document ( HRD )b) Low
level Design Document ( LLD )c) Mapping Design 4) Code Review5)
Peer Review6) Testinga) Unit Testingb) System Integration
Testing.c) USer Acceptance Testin