National Enterprise Wide Statistical System (NEWSS) DATA
MIGRATION Azizah Bt Hashim, Nur Hurriyatul Huda Bt Abdullah
Sani
Abstract
This paper aims to examine the benefit of data migration in
National Enterprise Wide Statistical System (NEWSS) project to the
Department Of Statistic Malaysia . Number of data is drawn from
NEWSS Phase 1 and II included data from Economic Census 2005 and
2010. ETL model is a comprehensive method to study NEWSS data
migration because we can investigate the effect on the data across
sectors in department. The general findings of this paper is that
the contribution of data migration activities to the Operational of
System NEWSS is significantly increased in 2014 compared to 2011.
This is in line with the Department mission and objective to
produce integrity and reliability data of National Statistics
through the use of the best technology, and to improve and
strengthen statistical services and delivery system.
Keywords: Database migration, Data Migration, ETL, Objective and
Mission
1. IntroductionWith the rapid growing of business requirement in
DOSM and new enterprise wide application integration, organizations
come to a stage where they have to change from working in separated
database and multiple platform to a single and integrated one.
Migration also happen when Organization realize that the existing
systems have performance and scalability limitations, which cannot
cater to their ever-expanding business needs.
Data migration is the process of transferring data between
storage types, formats, or computer system. It is required when
Organizations or Individuals Change Computer Systems or Upgrade to
a New Systems or when System Merge. Usually data migration
performed programmatically to achieve an automated migration.
figure 1: Data migration flow in DOSM
There is a different between data migration and database
migration, though database migration encompasses data migration
also. Database migration essentially means the movement of data and
conversion of various other structures and objects associated with
the database including schema and applications associated with the
current system to a different technology/platform. Database
migration is one of the most common but a major task in any
application migration. Example of activity comprises in database
migration are
Business Logic - Stored Procedure, Triggers, Packages,
Functions
Schema Tables, Views, Synonyms, Sequences, Indexes
Physical Data Security, Users, Roles, Privileges
Database dependency of applications associated with the
database
Data migration is simply the movement of data from one database
(or File System)/platform to another. This may include extraction
of the data, cleansing of the data and loading the same into the
target database. for example, when an application is developed, it
is required to get those data for the newly developed application
to operate. In this case only the data is moved from the required
database to the database used by the new application.In simple ways
database migration can be referred when there is a shifting from
one type of database systems to an entirely new type of database
system or to a database system with entirely new features and
functionality. Hence data migration is a subset when database
migration activities are carried out, though data migration may
also be taken up independently.
There are interesting question why it is required to move to
other database while the existing systems are good running with
current database. The reason why data migration activities are
carried out in NEWSS project in DOSM are
1. Avoid Businesses Failure2. Improve corporate performance and
deliver competitive advantage 3. Efficient and effective business
processes (centralized db)4. Measureable and accurate view of data
5. Perceive better value in the newer system in term of
standardization of operational field work and data entry
2. Literature reviewThere are a number of studies conducted on
best practices for data migration. For example data migration,
Methodologies for assessing, planning, moving and validating data
migration by IBMGlobal Technology Services, October 2009 and NetApp
Global Services,
January 2006. Meanwhile, study about Database Migration Approach
& Planning done by Keshav Tripathy, Pragjnyajeet Mohanty and
Biraja Prasad Nath(2002).
From the study by Martin Wagner, March 17, 2011 on Introduction
on Patterns for Data Migration Projects, he conclude that the
quality constraints on the data in the old
system may be lower than the constraints in the target system.
Inconsistent or missing
data entries that the legacy system somehow copes with (or
ignores) might cause severe
problems in the target system. In addition, the data migration
itself might corrupt the
data in a way that is not visible to the software developers but
only to business users.
NetApp Global Services, January 2006 on Data Migration Best
Practices state that for IT managers, data migration has become one
of the most routineand challengingfacts of life. With the increase
in the percentage of mission-critical data and the proportionate
increase in data availability demands, downtimewith its huge impact
on a companys financial bottom linebecomes unacceptable. In
addition, business, technical and operational requirements impose
challenging restrictions on the migration process itself. Resource
demandsstaff, CPU cycles, and bandwidthand risksapplication
downtime, performance impact to production environments, technical
incompatibilities, and data corruption/lossmake migration one of
ITs biggest challenges. Since the majority of storage systems
purchased by customers is used to store existingrather than
newdata, getting these new systems production-ready requires that
data be copied/moved from the old system to be replaced to the new
system being deployed. Whether the migration is performed by
internal IT or an external services provider, the migration
methodology is the same.On the other hand, IBM Global Technology
Services October 2009 mention that when systems must be taken down
for migration, business operations can be seriously affected. A
keyway to minimize the business impact of data migration is to use
best practices that incorporate planning, technology implementation
and validation. Any change in the storage infrastructure, whether
it is a tech-nology refresh, consolidation, relocation or storage
optimization,requires an organization to migrate data.There are a
variety of software products that can be used to migrate
data,including volume-management products, host- or array-based
replication products and relocation utilitiesas well as
custom-developed scripts. Each ofthese has strengths and weaknesses
surrounding performance, operating system support, storage-vendor
platform support and whether or not application downtime is
required to migrate the data. Some of these products enable online
migration of dataso applications dont need to be taken offline
during the migration process. A subset of these provides
nondisruptive migration,which means that applications not only
remain online, but also that application processing continues
without interruption or significant performance delays. Therefore,
IT organizations should carefully explore software options.Specific
requirements can help determine the best software technology to use
for each migration. In addtion , Keshav Tripathy, Pragjnyajeet
Mohanty and Biraja Prasad Nath on Database Migration Approach &
Planning state that Database migration, consists of three major
components, they are
Schema Migration This consists of mapping and migrating the
source schema with the target schema. For this the schema needs to
be extracted from the source system and the equivalent needs to be
replicated in the target system
Data Migration This is the part where the data is extracted from
the source database. Then it is checked for consistency and
accuracy, it is cleansed if necessary. Finally it is loaded into
the target system.
Application Migration This necessarily consists of changing the
database dependent areas (function calls, data accessing methods
etc) of the application so that the Input/Output behavior of the
converted application with the target database is exactly identical
with that of the original application with the source database.
However, this paper just focus on the methodology of data
migration only, which cover the data for NEWSS Phase 1 and II. The
problem on migration in DOSM in all includes the difficulties in
getting the final source data untill the completion of migration
project. hence, based on this applied methodology and approach
helps to benefit the preparation of quality data for Department.3.
Methodology
3.1 Definition of Data Migration According en.wikipedia.org,
Data migration is the process of transferring data between storage
types, formats, or computer systems. Data migration is usually
performed programmatically to achieve an automated migration. By
perform programmatically/automatic it freeing up human resources
from tedious tasks.
It is also required when organizations or individuals change
computer systems or upgrade to new systems, or when systems merge
happen. In DOSM environment, migration happen by transferring
database from old silo system to integrated one.
For the purpose of this study, we considered the impact of data
migration in National Enterprise Wide Statistical System (NEWSS) to
the Department Of Statistic Malaysia.3.2 Source of Data
Number of migration data is drawn from from NEWSS Phase 1 and II
included data from Economic Census 2005 and 2010.3.3 Analysis The
process occur during the migration process are analysis, mapping,
planning, designing, testing, loading n verifying.The analysis
happen in the source system, after that process extract and
transform data into staging area. Staging area is a workspace where
we work to clean, put a rules, validate data before we load it into
the target.
figure 1. 1: Data Migration Methodology3.4 Data Migration Lyfe
CycleData Migration Lyfe Cycle inclusive 6 phase which is analyze
phase, map, high level design, detail design, construct and test
& deploy.
figure 1. 2: Data Migration Lyfe Cycle3.5 DATA MIGRATION WORK
FLOW
figure 1. 2: Data Migration Lyfe Cyclefigure 1. 3: Data
Migration Work Flow
In data migration process, there is a few steps that need to be
done as below:
a) Prerequisite - all the requirement for migration need to be
defined properly and field list need tp be prepared based on the
NEWSS database format. All the Questionnaire form, sample of data
and explanation for requirement needed for further process.b)
Mapping - This is phase to identify database fields from source
file and map it with the fields in the NEWSS databases. The mapping
is important to identify which field in database related to which
field in NEWSS databases, to identify the related field changes
from year to year in questionnaire, to identify the related field
changes from year to year in sources data, to find information
about changes in each fields like new field has been created for
the current database or certain field been drop or fields been
split out or combined. so all those information in this phase is
very important for the next phase to be execute.c) Designing - In
this phases the script for migration been developed based on the
mapping gathered in step (b) and by referring questionnaire and
fields list in steps (a). After the script been designed, the
script been tested first by the sample data gathered in steps (a).
This testing part need to be perform to ensure that the script
developed meet the requirement and no error.d) Data Source -
Verification on the data sources and format been done in this
phase.
e) Cleansing and Loading Data - The Verification on the data
sources with migration script been done to check either the source
data is clean or not. if during the verification process the data
show error, so the details checking and correction need to be made
by SMD. Otherwise if the verification shows a successful process,
the data will go to the final verification and be prepared for the
next steps.f) Production Loaded - The Data with Final Verification
that been prepared on steps (e) been loaded to the Production
database. After the data been loaded a few testing using the NEWSS
system on the data loaded need to be perform by SMD. After the
successful testing, the approval of data migrated will be done by
SMD.3.6 Master Data Management Master Data Management comprise of 4
main activity as below:1. Identify data source - In NEWSS, data
source will come from other Statistical system/OLTP, and various
sources such as from MS Excel, csv, MS Access, flat file, My SQL,
MS SQL Server, etc. 2. Create data profiling - the process of
examining the DOSM data such as Economic, Population, Trade,
External Trade, Labour Force , everything is available in existing
data source information. The important of profiling are:
a) The aim is to understand your data completely and fully
b) to improve data quality (by clarify the structure, content,
relationships )
c) to improve understanding anomalies of data for the users
(basis for an early go/no-go decision)
d) To discover, register, and assess enterprise metadata (it
will validate metadata when it is available and to discover
metadata when it is not)
e) And the last one improving data accuracy in corporate
databases. (which helps to assure that data cleaning and
transformations have been done correctly according to
requirements.)
3. Check data quality - this process aim to discover data what
have you missed, when things go wrong, making confident decision
and reliability of data for further data analysis and data
analytics. For example, It checks all relevant data such as gender,
addresses, postcode, district, state, and date format for a given
respondents is required. Common data problems like misspellings,
typing error, and random abbreviations have been cleaned up.
4. Extract, transform and load - Extract referring to extract
data from multiple sources and format (MS Excel, csv, MS Access,
flat file, oracle, My SQL, MS SQL Server and etc.) to a single
standardize format. Meanwhile transfors involved data mapping,
verify process, code generation and data conversion. Load which is
transfer the data from historical data into production databased or
end target.However in NEWSS current migration process, the data
profiling and data quality activities not been done because of the
certain constraint. To substitute with this unavailability
component, the checking data happen in the process of extract,
transform and load. However it will take the process of migration
slightly longer to be completed. Practice done during this ETL
activities are, whatever data problem will be passing back to SMD
for checking, and SMD will check and make a corresponding action.
After completed, SMD will send the data back to the BPM. The new
data need to be validate again, and if there is still data problem
so we need to return it back to SMD for correction. So this process
usually will drag the time for ETL activities.
figure 1. 4: Extract, Transform and Load (ETL) diagram
3.7 Migration tools,In our market for migration tools, there
were many software tools that being use for ETL purpose. All the
available tools do have their own strength and weaknesses. Below is
the tools that been explored during the migration process conduct
in DOSM.a) Talend
really powerful, stable and customizable
It's quite-well embeddable, it produces java code
The drawback is the learning curve
b) Pentaho
Its ETL tool (named Kettle)
is just a component of Pentaho Business Intelligence open
platform
It's java-based
The major drawback is that Kettle is much harder to extend than
Talend.
c) CloverETL
mostly younger
it's light, easily embeddable and easy to learn
But it's really much less powerful than Talend and even than
Kettle.
3.8 CHALLENGESChallenges that occurs during migration process in
DOSM are :
a) Analyze -difficulty of data collection process because of
data come from different sources and misunderstanding of user
requirements.
b) Mapping - Uncontrollable of mapping versioning due to
frequent changes of the survey form.c) Design Migration Script -
usually time constraint problem because large number of data been
inform to be migrate in a short time frame. sometime ad hoc request
from SMD.d) Cleansing Data - Hard to get the real final data
because of a few revision release. Sometime data cleansing need to
be confirmed several times from the SMD especially if there is a
data problem during the ETL process.e) Testing - Checking the data
involved many reference tables ie: Establishment Frame, MSIC Code,
Household Frame, locality and etc.f) Data Loaded - Some data
patching activities impacts the new version of final data.
4. Findings and Discussion
4.1. Way forwardBased on the migration activities done for NEWSS
system in DOSM, there are some findings and enhancement have been
suggest as belowCURRENTFUTURE
No Profiling activity Establish profiling phase
No centralized final data StatsDW
Patching Less data patching
Repetition migration process impacts varies of Final Data Final
Data is fixed and cannot be altered
Hard to get the clean data which effect DM processFix the
cleansing data process at the beginning stage
Table 1: Finding during migration process4.2Contribution of
Migration activities to the DOSMBased on migration process for
NEWSS, the result is shown in Table 2 as below.CURRENTFUTURE
No Profiling activity Establish profiling phase
No centralized final data StatsDW
Patching Less data patching
Repetition migration process impacts varies of Final Data Final
Data is fixed and cannot be altered
Hard to get the clean data which effect DM processFix the
cleansing data process at the beginning stage
Table 1: Finding during migration process5. Concluding
Remarks
The contribution of data migration in the operational of NEWSS
has been proved by increasing of demand by SMD on migrated data.
However, final clean data is crucial in the contribution of the
migration process. By improving the quality of data in each
division, creating integrity and reliable data to the public in
National Enterprise Wide Statistical System (NEWSS) will valued
DOSM productivity workforce and in line with Department vision to
become a leading statistical organization Internationally by
2020.Hence, this study was conducted with the aim to investigate
whether data migration activities in National Enterprise Wide
Statistical System (NEWSS) project is in line with Department
intention. The migration methodology applied in ETL process in DOSM
is contribute on the preparation of quality data and to expedite
operational fieldwork. The contribution of migration process
increase the time for updating frame for the purpose of operational
fieldwork from manually operation taken one until two weeks time to
one day operation by migrating process.References
Best practices for data migration, Methodologies for assessing,
planning, moving andvalidating data migration, IBMGlobal Technology
Services, October 2009Keshav Tripathy, Pragjnyajeet Mohanty and
Biraja Prasad Nath, GDU Surface Transport, Bhubaneswar, Satyam
Technology Center, September 12 and 13, 2002, Database Migration,
Approach & Planning Data Migration Best Practices, NetApp
Global Services, January 2006Martin Wagner, Tim Wellhausen, March
17, 2011, Patterns for Data Migration ProjectsDerek Wilson,
Practical Data Migration Strategies, 23 April 2014
14