CONFIDENTIAL File ref: 17/7/4/1/3 Concept Paper On Big Data Annexure B
CONFIDENTIALFile ref: 17/7/4/1/3
Concept PaperOn
Big Data
Prepared by: CCBG ICT Subcommittee
Date: June 2017
Annexure B
CONFIDENTIALFile ref: 17/7/4/1/3
Table of Contents
1. Introduction.....................................................................................................31.1 Background...................................................................................................................3
1.2 Purpose of the document..............................................................................................3
1.3 Big Data and Central Banks..........................................................................................3
2. What is big data...............................................................................................43. The importance of big data in Central Banks...............................................64. Big Data Analytics and available solutions..................................................75. Data Harmonization.........................................................................................86. Way forward for doing big data analytics.....................................................97. Challenges of Big Data.................................................................................10
7.1 Technical.....................................................................................................................10
7.2 Legal............................................................................................................................11
7.3 Privacy and Security....................................................................................................11
8. Conclusions and Recommendations..........................................................119. References.....................................................................................................1210. Annexures......................................................................................................14
1. Introduction
1.1 BackgroundBy deliberation of CCBG in February 2017 this project was incorporated in the SIA
Project. The Project was defined to be approached as a joint project made up of
one team comprising of representatives from the following central banks:
Banco de Moçambique as team leader Banco Nacional de Angola
Banque Centrale Du Congo
Central Bank Of Lesotho
South African Reserve Bank
Reserve Bank Of Zimbabwe
1.2 Purpose of the documentThe purpose of this document is to provide a conceptual introduction to Big Data
and to understand the potential uses of Big Data in the Central Banks of the SADC
region, as well as, challenges and implications that comes from handling data that
display high volume, high velocity and variety characteristics.
1.3 Big Data and Central Banks
Actually the concern with handling huge volumes of data started seventy-three
years ago when Fremont Rider, a librarian of Wesleyan University, wrote a book
that talked about challenges to manage American University Libraries in future as
he estimated that they were doubling in size every sixteen years [1]. With huge
volumes of digital data that is created lastly, both structured and unstructured, this
concern is more pressing today than never as it is possible to analyse and convert it
into new opportunities in ways never before possible.
According to the results of a survey conducted by Irving Fisher Committee on
Central Bank Statistics (ICF) [2], two thirds of central banks have a strong interest in
big data as it is perceived as a potentially effective tool in supporting
macroeconomic and financial stability analyses. Also, the structural shift toward the
exploitation of big data by other economics agents have increased the interest of
Central Banks. However, even with this strong interest, the survey also found that
the actual involvement of Central banks in the use of big data is currently limited.
3 | P a g e
Challenges for exploring big data are related to, among others, technology
complexity and high costs in terms of investment in human capital and IT.
Using big data, the central banks can benefit from a number of advantages
including making better informed decisions regarding the following areas:
Economic forecasting.
Business Cycle Analysis.
Financial Stability analysis.
2. What is big data
The term big data was first used by John Mashey, who gave a talk “Big Data and the
Next Wave of infrastress” in 1998 [12], stressing the need for a new type of computing
taking into account new developments in technology.
The term “big data” is not clearly defined among the literature on this topic. One
common view, however, is that big data basically refers to huge volume of data, both
structured and unstructured, that cannot be stored and processed using traditional
approach within a given time frame. These attributes, namely, Volume, Velocity and
Variety characterize what is big data paradigm. But also we have Veracity [5] and
Value [6] that were added a few years later after the creation of big data.
Figure 1: The 3 Vs of big Data [7]
So, what does huge volume means in terms of size to be classified as big data? The
term big data can refer either GB or TB or PB or EX or anything that is larger than this
in size. However, the data size can be smaller than GB and be called big data
depending on the context that it is being used. For example, commonly the current
4 | P a g e
email systems do not support an email with an attachment of 300MB. Therefore, as the
email systems do not support attachment of this size, this data can be regarded as big
data, with respect with these systems. So, the term big data is not tied to a specific size
of data but to the infeasibility of processing this data on traditional computing
environment.
On the other hand, popular social network sites such as Facebook, twitter LinkedIn,
google+ and YouTube, each of these sites receive huge volumes of data on daily basis.
For example, LinkedIn alone receives tens of TB on daily basis. As the number of users
keep growing on these sites, storing and processing this data becomes a challenging
tasks. Since this data holds a lot of valuable information, this data needs to be
processed on a short span of time. By making use of traditional computing systems, it
can be noted that it is not possible to accomplish this task within a given time frame. As
the computing resources of traditional computing system will not be sufficient for
processing and storing such huge volume of data, new techniques and tools are being
developed and employed to analyse it. So, the term big data refers not only the large
data sets, but also to the frameworks, techniques, and tools used to detect patterns in
data sets.
For central banks, the big data inventory is composed of macro-economic data, survey
data, financial institution data, third party data, micro level data and unstructured data
(examiner reports, social media, etc.) [11].
5 | P a g e
Figure 2: Traditional and Newly Emerging Data Types Merging to Form Big Data for
Central Banks [11].
3. The importance of big data in Central Banks
Historically big data tools and techniques have had big impact in certain sectors such
as the information and communications industry than they have had in financial
services. Nowadays, the situation is quite different as the researchers and financial
services sector is getting more and more involved in the use of big data for forecasting
in economics and finance.
Despite the researchers in the field of economics are the major exploiters of big data
for forecasting various economic variables, it can be seen around the world that some
banks are also exploiting big data with the same purpose. Among various examples of
forecasting with big data in economics, Choi et al. [9] shows how big data from Google
trends variables can be used to predict economic indicators, outperforming models that
exclude these predictors by 5% to 20%; Carriero et al. [10] used big data to obtain real
time GDP predictions for the US. In addition, some banks are using big data, among
others, to enhance early detection of fraud [3] and analyse bank´s intraday liquidity
management [4].
6 | P a g e
So it is important for the central banks to understand how big data, including social
media, news feed and transaction-level trading, can impact their strategies and
operations, including making better informed decisions regarding macroeconomic and
financial stability purposes [2], especially in the following areas:
Economic forecastingo Inflation
o Housing prices
o Unemployment
o GDP
o Industrial production
o Retail sales
o External sector developments
o Tourism activity
Business Cycle Analysiso Sentiment indicators
o Nowcasting techniques
Financial stability analysiso Construction of risk indicators
o Assessment of investors’ behaviour
o Identification of credit and market risk
o Monitoring of capital flows
o Supervisory tasks
4. Big Data Analytics and available solutionsThe process of collecting, organizing and analysing huge volume of data, both
structured and unstructured, within a given time frame, with the aim of discovering
patterns and other useful information, requires the use of specialized technologies. As
stated before, computing resources of traditional computing system is not sufficient for
processing and storing such huge volume of data in a short time frame. Specialized
technologies for big data comprise a wide variety of specialized tools and applications
[13], summed up in the following table.
7 | P a g e
Technology DescriptionPredictive analytics “Software and/or hardware solutions that allow firms
to discover, evaluate, optimize, and deploy predictive models by analyzing big data sources to improve business performance or mitigate risk”. See annex 2.
NoSQL databases “Key-value, document, and graph databases”. A complementary addition to RDBMSs and SQL.
Search and knowledge discovery
“Tools and technologies to support self-service extraction of information and new insights from large repositories of unstructured and structured data that resides in multiple sources such as file systems, databases, streams, APIs, and other platforms and applications”.
Stream analytics “Software that can filter, aggregate, enrich, and analyze a high throughput of data from multiple disparate live data sources and in any data format”.
In-memory data fabric “Provides low-latency access and processing of large quantities of data by distributing data across the dynamic random access memory (DRAM), Flash, or SSD of a distributed computer system”.
Distributed file stores “A computer network where data is stored on more than one node, often in a replicated fashion, for redundancy and performance”.
Data virtualization “A technology that delivers information from various data sources, including big data sources such as Hadoop and distributed data stores in real-time and near-real time”.
Data integration “Tools for data orchestration across solutions such as Amazon Elastic MapReduce (EMR), Apache Hive, Apache Pig, Apache Spark, MapReduce, Couchbase, Hadoop, and MongoDB”.
Data preparation “Software that eases the burden of sourcing, shaping, cleansing, and sharing diverse and messy data sets to accelerate data’s usefulness for analytics”.
Data quality “Products that conduct data cleansing and enrichment on large, high-velocity data sets, using parallel operations on distributed data stores and databases”.
8 | P a g e
Most of the big data tools and technologies are open-source. Apache Foundation, a
non-profit organization, has provided valuable open source big data tools and
technologies, among them, Hadoop, Hadoop Distributed File System (HDFS), Hadoop
YARN, Hadoop MapReduce (see Annex 1). However, organizations can have the
option to use professional vendor support to help them get their big data platforms
running. In addition, there are vendor-specific distributions (see annex 2) based on
open source big data technologies, such as Cloudera, Hortonworks, MapR.
5. Data Harmonization
One of the biggest challenge with doing big data analytics is extract insights from
diverse information feeds from multiple, often unrelated sources. So before any insights
are extracted from big data, there is a need to harmonise this diverse information to a
common definition of granularity and naming conventions, enhancing in this way the
quality and utility of the data.
Data harmonization in big data is a process that brings together a variety of data types,
naming conventions, and columns, and transforming it into one cohesive data set.
Without it, the data is subject to sit in separate and disparate units, turning the process
of gathering insights from this data much more difficult.
Another aspect of data harmonization concerns the financial firms that central banks
regulate. Central banks typically have collected aggregate data from firms using
reporting returns structured like standard financial statements. However, standard
financial statements by themselves do not provide all information if further question
arises. A way to overcome the gaps leaved by standard financial statements is to
collect granular data once from financial firms, enabling the central bank to better spot
systemic risk and manage it with macro prudential policy. In addition, to enhance the
quality and utility of the data, there is a need to harmonise and enforce common
definitions of granular data attributes, both across the organisation and to financial
firms.
9 | P a g e
It be should be noted that the ongoing project SIA is working on data harmonization
component, which has a lot of things in common with data harmonization component of
big data. The most certain thing would be to extend the scope of SIA in order to include
data harmonization component of big data.
6. Way forward for doing big data analyticsDue to the great potential brought by big data, we believe that it is just a matter of time
before the big data exploitation takes place in the central banks. However, for this to
happen smoothly, the central banks need a strong roadmap. The roadmap can be
guided by the following steps [14]:
1. Strategic plano Identify strategic priorities
2. Identify Opportunitieso Brainstorm and ask crunchy questions
3. Determine data sources, assesso Data and applications landscape including archives
o Analytics and BI capabilities including skills
o Assess new technology adoptions
o IT strategy, priorities, policies, budget and investments.
o Current projects
o Current data, analytics and BI problems
4. Identify/define use caseso Based on the assessments and business priorities identify and prioritize
big data use cases
5. Pilots and Prototypeso Identify tools, technologies and processes for use cases and implement
pilots and prototypes
6. Adopt in Productiono Prioritize and implement successful high value initiatives in production,
high value initiatives in production.
10 | P a g e
7. Challenges of Big Data
234567
7.1 Technical1. Scarce availability of data scientists and people with expertise on math,
statistic, data engineering, pattern recognition, advanced computing,
visualization and modelling to handle and analyse big data [8].
2. The majority of statisticians are familiar with traditional statistics techniques.
So it will be a great challenge to retrain them in order to develop required
skills for big data.
3. Determining how to get value from big data. The nature of big data implies a
high and increasing noise to signal ratio over time, distorting the accuracy of
expected results. Because of this, retrieving useful information is more
complex than traditional statistics techniques.
4. Cost and effort associated with acquisition of new set of tools and
technologies to interact with extremes volume and variety of data formats.
5. Cost associated with maintaining data quality with regard to completeness,
validity, integrity, consistency, timeliness, and accuracy.
6. Legacy frameworks, disparate and third party systems that are difficult to
incorporate, pose additional challenges in implementing big data.
7. Information gaps when collecting financial statements from other firm. Lack
of a framework that enforce the delivery of granular information from firms
can undermine the results.
8. Lack of data harmonization can jeopardize the utility of the data as the data
will be sitting in separate, disparate units, which turn the process of gathering
insights from this data much more difficult.
7.2 Legal 9. Re-evaluation of internal and external data policies and regulatory
environment.
11 | P a g e
10.The use of big data can be limited/complicated by laws that regulate the use
of certain types of data.
7.3 Privacy and Security11.Working with huge data volume, variety and complexity can increase the risk
of data breach and thereby can pose threat to privacy. To make things
worse, the traditional security mechanisms such as firewalls and
demilitarized zones cannot be suitable to protect the big data environment.
8. Conclusions and RecommendationsBig data can be leveraged to provide the central banks with better informed decisions
regarding macroeconomic and financial stability purposes Macroeconomic. Due to this,
we believe that it is just a matter of time before the big data exploitation takes place in
the central banks. Central banks of SADC have the interest to clarify the benefits of
using big data, the process of collecting, organizing and analysing huge volume of
data, and manage all the associated challenges that go along with it. The cooperation
between central banks of SADC can respond to these needs by taking into account the
following actions:
Creation of a strong big data roadmap in order to establish the foundation and
structure for successful usage and exploitation of big data.
Creation of data lab with all resources (computers, software, IT experts) as a
way to help the organization and/or counterparties collect, organise and analyse
huge volume of data, both structured and unstructured. The lab can also serve
to create small-scale “big data” prototypes that meets the business goals, which
later can move into a full fledge big data solution.
Participation and training in big data workshops, mainly focused on the use of
big data in central banks.
Establishment of a framework of data harmonization that applies across the
organization, central banks and to financial firms.
9. References
12 | P a g e
1. Press, G. (2013). A Very Short History Of Big Data, Forbes, web page, available at
https://www.forbes.com/sites/gilpress/2013/05/09/a-very-short-history-of-big-data/
#51bd5bb865a1
2. Tissot, B., Hülagü, T., Nymand-Andersen, P., & Suarez, L. C. (2015). Central banks’ use of and
interest in “big data”. Bank for International Settlements 2015.
3. Davenport, T(2014), ´Big data at work: dispelling the miths, uncovering the oportunities´, Harvard
Business Review, Boston
4. Merrouche, S (2014), ´Banks´intraday liquidity management during operational outages: theory
and evidence from the UK payment system´, Bank of England Working Paper No. 370, available
at http://www.bankofengland.co.uk/research/Documents/workingpapers/2009/wp370.pdf
5. IBM (2016), “The Four V's of Big Data”, web page, available at
http://www.ibmbigdatahub.com/infographic/four-vs-big-data.
6. BBVA (2017), “the five V´s of big data”, web page, available at https://www.bbva.com/en/five-vs-
big-data/
7. EUCLIDE, Chapter 6: Scalling UP linked Data, web page, available at
http://euclid-project.eu/modules/chapter6.html
8. Poynter R (2013), Big data successes and limitations: what researchers and marketers need to
know, web page, available at http://www.visioncritical.com/blog/big-data-successes-and-
limitations. Accessed 14 Jul 2017
9. Choi H, Varian H (2011), Predicting the present with google trends, available at
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.221.2435&rep=rep1&type=pdf
10. Carriero A, Clark TE, Marcellino M (2012a) Real-time nowcasting with a Bayesian mixed
frequency model with stochastic volatility. Working Paper, No. 1227, Federal Reserve Bank of
Cleveland, available at
https://www.ecb.europa.eu/events/pdf/conferences/140407/MarcellinoReal_TimeNowcastingWit
hABayesianMixedFrequencyModelWithStochasticVolatility.pdf?
1349ebe7044626a4f953406dac102015
11. Casey, M. (2014), Emerging Opportunities and Challenges with Central Bank Data, presentation
slides, available at https://www.ecb.europa.eu/events/pdf/conferences/141015/presentations/
Emerging_opportunities_and_chalenges_with_Central_Bank_data-presentation.pdf?
6074ecbc2e58152dd41a9543b1442849
12. Lohr, S. (2013), The origins of ´Big Data´: An Etymologial Detective Story´, web page, available
at https://bits.blogs.nytimes.com/2013/02/01/the-origins-of-big-data-an-etymological-detective-
story/
13. Press, G. (2016), Top 10 Hot Big Data Technologies, web page, available at
https://www.forbes.com/sites/gilpress/2016/03/14/top-10-hot-big-data-technologies/
#750957af65d7
13 | P a g e
14. Deloitte (2013), Big Data challenges and Success Factors, presentation slides, available at
https://www2.deloitte.com/content/dam/Deloitte/it/Documents/deloitte-analytics/
bigdata_challenges_success_factors.pdf
14 | P a g e
10. Annexures
Annex 1 – Hadoop Components & Ecosystem
15 | P a g e
Annex 2 – Oracle Integrated Solution for big data and Oracle Big Data Platform
16 | P a g e