2. INTRODUCTIONBig Data is not about the size of the data, its
about the value within the data.We are awash in a flood of data
today. In a broad range of application areas, data is being
collected at unprecedented scale. Decisions that previously were
based on guesswork, or on painstakingly constructed models of
reality, can now be made based on the data itself. Such Big Data
analysis now drives nearly every aspect of our modern society,
including mobile services, retail, manufacturing, financial
services, life sciences, and physical sciences.In 2010 the term Big
Data was virtually unknown, but by mid-2011 it was being widely
touted as the latest trend, with all the usual hype. Like cloud
computing before it, the term has today been adopted by everyone,
from product vendors to large-scale outsourcing and cloud service
providers keen to promote their offerings.
In short, Big Data is about quickly deriving business value from
a range of new and emerging data sources, including social media
data, location data generated by smart phones and other roaming
devices, public information available online and data from sensors
embedded in cars, buildings and other objects and much more
besides
Over the past decade, much has been written about "Big Data" in
the last couple of years, but just what is it? As now commonly
used, the term Big Data refers not just to the explosive growth in
data that almost all organizations are experiencing, but also the
emergence of data technologies that allow that data to be
leveraged. Big Data is a holistic term used to describe the ability
of any company, in any industry, to find advantage in the ever
increasingly large amount of data that now flows continuously into
those enterprises, as well as the semi structured and unstructured
data that was previously either ignored or too costly to deal with.
The problem is that as the world becomes more connected via
technology, the amount of data flowing into companies is growing
exponentially and identifying value in that data becomes more
difficult - as the data haystack grows larger, the needle becomes
more difficult to find. So Big Data is really about finding the
needles gathering, sorting and analyzing the flood of data to find
the valuable information on which sound business decisions are
made. Big data is the term for a collection of data sets so large
and complex that it becomes difficult to process using on-hand
database management tools or traditional data processing
applications. The challenges include capture, curation, storage,
search, sharing, transfer, analysis and visualization. The trend to
larger data sets is due to the additional information derivable
from analysis of a single large set of related data, as compared to
separate smaller sets with the same total amount of data, allowing
correlations to be found to "spot business trends, determine
quality of research, prevent diseases, link legal citations, combat
crime, and determine real-time roadway traffic conditions."The term
big data refers to data sets so large and complex that traditional
tools, like relational databases, are unable to process them in an
acceptable time frame or within a reasonable cost range. Problems
occur in sourcing, moving, searching, storing, and analyzing the
data.Definition of Big Data
Big Data has been defined in various ways by different
organizations over the years. Few of them include:
IBM defines: Every day, we create 2.5 quintillion bytes of data
so much that 90% of the data in the world today has been created in
the last two years alone. This data comes from everywhere: sensors
used to gather climate information, posts to social media sites,
digital pictures and videos, purchase transaction records, and cell
phone GPS signals to name a few. This data is big data.
In simple words, A set of technology advances that have made
capturing and analyzing data at high scale and speed vastly more
efficient.
Big data is a massive volume of both structured and unstructured
data that is so large that it's difficult to process using
traditional database and software techniques.
Some examples of Big Data:
An airline jet collects 10 terabytes of sensor data for every 30
minutes of flying time.
Twitter has over 500 million registered users.
1. The USA, whose 141.8 million accounts represents 27.4 percent
of all Twitter users, good enough to finish well ahead of Brazil,
Japan, the UK and Indonesia.
2. 79% of US Twitter users are more like to recommend brands
they follow.
3. 67% of US Twitter users are more likely to buy from brands
they follow.
4. 57% of all companies that use social media for business use
Twitter
How fast data is increasing:
The picture which explains what happens in every 60 seconds on
the internet. By this we can understand how much data being
generated in a second, a minute, a day or a year and how
exponentially it is generating. As per the analysis by Tech News
Daily we might generate more than 8 Zetta bytes of data by
2015.
Defining Big Data: the 3V model & Characteristics of Big
DataMany analysts use the 3V model to define Big Data. The three Vs
stand for volume, velocity and variety.
Volume refers to the fact that Big Data involves analysing
comparatively huge amounts of information, typically starting at
tens of terabytes. Velocity reflects the sheer speed at which this
data is generated and changes. For example, the data associated
with a particular hashtag on Twitter often has a high velocity.
Tweets fly by in a blur. In some instances they move so fast that
the information they contain cant easily be stored, yet it still
needs to be analysed.Variety describes the fact that Big Data can
come from many different sources, in various formats and
structures. For example, social media sites and networks of sensors
generate a stream of ever-changing data. As well as text, this
might include, for example, geographical information, images,
videos and audio.
Big Data Problem:
Traditional systems build within the company for handling the
relational databases may not be able to support/scale as data
generating with high volume, velocity and variety of data.
Volume: As an example, Terabytes of posts generated on Facebook
or 400 billion annual twitter tweets could mean Big Data! This
enormous amount of data will be stored somewhere to analyze and
come up with data science reports for different solutions and
problem solving approaches.
Velocity: Big data requires fast processing. Time factor plays a
very crucial role in several organizations. For instance, millions
of records are generated in the stock market which needs to be
stored and processed with the same speed as its coming into the
system.
Variety: There is no specific format of Big Data. It could be in
any form such as structured, unstructured, text, images, audio,
video, log files, emails, simulations, 3D models, etc. Until now we
have been working with only structured data. It might be difficult
to handle the quality and quantity of unstructured or
semi-structured data that we are generating on a daily basis.
How Big Data handles the above problems: Distributed File System
(DFS): In DFS, we can divide a large set of data files into smaller
blocks and load these blocks into multiple number of machines which
will then be ready for parallel processing. For example, if we have
1 Terabyte of data to read with 1 machine and 4 Input/Output
channels with each channels reading speed id 100MB/sec, the whole 1
TB data will be read in 45 minutes. On the other hand, if we have
10 different machines, we can divide 1 TB of data into 10 machines
and then the data can be read in parallel which reduces the total
time to only 4.5 minutes.
Parallel Processing: When data resides on N number of servers
and holds the power of N servers, then the data can be processed in
parallel for analysis, which helps the user to reduce the wait time
to generate the final report or analyzed data.
Fault Tolerance: The Fault tolerance feature of Big Data
frameworks (like Hadoop) is the one of the main reason for using
this framework to run jobs. Even when running jobs on a large
cluster where individual nodes or network components may experience
high rates of failure, BigData frameworks can guide jobs toward a
successful completion as the data is replicated into multiple
nodes/slaves.
Use of Commodity Hardware: Most of the Big Data tools and
frameworks need commodity hardware for its working which reduces
the cost of the total infrastructure and very easy to add more
clusters as data size increase.The Potentials and Difficulties of
Big Data
Big data needs to be considered in terms of how the data will be
manipulated. The size of the data set will impact data capture,
movement, storage, processing, presentation, analytics, reporting,
and latency. Traditional tools quickly can become overwhelmed by
the large volume of big data. Latency time it takes to access the
datais as an important a consideration as volume. Suppose you might
need to run an ad hoc query against the large data set or a
predefined report. A large data storage system is not a data
warehouse, however, and it may not respond to queries in a few
seconds. It is, rather, the organization-wide repository that
stores all of its data and is the system that feeds into the data
warehouses for management reporting. One solution to the problems
presented by very large data sets might be to discard parts of the
data so as to reduce data volume, but this isnt always practical.
Regulations might require that data be stored for a number of
years, or competitive pressure could force you to save everything.
Also, who knows what future benefits might be gleaned from historic
business data? If parts of the data are discarded, then the detail
is lost and so too is any potential future competitive advantage.
Instead, a parallel processing approach can do the trick think
divide and conquer. In this ideal solution, the data is divided
into smaller sets and is processed in a parallel fashion. What
would you need to implement such an environment? For a start, you
need a robust storage platform thats able to scale to a very large
degree as the data grows and one that will allow for system
failure. Processing all this data may take thousands of servers, so
the price of these systems must be affordable to keep the cost per
unit of storage reasonable. In licensing terms, the software must
also be affordable because it will need to be installed on
thousands of servers. Further, the system must offer redundancy in
terms of both data storage and hardware used. It must also operate
on commodity hardware, such as generic, low-cost servers, which
helps to keep costs down. It must additionally be able to scale to
a very high degree because the data set will start large and will
continue to grow. Finally, a system like this should take the
processing to the data, rather than expect the data to come to the
processing. If the latter were to be the case, networks would
quickly run out of bandwidth.
Requirements for a Big Data System
This idea of a big data system requires a tool set that is rich
in functionality. For example, it needs a unique kind of
distributed storage platform that is able to move very large data
volumes into the system without losing data. The tools must include
some kind of configuration system to keep all of the system servers
coordinated, as well as ways of finding data and streaming it into
the system in some type of ETL-based stream. (ETL, or extract,
transform, load, is a data warehouse processing sequence.) Software
also needs to monitor the system and to provide downstream
destination systems with data feeds so that management can view
trends and issue reports based on the data. While this big data
system may take hours to move an individual record, process it, and
store it on a server, it also needs to monitor trends in real
time.In summary, to manipulate big data, a system requires the
following:
A method of collecting and categorizing data
A method of moving data into the system safely and without data
loss A storage system that Is distributed across many servers Is
scalable to thousands of servers Will offer data redundancy and
backup Will offer redundancy in case of hardware failure Will be
cost-effective
A rich tool set and community support A method of distributed
system configuration Parallel data processing
System-monitoring tools
Reporting tools
ETL-like tools (preferably with a graphic interface) that can be
used to build tasks that process the data and monitor their
progress
Scheduling tools to determine when tasks will run and show task
status
The ability to monitor data trends in real time
Local processing where the data is stored to reduce network
bandwidth usage3. New technologies in Big DataDealing with Big Data
which sets up to multiple peta bytes in size (a single peta byte is
a quadrillion bits of data) requires new technologies and new
approaches to efficiently process large quantities of data within
tolerable elapsed times. Traditional relational database
technologies, like SQL, have been proven inadequate in terms of
response times when applied to very large datasets such as those
found in Data implementations. To address this shortcoming, these
Big Data implementations are leveraging new technologies that
provide a framework for processing the massive data stores that
define Big Data. The Big Data landscape is dominated by two classes
of technology1. Operational: Systems that provide operational
capabilities for real-time, interactive workloads where data is
primarily captured and stored. Focus is on servicing highly
concurrent requests while exhibiting low latency for responses
operating on highly selective access criteria.2. Analytical:
Systems that provide analytical capabilities for retrospective,
complex analysis that may touch most or all of the data.Focus is on
high throughput; queries can be very complex and touch most if not
all of the data in the system at any time. Both systems tend to
operate over many servers operating in a cluster, managing tens or
hundreds of terabytes of data across billions records. Currently
trending technologies: 1. Column oriented databases
2. Schema-less / No-SQL databases
3. Map Reduce
4. Hadoop
5. Hive
6. PIG
7. WibiData
8. PLATFORA
9. Sky TreeColumn-oriented databases
Traditional, row-oriented databases are excellent for online
transaction processing with high update speeds, but they fall short
on query performance as the data volumes grow and as data becomes
more unstructured. Column-oriented databases store data with a
focus on columns, instead of rows, allowing for huge data
compression and very fast query times. The downside to these
databases is that they will generally only allow batch updates,
having a much slower update time than traditional models.
Schema-less databases, or NoSQL databases
There are several database types that fit into this category,
such as key-value stores and document stores, which focus on the
storage and retrieval of large volumes of unstructured,
semi-structured, or even structured data. They achieve performance
gains by doing away with some (or all) of the restrictions
traditionally associated with conventional databases, such as
read-write consistency, in exchange for scalability and distributed
processing.
MapReduce
This is a programming paradigm that allows for massive job
execution scalability against thousands of servers or clusters of
servers. Any MapReduce implementation consists of two tasks:
The "Map" task, where an input dataset is converted into a
different set of key/value pairs, or tuples;
The "Reduce" task, where several of the outputs of the "Map"
task are combined to form a reduced set of tuples (hence the
name).
Hadoop
Hadoop is by far the most popular implementation of MapReduce,
being an entirely open source platform for handling Big Data. It is
flexible enough to be able to work with multiple data sources,
either aggregating multiple sources of data in order to do large
scale processing, or even reading data from a database in order to
run processor-intensive machine learning jobs. It has several
different applications, but one of the top use cases is for large
volumes of constantly changing data, such as location-based data
from weather or traffic sensors, web-based or social media data, or
machine-to-machine transactional data.
Hive
Hive is a "SQL-like" bridge that allows conventional BI
applications to run queries against a Hadoop cluster. It was
developed originally by Facebook, but has been made open source for
some time now, and it's a higher-level abstraction of the Hadoop
framework that allows anyone to make queries against data stored in
a Hadoop cluster just as if they were manipulating a conventional
data store. It amplifies the reach of Hadoop, making it more
familiar for BI users.
PIG
PIG is another bridge that tries to bring Hadoop closer to the
realities of developers and business users, similar to Hive. Unlike
Hive, however, PIG consists of a "Perl-like" language that allows
for query execution over data stored on a Hadoop cluster, instead
of a "SQL-like" language. PIG was developed by Yahoo!, and, just
like Hive, has also been made fully open source.
WibiData
WibiData is a combination of web analytics with Hadoop, being
built on top of HBase, which is itself a database layer on top of
Hadoop. It allows web sites to better explore and work with their
user data, enabling real-time responses to user behavior, such as
serving personalized content, recommendations and decisions.
PLATFORA
Perhaps the greatest limitation of Hadoop is that it is a very
low-level implementation of MapReduce, requiring extensive
developer knowledge to operate. Between preparing, testing and
running jobs, a full cycle can take hours, eliminating the
interactivity that users enjoyed with conventional databases.
PLATFORA is a platform that turns user's queries into Hadoop jobs
automatically, thus creating an abstraction layer that anyone can
exploit to simplify and organize datasets stored in Hadoop.Storage
Technologies
As the data volumes grow, so does the need for efficient and
effective storage techniques. The main evolutions in this space are
related to data compression and storage virtualization.
SkyTree
SkyTree is a high-performance machine learning and data
analytics platform focused specifically on handling Big Data.
Machine learning, in turn, is an essential part of Big Data, since
the massive data volumes make manual exploration, or even
conventional automated exploration methods unfeasible or too
expensive.
Big Data in the cloud
Big Data and cloud computing go hand-in-hand. Cloud computing
enables companies of all sizes to get more value from their data
than ever before, by enabling blazing-fast analytics at a fraction
of previous costs. This, in turn drives companies to acquire and
store even more data, creating more need for processing power and
driving a virtuous circle.4. Big Data Architecture
Data can be classified into three discrete categories namely- 1.
Structured Data 2. Unstructured Data
3. Semi Structured Data1. Structured Data- Data that resides in
a fixed field within a record or file is called structured data.
This includes data contained in relational databases and
spreadsheets. Structured data first depends on creating a data
model viz. defining what fields of data will be stored and how that
data will be stored: data type and any restrictions on the data
input. Structured data has the advantage of being easily entered,
stored, queried and analyzed. Structured data is organized in a
highly mechanized and manageable way. Structured data is often
managed using Structured Query Language(SQL).2. Unstructured Data-
Unstructured data usually refers to information that doesn't reside
in a traditional row column database. Unstructured data files often
include text and multimedia content. Examples include email
messages, word processing documents, videos, photos, audio files,
presentations, web pages and many other kinds of business
documents. These sorts of files may have an internal structure,
they are still considered "unstructured" because the data they
contain doesn't fit neatly in a database. Experts estimate that 80
to 90 percent of the data in any organization is unstructured. And
the amount of unstructured data in enterprises is growing
significantly. Unstructured data is raw and unorganized. Digging
through such data can be cumbersome and costly. Big Data has
generally to do with this large collection of unstructured data
that is growing in size daily and swiftly.3. Semi structured data-
is a form of structured data that does not conform with the formal
structure of data models associated with relational databases or
other forms of data tables, but nonetheless contains tags or other
markers to separate semantic elements and enforce hierarchies of
records and fields within the data. Therefore, it is also known as
self-describing structure. In semi-structured data, the entities
belonging to the same class may have different attributes even
though they are grouped together, and the attributes' order is not
important. Semi-structured data is increasingly occurring since the
advent of the Internet where full-text documents and databases are
not the only forms of data many more and different applications
need a medium for exchanging information. In object-oriented
databases, one often finds semi-structured data.Big Data
architecture is premised on a skill set for developing reliable,
scalable, completely automated data pipelines. That skill set
requires profound knowledge of every layer in the stack, beginning
with cluster design and spanning everything from Hadoop tuning to
setting up the top chain responsible for processing the data. The
following diagram shows the complexity of the stack, as well as how
data pipeline engineering touches every part of it.
The main detail here is that data pipelines take raw data and
convert it into insight (or value). Along the way, the Big Data
engineer has to make decisions about what happens to the data, how
it is stored in the cluster, how access is granted internally, what
tools to use to process the data, and eventually the manner of
providing access to the outside world. The latter could be BI or
other analytic tools, the former (for the processing) are likely
tools such as Impala or Apache Spark. 5. System DesignCompanies
today already use, and appreciate the value of, business
intelligence. Business data is analyzed for many purposes: a
company may perform system log analytics and social media analytics
for risk assessment, customer retention, brand management, and so
on. Typically, such varied tasks have been handled by separate
systems, even if each system includes common steps of information
extraction, data cleaning, relational-like processing (joins,
group-by, aggregation), statistical and predictive modelling, and
appropriate exploration and visualization tools. With Big Data, the
use of separate systems in this fashion becomes prohibitive
expensive given the large size of the data sets. The expense is due
not only to the cost of the systems themselves, but also the time
to load the data into multiple systems. Big Data has made it
necessary to run heterogeneous workloads on single infrastructure
that is sufficiently flexible to handle all these workloads. The
challenge here is not to build a system that is ideally suited for
all processing tasks.Instead, the need is for the underlying system
architecture to be flexible enough that the components built on top
of it for expressing the various kinds of processing tasks can tune
it to efficiently run these different workloads. we focus on the
programmability requirements. If users are to compose and build
complex analytical pipelines over Big Data, it is essential that
they have appropriate high-level primitives to specify their needs
in such flexible systems. The MapReduce framework has been
tremendously valuable, but is only a first step. Even declarative
languages that exploit it, such as Pig Latin, are at a rather low
level when it comes to complex analysis tasks. Similar declarative
specifications are required at higher levels to meet the
programmability and composition needs of these analysis pipelines.
Besides the basic technical need, there is a strong business
imperative as well. Businesses typically will outsource Big Data
processing, or many aspects of it. Big data classification
6. Algorithms used in Big DataBig data is data so large that it
does not fit in the main memory of a single machine, and the need
to process big data by efficient algorithms arises in Internet
search, network traffic monitoring, machine learning, scientific
computing, signal processing, and several other areas.
For decades researchers across different disciplines of computer
science have envisioned the need oftechniques to handle
data-intensive computing. With the boom of internet and the
explosion of datain every socio-economical aspect, once what was a
futuristic research, has now transformed itselfinto a dire
requirement. Big Data comes with immense opportunity, but turning
this seriously highvolume, high velocity, structured or
unstructured, heterogeneous, often noisy and high-dimensionaldata
into something one can use is a huge challenge
1. Streaming: Sampling and Sketching
2. Dimensionality Reduction
3. External Memory and Semi-streaming Algorithms
4. Map-Reduce Framework and Extensions
5. Near Linear Time Algorithm Design
6. Property Testing
7. Metric Embedding
8. Sparse Transformation
9. Crowd sourcing
Mainly used big data algorithms is A/B testing and data
analytics in general. In all machine learning algorithms, you do
much better by understanding the domain of the data well and also
understanding the underlying models, their strengths and weaknesses
etc. It is much the same with A/B testing. 7. Bottom of Form
Features of Big DataThe traditional data warehousing approach of
bringing data into a central repository is costly and time
consuming. Moreover, there is a significant shift in the way
information is managed and consumed in a big data environment.1.
High-capacity, inexpensive storage
2. High-performance, inexpensive processing power3. High
velocity data stream processing 4. Data integration and quality
capabilities
5. Relational database acceleration/scale
6. Unstructured text management and search
8. Draw backs of Big Data There is no doubt that big data is a
valuable tool that has already had a critical impact in certain
areas.
The first thing to note is that although big data is very good
at detecting correlations, especially subtle correlations that an
analysis of smaller data sets might miss, it never tells us which
correlations are meaningful. Second, big data can work well as an
adjunct to scientific inquiry but rarely succeeds as a wholesale
replacement. Third, many tools that are based on big data can be
easily gamed. Even Googles celebrated search engine, rightly seen
as a big data success story, is not immune to Google bombing and
spamdexing, wily techniques for artificially elevating website
search placement.
Fourth, even when the results of a big data analysis arent
intentionally gamed, they often turn out to be less robust than
they initially seem. Consider Google Flu Trends, once the poster
child for big data. A fifth concern might be called the
echo-chamber effect, which also stems from the fact that much of
big data comes from the web. Whenever the source of information for
a big data analysis is itself a product of big data, opportunities
for vicious cycles abound.
Sixth, big data is prone to giving scientific-sounding solutions
to hopelessly imprecise questions. In the past few months, for
instance, there have been two separate attempts to rank people in
terms of their historical importance or cultural contributions,
based on data drawn from Wikipedia.FINALLY, big data is at its best
when analyzing things that are extremely common, but often falls
short when analyzing things that are less common. For instance,
programs that use big data to deal with text, such as search
engines and translation programs. 9. Future advances in Big DataBig
data for all
Currently Big Data is seen predominantly as a business tool.
Increasingly, though, consumers will also have access to powerful
Big Data applications. In a sense, they already do (e.g. Google,
social media search tools, etc). But as the number of public data
sources grows and processing power becomes ever faster and cheaper
increasingly easy-to-use tools will emerge that put the power of
Big Data analysis into everyones handsData evolution
It is also certain that the amount of data stored will continue
to grow at an astounding rate. This inevitably means Big Data
applications and their underlying infrastructure will need to keep
pace. More governments will initiate open data projects, further
boosting the variety and value of available data sources. Linked
Data databases will become more popular and could potentially push
traditional relational databases to one side due to their increased
speed and flexibility. This means businesses will be able to
develop and evolve applications at a much faster rate.Data security
will always be a concern, and in future data will be protected at a
much more granular level than it is today. As data increasingly
becomes viewed as a valuable commodity, it will be freely traded,
manipulated, added to and re-sold.Dawn of the databots
As volumes of stored data continue to grow exponentially and
data becomes more openly accessible, databots will increasingly
crawl organisations linked data, unearthing new patterns and
relationships in that data over time. These databots will initially
be small applications or programs that follow simple rules, but as
time moves on they will become more sophisticated, self-learning
entities. The artificial intelligence programs they employ will
continue to grow more effective due to the fact that they can
operate over time and learn from ever larger data sets. The Final
Word on Big DataThe vision is about linking people, products and
services with information. In this context, the digital world
offers mechanisms that allow individuals and organisations to share
and collaborate on a planetary scale.
In this fast-moving, connected world, intuition, experience and
training will not be enough to give businesses the insight they
need. They need to apply scientific and data analysis to their
questions and find the answers in the time frame demanded so they
can make the right decisions. And that all requires scalable,
global solutions.
10.Conclusion Big Data is huge amount of data. Through better
analysis of the largevolumes of data that are becoming available,
there is the potential for makingfaster advances in many scientific
disciplines and improving the profitability and success of many
enterprises. However, many technical challenges describe in this
paper must be addressed before this potential can be realized
fully. Thechallenges include not just the obvious issues of scale,
but also heterogeneity,lack of structure, error-handling, privacy,
timeliness, provenance, andvisualization, at all stages of the
analysis pipeline from data acquisition to result interpretation.
These technical challenges are common across a large variety of
application domains, and therefore not cost-effective to address in
the context of one domain alone. Furthermore, these challenges will
require transformative solutions, and will not be addressed
naturally by the next generation of industrial products. We must
support and encourage fundamental research towards addressing these
technical challenges if we are to achieve the promised benefits of
Big Data.1