Big Data: A Classification of Acquisition and Generation Methods Big Data: A Classification of Acquisition and Generation Methods Vijayakumar Nanjappan, Hai-Ning Liang*, Wei Wang, Ka Lok Man Xi’an Jiaotong-Liverpool University, China E-Mail: [email protected]Introduction Big Data: A Classification Characteristics of Big Data Big Data Generation Methods Data Sources Data Types Big Data Acquisition Methods Interface Methods Interface Devices Big Data Management Data Representation and Organization File Formats Data Compression Databases NoSQL Types Data Fusion Summary
34
Embed
Big Data: A Classification of Acquisition and Generation ...csse.xjtlu.edu.cn/wwang/publication/Book_Chapter_Draft_SUB.pdf · Big Data: A Classification The coinage of the term “big
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Big Data: A Classification of Acquisition and Generation Methods
Big Data: A Classification of Acquisition and Generation Methods Vijayakumar Nanjappan, Hai-Ning Liang*, Wei Wang, Ka Lok Man Xi’an Jiaotong-Liverpool University, China E-Mail: [email protected]
Introduction
Big Data: A Classification
Characteristics of Big Data
Big Data Generation Methods
Data Sources
Data Types
Big Data Acquisition Methods
Interface Methods
Interface Devices
Big Data Management
Data Representation and Organization
File Formats
Data Compression
Databases
NoSQL Types
Data Fusion
Summary
Abstract
Traditionally, data have been stored in securely protected databases for special
purposes, such as satellite imagery data for earth science research or customer transaction
data for business analytics. The usefulness of data lies in the fact that they can be examined
and analyzed to unearth correlations among data items and to discover knowledge to gain
deeper insightful trends. Data analytics has been the key research topic in data mining,
knowledge discovery and machine learning for decades. In recent years, the term “data” has
experienced a major rejuvenation in many aspects of our lives. The rapid development of the
Internet and web technologies allows ordinary users to generate vast amounts of data about
their daily lives. On the Internet of Things, the number of connected devices have grown
exponentially, each of which produces real-time or near real-time streaming data about our
physical world. The resulting data, which is extremely difficult, if not impossible, to be
stored, processed and analyzed with conventional computing methodologies and resources, is
referred to as the “Big Data”. In this chapter, we focus on a subset of big data: digital data
and analog data. These two major subsets are further divided as the environmental and
personal source of data. We have also highlighted the data types and formats as well as
different input mechanisms. These classifications are helpful to understand the active and
passive way of data collection and production with explicit and without (i.e., implicit) human
involvement. This chapter intends to provide enough information to support the reader to
understand the role of digital and analog sources, how data is acquired, transmitted and pre-
processed using today’s growing variety of computing devices and sensors.
Keywords: Big data; data generation; data acquisition; data storage; data
management; sensing devices; user interfaces.
Big Data: A Classification
The coinage of the term “big data” alludes to datasets of exceptionally massive sizes
with distinct and intricate structure. They can be extremely difficult to analyze and visualize
with any personal computing devices and conventional computational methods [1]. In fact,
enormous datasets of complex structures have been generated and used for long time, for
example in satellite imagery, raster data, geographical data, biological, and ecological data,
used for scientific research can also be considered as “big data”. Nowadays, we see that many
different kinds of big data exist in our lives, from social media data, to organization and
enterprise data, to the sensor data on the Internet of Things (e.g., metrological data about our
environment and healthcare data).
Characteristics of Big Data
In 2001, Doug Laney characterized big data from three perspectives, volume, velocity
and variety (the 3Vs) [2]. Volume refers to the magnitude of data, which usually determines
the potential value of the data. Velocity refers to speed at which data is generated and
processed according the requirements of different applications. Variety refers to the nature
and different types of data. Later, the research community proposed two additional Vs:
Veracity and Value. Veracity indicates the trustworthiness and quality of the data. This is
particularly important as big data are usually collected from a variety of sources, some of
which may not provide quality, reliable data. The term value is used to indicate the potential
(or hope) that valuable information or insight can be extracted or derived from the big data
provided that the data is appropriately processed and analyzed. These characteristics bring
new challenges into the data processing and analytics pipeline. As the size of the data is
constantly increasing and the velocity of the data generation is higher than the processing
speed, scalable storage and efficient data management methods are needed to enable real-
time or near real-time data processing by the analytical tools. To ensure the creditability of
the analytics, quality of the data must be taken into consideration, for example, to identify
erroneous processes and uncertain, unreliable, or missing data.
Big Data Generation Methods
In today’s digital era, the data unambiguously denote digital data which can be either
born-digital or born-analog but eventually converted into digital form. There has already been
large amounts of conventional digital data such as Web documents, social media, and
business transaction data. In recent years, the “Internet of Things” (IoT) is generating vast
volumes of data about our physical world captured by sensing devices. Many of the everyday
objects are embedded with a variety of sensors capable of collecting analog data and
converting it into digital. Besides conventional data, sensor data are becoming the next big
data source.
Data Sources
Born-Digital Data
The born-digital data are created and managed using computers or other digital
devices. Almost all of the documents in personal computers are stored in some standardized
file formats (e.g., WORD or PDF documents). Advances in Internet and World Wide Web
technologies have enabled computers around the world to be connected so that billions of
Web documents can be accessed anywhere. Emergence of Web 2.0 technologies enriched
data and media types from text only to images, videos, and audios as well as the associated
metadata such as temporal and geographical information. We can see now that there are
numerous images and videos being uploaded to the social media websites which are
annotated with location information and tagging data related to their contents. Some of the
other traditional big data sources include electronic mails, instant messages, medical records
and business transactions.
Sensor Data
Recently, billions of physical objects, such as sensors, smartphones, tablets, wearable
devices, and RFIDs, embedded with identification, sensing, computing, communication, and
actuation capabilities, are increasingly connected to the Internet, resulting in the next
technological revolution, known as the “Internet of Things” (IoT). Integration of multiple
semiconductor components on a single chip (System on Chip) is the key success of the
Internet of Things, which has the potential to revolutionize a large array of intelligent
applications and services in many fields.
According to Gartner, the network of connected things will reach nearly 6.4 billion by
2016, with around 5.5 million new devices get connected every day [3]. It is estimated that by
the end of 2016, sales of worldwide wearable electronic devices will be increased by 18.4
percent [4]. In contrast, there is a 9.6 percent decline in worldwide PC shipments, which
indicates that smart devices are more preferred in the market [5]. It is reported that by 2018,
new digital devices that can talk to each other in the household will be common [6]. It is
estimated that nearly 3 trillion gigabytes of data are produced in a single day. The high
volumes of heterogeneous data streams coming from these varieties of devices bring great
challenges to the traditional data management methods.
A widespread example of these portable devices are mobile phones or smart devices,
like Apple’s watch, have been integrated with varieties of sensors like accelerometer,
gyroscope, compass, GPS, and more recently sensors that can capture biometric information
such as heartrate. Table 1 lists the commonly used sensors on smartphones or tablets.
Sensors on Smartphones Function
Microphone The real-world sound and vibration are converted to digital audio.
Camera Senses visible light or electromagnetic radiation and converts them to digital image or video.
Gyroscope Orientation information Accelerometer Measures the linear acceleration
Compass or Magnetometer Works as a traditional compass. Provides orientation in relation to the magnetic field of Earth.
Proximity Sensor Finds proximity of the phone from the user. Ambient Light Sensor Optimizes the display brightness
GPS Global Positioning System, tracks the target location or 'navigate' the things by map with the help of GPS satellites.
Barometer Measures atmospheric pressure.
Table 1: Common sensors integrated in smartphones and tablets.
Sensors built on the Micro-Electro Mechanical Systems (MEMS) are small in size and
only have limited processing and computing capabilities. A wireless sensor networks (WSN)
can be developed by connecting the spatially distributed sensors using wireless interfaces.
There can be different kinds of sensors integrated into a single WSN, such as mechanical,
magnetic, thermal, biological, chemical, and optical. A sensor can be either immobile or
mobile (including wearable). While immobile sensors are installed on an object at a fixed
location [7], mobile sensor are usually installed on a moving object. Wearable sensor is a
special kind of mobile sensor and is worn on the human body, which can be used to form a
body sensor network (BSN) or body area network (BAN) [8].
The fixed sensors can be installed on earth surfaces like terrain [9], submerged under
the water [10] and under the land [11]. In contrast, mobile sensors can move and interact with
surrounding physical environments. Wearable sensors are worn by the users and can convert
physical or environmental parameters of wearers such as blood pressure [12,13], heart rate
[14,15], bodily motion [16], brain activity [17], and skin temperature [18]. Table 2
summarizes some of the most commonly used sensors in body sensor networks.
Sensor Function
Blood-pressure sensor Measures the human blood pressure.
4. Gartner Says Worldwide Wearable Devices Sales to Grow 18.4 Percent in 2016 [Internet]. [cited 2016 Apr 20]. Available from: http://www.gartner.com/newsroom/id/3198018
5. Gartner Says Worldwide PC Shipments Declined 9.6 Percent in First Quarter of 2016 [Internet]. [cited 2016 Apr 20]. Available from: http://www.gartner.com/newsroom/id/3280626
6. When to Expect Devices and Connected [Internet]. [cited 2016 Apr 20]. Available from: http://www.gartner.com/newsroom/id/3220117
7. Yick J, Mukherjee B, Ghosal D. Wireless sensor network survey. Comput Netw. 2008 Aug 22;52(12):2292–330.
8. Lai X, Liu Q, Wei X, Wang W, Zhou G, Han G. A Survey of Body Sensor Networks. Sensors. 2013 Apr 24;13(5):5406–47.
9. Akyildiz IF, Su W, Sankarasubramaniam Y, Cayirci E. A Survey on Sensor Networks. Comm Mag. 2002 Aug;40(8):102–14.
10. Akyildiz IF, Pompili D, Melodia T. Challenges for Efficient Communication in Underwater Acoustic Sensor Networks. SIGBED Rev. 2004 Jul;1(2):3–8.
11. Li M, Liu Y. Underground Structure Monitoring with Wireless Sensor Networks. In: Proceedings of the 6th International Conference on Information Processing in Sensor
Networks [Internet]. New York, NY, USA: ACM; 2007 [cited 2016 Apr 22]. p. 69–78. (IPSN ’07). Available from: http://doi.acm.org/10.1145/1236360.1236370
12. Espina J, Falck T, Muehlsteff J, Aubert X. Wireless Body Sensor Network for Continuous Cuff-less Blood Pressure Monitoring. In: 2006 3rd IEEE/EMBS International Summer School on Medical Devices and Biosensors. 2006. p. 11–5.
13. Teng XF, Zhang YT, Poon CCY, Bonato P. Wearable Medical Systems for p-Health. IEEE Rev Biomed Eng. 2008;1:62–74.
14. Paradiso R, Loriga G, Taccini N. A Wearable Health Care System Based on Knitted Integrated Sensors. IEEE Trans Inf Technol Biomed. 2005 Sep;9(3):337–44.
15. Rienzo MD, Rizzo F, Parati G, Brambilla G, Ferratini M, Castiglioni P. MagIC System: a New Textile-Based Wearable Device for Biological Signal Monitoring. Applicability in Daily Life and Clinical Setting. In: 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference. 2005. p. 7167–9.
16. Mattmann C, Clemens F, Tröster G. Sensor for Measuring Strain in Textile. Sensors. 2008 Jun 3;8(6):3719–32.
17. Devot S, Bianchi AM, Naujoka E, Mendez MO, Braurs A, Cerutti S. Sleep Monitoring Through a Textile Recording System. In: 2007 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. 2007. p. 2560–3.
18. Jung S, Ji T, Varadan VK. Point-of-care temperature and respiration monitoring sensors for smart fabric applications. Smart Mater Struct. 2006;15(6):1872.
19. Verma P. Gracoli: A Graphical Command Line User Interface. In: CHI ’13 Extended Abstracts on Human Factors in Computing Systems [Internet]. New York, NY, USA: ACM; 2013 [cited 2016 Mar 30]. p. 3143–6. (CHI EA ’13). Available from: http://doi.acm.org/10.1145/2468356.2479631
20. Garzotto F, Valoriani M. Touchless gestural interaction with small displays: a case study. In ACM Press; 2013 [cited 2015 Jul 2]. p. 1–10. Available from: http://dl.acm.org/citation.cfm?doid=2499149.2499154
21. F.E. White, Data Fusion Lexicon, Joint Directors of Laboratories, Technical_Panel for C3, Data Fusion Sub-Panel, Naval Ocean Systems Center, San Diego,_1991.
22. Hall DL, Llinas J. An introduction to multisensor data fusion. Proc IEEE. 1997 Jan;85(1):6–23.