This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
“A term used to refer to the study and applications of data sets that are too complex for traditional data-processing application software to adequately deal with” [Wik18a]
“Extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions” The highlighted answer of Google Search as of Sept. 2018
Volume, Velocity, and Variety, Veracity, Value (a.k.a. Three V’s, Five V’s) [Lan01, Goe14, Wik18a]
Users can provide the data that is on a Web 2.0 site and exercise some control over that data Web 2.0 is bidirectional, i.e., users are creators of user-generated
content as well as consumers
Examples of Web 2.0 include Social networking sites (e.g., Facebook, Twitter) Blogs (e.g., Tumblr) Wikis (e.g., Wikipedia) Photo/video sharing sites (e.g., Flickr, YouTube) …
Social networking sites are attracting significant interestsworldwide and producing big data The data are modeled using a graph, where a node is a person and
an edge is a relationship between them (followers, friends, etc.)
Big Data from Scientific ExperimentsBig Data from Scientific Experiments
Many scientists are using and producing vast amounts of data through scientific simulations and observations [Gra02]
Scientists try to discover patterns, trends, hidden messages, or even truth from this vast amount of scientific data through intensive analysis
Here comes the notion of data science that “uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from data in various forms”[Wik18g]
(Data science deals with any type of data as well as scientific data) Turing award winner Jim Gray envisioned data science as a
fourth paradigm of science following empirical, theoretical, and computational sciences in the human history[Gra02]
However, there are several problems in working with multiple machines Coordination among multiple nodes Dealing with frequent hardware failures when we work with a large
number of inexpensive processors and storage devices → replication
Nevertheless, the programmers do not want to think about these complexities
“A programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster” [Wik18c]
Typically for batch-oriented large-scale parallelization
Inspired by functional programming’s map() and reduce() functions
Proposed by Jeffrey Dean and Sanjay Ghemawat [Dea04] at Google in 2004 Cited more than 25,000 times as of Sept. 2018
MapReduce works by breaking the processing into two phases: the map phase and the reduce phase Map usually performs filtering and sorting Reduce usually performs a summary operation The output of map is provided to the input of reduce Each phase has key-value pairs as input and output, the types of
A software framework for distributed storage and processing of big data using the MapReduce programming model The most popular open-source implementation of MapReduce Being developed as a top-level Apache project
Significance: Why Hadoop? Because Hadoop takes care of these complexities, the programmers
don’t have to worry about the complexities behind parallel distributed processing
Hadoop is optimized for a single batch job, whereas applications such as machine learning typically needs iterative computation that repeats until convergence
Limitations of Hadoop for iterative processing Repeatedly writing the intermediate output (the result of the i-th
iteration) to disk and reading it from disk again at the next iteration, causing excessive disk I/O’s—degrading performance.
Spark for Iterative ProcessingSpark for Iterative Processing
Solutions of Spark In Spark, a main architectural component is the Resilient
Distributed Dataset (RDD), which is a main-memory structure representing a working set (i.e., intermediate results). It is distributed over a cluster of nodes, and is fault-tolerant.
By using RDDs across iterations, we can eliminate expensive disk I/O’s.
Hadoop is designed for a job that requires (almost) allinput data ready to start processing, whereas real-time applications need to return the results as soon as an input stream arrives
Limitation of Hadoop for stream processing The entire input should reside on the HDFS (disks) before processing The reducers will start only after all the mappers are completed, i.e.,
after all data splits (each with 128 MB) are read in and processed by Map
Storm for Stream ProcessingStorm for Stream Processing
Solutions of Storm for stream processing Storm: a streaming engine or a data stream management system(DSMS) In Storm, data are processed in real time as they arrive
Topology: defines a (continuous) query, in the form of a directed acyclic graph(DAG), consisting of Spouts, Bolts, and Streams (edges)
Spout: defines a stream source Bolt: defines work logic executing each record (default) or each micro batch
(Microbatch is a set of data records collected for a very short period of time)
Evolution of Data Management Systems [Wha18]Evolution of Data Management Systems [Wha18]
When MapReduce (or NoSQL) initially came about in 2004, we lost much of the high-level functionality of the relational DBMS—such as SQL, indexing, schemas, and transactions—in return for scalability
Since then, there have been a lot of efforts to restore them Two distinct trends SQL-on-Hadoop NewSQL initiatives
“…provide the same scalable performance of NoSQL systems for on-line transaction processing (OLTP) read-write workloads while still maintaining the ACID guarantees of a traditional database system” [Wik18d]
Providing high-level functionality (e.g., SQL, transactions, schemas, and secondary indexes) of conventional DBMSs
Base architecture: “shared-nothing parallel DBMS”Examples [Asl11][Kat13]:
Parallel DBMSs can be as good as or even better than MapReduce in performance Stonebreaker et al.[Sto10] have shown that parallel DBMSs are (linearly)
scalable and capable of processing petabyte-scale databases and large-scale query loads
Floratou et al.[Flo11] have shown that parallel DBMSs outperform MapReduce providing high performance and scalability by partitioning and storing tables in multiple nodes configured in a shared-nothing manner
Drawbacks of Parallel DBMS Expensive Too heavy by having too much functionality that is not needed in practical
large-scale applications—including capability of processing global transactions with general workload
Not suitable where faults occur frequently Hard to setup and use
An Object-Relational DBMS developed at KAIST for over 26 years (1990 – 2016) [Wha02, 03, 05, 07, 10, 12, 13, 15]
An earlier version of this technology played a vital role in starting up NaverCom Co. (currently, Naver Co.) in 1996-2000, which has been the number one portal in Korea
Best Demonstration Award at the IEEE 21st Int’l Conf. on Data Engineering (ICDE), Tokyo, Japan, Apr. 5-8, 2005 [Wha05]
Tight integration of IR features (U.S. patented [Wha02]) as well as spatial database features with the DBMS Being a DBMS and, at the same time, a search engine [Wha02, Wha03, Wha05, Wha15] Being a DBMS and, at the same time, a GIS engine [Wha07, Wha10]
Concurrency control and recovery Coarse granularity locking version: the shadow-page deferred-update recovery method (US patented
[Wha12]) Fine granularity locking version: the ARIES recovery method [Moh92]
Having many commercial applications
Consisting of approximately 600,000 lines of C/C++ (precision) codes
Open source released (600,000+ lines of C, C++)(Only the coarse granularity locking version has been released as of Aug. 2016)
Structure of the IR Index for DB-IR Tight Integration (U.S. Patented) [Wha02]Structure of the IR Index for DB-IR Tight Integration (U.S. Patented) [Wha02]
One Linux machine (one Quad-Core 3.0GHz CPUs, 6GB RAM)
Slaves ‡ (10 slaves) Four Linux machines (two Dual-Core 3.0GHz CPUs, 4GB RAM) One Linux machine (one Quad-Core 2.5GHz CPU, 4GB RAM) Five Linux machines (one Quad-Core 2.4GHz CPU, 8GB RAM) Four disk arrays (AS-2400~AS-2500, 0.9TB~3.9TB, RAID5, 200MB/s bandwidth, 512MB~1GB cache, average
59.5 MB/s disk transfer rate, 13 disks (arms) + 1 parity disk + 1 hot spare ) One disk array (TN-6416S, 13TB, RAID5, 4Gbit/s bandwidth, 512MB cache, average 83.3MB/s disk transfer
rate, 13 disks (arms) + 1 parity disk) Five internal disk arrays (B110i, 5TB, 768MB/s bandwidth, 81.2MB/s disk transfer rate, 10 disk (arms) + 1
parity disk)
Network †‡
Eleven gigabit LAN cards(Intel 82574L dual-port(1), Intel 82541GI single-port(5), HP NC326i dual-port(5)) A gigabit hub (HP 1410-24G, 1000Mbps, 24port)
Data 114 million Web documents 2 (duplicated)
Size of loaded data : Web page (1.55TB) and IR index (1.84TB) for 228 million Web documents Each slave indexes 22.8 million web documents (Note: A slave is capable of indexing 100 million documents)
† The master (ODYS Parallel-IR) consists of 58,000 lines of C and C++ code‡ The slave (Odysseus DBMS) consists of 600,000 lines of C and C++ code†‡ We use socket-based RPC consisting of 17,000 lines of C, C++, and Python code developed by the authors
Performance Projection for 300-Node Real-World-Scale ODYS Performance Projection for 300-Node Real-World-Scale ODYS
300-Node Real-World-Scale ODYS One ODYS set consisting of 4 masters and 300 slaves 300 slaves capable of indexing 30 billion Web pages Performance projection through performance modelling and
Estimated Average Total Query Response Time (300-node ODYS) [Wha13](measured with a 10-node ODYS and extended to a 300-node one through performance modelling)
− Web pages indexed: 6.84 (30) billions(22.8 (100) million Web pages/slave)() indicates max capacity
− Nodes required: 43,472 (for 194ms/query)
− Nodes required: 86,944 (for 148ms/query)
3.5 7
Requires 143 sets of 304 nodes = 43,472 nodes
Requires 286 sets of 304 nodes = 86,944 nodes
194ms
148ms
† Google Search Statistics [Goo18] indicates Google Search handles 3 billion queries/day as of May 30th, 2018Nielsenwire [Nie10] reports that Google handled 214 million queries/day in the U.S. in Feb. 2010
2018. 10 KAIST/DGIST
Summary of ODYSSummary of ODYS
We have shown that a massively-parallel search engine can be implemented using a DB-IR tightly-integrated parallel DBMS Capable of handing real-world scale data and query loads
Providing high-level functionality
We have shown detailed implementation of the ODYS search engine Being capable of indexing 100 million Web pages/node with shared-
nothing architecture → high scalability
Having tightly integrated DB-IR capability → high performance
Having SQL, schemas, and indexes → high-level functionality
However, it was also reported that IBM's Watson gave unsafe recommendations for treating cancer “IBM’s Watson Hasn’t Beaten Cancer, But A.I. Still Has Promise”
“…But in the documents obtained by STAT (medical web site), doctors who had tried to use Watson to help them design treatment complained that the system wasn’t ready to practice medicine.”
<Source: Bloomberg, August 25, 2018, https://www.bloomberg.com/view/articles/2018-08-24/ibm-s-watson-failed-against-cancer-but-a-i-still-has-promise >
Possible problems: quality of the information sources
A platform or engine that suggests items to users by predicting how they would rate the items 35% of consumer purchases on Amazon and 75% of video watches
on Netflix come from recommendations [Mac13] Example: Amazon.com’s recommender system
Intelligent Personal AssistantIntelligent Personal Assistant
“A mobile software agent that can perform tasks, or services, on behalf of an individual based on a combination of user input, location awareness, and the ability to access information from a variety of online sources…” [Wik18e]
Example: Google AssistantExample: Google Assistant
“Google Assistant” exploits Big Data, such as search keywords, locations visited, e-mails, and calendar entries, to provide suitable answers and recommendations Example: It prompts you about half an
hour before you leave to let you know the approximate drive time based on current traffic conditions (by looking at your calendar entries!)
“Echo” is a device connected to Amazon’s intelligent personal assistant, Alexa
Amazon is selling Echo at a very low price ($30) to customers Over 5M Echo devices have been sold in the last 2 years All those people asking Alexa to order kitchen supplies, turn
on the lights, or play music gives Amazon a valuable stockpile of data (adding to Big Data)
By using this Big Data, Amazon builds a “360-degree view” of their customers’ buying habits.
Big data are the most valuable assets Facebook knows the people you know and places you go;Google knows the things you use and search on the Internet;Amazon knows the items you buy online (or even offline—through Amazon go, etc.)
2018. 10 KAIST/DGIST
AlphaGoAlphaGo
A computer program that plays Go, which is developed by Google DeepMind In March 2016, AlphaGo beat Lee Sedol with 4 to 1 score In May 2017, AlphaGo beat Ke Jie, the world No. 1 ranked
“Configuration and Strength”[Wik18f], “Power Consumption”[Dee18]
Learning [Wik18f] AlphaGo was initially trained to mimic human play by attempting to
match the moves of expert players from recorded historical games It was trained using a KGS Go Server database (Big Data) of around 30
million moves from 160,000 games played by 6 to 9 dan human players Supervised learning + reinforcement learning In this version, without the Big Data of human games, it would not have
A version created without using data from human games but became stronger than any other previous versions [Sil17]
By playing games against itself (self-play), AlphaGo Zero surpassed the strength of AlphaGo Lee in three days by winning 100 games to 0, reached the level of AlphaGo Master in 21 days, and exceeded all the old versions in 40 days[Dee18]
Training was done soly based on reinforcement learning without recorded moves from human games
Where is the role of Big Data? Answer: it generates its own Big Data through an enormous number of self-plays
Many distributed data processing platforms for Big Data have been actively developed in industry and academia
The ODYS search engine, developed at KAIST, has shown that a massively-parallel search engine with higher functionality can be implemented using a DB-IR tightly-integrated parallel DBMS.
Emerging applications are realizing big data intelligence
The boom of artificial intelligence is fueled by recent Big Data technologies Big Data is essential for training the deep neural network
[Abo15] Daniel Abadi et al., “Tutorial: SQL-on-Hadoop Systems,”, In Proc. 41st Int’l Conf. on Very Large Data Bases, pp. 2050-2051, Kohala Coast, Hawaii, Aug. 2015.
[Abi05] Serge Abiteboul, et al., “The Lowell Database Research Self-Assessment,” Comm. of ACM, Vol. 48, No. 5, pp. 111-118, May 2005.
[Asl11] Matt Aslett, "How Will the Database Incumbents Respond to NoSQL and NewSQL?," Technical Report, the 451 Group, Apr. 2011. (available at https://451research.com/report-short?entityId=66963)
[CRW05] Surajit Chaudhuri, Raghu Ramakrishnan, and Gerhard Weikum, “Integrating DB and IR Technologies: What is the Sound of One Hand Clapping?,” In Proc. 2nd Biennial Conf. on Innovative Data Systems Research, Asilomar, California, pp. 1-12, Jan. 2005.
[Dea04] Dean, J. and Ghemawat, S., “MapReduce: Simplified Data Processing on Large Clusters,” In Proc. 6th Symposium on Operating System Design and Implementation (OSDI), pp. 137-150, Oct. 2014.
[Fer10] Ferrucci, D. et al., "Building Watson: An Overview of the DeepQA Project,” AI Magazine, Vol.31, No. 3, pp. 59-79, July 2010.
[Flo11] Floratou, A., Patel, J. M., Shekita, E. J., and Tata, S., “Column-oriented Storage Techniques for MapReduce,” In Proc. of the VLDB Endowment, Vol. 4, No. 7, pp. 419-429, 2011.
[Goe14] Goes, P., "Design Science Research in Top Information Systems Journals,” MIS Quarterly: Management Information Systems, Vol. 38, No. 1, 2014.
[Goo18] Google Search Statistics - Internet Live Stats, www.internetlivestats.com, retrieved 2018-05-30.[Gra02] Gray, J. and Szalay, A., “The World Wide Telescope: An Archetype for Online Science,” Comm. ACM, Vo. 45, No.
11, pp. 50-54, Nov. 2002.[Kat13] Katarina Grolinger, Wilson A Higashino, Abhinav Tiwari and Miriam AM Capretz, “Data Management in Cloud
Environments: NoSQL and NewSQL Data Stores,” Journal of Cloud Computing: Advances, Systems and Applications, Vol. 2, No. 22, 2013.
[Lan01] Laney, D.,"3D Data Management: Controlling Data Volume, Velocity and Variety,” META Group Research Note, Vol. 6, No. 70, 2001.
[Len04] Lentz, A., “MySQL Storage Engine Architecture,” In MySQL Developer Articles, MySQL AB, May 2004.
[Mac13] MacKenzie, I., Meyer, C., and Noble, S., "How Retailers Can Keep up with Consumers," McKinsey&CompanyReport, Oct. 2013 (https://www.mckinsey.com/industries/retail/our-insights/how-retailers-can-keep-up-with-consumers).
[Moh92] Mohan, C. et al., “ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging.,” ACM Trans. Database Systems, Vol. 17, No. 1, pp.94-162, 1992.
[Nie10] Nielsenwire, “Nielsen Reports February 2010 U.S. Search Rankings,” Technical Report, Mar. 15, 2010 (available at http://blog.nielsen.com/nielsenwire/online_mobile/nielsen-reports-february-2010-u-s-search-rankings/).
[Sil16] Silver D. et al, "Mastering the Game of Go with Deep Neural Networks and Tree Search," Nature, Vol. 529, pp. 484-489, Jan. 2016.
[Sil17] Silver D. et al, "Mastering the Game of Go without Human Knowledge," Nature, Vol. 550, pp. 354-349, Oct. 2017.[Sto10] Stonebraker, M. et al., “MapReduce and Parallel DBMSs: Friends or Foes?,” Communications of the ACM (CACM),
pp. 64-71, Jan. 2010.[Weik07] Gerhard Weikum, “DB&IR: Both Sides Now,” In Proc. 2007 ACM SIGMOD Int’l Conf. on Management of Data, pp.
25-30, Beijing, China, June 12-14, 2007.[Wha02] Whang, K. et al., An Inverted Index Storage Structure Using Subindexes and Large Objects for Tight Coupling of
Information Retrieval with Database Management Systems, U.S. Patent No. 6,349,308, Feb. 19, 2002, Application No. 09/250,487, Feb. 15, 1999.
[Wha03] Whang, K., “Tight Coupling: A Way of Building High-Performance Application Specific Engines,” a presentation at the panel Next-Generation Web Technology and Database Issues, the 8th International Conference on Database Systems for Advanced Applications (DASFAA 2003), Kyoto, Japan, URL:http://db-www.aist-nara.ac.jp/dasfaa2003/ppt.html, Mar. 2003.
[Wha05] Whang, K., Lee, M., Lee, J., Kim, M. and Han, W., “Odysseus: a High-Performance ORDBMS Tightly-Coupled with IR Features,” In Proc. IEEE Int'l Conf. on Data Engineering, Tokyo, Japan, pp. 1104-1105, Apr. 2005. This paper received the Best Demonstrate Award.
[Wha07] Whang, K., Lee, J., Kim, M., Lee, M., Lee, K., “Odysseus: a High-Performance ORDBMS Tightly-Coupled with Spatial Database Features,” In Proc. IEEE Int'l Conf. on Data Engineering, Istanbul, Turkey, p.1493-1494, Apr. 2007.
[Wha10] Whang, K., Lee, J., Kim, M., Lee, M., Lee, K., Han, W., Kim, J., “Tightly-Coupled Spatial Database Features in the Odysseus/OpenGIS DBMS for High-Performance, GeoInformatica, Vol. 14, No. 4, pp. 425-446, 2010.
[Wha12] Whang, K. et al., “A Method for Recovering Data in a Storage System” U.S. Patent No. 8,108,356, Jan. 31, 2012, Application No. 12/208,014, Sept. 10, 2008.
[Wha13] Kyu-Young Whang, Tae-Seob Yun, Yeon-Mi Yeo, Il-Yeol Song, Hyuk-Yoon Kwon, and In-Joong Kim, “ODYS: an Approach to Building a Massively-Parallel Search Engine Using a DB-IR Tightly-Integrated Parallel DBMS for Higher-Level Functionality,” In Proc. 2013 ACM Int’l Conf. on Management of Data (SIGMOD), pp. 313-324, June 2013.
[Wha15] Whang, K., Lee, J., Lee, M., Han, W., Kim, M., Kim, J., “DB-IR Integration Using Tight-Coupling in the Odysseus DBMS, World Wide Web, Vol. 18, No. 3, pp.491-520, 2015.
[Wha18] Whang, K., Yun, T., Park, J., Cho, K., Kim, S., Yi, I., Na, I., and Lee, B., Building Social Networking ServicesSystems Using the Relational Shared-Nothing Parallel DBMS, Tech. Report CS-TR-2018-419, School of Computing, KAIST, August 2018.