1/30 Intro. Amdahl BD Processing Languages Q&A Conclusion References Concepts Big Data: Data Wrangling Boot Camp BD Tools and Techniques Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD 23 February 2018 23 February 2018 23 February 2018 23 February 2018 23 February 2018 23 February 2018 23 February 2018 23 February 2018 23 February 2018 23 February 2018 23 February 2018 23 February 2018 23 February 2018 23 February 2018 23 February 2018 23 February 2018 23 February 2018 23 February 2018 23 February 2018 23 February 2018 23 February 2018
30
Embed
Big Data: Data Wrangling Boot Camp BD Tools and TechniquesBig Data: Data Wrangling Boot Camp BD Tools and Techniques Chuck Cartledge, ... Hadoop multithreading hidden from view. Image
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1/30
Intro. Amdahl BD Processing Languages Q & A Conclusion References Concepts
Big Data: Data Wrangling Boot CampBD Tools and Techniques
23 February 201823 February 201823 February 201823 February 201823 February 201823 February 201823 February 201823 February 201823 February 201823 February 201823 February 201823 February 201823 February 201823 February 201823 February 201823 February 201823 February 201823 February 201823 February 201823 February 201823 February 2018
2/30
Intro. Amdahl BD Processing Languages Q & A Conclusion References Concepts
Table of contents (1 of 1)1 Intro.
2 AmdahlA little math
3 BD ProcessingProgramming paradigmsAn overview
4 LanguagesEach is different for areason
5 Q & A
6 Conclusion
7 References8 Concepts
ChaosSelf referential curvesKoch curvesMandelbrot curvesHow applicable to BigData?
3/30
Intro. Amdahl BD Processing Languages Q & A Conclusion References Concepts
What are we going to cover?
We’re going to talk about:
Why it is important to understandyour problem
What are single and multithreadedprograms
What are different tools, andframeworks to support BDprocessing
What languages and programmingparadigms fit the BD world
A passing appreciation of BD andChaos concepts
4/30
Intro. Amdahl BD Processing Languages Q & A Conclusion References Concepts
A little math
Amdahl’s Law [2]
Time for serial executiondef.== T (1)
Portion that can NOT beparalyzed
def.== B ∈ [0, 1]
Number of parallel resourcesdef.== n
T (n) = T (1)∗(B+ 1n (1−B))
Speed updef.== S(n)
S(n) = T (1)T (n) = 1
B+ 1n
(1−B)Dr. Gene Amdahl (circa 1960)
5/30
Intro. Amdahl BD Processing Languages Q & A Conclusion References Concepts
A little math
Amdahl’s Law (A summary)
Division and measurement of serial and parallel operations appearstime and again. (Shades of Mandelbrot.)
“Make the common fast.”
“Make the fast common.”
Understand what parts haveto be done serially
Understand what parts canbe done in parallel
Need to factor in “overhead” costs when computing speed up.
6/30
Intro. Amdahl BD Processing Languages Q & A Conclusion References Concepts
A little math
Some questions are easily stated, . . .
Which of these are paralizable(and why)?
1 a[i ] = b[i ] + c[i ]
2 a[i ] = f (b)
3 a[i ] = a[i − 1] + b[i − 1]
4 a = b + c
7/30
Intro. Amdahl BD Processing Languages Q & A Conclusion References Concepts
Programming paradigms
Single thread vs. multithreads
Single-threaded process –has full access to CPU andRAM
Multithreaded process –shares access to CPU andRAM
Multithreaded makes sensewith independent tasks
Multithreaded may share thesame memory space(language dependent) Image from [3].
Coordination across multiple threads can be tricky.
8/30
Intro. Amdahl BD Processing Languages Q & A Conclusion References Concepts
Programming paradigms
Hadoop multithreading hidden from view.
Image from [5].
Hadoop infrastructure hides lots of complexity.
9/30
Intro. Amdahl BD Processing Languages Q & A Conclusion References Concepts
An overview
Vocabulary
Data Sources – where data comesfromIngestion – how data ispre-processed for acceptanceData Sea/Lake – where data livesProcessing – how data isprocessed prior to storageData warehouse – transition fromSQL to NoSQLAnalysis – extracting informationfrom dataUser interface – how the userinteracts with the information
Image from [1].
10/30
Intro. Amdahl BD Processing Languages Q & A Conclusion References Concepts
An overview
Same image.
Image from [1].
11/30
Intro. Amdahl BD Processing Languages Q & A Conclusion References Concepts
An overview
Another collection of Open Source BD tools
Tools partitioned differently:
Big Data searchBusiness IntelligenceData aggregationData Analysis &PlatformsDatabases / DatawarehousingData aggregationData MiningDocument StoreGraph databasesGrid Solutions
Intro. Amdahl BD Processing Languages Q & A Conclusion References Concepts
Each is different for a reason
Vocabulary (1 of 2)[12].
array – generalize operations onscalars to apply transparently tovectors, matrices, andhigher-dimensional arrays.client side – languages are limitedby the abilities of the browser orintended client.compiled – languages typicallyprocessed by compilers, thoughtheoretically any language can becompiled or interpreted.concurrent – languages providelanguage constructs forconcurrency.curly-bracket – languages have asyntax that defines statementblocks using the curly bracket orbrace charactersdeclarative – languages describe a
problem rather than defining asolution.extension – languages embeddedinto another program and used toharness its features in extensionscripts.functional – languages defineprograms and subroutines asmathematical functions.generic – language is applicable tomany domains.imperative – languages may bemulti-paradigm and appear in otherclassifications.impure – languages containingimperative features.interactive mode – languages act asa kind of shell
16/30
Intro. Amdahl BD Processing Languages Q & A Conclusion References Concepts
Each is different for a reason
Vocabulary (2 of 2)[12].
interpreted – languages areprogramming languages in whichprograms may be executed fromsource code form, by an interpreter.iterative – languages are builtaround or offering generators.list-based – languages are a type ofdata-structured language that arebased upon the list data structure.metaprogramming – hat write ormanipulate other programs (orthemselves) as their data or that dopart of the work that is otherwisedone at run time during compiletime.object-oriented (class-based) –
support objects defined by theirclass.object-oriented prototype-based –languages are object-orientedlanguages where the distinctionbetween classes and instances hasbeen removedprocedural – languages are basedon the concept of the unit andscopereflective – languages let programsexamine and possibly modify theirhigh level structure at runtime.scripting – another term forinterpreted
17/30
Intro. Amdahl BD Processing Languages Q & A Conclusion References Concepts
Intro. Amdahl BD Processing Languages Q & A Conclusion References Concepts
Each is different for a reason
Same image.
Image from [13].
Each paradigm is a result of a problem domain.
19/30
Intro. Amdahl BD Processing Languages Q & A Conclusion References Concepts
Each is different for a reason
What does the future hold?
“If languages are not defined by taxonomies, how arethey constructed? They are aggregations of features.Rather than study extant languages as a whole, whichconflates the essential with the accidental, it is moreinstructive to decompose them into constituent features,which in turn can be studied individually. The studentthen has a toolkit of features that they can re-composeper their needs.”
S. Krishnamurthi [6]
New languages will be created all the time to fit needs.
20/30
Intro. Amdahl BD Processing Languages Q & A Conclusion References Concepts
Q & A time.
Q: Do you know what the deathrate around here is?A: One per person.
21/30
Intro. Amdahl BD Processing Languages Q & A Conclusion References Concepts
What have we covered?
Looked at how Amdahl’s Law canimprove performanceLooked at single and multithreadedprogramsLooked at some of the many OpenSource Big Data tools that areavailableLooked at how and why somelanguages are better than othersfor a particular application
Next: Getting Twitter developer accounts
22/30
Intro. Amdahl BD Processing Languages Q & A Conclusion References Concepts
References (1 of 4)
[1] Josef Adersberger, Big Data Landscape Q3/2016, email,2016.
[2] Gene M Amdahl,Validity of the single processor approach to achieving large scale computing capabilities,Proceedings of the Spring Joint Computer Conference, ACM,1967, pp. 483–485.
[3] John T. Bell, Threads, https://www.cs.uic.edu/~jbell/CourseNotes/OperatingSystems/4_Threads.html, 2013.
[4] James Gleick, Chaos: Making a New Science, Random House,1997.
Intro. Amdahl BD Processing Languages Q & A Conclusion References Concepts
References (4 of 4)
[13] Peter Van Roy et al.,Programming Paradigms for Dummies: What Every Programmer Should Know,New Computational Paradigms for Computer Music 104(2009).
26/30
Intro. Amdahl BD Processing Languages Q & A Conclusion References Concepts
Chaos
How long is the coast of the Britain?
Question raised byRichardson [9]
Popularized by Mandelbrot[7]
Foundational question inChaos Theory [4]
Varies from ≈ 2,400 to ≈ 3,400 km depending on your yardstick[7]
27/30
Intro. Amdahl BD Processing Languages Q & A Conclusion References Concepts
Self referential curves
Curves that look like themselves.
Richardson derived:L(G ) = MG 1−D
It was ignored
D is the dimensionalcharacteristic [7]
28/30
Intro. Amdahl BD Processing Languages Q & A Conclusion References Concepts
Koch curves
Simple algorithms yield things of beauty.
29/30
Intro. Amdahl BD Processing Languages Q & A Conclusion References Concepts
Mandelbrot curves
In 2 and 3D.
Mandelbrot’s equation:zn+1 = z2
n + cwhere c is complex
Mandelbrot curve is selfreferential
30/30
Intro. Amdahl BD Processing Languages Q & A Conclusion References Concepts
How applicable to Big Data?
Big data problems are addressed in your computer.
With Koch and Mandelbrot, we were looking deeper and deeper.What happens if we go higher instead of deeper?
Concept Computer Big Data
Paralizable Cores Processing nodes
Data locality Cache (L1, L2, etc.) HDFS
Coordination OS Hadoop
Output RAM HDFS
We will be bringing these ideas out into the open.