Top Banner
Map(), flatMap() and reduce() are your new best friends: simpler collections, concurrency, and big data Chris Richardson Author of POJOs in Action Founder of the original CloudFoundry.com @crichardson [email protected] http://plainoldobjects.com
67

Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

Aug 27, 2014

Download

Software

Higher-order functions such as map(), flatmap(), filter() and reduce() have their origins in mathematics and ancient functional programming languages such as Lisp. But today they have entered the mainstream and are available in languages such as JavaScript, Scala and Java 8. They are well on their way to becoming an essential part of every developer’s toolbox.

In this talk you will learn how these and other higher-order functions enable you to write simple, expressive and concise code that solve problems in a diverse set of domains. We will describe how you use them to process collections in Java and Scala. You will learn how functional Futures and Rx (Reactive Extensions) Observables simplify concurrent code. We will even talk about how to write big data applications in a functional style using libraries such as Scalding.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

Map(), flatMap() and reduce() are your new best friends:

simpler collections, concurrency, and big data

Chris Richardson

Author of POJOs in ActionFounder of the original CloudFoundry.com

@[email protected]://plainoldobjects.com

Page 2: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Presentation goal

How functional programming simplifies your code

Show that map(), flatMap() and reduce()

are remarkably versatile functions

Page 3: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

About Chris

Page 4: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

About Chris

Founder of a buzzword compliant (stealthy, social, mobile, big data, machine learning, ...) startup

Consultant helping organizations improve how they architect and deploy applications using cloud, micro services, polyglot applications, NoSQL, ...

Page 5: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Agenda

Why functional programming?

Simplifying collection processing

Simplifying concurrency with Futures and Rx Observables

Tackling big data problems with functional programming

Page 6: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Functional programming is a programming paradigm

Functions are the building blocks of the application

Best done in a functional programming language

Page 7: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Functions as first class citizens

Assign functions to variables

Store functions in fields

Use and write higher-order functions:

Pass functions as arguments

Return functions as values

Page 8: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Avoids mutable state

Use:

Immutable data structures

Single assignment variables

Some functional languages such as Haskell don’t allow side-effects

Page 9: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Why functional programming?

"the highest goal of programming-language design to enable good ideas to be elegantly expressed"

http://en.wikipedia.org/wiki/Tony_Hoare

Page 10: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Why functional programming?More expressive

More intuitive - declarative code matches problem definition

Functional code is usually much more composable

Immutable state:

Less error-prone

Easy parallelization and concurrency

But be pragmatic

Page 11: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

An ancient idea that has recently become popular

Page 12: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Mathematical foundation:

λ-calculus

Introduced byAlonzo Church in the 1930s

Page 13: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Lisp = an early functional language invented in 1958

http://en.wikipedia.org/wiki/Lisp_(programming_language)

1940

1950

1960

1970

1980

1990

2000

2010

garbage collection dynamic typing

self-hosting compiler tree data structures

(defun factorial (n) (if (<= n 1) 1 (* n (factorial (- n 1)))))

Page 14: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

My final year project in 1985: Implementing SASL

sieve (p:xs) = p : sieve [x | x <- xs, rem x p > 0];

primes = sieve [2..]

A list of integers starting with 2

Filter out multiples of p

Page 15: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

Mostly an Ivory Tower technology

Lisp was used for AI

FP languages: Miranda, ML, Haskell, ...

“Side-effects kills kittens and puppies”

Page 16: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

http://steve-yegge.blogspot.com/2010/12/haskell-researchers-announce-discovery.html

!*

!*

!*

Page 17: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

But today FP is mainstreamClojure - a dialect of Lisp

A hybrid OO/functional language

A hybrid OO/FP language for .NET

Java 8 has lambda expressions

Page 18: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Java 8 lambda expressions are functions

x -> x * x

x -> { for (int i = 2; i < Math.sqrt(x); i = i + 1) { if (x % i == 0) return false; } return true; };

(x, y) -> x * x + y * y

An instance of an anonymous inner class that implements a functional interface (kinda)

Page 19: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Agenda

Why functional programming?

Simplifying collection processing

Simplifying concurrency with Futures and Rx Observables

Tackling big data problems with functional programming

Page 20: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Lot’s of application code=

collection processing:

Mapping, filtering, and reducing

Page 21: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Social network examplepublic class Person {

enum Gender { MALE, FEMALE }

private Name name; private LocalDate birthday; private Gender gender; private Hometown hometown;

private Set<Friend> friends = new HashSet<Friend>(); ....

public class Friend {

private Person friend; private LocalDate becameFriends; ...}

public class SocialNetwork { private Set<Person> people; ...

Page 22: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Typical iterative code - e.g. filteringpublic class SocialNetwork {

private Set<Person> people;

...

public Set<Person> lonelyPeople() { Set<Person> result = new HashSet<Person>(); for (Person p : people) { if (p.getFriends().isEmpty()) result.add(p); } return result; }

Declare result variable

Modify result

Return result

Iterate

Page 23: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Problems with this style of programming

Low level

Imperative (how to do it) NOT declarative (what to do)

Verbose

Mutable variables are potentially error prone

Difficult to parallelize

Page 24: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Java 8 streams to the rescue

A sequence of elements

“Wrapper” around a collection (and other types: e.g. JarFile.stream(), Files.lines())

Streams can also be infinite

Provides a functional/lambda-based API for transforming, filtering and aggregating elements

Much simpler, cleaner and declarative code

Page 25: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

public class SocialNetwork {

private Set<Person> people;

...

public Set<Person> peopleWithNoFriends() { Set<Person> result = new HashSet<Person>(); for (Person p : people) { if (p.getFriends().isEmpty()) result.add(p); } return result; }

Using Java 8 streams - filteringpublic class SocialNetwork {

private Set<Person> people;

...

public Set<Person> lonelyPeople() { return people.stream()

.filter(p -> p.getFriends().isEmpty())

.collect(Collectors.toSet()); }

predicate lambda expression

Page 26: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

The filter() function

s1 a b c d e ...

s2 a c d ...

s2 = s1.filter(f)

Elements that satisfy predicate f

Page 27: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Using Java 8 streams - mapping

class Person ..

private Set<Friend> friends = ...;

public Set<Hometown> hometownsOfFriends() { return friends.stream() .map(f -> f.getPerson().getHometown()) .collect(Collectors.toSet()); }

Page 28: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

The map() function

s1 a b c d e ...

s2 f(a) f(b) f(c) f(d) f(e) ...

s2 = s1.map(f)

Page 29: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Using Java 8 streams - friend of friends using flatMap

class Person ..

public Set<Person> friendOfFriends() { return friends.stream() .flatMap(friend -> friend.getPerson().friends.stream()) .map(Friend::getPerson) .filter(f -> f != this) .collect(Collectors.toSet()); }

maps and flattens

Page 30: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

The flatMap() function

s1 a b ...

s2 f(a)0 f(a)1 f(b)0 f(b)1 f(b)2 ...

s2 = s1.flatMap(f)

Page 31: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Using Java 8 streams - reducingpublic class SocialNetwork {

private Set<Person> people;

...

public long averageNumberOfFriends() { return people.stream() .map ( p -> p.getFriends().size() ) .reduce(0, (x, y) -> x + y) / people.size(); } int x = 0;

for (int y : inputStream) x = x + yreturn x;

Page 32: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

The reduce() function

s1 a b c d e ...

x = s1.reduce(initial, f)

f(f(f(f(f(f(initial, a), b), c), d), e), ...)

Page 33: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Adopting FP with Java 8 is straightforward

Simply start using streams and lambdasEclipse can refactor anonymous inner classes to lambdas

Page 34: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Agenda

Why functional programming?

Simplifying collection processing

Simplifying concurrency with Futures and Rx Observables

Tackling big data problems with functional programming

Page 35: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Let’s imagine that you are writing code to display the

products in a user’s wish list

Page 36: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

The need for concurrencyStep #1

Web service request to get the user profile including wish list (list of product Ids)

Step #2

For each productId: web service request to get product info

But

Getting products sequentially ⇒ terrible response time

Need fetch productInfo concurrentlyComposing sequential + scatter/gather-style

operations is very common

Page 37: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Futures are a great abstraction for composing concurrent operations

http://en.wikipedia.org/wiki/Futures_and_promises

Page 38: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Worker thread or event-driven code

Main thread

Composition with futures

Outcome

Future 2

Clientget Asynchronous

operation 2

set

initiates

Asynchronous operation 1

Outcome

Future 1

getset

Page 39: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

But composition with basic futures is difficult

Java 7 future.get([timeout]):

Blocking API ⇒ client blocks thread

Difficult to compose multiple concurrent operations

Futures with callbacks:

e.g. Guava ListenableFutures, Spring 4 ListenableFuture

Attach callbacks to all futures and asynchronously consume outcomes

But callback-based code = messy code

See http://techblog.netflix.com/2013/02/rxjava-netflix-api.html

We need functional futures!

Page 40: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Functional futures - Scala, Java 8 CompletableFuture

def asyncPlus(x : Int, y : Int) : Future[Int] = ... x + y ...

val future2 = asyncPlus(4, 5).map{ _ * 3 }

assertEquals(27, Await.result(future2, 1 second))

Asynchronously transforms future

def asyncSquare(x : Int) : Future[Int] = ... x * x ...

val f2 = asyncPlus(5, 8).flatMap { x => asyncSquare(x) }

assertEquals(169, Await.result(f2, 1 second))

Calls asyncSquare() with the eventual outcome of

asyncPlus()

Page 41: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Functions like map() are asynchronous

someFn(outcome1)

f2

f2 = f1 map (someFn) Outcome1

f1

Implemented using callbacks

Page 42: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

class WishListService(...) { def getWishList(userId : Long) : Future[WishList] = {

userService.getUserProfile(userId).

Scala wish list service

Java 8 Completable Futures let you write similar code

Future[UserProfile]

map { userProfile => userProfile.wishListProductIds}. flatMap { productIds => val listOfProductFutures = productIds map productInfoService.getProductInfo Future.sequence(listOfProductFutures) }. map { products => WishList(products) }

Future[List[Long]]

List[Future[ProductInfo]]

Future[List[ProductInfo]]

Future[WishList]

Page 43: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Your mouse is your database

Erik Meijer

http://queue.acm.org/detail.cfm?id=2169076

Page 44: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Introducing Reactive Extensions (Rx)

The Reactive Extensions (Rx) is a library for composing asynchronous and event-based programs using observable sequences and LINQ-style query operators. Using Rx, developers represent asynchronous data streams

with Observables , query asynchronous data streams using LINQ operators , and .....

https://rx.codeplex.com/

Page 45: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

About RxJava

Reactive Extensions (Rx) for the JVM

Original motivation for Netflix was to provide rich Futures

Implemented in Java

Adaptors for Scala, Groovy and Clojure

Embraced by Akka and Spring Reactor: http://www.reactive-streams.org/

https://github.com/Netflix/RxJava

Page 46: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

RxJava core concepts

trait Observable[T] { def subscribe(observer : Observer[T]) : Subscription ...}

trait Observer[T] {def onNext(value : T)def onCompleted()def onError(e : Throwable)

}

Notifies

An asynchronous stream of items

Used to unsubscribe

Page 47: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

Comparing Observable to...Observer pattern - similar but adds

Observer.onComplete()

Observer.onError()

Iterator pattern - mirror image

Push rather than pull

Futures - similar

Can be used as Futures

But Observables = a stream of multiple values

Collections and Streams - similar

Functional API supporting map(), flatMap(), ...

But Observables are asynchronous

Page 48: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Fun with observables

val every10Seconds = Observable.interval(10 seconds)

-1 0 1 ...

t=0 t=10 t=20 ...

val oneItem = Observable.items(-1L)

val ticker = oneItem ++ every10Seconds

val subscription = ticker.subscribe { (value: Long) => println("value=" + value) }...subscription.unsubscribe()

Page 49: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

def getTableStatus(tableName: String) : Observable[DynamoDbStatus]=

Observable { subscriber: Subscriber[DynamoDbStatus] =>

}

Connecting observables to the outside world

amazonDynamoDBAsyncClient.describeTableAsync(new DescribeTableRequest(tableName), new AsyncHandler[DescribeTableRequest, DescribeTableResult] {

override def onSuccess(request: DescribeTableRequest, result: DescribeTableResult) = { subscriber.onNext(DynamoDbStatus(result.getTable.getTableStatus)) subscriber.onCompleted() }

override def onError(exception: Exception) = exception match { case t: ResourceNotFoundException => subscriber.onNext(DynamoDbStatus("NOT_FOUND")) subscriber.onCompleted() case _ => subscriber.onError(exception) } }) }

Called once per subscriber

Asynchronously gets information about DynamoDB table

Page 50: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Transforming observables

val tableStatus : Observable[DynamoDbMessage] = ticker.flatMap { i => logger.info("{}th describe table", i + 1) getTableStatus(name) }

Status1 Status2 Status3 ...

t=0 t=10 t=20 ...

+ Usual collection methods: map(), filter(), take(), drop(), ...

Page 51: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Calculating rolling averageclass AverageTradePriceCalculator {

def calculateAverages(trades: Observable[Trade]): Observable[AveragePrice] = { ... }

case class Trade( symbol : String, price : Double, quantity : Int ...)

case class AveragePrice(symbol : String, price : Double, ...)

Page 52: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Calculating average pricesdef calculateAverages(trades: Observable[Trade]): Observable[AveragePrice] = {

trades.groupBy(_.symbol). map { case (symbol, tradesForSymbol) => val openingEverySecond =

Observable.items(-1L) ++ Observable.interval(1 seconds) def closingAfterSixSeconds(opening: Any) =

Observable.interval(6 seconds).take(1)

tradesForSymbol.window(openingEverySecond, closingAfterSixSeconds _).map { windowOfTradesForSymbol => windowOfTradesForSymbol.fold((0.0, 0, List[Double]())) { (soFar, trade) => val (sum, count, prices) = soFar (sum + trade.price, count + trade.quantity, trade.price +: prices) } map { case (sum, length, prices) => AveragePrice(symbol, sum / length, prices) } }.flatten }.flatten }

Create an Observable of per-symbol Observables

Create an Observable of per-symbol Observables

Page 53: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Agenda

Why functional programming?

Simplifying collection processing

Simplifying concurrency with Futures and Rx Observables

Tackling big data problems with functional programming

Page 54: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Let’s imagine that you want to count word frequencies

Page 55: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Scala Word Count

val frequency : Map[String, Int] = Source.fromFile("gettysburgaddress.txt").getLines() .flatMap { _.split(" ") }.toList

frequency("THE") should be(11)frequency("LIBERTY") should be(1)

.groupBy(identity) .mapValues(_.length))

Map

Reduce

Page 56: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

But how to scale to a cluster of machines?

Page 57: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Apache HadoopOpen-source software for reliable, scalable, distributed computing

Hadoop Distributed File System (HDFS)

Efficiently stores very large amounts of data

Files are partitioned and replicated across multiple machines

Hadoop MapReduce

Batch processing system

Provides plumbing for writing distributed jobs

Handles failures

...

Page 58: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Overview of MapReduceInputData

Mapper

Mapper

Mapper

Reducer

Reducer

Reducer

OutputDataShuffle

(K,V)

(K,V)

(K,V)

(K,V)*

(K,V)*

(K,V)*

(K1,V, ....)*

(K2,V, ....)*

(K3,V, ....)*

(K,V)

(K,V)

(K,V)

Page 59: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

MapReduce Word count - mapperclass Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } }}

(“Four”, 1), (“score”, 1), (“and”, 1), (“seven”, 1), ...

Four score and seven years⇒

http://wiki.apache.org/hadoop/WordCount

Page 60: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Hadoop then shuffles the key-value pairs...

Page 61: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

MapReduce Word count - reducer class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values, Context context) { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } }

(“the”, 11)

(“the”, (1, 1, 1, 1, 1, 1, ...))⇒

http://wiki.apache.org/hadoop/WordCount

Page 62: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

About MapReduceVery simple programming abstract yet incredibly powerful

By chaining together multiple map/reduce jobs you can process very large amounts of data in interesting ways

e.g. Apache Mahout for machine learning

But

Mappers and Reducers = verbose code

Development is challenging, e.g. unit testing is difficult

It’s disk-based, batch processing ⇒ slow

Page 63: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Scalding: Scala DSL for MapReduceclass WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => tokenize(line) } .groupBy('word) { _.size } .write( Tsv( args("output") ) )

def tokenize(text : String) : Array[String] = { text.toLowerCase.replaceAll("[^a-zA-Z0-9\\s]", "") .split("\\s+") }}

https://github.com/twitter/scalding

Expressive and unit testable

Each row is a map of named fields

Page 64: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Apache SparkPart of the Hadoop ecosystem

Key abstraction = Resilient Distributed Datasets (RDD)

Collection that is partitioned across cluster members

Operations are parallelized

Created from either a Scala collection or a Hadoop supported datasource - HDFS, S3 etc

Can be cached in-memory for super-fast performance

Can be replicated for fault-tolerance

REPL for executing ad hoc queries

http://spark.apache.org

Page 65: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Spark Word Countval sc = new SparkContext(...)

sc.textFile("s3n://mybucket/...") .flatMap { _.split(" ")} .groupBy(identity) .mapValues(_.length) .toArray.toMap }}

Expressive, unit testable and very fast

Page 66: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Summary

Functional programming enables the elegant expression of good ideas in a wide variety of domains

map(), flatMap() and reduce() are remarkably versatile higher-order functions

Use FP and OOP together

Java 8 has taken a good first step towards supporting FP

Page 67: Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

@crichardson

Questions?

@crichardson [email protected]

http://plainoldobjects.com