Intelligent System Optimizations

MANCHESTER LONDON NEW YORK

Martin Zapletal @zapletal_martin@ScaleByTheBay

Intelligent System Optimizations

@cakesolutions

https://twitter.com/zapletal_martin

val in: Array[Int]

var i = 0var result = 0while (i <= in.size) { result += in(i) i += 1}

Motivation

val in: Array[Int]

in.foldLeft(0)(_ + _)

Distributed System Optimizations

● Performance, cost, reliability, uptime● Logical, physical, intelligence, .... Optimized

single machine

Optimized single

machine

Optimized single

machine

Optimized distributed

system

Machine learning

● Stock market predictions● Product recommendations● Facial recognition● Object recognition● Speech understanding● Self-driving cars● Distributed system optimizations● ...

Observability

● Observability○ Logs○ Metrics○ Traces

● Analytics● Understanding● Actions● Reactive

[9]

Economies of scale

[13, 14]

Economies of scale

[13, 14]

Economies of scale

[13, 14]

Economies of scale

[13, 14]

Economies of scale

● Find the perfect balance● Align the risk taken by a service with the risk the business is willing to

bear● Explicit decisions

[15]

ML in large scale systems

● Node assignment● Cluster scheduling● Resource management● Serverless instance sizes● Automated Scaling● Anomaly detection● Topology search● Failure domain identification● Data mining● Configuration● Optimizations● ...

[15]

Configurationval data = spark.read.format("libsvm") .load("sample_multiclass_classification_data.txt")

val splits = data.randomSplit(Array(0.6, 0.4), seed = 1234L)val train = splits(0)val test = splits(1)

val layers = Array[Int](4, 5, 4, 3)

val trainer = new MultilayerPerceptronClassifier() .setLayers(layers) .setBlockSize(128) .setSeed(1234L) .setMaxIter(100)

val model = trainer.fit(train)

val result = model.transform(test)val predictionAndLabels = result.select("prediction", "label")val evaluator = new MulticlassClassificationEvaluator() .setMetricName("accuracy")

evaluator.evaluate(predictionAndLabels)





























Configurationval spark = SparkSession .builder() .config("spark.reducer.maxSizeInFlight", maxSizeInFlight) .config("spark.reducer.maxReqsInFlight", maxReqsInflight) .config("spark.shuffle.file.buffer", shuffleFileBuffer) .config("spark.memory.fraction", sparkMemoryFraction) .config("spark.shuffle.service.index.cache.entries", shuffleServiceIndexCacheEntries) .config("spark.memory.storageFraction", sparkMemoryStorageFraction) .config("spark.shuffle.memoryFraction", sparkShuffleMemoryFraction) .config("spark.storage.memoryFraction", sparkStorageMemoryFraction) .config("spark.storage.unrollFraction", sparkStorageUnrollFraction) .config("spark.broadcast.blockSize", sparkBroadcastBlockSize) .config("spark.executor.cores", sparkExecutorCores) .config("spark.default.parallelism", defaultParallelism) .config("spark.files.maxPartitionBytes", sparkFilesMaxPartitionBytes) .config("spark.files.openCostInBytes", sparkFilesOpenCostInByes) .config("spark.rpc.message.maxSize", sparkRpcMessageMaxSize) .config("spark.storage.memoryMapThreshold", sparkSorageMemoryMapThreshold) .config("spark.cores.max", sparkCoresMax) .config("spark.speculation", "true") .config("spark.speculation.interval", sparkSpeculationInterval) .config("spark.speculation.multiplier", sparkSpeculationMultiplier) .config("spark.speculation.quantile", sparkSpeculationQuantile) .config("spark.task.cpus", sparkTaskCpus)

Random trials

● Labelled training examples (feature vector, label)● Classification, regression, clustering, ...● Find an algorithm that for given feature vector finds the correct label● Optimization of an objective function with respect to model parameters

Features Label

100mb,10,100kb,0.75,0.1,0.001,0.75,0.75,0.1,1g,2,8,10000,1000000,100,1g,1ms,1.5,0.01,1g

61.9799

1kb,2147483647,10mb,0.75,0.001,0.1,0.001,0.5,0.001,10mb,8,2,10000,1000000,100,1kb,1ms,1.1,0.75,512mb

12.2511

100mb,10,1mb,0.75,0.5,0.001,0.1,0.001,0.5,100mb,4,1,100000000,1000,1000,10mb,100ms,1.5,0.01,8g

20.7461

... ...

Supervised Learning

Supervised Learning

Supervised Learning

Supervised Learning

Supervised Learning

Supervised Learningdef deep_nn(x, hidden1_units, hidden2_units): with tf.name_scope('hidden1'): weights1 = tf.Variable(tf.truncated_normal([20, hidden1_units], stddev=1.0 / math.sqrt(float(20))), name='weights') biases1 = tf.Variable(tf.zeros([hidden1_units]), name='biases') hidden1 = tf.nn.relu(tf.matmul(x, weights1) + biases1)

with tf.name_scope('hidden2'): weights2 = tf.Variable(tf.truncated_normal([hidden1_units, hidden2_units], stddev=1.0 / math.sqrt(float(hidden1_units))), name='weights') biases2 = tf.Variable(tf.zeros([hidden2_units]), name='biases') hidden2 = tf.nn.relu(tf.matmul(hidden1, weights2) + biases2)

with tf.name_scope('linear'): weights3 = tf.Variable(tf.truncated_normal([hidden2_units, 1], stddev=1.0 / math.sqrt(float(hidden2_units))), name='weights') biases3 = tf.Variable(tf.zeros([1]), name='biases') logits = tf.matmul(hidden2, weights3) + biases3 return logits










Supervised Learning example_batch, label_batch = read_example_batch(filename=FLAGS.data_file, batch_size=batch_size) x = tf.placeholder(tf.float32, [None, 20]) y = deepnn(x, hidden1_units, hidden2_units) y_ = tf.placeholder(tf.float32, [None, 1]) mse = tf.reduce_sum(tf.pow(y-y_, 2))/(2*batch_size) train_step = tf.train.GradientDescentOptimizer(learning_rate=0.25).minimize(mse)

sess = tf.Session() with sess.as_default(): sess.run(tf.global_variables_initializer()) coord = tf.train.Coordinator() threads = tf.train.start_queue_runners(coord=coord)

for i in range(1000): examples, labels = sess.run([example_batch, label_batch]) sess.run(train_step, feed_dict={x: examples, y_: labels})










Supervised Learning

Problem solved

PROBLEM SOLVED!

Data● What about training data?● Random runs● In-line optimization● Synthetic data● Generative models● Unsupervised learning

General case

Akka-httpKafka

Microservice CockroachDB

Dynamo

Redis Cache

Akka Cluster

Load Balancer

Lambda

General case● Complex● Long running● Large number of variables (many unknown)● Many possible actions● Temporal reward

Reinforcement Learning● Trial and error (positive reinforcement)● Set of states S● Set of actions A● Reward r

● P(s’ | a, s)● s0, a0, r1 -> s1, a1, r2 -> …● Q(s, a) = r + γ * maxa’(Q(s’, a’))

[19. 20]

Environment

Agent

Interpreter

Reward

State Action

Deep Reinforcement Learning● Complex hierarchical actions and planning● Temporal reward

[21]

Deep Reinforcement Learning - Gym

RewardNumber of errors,

Cost

Concurrent user load 10,000 12,000 20,000 ...

Instances and sizes 1,1,2 ? ? ?

Action

Nothing

Inc inst size

Dec inst size

Add inst

Remove inst

State

Deep Reinforcement Learning - Agentclass agent(): def __init__(self, s_size,a_size): self.inputs = tf.placeholder(shape=[None, s_size],dtype=tf.float32) W = tf.Variable(tf.random_uniform([s_size,a_size],0,0.01)) self.Q_values = tf.matmul(self.inputs,W)

self.largest_Q_value_index = tf.argmax(self.Q_values,1) self.largest_Q_value = tf.reduce_max(self.Q_values)

self.next_Q_values = tf.placeholder(shape=[1,a_size],dtype=tf.float32) loss = tf.reduce_sum(tf.square(self.next_Q_values - self.Q_values)) trainer = tf.train.GradientDescentOptimizer(learning_rate=0.1) self.updateModel = trainer.minimize(loss)

[22, 23]




[22, 23]




[22, 23]




[22, 23]

Deep Reinforcement Learning - Agentwith tf.Session() as sess: sess.run(init) for i in range(num_episodes): state = env.reset() d = False j = 0 while j < env.episode_length() + 1: j+=1 fixed_in = fixed_length_history(episode_history[:,0]) chosen_action, all_Q_values = sess.run( [myAgent.largest_Q_value_index,myAgent.Q_values], feed_dict={myAgent.inputs:fixed_in}) if np.random.rand(1) < random_action_probability: chosen_action[0] = env.action_space.sample()

[22, 23]


[22, 23]


[22, 23]

Deep Reinforcement Learning - Agent next_state,r,d,_ = env.step(chosen_action[0])

episode_history.append([state, chosen_action, r, next_state]) next_largest_Q_value = sess.run( myAgent.largest_Q_value, feed_dict={myAgent.inputs:next_fixed_in}) all_Q_values[0,a[0]] = r + gamma * next_largest_Q_value _ = sess.run( myAgent.updateModel, feed_dict={myAgent.inputs:fixed_in,myAgent.next_Q_values:all_Q_values}) state = next_state

[22, 23]



[22, 23]



[22, 23]



[22, 23]

Learn

[24]

Learn

[25]

Conclusion

● Optimize your code and architecture● Measure, gather data● Use the data!● Continuously improve and evolve

Questions


0845 617 1200

@zapletal_martin @cakesolutions


[email protected]

We are hiringhttp://www.cakesolutions.net/careers

We are hiringhttp://www.cakesolutions.net/careers

References[0] https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-whole-stage-codegen.html[1] https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6122906529858466/293651311471490/5382278320999420/latest.html[2] http://hydronitrogen.com/in-the-code-spark-sql-query-planning-and-execution.html[3] https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html[4] http://www.vldb.org/pvldb/vol4/p539-neumann.pdf[5] https://blog.acolyer.org/2016/05/23/efficiently-compiling-efficient-query-plans-for-modern-hardware/[6] https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-lineage.html[7] https://databricks.com/blog/2017/02/16/processing-trillion-rows-per-second-single-machine-can-nested-loop-joins-fast.html[8] https://www.slideshare.net/SparkSummit/costbased-optimizer-framework-for-spark-sql-spark-summit-east-talk-by-ron-hu-and-zhenhua-wang[9] https://medium.com/@copyconstruct/monitoring-in-the-time-of-cloud-native-c87c7a5bfa3e[10] https://www.lightbend.com/blog/lightbend-monitoring-now-integrates-with-datadog-for-monitoring-akka-based-reactive-applications[11] https://www.lightbend.com/blog/how-to-get-monitoring-right-for-streaming-and-fast-data-systems-built-with-spark-mesos-akka-cassandra-and-kafka?utm_content=buffera2917&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer[12] http://zipkin.io/[13] https://www.theverge.com/2016/7/21/12246258/google-deepmind-ai-data-center-cooling[14] http://www.npr.org/sections/thetwo-way/2017/03/03/518322734/amazon-and-the-150-million-typo.[15] http://money.cnn.com/2016/09/07/technology/delta-computer-outage-cost/[16] https://www.slideshare.net/SparkSummit/ernest-efficient-performance-prediction-for-advanced-analytics-on-apache-spark-spark-summit-east-talk-by-shivaram-venkataraman[17] https://www.slideshare.net/SparkSummit/auto-scaling-systems-with-elastic-spark-streaming-spark-summit-east-talk-by-phuduc-nguyen[18] https://cloud.google.com/blog/big-data/2016/03/comparing-cloud-dataflow-autoscaling-to-spark-and-hadoop[19] https://www.youtube.com/watch?v=URWXG5jRB-A[20] https://en.wikipedia.org/wiki/Reinforcement_learning[21] https://stats.stackexchange.com/questions/234891/difference-between-convolution-neural-network-and-deep-learning[22] https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-0-q-learning-with-tables-and-neural-networks-d195264329d0[23] https://github.com/awjuliani/DeepRL-Agents/blob/master/Q-Network.ipynb[24] https://www.youtube.com/watch?v=22g14GtVhXk[25] https://www.youtube.com/watch?v=C-BY3JhXTiE

https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-whole-stage-codegen.html

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6122906529858466/293651311471490/5382278320999420/latest.html

http://hydronitrogen.com/in-the-code-spark-sql-query-planning-and-execution.html

https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html

http://www.vldb.org/pvldb/vol4/p539-neumann.pdf

https://blog.acolyer.org/2016/05/23/efficiently-compiling-efficient-query-plans-for-modern-hardware/

https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-lineage.html

https://databricks.com/blog/2017/02/16/processing-trillion-rows-per-second-single-machine-can-nested-loop-joins-fast.html

https://www.slideshare.net/SparkSummit/costbased-optimizer-framework-for-spark-sql-spark-summit-east-talk-by-ron-hu-and-zhenhua-wang

https://www.lightbend.com/blog/lightbend-monitoring-now-integrates-with-datadog-for-monitoring-akka-based-reactive-applications

https://www.lightbend.com/blog/how-to-get-monitoring-right-for-streaming-and-fast-data-systems-built-with-spark-mesos-akka-cassandra-and-kafka?utm_content=buffera2917&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer

https://www.lightbend.com/blog/how-to-get-monitoring-right-for-streaming-and-fast-data-systems-built-with-spark-mesos-akka-cassandra-and-kafka?utm_content=buffera2917&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer

http://zipkin.io/

https://www.theverge.com/2016/7/21/12246258/google-deepmind-ai-data-center-cooling

http://money.cnn.com/2016/09/07/technology/delta-computer-outage-cost/

https://www.slideshare.net/SparkSummit/ernest-efficient-performance-prediction-for-advanced-analytics-on-apache-spark-spark-summit-east-talk-by-shivaram-venkataraman

https://www.slideshare.net/SparkSummit/auto-scaling-systems-with-elastic-spark-streaming-spark-summit-east-talk-by-phuduc-nguyen

https://cloud.google.com/blog/big-data/2016/03/comparing-cloud-dataflow-autoscaling-to-spark-and-hadoop

https://www.youtube.com/watch?v=URWXG5jRB-A

https://en.wikipedia.org/wiki/Reinforcement_learning

https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-0-q-learning-with-tables-and-neural-networks-d195264329d0

https://www.youtube.com/watch?v=22g14GtVhXk

Thank you

Intelligent System Optimizations

Science