Please tick the box to continue:

Page 1: 1. Spark DataFrames + SQL - Systems Group · 2019-06-11 · Big Data for Engineers – Exercises Spring 2019 – Week 9 – ETH Zurich Spark + MongoDB 1. Spark DataFrames + SQL 1.1

Big Data for Engineers – Exercises

Spring 2019 – Week 9 – ETH Zurich

Spark + MongoDB

1. Spark DataFrames + SQL

1.1 Setup the Spark cluster on Azure

Create a clusterSign into the azure portal ( for "HDInsight clusters" using the search box at the top.Click on "+ Add".Give the cluster a unique name.In the "Select Cluster Type" choose Spark and a standard Cluster Tier (Finish with pressing "select").In step 2, the container name will be filled in for you automatically. If you want to do the exercise sheet in several sittings, changeit to something you can remember or write it down.Set up a Spark cluster with default configuration. It should cost something around 3.68 sFR/h.Wait for 20 mins so that your cluster is ready.


Remember to delete the cluster once you are done. If you want to stop doing the exercises at any point, delete it and recreate it usingthe same container name as you used the first time, so that the resources are still there.

Page 2: 1. Spark DataFrames + SQL - Systems Group · 2019-06-11 · Big Data for Engineers – Exercises Spring 2019 – Week 9 – ETH Zurich Spark + MongoDB 1. Spark DataFrames + SQL 1.1

Access your clusterMake sure you can access your cluster (the NameNode) via SSH:

$ ssh <ssh_user_name>@<cluster_name>

If you are using Linux or MacOSX, you can use your standard terminal. If you are using Windows you can use:

Putty SSH Client and PSCP tool (get them at here).This Notebook server terminal (Click on the Jupyter logo and the goto New -> Terminal).Azure Cloud Terminal (see the HBase exercise sheet for details)

The cluster has its own Jupyter server. We will use it. You can access it through the following link:


You can access cluster's YARN in your browser


The Spark UI can be accessed via Azure Portal, see Spark job debugging

You need to upload this notebook to your cluster's Jupyter inorder to execute Python code blocks.To do this, just open the Jupyter through the link given above and use the "Upload" button.

1.2. The Great Language GameThis week you will be using again the language confusion dataset. You will write queries with Spark DataFrames and SQL. You willhave to submit the results of this exercise to Moodle to obtain the weekly bonus. You will need four things:

The query you wroteSomething related to its output (which you will be graded on)The time it took you to write itThe time it took you to run it

As you might have observed in the sample queries above, the time a job took to run is displayed on the rightmost column of its ouptut.If it consists of several stages, however, you will need the sum of them. The easiest thing is if you just take the execution time of thewhole query:

Page 3: 1. Spark DataFrames + SQL - Systems Group · 2019-06-11 · Big Data for Engineers – Exercises Spring 2019 – Week 9 – ETH Zurich Spark + MongoDB 1. Spark DataFrames + SQL 1.1

Of course, you will not be evaluated on the time it took you to write the queries (nor on the time it took them to run), but this is usefulto us in order to measure the increase in performance when using Sparksoniq. There is a cell that outputs the time you startedworking before every query. Use this if you find it useful.

For this exercise, we strongly suggest that you use the Azure cluster as described above.

Log in to your cluster using SSH as explained above and run the following commands:


tar -jxvf confusion-2014-03-02.tbz2 -C /tmp

hdfs dfs -copyFromLocal /tmp/confusion-2014-03-02/confusion-2014-03-02.json /confusion.json

This dowloads the archive file to the cluster, decompresses it and uploads it to HDFS when using a cluster. Now, create an RDD fromthe file containing the entries:

In [ ]:

data = sc.textFile('wasb:///confusion.json')

Last week you loaded the json data with the following snippet:

In [ ]:

import jsonentries =

This week you will use DataFrames:

In [ ]:

entries_df =

You can check the schema by executing the following code:

In [ ]:


You can place the data to a temporary table with the following code:

In [ ]:


Now, you can use normal SQL, with sql magic (%%sql), to perform queries on the table entries. For example:

In [ ]:

%%sqlSELECT *FROM entriesWHERE country == "CH"

Good! Let's get to work. A few last things:

This week, you should not have issues with the output being too long, since sql magic limits its size automatically.Remember to delete the cluster if you want to stop working! You can recreate it using the same container name and yourresources will still be there.

And now to the actual queries:

1. Find all games such that the guessed language is correct (=target), and such that this language is Russian.

In [ ]:

from datetime import datetime

Page 4: 1. Spark DataFrames + SQL - Systems Group · 2019-06-11 · Big Data for Engineers – Exercises Spring 2019 – Week 9 – ETH Zurich Spark + MongoDB 1. Spark DataFrames + SQL 1.1

# Started working:print(

In [ ]:

%%sqlSELECT *FROM entriesWHERE target == guess AND target == "Russian"

2. List all chosen answers to games where the guessed language is correct (=target).

In [ ]:

# Started working:print(

In [ ]:

%%sqlSELECT guessFROM entriesWHERE target == guess

3. Find all distinct values of languages (the target field).

In [ ]:

# Started working:print(

In [ ]:

%%sqlSELECT DISTINCT(target)FROM entries

4. Return the top three games where the guessed language is correct (=target) ordered by language (ascending), then country(ascending), then date (ascending).

In [ ]:

# Started working:print(

In [ ]:

%%sqlSELECT *FROM entriesWHERE guess == targetORDER BY target, country, date ASCLIMIT 3

5. Aggregate all games by country and target language, counting the number of guesses for each pair (country, target).

In [ ]:

# Started working:print(

In [ ]:

%%sqlSELECT country, target, COUNT(*) as countFROM entriesGROUP BY country, target

6. Find the overall percentage of correct guesses when the first answer (amongst the array of possible answers) was the correct one.

In [ ]:

# Started working:

Page 5: 1. Spark DataFrames + SQL - Systems Group · 2019-06-11 · Big Data for Engineers – Exercises Spring 2019 – Week 9 – ETH Zurich Spark + MongoDB 1. Spark DataFrames + SQL 1.1

# Started working:print(

In [ ]:

%%sqlSELECT( SELECT float(COUNT(*)) AS first_choice_count FROM entries WHERE target == guess and target == choices[0]) / ( SELECT float(COUNT(*)) AS correct_count FROM entries WHERE target == guess) AS percentage

7. Sort the languages by increasing overall percentage of correct guesses.

In [ ]:

# Started working:print(

In [ ]:

%%sqlSELECT target, float(correct_count) / float(overall_count) AS percentageFROM ( SELECT target, COUNT(*) AS correct_count FROM entries WHERE target == guess GROUP BY target)NATURAL JOIN ( SELECT target, COUNT(*) AS overall_count FROM entries GROUP BY target)ORDER BY percentage ASC

8. Group the games by the index of the correct answer in the choices array and output all counts.

The following code snippet will create a user-defined SQL function, which you can use in your SQL queries.You may call it in your queries as array_position(x, y) , where x is an array (for example an entry for the column choices ) and y is some data that the position/index of which you want to find in the array.

In [ ]:

spark.udf.register("array_position", lambda x,y: x.index(y))

In [ ]:

# Started working:print(

In [ ]:

%%sqlSELECT index, COUNT(index) AS countFROM ( SELECT array_position(choices, target) AS index FROM entries)GROUP BY index

9. What is the language of the sample that has the highest successful guess rate?

In [ ]:

# Started working:print(

In [ ]:

Page 6: 1. Spark DataFrames + SQL - Systems Group · 2019-06-11 · Big Data for Engineers – Exercises Spring 2019 – Week 9 – ETH Zurich Spark + MongoDB 1. Spark DataFrames + SQL 1.1

In [ ]:

# If you interpreted "sample" as the whole dataset%%sqlSELECT target, float(correct_count) / float(overall_count) AS percentageFROM ( SELECT target, COUNT(*) AS correct_count FROM entries WHERE target == guess GROUP BY target)NATURAL JOIN ( SELECT target, COUNT(*) AS overall_count FROM entries GROUP BY target)ORDER BY percentage DESCLIMIT 1

In [ ]:

# If you interpreted "sample" as the sample field/attribute%%sqlSELECT sample, target, float(correct_count) / float(overall_count) AS percentageFROM ( SELECT sample, target, COUNT(*) AS correct_count FROM entries WHERE target == guess GROUP BY sample, target)NATURAL JOIN ( SELECT sample, target, COUNT(*) AS overall_count FROM entries GROUP BY sample, target)ORDER BY percentage DESCLIMIT 1

10. Return all games played on the latest day.

In [ ]:

# Started working:print(

In [ ]:

%%sqlSELECT *FROM entriesWHERE date == ( SELECT MAX(date) as last_date FROM entries)

2. Document storesA record in document store is a document. Document encoding schemes include XML, YAML, JSON, and BSON, as well as binaryforms like PDF and Microsoft Office documents (MS Word, Excel, and so on). MongoDB documents are similar to JSON objects.Documents are composed of field-value pairs and have the following structure:

The values of fields may include other documents, arrays, and arrays of documents. Data in MongoDB has a flexible schema in the

Page 7: 1. Spark DataFrames + SQL - Systems Group · 2019-06-11 · Big Data for Engineers – Exercises Spring 2019 – Week 9 – ETH Zurich Spark + MongoDB 1. Spark DataFrames + SQL 1.1

The values of fields may include other documents, arrays, and arrays of documents. Data in MongoDB has a flexible schema in thesame collection. All documents do not need to have the same set of fields or structure, and common fields in a collection's documentsmay hold different types of data.

2.1 General Questions1. What are advantages of document stores over relational databases?2. Can the data in document stores be normalized?3. How does denormalization affect performance?

Solution1) Flexibility. Not every record needs to store the same properties. New properties can be added on the fly (Flexible schema).

2) Yes. References can be used for data normalization.

3) All data for an object is stored in a single record. In general, it provides better performance for read operations (since expensivejoins can be omitted), as well as the ability to request and retrieve related data in a single database operation. In addition, embeddeddata models make it possible to update related data in a single atomic write operation.

2.2 True/False QuestionsSay if the following statements are true or false.

1. Document stores expose only a key-value interface.2. Different relationships between data can be represented by references and embedded documents.3. MongoDB does not support schema validation.4. MongoDB stores documents in the XML format.5. In document stores, you must determine and declare a table's schema before inserting data.6. MongoDB performance degrades when the number of documents increases.7. Document stores are column stores with flexible schema.8. There are no joins in MongoDB.

Page 8: 1. Spark DataFrames + SQL - Systems Group · 2019-06-11 · Big Data for Engineers – Exercises Spring 2019 – Week 9 – ETH Zurich Spark + MongoDB 1. Spark DataFrames + SQL 1.1

Solution1. (False) Document stores expose only a key-value interface.2. (True) Different relationships between data can be represented by references and embedded documents.3. (False) MongoDB does not support schema validation.4. (False) MongoDB stores documents in the XML format.5. (False) In document stores, you must determine and declare a table's schema before inserting data.6. (True) MongoDB performance degrades when the number of documents increases.7. (False) Document stores are column stores with flexible schema.8. (True) There are no joins in MongoDB. Nonetheless, starting in version 3.2, MongoDB supports aggregations with

"lookup" operator, which can perform a LEFT OUTER JOIN .

3. MongoDB

3.1 Install MongoDBMongoDB is an open-source document database. The next step is to install it on your local machine. For that, you can follow generalinstruction on MongoDB web page install MongoDB

Here the short instruction for Ubuntu is outlined:

1. Import the public key used by the package management system.

sudo apt-key adv --keyserver hkp:// --recv


2. Create the /etc/apt/sources.list.d/mongodb-org-3.6.list list file using the command appropriate for your version of Ubuntu:

Ubuntu 14.04

echo "deb [ arch=amd64 ] trusty/mongodb-org/3.6

multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-3.6.list

Ubuntu 16.04

echo "deb [ arch=amd64,arm64 ] xenial/mongodb-org/3.6 m

ultiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-3.6.list

1. Install the MongoDB packages.

sudo apt-get update

sudo apt-get install -y mongodb-org

To run MongoDB, execute the following command:

sudo service mongod start

To stop MongoDB, execute the following command:

sudo service mongod stop

3.2 Import the datasetRetrieve the "restaurants" dataset from

Use mongoimport to insert the documents into the restaurants collection in the test database . If the collection alreadyexists in the test database, the operation will drop the restaurants collection first.

mongoimport --db test --collection restaurants --drop --file ./primer-dataset.json

3.3 Mongo shellThe mongo shell is an interactive JavaScript interface to MongoDB. You can use the mongo shell to query and update data as wellas to perform administrative operations.

Page 9: 1. Spark DataFrames + SQL - Systems Group · 2019-06-11 · Big Data for Engineers – Exercises Spring 2019 – Week 9 – ETH Zurich Spark + MongoDB 1. Spark DataFrames + SQL 1.1

as to perform administrative operations.

To start mongo use:

```` mongo --shell

In the mongo shell connected to a running MongoDB instance, switch to the ```test```


use test

Try to insert a document into the ```restaurants``` collection. In addition, you can see th

e structure of documents the in the collection.




"address" : {

"street" : "2 Avenue",

"zipcode" : "10075",

"building" : "1480",

"coord" : [ -73.9557413, 40.7720266 ]


"borough" : "Manhattan",

"cuisine" : "Italian",

"grades" : [


"date" : ISODate("2014-10-01T00:00:00Z"),

"grade" : "A",

"score" : 11



"date" : ISODate("2014-01-16T00:00:00Z"),

"grade" : "A",

"score" : 17



"name" : "Vella",

"restaurant_id" : "41704620"



Query all documents in a collection:


Query one document in a collection:


To format the printed result, you can add .pretty() to the operation, as in the following:


Query DocumentsFor the db.collection.find() method, you can specify the following optional fields:

a query filter to specify which documents to return,a query projection to specifies which fields from the matching documents to return (the projection limits the amount of data thatMongoDB returns to the client over the network),optionally, a cursor modifier to impose limits, skips, and sort orders.

3.4 Questions

Page 10: 1. Spark DataFrames + SQL - Systems Group · 2019-06-11 · Big Data for Engineers – Exercises Spring 2019 – Week 9 – ETH Zurich Spark + MongoDB 1. Spark DataFrames + SQL 1.1

3.4 QuestionsWrite queries in MongoDB that return the following:

1. All restaurants in borough (a town) "Brooklyn" and cuisine (a style of cooking) "Hamburgers".2. The number of restaurants in the borough "Brooklyn" and cuisine "Hamburgers".3. All restaurants with zipcode 11225.4. Names of restaurants with zipcode 11225 that have at least one grade "C".5. Names of restaurants with zipcode 11225 that have as first grade "C" and as second grade "A".6. Names and streets of restaurants that don't have an "A" grade.7. All restaurants with a grade C and a score greater than 50.8. All restaurants with a grade C or a score greater than 50.9. All restaurants that have only A grades.

You can read more about MongoDB here:

3.4 Solution1.

db.restaurants.find({"borough" : "Brooklyn", "cuisine" : "Hamburgers" })


db.restaurants.find({"borough" : "Brooklyn", "cuisine" : "Hamburgers" }).count()


db.restaurants.find({"address.zipcode" : "11225" })


db.restaurants.find({"address.zipcode" : "11225" , "grades.grade" : "C" } , {"name" : 1 })


db.restaurants.find({"address.zipcode" : "11225" , "grades.0.grade" : "C", "grades.1.grade"

: "A" }, {"name" : 1 })


db.restaurants.find({"grades.grade" : { $ne : "A"}} , {"name" : 1 , "address.street": 1})


db.restaurants.find({grades : {$elemMatch : {grade : "C", score : {$gt : 50}}}})


db.restaurants.find({$or: [{"grades.score" : {$gt : 50}}, { "grades.grade" : "C"}]})


db.restaurants.find({ "grades": {"$not": { "$elemMatch": {"grade" :{$ne : "A" }}}}})

4. Indexing in MongoDBIndexes support the efficient resolution of queries. Without indexes, MongoDB must scan every document of a collection to selectthose documents that match the query statement. Scan can be highly inefficient and require MongoDB to process a large volume ofdata.

Indexes are special data structures that store a small portion of the data set in an easy-to-traverse form. The index stores the value ofa specific field or set of fields, ordered by the value of the field as specified in the index.

MongoDB supports indexes that contain either a single field or multiple fields depending on the operations that this index typesupports.

By default, MongoDB creates the _id index, which is an ascending unique index on the _id field, for all collections when thecollection is created. You cannot remove the index on the _id field.

Page 11: 1. Spark DataFrames + SQL - Systems Group · 2019-06-11 · Big Data for Engineers – Exercises Spring 2019 – Week 9 – ETH Zurich Spark + MongoDB 1. Spark DataFrames + SQL 1.1

collection is created. You cannot remove the index on the _id field.

Managing indexes in MongoDBAn explain() operator provides information on the query plan. It returns a document that describes the process and indexes usedto return the query. This may provide useful insight when attempting to optimize a query.

db.restaurants.find({"borough" : "Brooklyn").explain()

In the mongo shell, you can create an index by calling the createIndex() method.

db.restaurants.createIndex({"borough" : 1})

Now, you retrieve a new query plan for indexed data.

db.restaurants.find({"borough" : "Brooklyn").explain()

The value of the field in the index specification describes the kind of index for that field. For example, a value of 1 specifies an indexthat orders items in ascending order. A value of -1 specifies an index that orders items in descending order.

To remove all indexes, you can use db.collection.dropIndexes() . To remove a specific index you can use db.collection.dropIndex() , such as db.restaurants.dropIndex({ borough : 1 }) .

4.1 QuestionsPlease answer questions 1 and 2 in Moodle

1) Which queries will use the following index:

db.restaurants.createIndex({"borough" : 1})

A. db.restaurants.find({"" : "Boston"})B. db.restaurants.find({}, {"borough" : 1})C. db.restaurants.find().sort({"borough" : 1})D. db.restaurants.find({"cuisine" : "Italian" }, {"borough" : 1})

2) Which queries will use the following index:

db.restaurants.createIndex({"address" : -1})

A. db.restaurants.find({"address.zipcode" : "11225"})B. db.restaurants.find({"" : "Boston"})C. db.restaurants.find({"" : "Boston"}, {"address" : 1 })D. db.restaurants.find({"address" : 1 })

3) Write a command for creating an index on the "zipcode" field.

4) Write an index to speed up the following query:

db.restaurants.find({"grades.grade" : { $ne : "A"}}, {"name" : 1 , "address.street": 1})

5) Write an index to speed up the following query:

db.restaurants.find({"grades.score" : {$gt : 50}, "grades.grade" : "C"})

Solution1) Only query C would benefit from the index.

2) Only query D would benefit from the index.

3) db.restaurants.createIndex({"address.zipcode" : 1 })

4) Just db.restaurants.createIndex({"grades.grade": 1}) , since {"name" : 1 , "address.street": 1} is aprojection.

5) db.restaurants.createIndex({"grades.score": 1 , "grades.grade": 1})

Page 12: 1. Spark DataFrames + SQL - Systems Group · 2019-06-11 · Big Data for Engineers – Exercises Spring 2019 – Week 9 – ETH Zurich Spark + MongoDB 1. Spark DataFrames + SQL 1.1

However it won't work for

db.restaurants.find({"grades.grade" : "C", "grades.score" : {$gt : 50}})

Related Documents