Top Banner
Avalanche By: Matthew Levandowski, Travis Fisher, Erik Vavro, Eric Nelson, Jonathan Hoatlin
18
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: AvalancheProject2012

Avalanche By: Matthew Levandowski, Travis Fisher, Erik Vavro,

Eric Nelson, Jonathan Hoatlin

Page 2: AvalancheProject2012

Ideas & Definitions Workbench / Interface

- A sandbox environment for developing workflows that can be later used in implementations (e.g., our beer restaurant).

The workbench acts as a secure entry point to the remote framework (RESTful cloud service).

Block

- A single event of data manipulation. Blocks are commonly chained together and each block is usually dependent on the

output of its preceding blocks. They can accept data from either Mongo, the UI, and of course, other blocks. Blocks inherit

the behavior of celery tasks.

Connection

- An identifying route between a source block and target block.

Group Block

- A block that encapsulates sub-blocks, used to provide a basic sense of hierarchy. Group blocks do not perform any data

manipulation themselves and simply forward incoming data to their sub-blocks.

Workflow

- A user-owned collection of blocks and/or group blocks and connections that is described by a JSON schema. Workflows

are generated in the UI (workbench) and displayed with the Graphiti JSON graph and passed to the remote framework for

serialization into an executable sequence of blocks (celery tasks).

Page 3: AvalancheProject2012

Ideas & Definitions (con.) Framework

- A RESTful cloud-based framework that mines data, serializes workflows and performs various statistical/analytical tasks

(powered by celery).

Celery

- An asynchronous task queue/job queue library based on distributed message passing.

Celery Worker

- An external process connected to the mongo database that executes tasks on the task queue and returns results to other

tasks or to the main workflow task

Task

- A unit of execution in Celery. Blocks inherit from Task, so that they can be run in Celery

Page 4: AvalancheProject2012

Workbench/Features • Administrator page allows user to create workflow

• Each block has metadata so that front end knows what connections and parameters each block needs.

• After user creates blocks dynamic form is created to receive parameters from the user.

• Restaurant allows user to create data by ordering beers and wines

• History of Results

• Upload Datasets

General Use Case

1. User logs in and then creates new dataset upload (server parses as json)

2. Dataset file is uploaded to server and generates unique filename

3. User creates new block by requesting block parameters and building form

4. Form and data is validated and new block is created

5. Before saving block model generates unique block id and adds to Graphiti canvas

6. Saves block model json to workflow field

7. User clicks ‘Run’ button and serializes blocks and workflow to send to backend

Page 5: AvalancheProject2012

Framework/Features • Uses celery which is a multi-threaded tasks handler – increases performance

• MongoDB is a flexible, schema-free, BSON based database (NoSQL)

• Parses workflows into blocks and creates tasks for celery

Concepts and Paradigms

• Distributed, message-based computing

• Meta based

• Choose between duck and static typing

• Data confidence

• Scalability

• Modularity

• Cloud-based RESTful service

General Use Case

1. Workflow json gets sent to backend to be executed

2. Backend parses the workflow data and creates an executable sequence of blocks

3. Celery automagically handles and optimizes block queueing and saves results into MongoDB

4. Backend returns ids of results back to frontend.

5. Frontend access MongoDB API to get result data and parse into a visually pleasing format

6. Django display’s views for results with highcharts javascript library.

Page 6: AvalancheProject2012

Example Workflow

Page 7: AvalancheProject2012

Celery Constructs

Chain Chord

Page 8: AvalancheProject2012

What we need

Common Dependencies Multiple Inputs

Page 9: AvalancheProject2012

Solution:

Parallel Topological Sort

Page 10: AvalancheProject2012

Parallel Topological Sort

Blocks without dependencies are started

Page 11: AvalancheProject2012

Parallel Topological Sort

B0 finishes, b3 is started

Page 12: AvalancheProject2012

Parallel Topological Sort

b1 finishes, b2 and b4 are started

Page 13: AvalancheProject2012

Parallel Topological Sort

B2, b3, b4 finish, b5 is started

Page 14: AvalancheProject2012

Parallel Topological Sort

B5 finishes

• Result ids are returned when all blocks finish

• The data stays in mongo

Page 15: AvalancheProject2012

Framework/Algorithms • Basic Statistics

o Mean, Median, Mode

o Standard Deviation

o Variation

o Maximum, Minimum

• Set Theory

o Union

o Intersection

o Difference

o Sorting

• Apriori Algorithm

• K-Means Clustering

• Outlier Detection (Density-Based Clustering)

Page 16: AvalancheProject2012

Demo

Page 17: AvalancheProject2012

Workbench Technology • Django – Python based website framework

• Jquery – multi-browser JavaScript library designed to simplify the client-side scripting of HTML with ajax support

• Twitter Bootstrap Framework – HTML and CSS-based design templates for typography, forms, buttons, charts, navigation

and other interface components, as well as optional JavaScript extensions.

• Gargoyle – Togglable feature flips for administrator interface

• HTML5 Canvas - dynamic, scriptable rendering of 2D shapes and bitmap images

Problems Encountered?

• HTML5 Canvas GUI frontend does not work right on all browsers

• Django and jquery ui drag and drop.

• Django steep learning curve.

Page 18: AvalancheProject2012

Framework Technology • Celery

• MongoDB

• Numpy

• Scipy

• Scikit Learn

• Flask

Problems Encountered?

• Celery has a steep initial learning curve

• Spent a lot of time revising the structure of workflows and blocks

• Machine learning algorithms are difficult

• Coordination of data formats was difficult to address between the front and back end