Top Banner
CSE 662 – Languages and Databases Project Overview Part 1 Oliver Kennedy & Lukasz Ziarek
19

CSE 662 – Languages and Databases Project Overview Part 1 Oliver Kennedy & Lukasz Ziarek.

Jan 17, 2016

Download

Documents

Jacob Holt
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CSE 662 – Languages and Databases Project Overview Part 1 Oliver Kennedy & Lukasz Ziarek.

CSE 662 – Languages and Databases

Project Overview Part 1Oliver Kennedy & Lukasz

Ziarek

Page 2: CSE 662 – Languages and Databases Project Overview Part 1 Oliver Kennedy & Lukasz Ziarek.

Announcements

• Scientista Brunch • Today: 12 – 1 113A Davis Hall• The Scientista Foundation is a national

organization that empowers pre-professional women in STEM.

Page 3: CSE 662 – Languages and Databases Project Overview Part 1 Oliver Kennedy & Lukasz Ziarek.

Types of Projects

• Benchmarking and Evaluation• Replication of Research Results• Building of Specialized Databases• Research Oriented

Page 4: CSE 662 – Languages and Databases Project Overview Part 1 Oliver Kennedy & Lukasz Ziarek.

Checkpoint Expectations

• 10+ page report• Design / algorithms• Limitations• Work plan• Related work

• Current state of the implementation brief• Code / scripts written thus far• Preliminary numbers

Page 5: CSE 662 – Languages and Databases Project Overview Part 1 Oliver Kennedy & Lukasz Ziarek.

Final Project Expectations

• 15+ page report• Implementation and Documentation• Unit tests and benchmarks• Build scripts• Automated evaluation scripts• Automated plotting scripts

Page 6: CSE 662 – Languages and Databases Project Overview Part 1 Oliver Kennedy & Lukasz Ziarek.

Embedded Database Benchmark

• Lightweight, embedded databases are increasingly being used to store structured application state.  Commonly used embedded databases include SQLite, BerkeleyDB, Derby, HSQLDB, and H2.  Each database engine targets different workload/usage patterns.  In this project, you will design a micro-benchmark workload and then perform a comparison between several embedded database engines, and analyze the results in a detailed report.

Page 7: CSE 662 – Languages and Databases Project Overview Part 1 Oliver Kennedy & Lukasz Ziarek.

Embedded Database Benchmark • Languages Required: Java, some C/C++, A language like

Python or R• First Steps: Install all 5 embedded databases and evaluate

them using YCSB's 6 default workloads (workloads A-F)• Expected Outcomes: (1) A report detailing the strengths and

weaknesses of each system evaluated.  (2) The software used to perform the benchmarks.

• Preliminary Questions: What types of workloads do you expect embedded databases to be used on?  What kinds of access patterns do these workloads create?  How would you evaluate each database's performance on these access patterns?  

Page 8: CSE 662 – Languages and Databases Project Overview Part 1 Oliver Kennedy & Lukasz Ziarek.

PocketData Benchmark

• Lightweight, embedded databases are increasingly being used to store structured application state, especially on mobile devices.  We have permission to analyze traces of (anonymized) query logs from PhoneLab phones.  In this project, you will use these traces as the basis for a macro-benchmark workload- and data-generator that simulates query patterns on a mobile phone.  

Page 9: CSE 662 – Languages and Databases Project Overview Part 1 Oliver Kennedy & Lukasz Ziarek.

PocketData Benchmark

• Languages Required: Probably a scripting language like Ruby/Python

• First Steps: Pick 3-4 high-volume applications in the 11-phone PhoneLab trace, label each of the queries issued by each application by the query's intent, and define a query "pattern" for each intent.

• Expected Outcomes: (1) A workload- and data-generator simulating the behavior of a mobile app's query workload.  (2) A report outlining the design of the data and workload generators, and a lightweight evaluation of one or more database engines on this workload.

• Preliminary Questions: What are features of the query workload that need to be emulated?  What constitutes a "realistic" simulation (i.e., how do you judge whether your generators are successful or not)?

Page 10: CSE 662 – Languages and Databases Project Overview Part 1 Oliver Kennedy & Lukasz Ziarek.

Replicate the Dietrich Paper

• The Uncracked Pieces in Database Cracking" by Schuhknecht et al ( http://dl.acm.org/citation.cfm?id=2732229 ) describes a thorough evaluation of several different index types, including among others, Cracker Indexes.  In this project, you will implement a variety of simple in-memory index structures such as Cracker Indexes, Hybrid Cracker Indexes, BTree Indexes, and possibly others.  Using these index structures, you will attempt to replicate Schuhknecht et al's results and further generalize on them.  

Page 11: CSE 662 – Languages and Databases Project Overview Part 1 Oliver Kennedy & Lukasz Ziarek.

Replicate the Dietrich Paper

• Languages Required: A complied, non-GC language like C, C++ or Rust

• First Steps: Read "The Uncracked Pieces in Database Cracking"  Implement a trivial in-memory BTree, Hash Table, and Cracker Index.

• Expected Outcomes: (1) A report outlining how you validated of Schuhknecht et. al.'s results, and detailing any additional findings you discovered in the process.

• Preliminary Questions: How did Schuhknecht et al evaluate cracker indexes?  What are their main takeaways?  Do these findings create any new questions to be answered?

Page 12: CSE 662 – Languages and Databases Project Overview Part 1 Oliver Kennedy & Lukasz Ziarek.

LLVM Query Runtime

• Several recent academic database systems such as HyPer ( http://hyper-db.com ) and DBToaster ( http://www.dbtoaster.org ) use compilers to accelerate query processing by translating query plans into machine code before executing it.  In this project, you will do the same, creating a simple compiler (along the lines of the CSE 562 query processor) that generates machine code using LLVM (a highly extensible compiler toolchain).

Page 13: CSE 662 – Languages and Databases Project Overview Part 1 Oliver Kennedy & Lukasz Ziarek.

LLVM Query Runtime

• Languages Required: Java, Scala, or C/C++• First Steps: Write a program that constructs

a toy program in LLVM Bytecode (e.g., print out a hardcoded tuple)

• Expected Outcomes: (1) A JIT-compiled query engine.  (2) A report evaluating the performance of your engine.

Page 14: CSE 662 – Languages and Databases Project Overview Part 1 Oliver Kennedy & Lukasz Ziarek.

Lightweight Runtimes

• Small, lightweight database engines are becoming increasingly important as data migrates to low-power devices like smartphones.  In this project, you will implement a simple query evaluation engine on an Intel Galileo, a low-powered computing platform.  

Page 15: CSE 662 – Languages and Databases Project Overview Part 1 Oliver Kennedy & Lukasz Ziarek.

Lightweight Runtimes

• Languages Required: Java (?), C++ or C• First Steps: Get to the point where you can SSH into

the Galileo board and run code.• Expected Outcomes: (1) A query processor that

runs on the Galileo board.  (2) Documentation of what changes need to be made.

• Preliminary Questions: How will the low-CPU, and low-memory capabilities of the Galileo board affect the query processor's design?  The board has limited internal memory, where should the data be stored?

Page 16: CSE 662 – Languages and Databases Project Overview Part 1 Oliver Kennedy & Lukasz Ziarek.

Policy Exploration for JITDs

• Just-in-Time Data Structures assemble simple building-blocks into more complex data structures using policies (typically defined as a set of rewrite rules).  A key feature is that the policy can change in response to changing workloads.  In this project, you will design and evaluate one or more JITD policies.  This project will be available to two groups: One group will work with the Java implementation, and the other group will work with the C implementation.

Page 17: CSE 662 – Languages and Databases Project Overview Part 1 Oliver Kennedy & Lukasz Ziarek.

Policy Exploration for JITDs

• Languages Required: Java (group A) and C (group B)• First Steps: Familiarize yourself with the implementation

by coming up with a simple policy and implementing it (e.g., A LSM Tree, a prefix trie or a hash table)

• Expected Outcomes: (1) One or more policy implementations in the JITD framework.  (2) A report detailing the design of your policy, and discussing your evaluation of the policy/policies.

• Preliminary Questions: What types of workloads will your policy (or policies) target?  How does your policy change in response to changes in the workload?

Page 18: CSE 662 – Languages and Databases Project Overview Part 1 Oliver Kennedy & Lukasz Ziarek.

JITDs on Disk

• Just-in-Time Datastructures use a DAG-based data representation that creates many random read accesses (assuming sequential writes).  When used as an in-memory data structure, this is ok.  However, when written directly to a sequential medium like hard disks (or even SSDs), this can cause significant performance impacts.  In this project, you will modify either JITD implementation to persist the index to disk in a way that makes it possible to efficiently reconstruct the index as needed.

Page 19: CSE 662 – Languages and Databases Project Overview Part 1 Oliver Kennedy & Lukasz Ziarek.

JITDs on Disk

• Languages Required: Java, C++ or C• First Steps: Using raw IO, or a storage layer like

BerkeleyDB or similar, add support for paging out Cogs to one of the 

• Expected Outcomes: (1) An implementation of a persistent JITD.  (2) A report evaluating the persistent data structure against existing datastructures

• Preliminary Questions: How can a JITD's random accesses be converted into sequential accesses?  What are efficient ways of compacting the log?  What happens when the data structure gets bigger than memory?