CSE 662 – Languages and Databases Project Overview Part 1 Oliver Kennedy & Lukasz Ziarek.

CSE 662 – Languages and Databases

Project Overview Part 1Oliver Kennedy & Lukasz

Ziarek

Announcements

• Scientista Brunch • Today: 12 – 1 113A Davis Hall• The Scientista Foundation is a national

organization that empowers pre-professional women in STEM.

Types of Projects

• Benchmarking and Evaluation• Replication of Research Results• Building of Specialized Databases• Research Oriented

Checkpoint Expectations

• 10+ page report• Design / algorithms• Limitations• Work plan• Related work

• Current state of the implementation brief• Code / scripts written thus far• Preliminary numbers

Final Project Expectations

• 15+ page report• Implementation and Documentation• Unit tests and benchmarks• Build scripts• Automated evaluation scripts• Automated plotting scripts

Embedded Database Benchmark

• Lightweight, embedded databases are increasingly being used to store structured application state. Commonly used embedded databases include SQLite, BerkeleyDB, Derby, HSQLDB, and H2. Each database engine targets different workload/usage patterns. In this project, you will design a micro-benchmark workload and then perform a comparison between several embedded database engines, and analyze the results in a detailed report.

Embedded Database Benchmark • Languages Required: Java, some C/C++, A language like

Python or R• First Steps: Install all 5 embedded databases and evaluate

them using YCSB's 6 default workloads (workloads A-F)• Expected Outcomes: (1) A report detailing the strengths and

weaknesses of each system evaluated. (2) The software used to perform the benchmarks.

• Preliminary Questions: What types of workloads do you expect embedded databases to be used on? What kinds of access patterns do these workloads create? How would you evaluate each database's performance on these access patterns?

PocketData Benchmark

• Lightweight, embedded databases are increasingly being used to store structured application state, especially on mobile devices. We have permission to analyze traces of (anonymized) query logs from PhoneLab phones. In this project, you will use these traces as the basis for a macro-benchmark workload- and data-generator that simulates query patterns on a mobile phone.

PocketData Benchmark

• Languages Required: Probably a scripting language like Ruby/Python

• First Steps: Pick 3-4 high-volume applications in the 11-phone PhoneLab trace, label each of the queries issued by each application by the query's intent, and define a query "pattern" for each intent.

• Expected Outcomes: (1) A workload- and data-generator simulating the behavior of a mobile app's query workload. (2) A report outlining the design of the data and workload generators, and a lightweight evaluation of one or more database engines on this workload.

• Preliminary Questions: What are features of the query workload that need to be emulated? What constitutes a "realistic" simulation (i.e., how do you judge whether your generators are successful or not)?

Replicate the Dietrich Paper

• The Uncracked Pieces in Database Cracking" by Schuhknecht et al ( http://dl.acm.org/citation.cfm?id=2732229 ) describes a thorough evaluation of several different index types, including among others, Cracker Indexes. In this project, you will implement a variety of simple in-memory index structures such as Cracker Indexes, Hybrid Cracker Indexes, BTree Indexes, and possibly others. Using these index structures, you will attempt to replicate Schuhknecht et al's results and further generalize on them.

http://dl.acm.org/citation.cfm?id=2732229

Replicate the Dietrich Paper

• Languages Required: A complied, non-GC language like C, C++ or Rust

• First Steps: Read "The Uncracked Pieces in Database Cracking" Implement a trivial in-memory BTree, Hash Table, and Cracker Index.

• Expected Outcomes: (1) A report outlining how you validated of Schuhknecht et. al.'s results, and detailing any additional findings you discovered in the process.

• Preliminary Questions: How did Schuhknecht et al evaluate cracker indexes? What are their main takeaways? Do these findings create any new questions to be answered?

LLVM Query Runtime

• Several recent academic database systems such as HyPer ( http://hyper-db.com ) and DBToaster ( http://www.dbtoaster.org ) use compilers to accelerate query processing by translating query plans into machine code before executing it. In this project, you will do the same, creating a simple compiler (along the lines of the CSE 562 query processor) that generates machine code using LLVM (a highly extensible compiler toolchain).

http://hyper-db.com/

http://www.dbtoaster.org/

LLVM Query Runtime

• Languages Required: Java, Scala, or C/C++• First Steps: Write a program that constructs

a toy program in LLVM Bytecode (e.g., print out a hardcoded tuple)

• Expected Outcomes: (1) A JIT-compiled query engine. (2) A report evaluating the performance of your engine.

Lightweight Runtimes

• Small, lightweight database engines are becoming increasingly important as data migrates to low-power devices like smartphones. In this project, you will implement a simple query evaluation engine on an Intel Galileo, a low-powered computing platform.

Lightweight Runtimes

• Languages Required: Java (?), C++ or C• First Steps: Get to the point where you can SSH into

the Galileo board and run code.• Expected Outcomes: (1) A query processor that

runs on the Galileo board. (2) Documentation of what changes need to be made.

• Preliminary Questions: How will the low-CPU, and low-memory capabilities of the Galileo board affect the query processor's design? The board has limited internal memory, where should the data be stored?

Policy Exploration for JITDs

• Just-in-Time Data Structures assemble simple building-blocks into more complex data structures using policies (typically defined as a set of rewrite rules). A key feature is that the policy can change in response to changing workloads. In this project, you will design and evaluate one or more JITD policies. This project will be available to two groups: One group will work with the Java implementation, and the other group will work with the C implementation.

Policy Exploration for JITDs

• Languages Required: Java (group A) and C (group B)• First Steps: Familiarize yourself with the implementation

by coming up with a simple policy and implementing it (e.g., A LSM Tree, a prefix trie or a hash table)

• Expected Outcomes: (1) One or more policy implementations in the JITD framework. (2) A report detailing the design of your policy, and discussing your evaluation of the policy/policies.

• Preliminary Questions: What types of workloads will your policy (or policies) target? How does your policy change in response to changes in the workload?

JITDs on Disk

• Just-in-Time Datastructures use a DAG-based data representation that creates many random read accesses (assuming sequential writes). When used as an in-memory data structure, this is ok. However, when written directly to a sequential medium like hard disks (or even SSDs), this can cause significant performance impacts. In this project, you will modify either JITD implementation to persist the index to disk in a way that makes it possible to efficiently reconstruct the index as needed.

JITDs on Disk

• Languages Required: Java, C++ or C• First Steps: Using raw IO, or a storage layer like

BerkeleyDB or similar, add support for paging out Cogs to one of the

• Expected Outcomes: (1) An implementation of a persistent JITD. (2) A report evaluating the persistent data structure against existing datastructures

• Preliminary Questions: How can a JITD's random accesses be converted into sequential accesses? What are efficient ways of compacting the log? What happens when the data structure gets bigger than memory?

CSE 662 – Languages and Databases Project Overview Part 1 Oliver Kennedy & Lukasz Ziarek.

Documents