2015 Intel Big Data Software Summit http://goto/bigdatasoftware Enabling Near-Data Accelerators in Datacenters Dave Ojika, Jayson Strayer, Gaurav Kaul, Prashanth Thinakaran, Darin Acosta • Motivation • Bring unconventional compute cores (especially FPGAs) into mainstream big data use • Abstract software complexity by introducing efficient accelerator programming model • Enable a data-oriented framework for near- memory, distributed processing • Approach • Accelerate computation on FPGA; transfer data over low-latency DDR bus • Provide in-memory storage using open-source Tachyon framework • Offload Spark workload to accelerator • Method • Use Compute-Near Memory (CNM) architecture for design-space exploration • Map data to cores with affinity to specific memory regions • Integrate a Java-OpenCL middleware to support scheduling of tasks on accelerator Highlights Accelerator Overview • Memory-speed data access • Memory-centric buffer synchronizes with underlying file system Write Method Read Method Data Register 4 TB Image Cmd Register Interface Connect Object Copy Method W W R R R W write_bit read_bit copy_bit R/W Workload Analysis In-Memory Framework Data and Compute Layer Current Developments • Boosted Decision Tree (BDT) • Latency-sensitive • Poor data locality • Fits in 4TB memory • 7-fold cross-validation Hit 1GB 10GB L1 94.06% 89.75% DRAM 0.74% 5.74% quad-core i7 CPU, 8 GB RAM • Fraction of store-bound stalls increases with size of dataset; memory bandwidth requirement too high for CPU Workload can be trivially parallelized across DIMMs • 1 st Place ATLAS ‘14 Higgs ML Challenge: • Deep Learning from Oxdata’s H20 • Where do FPGA accelerators stand? • Explore BDT on CNM accelerator High-energy physics experiment at CERN’s LHC (collaboration with UF Physics) • Simics Simulator • Functional model • Software stack • Apps & workload exploration Task Task Host Middleware Driver FPGA Queue Scheduler Tachyon File System (Local or HDFS) • In-memory data exchange • Reliable file sharing at memory-speed • Caching of working set files in memory • Fault-tolerant and distributed API Tachyon utilizes memory aggressively, leveraging data lineage • OpenCL driver integration • Container enablement • Cloud orchestration • NVM support and NFV Compute Near Memory (CNM) Big Data Framework Application API API Prototype with PCI, DDR and Direct I/O interfaces JOC JOC: Java-to-OpenCL Component No-Higgs or Higgs • BDT on CPU (2 nd Place ATLAS ‘14 Higgs ML Challenge) Application to architecture transformation • Utilize parallelism on FPGA • Leverage low-latency DDR and 100 GB optical links 100 GB Transceiver IP FIFO Decoder (data filter) Data Reassembly Level 2: FPGA as high-performance accelerator Level 1: FPGA receives and pre-processes data in real-time QFSP • Direct I/O • Real-time • Low latency • Low power • Compute Engine • Up to 3 TFLOPS • OpenCL kernel • BW of host memory Altera Arria 10 FPGA Generic implementation To Datastore DRAM DRAM DRAM Pre-processed dataset Datastore Synchronize • In-memory data store • Memory-centric distributed storage • Reliable data sharing at memory speed Development kit Cloud Orchestration Training time for 11 million events: 5 hours! Xeon E5-2680 @ 2.8 GHZ BDT on MATLAB • Prediction time: 370 ms • Okay for online, real- time prediction • Training time: 5 hours • Grew with increasing data size • Data affinity • Cores cooperate with each other for shared data accesses • Shared Virtual Memory (SVM) Accelerator model CPU and device both access shared data using the same virtual addresses No explicit data marshaling Dave Ojika: Cloud Infrastructure Jayson Strayer: Platform Silicon Gaurav Kaul: Health and Life Sciences Prashanth Thinakaran: Big Data Darin Acosta: Physics Professor, UF