Top Banner
CUDA PARELLEL PROGRAMMING AND GAMING MEETUP RIGA, LATVIA, JANUARY 12, 2015
76

Cuda meetup presentation 5

Jul 16, 2015

Download

Education

Rihards Gailums
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Cuda meetup presentation 5

CUDA PARELLELPROGRAMMING AND

GAMING MEETUP

RIGA, LATVIA, JANUARY 12, 2015

Page 2: Cuda meetup presentation 5

Rihards Gailums

Twitter: @RihardsGailums

[email protected]

Page 3: Cuda meetup presentation 5

• Personal Introductions;

• CES Nvidia news;

• Jetson TK1;

• Anouncements;

• Summary

Agenda

Page 4: Cuda meetup presentation 5

Who am I

What I am doing

What is my parallel programmingexperience [CUDA] experience

Introduction

Page 5: Cuda meetup presentation 5

Exponential Technologies

Drones AUTOMOTIVE 3D Printing Artificial intelligence

Biotechnology Design ROBOTICS Computer Vision

Page 6: Cuda meetup presentation 5

CES 2015

Page 7: Cuda meetup presentation 5
Page 8: Cuda meetup presentation 5

Parrot drone + Jetson TK1

Page 9: Cuda meetup presentation 5
Page 10: Cuda meetup presentation 5
Page 11: Cuda meetup presentation 5
Page 12: Cuda meetup presentation 5

Featured events

Page 13: Cuda meetup presentation 5
Page 14: Cuda meetup presentation 5

The Startup

Universi ty

Latvia Tech and

Entrepreneurship

meetup network

Page 15: Cuda meetup presentation 5

Riga BioTechnology Meetup

UI/UX Riga Meetup

Riga Drone Meetup

Riga Mobile App Developer

Meetup

Kick-off Meetup

December 12, THE Mill

December Meetup

December 15, THE Mill

Drone Kick-off Meetup

December 16, THE Mill

Kick-off Meetup

December 17, THE Mill

Page 16: Cuda meetup presentation 5

Riga Startup: Idea to IPO

CUDA parallel programming and

gaming meetup Riga

3D Printing Riga

Meetup

Bitcoin and Cryptocurrencies

Meetup

Find you Co-founder

December 18, THE Mill

December CUDA Meetup

December 22, THE Mill

January 3D Printing

January 8, RTU Design

FactoryJanuARY Meetup

jANUARy 26, THE Mill

Page 17: Cuda meetup presentation 5

Sponsors

Page 18: Cuda meetup presentation 5

Technology and Entrepreneurship education

Page 19: Cuda meetup presentation 5

Paralell computing Research Center

CUDA LABORATORY

Page 20: Cuda meetup presentation 5

Jetson TK1

Page 21: Cuda meetup presentation 5
Page 22: Cuda meetup presentation 5
Page 23: Cuda meetup presentation 5

http://www.anandtech.com/show/7905/nvidia-announces-jetson-tk1-dev-board-adds-erista-to-tegra-roadmap

Page 24: Cuda meetup presentation 5

https://www.youtube.com/watch?v=aw01HwTN1MM

Page 25: Cuda meetup presentation 5

Whats in the Box

Page 26: Cuda meetup presentation 5
Page 27: Cuda meetup presentation 5

Login Credentials

Username: ubuntuPassword: ubuntu

Page 28: Cuda meetup presentation 5

Install the NVIDIA Linux driver binary release on your target located in:

${HOME}/NVIDIA-INSTALLER

Step 1)

Change directories into the NVIDIA installation directory:

cd ${HOME}/NVIDIA-INSTALLER

Step 2)

Run the installer script to extract and install the Linux driver binary release:

sudo ./installer.sh

Step 3)

Reboot the system to have the graphical desktop UI come up.

Page 29: Cuda meetup presentation 5

CUDA SDK Demo Samples

• Particles

• Nbody

• Smokeparticles

• waves

Page 30: Cuda meetup presentation 5
Page 31: Cuda meetup presentation 5

http://elinux.org/Jetson_TK1

Page 32: Cuda meetup presentation 5
Page 33: Cuda meetup presentation 5

Jetson/Installing CUDA

http://elinux.org/Jetson/Installing_CUDA

Page 34: Cuda meetup presentation 5

Jetson/Installing OpenCV

http://elinux.org/Jetson/Installing_OpenCV

Page 35: Cuda meetup presentation 5

Open CV

http://docs.opencv.org/doc/tutorials/tutorials.html

Page 36: Cuda meetup presentation 5

http://docs.opencv.org/doc/tutorials/objdetect/cascade_classifier/cascade_classifier.html#cascade-classifier

Object Detection

Page 37: Cuda meetup presentation 5

Jetson TEGRA TK1Tegra K1 SOC

• Kepler GPU with 192 CUDA cores• 4-Plus-1 quad-core ARM Cortex A15 CPU

• 2 GB x16 memory with 64 bit width

• 16 GB 4.51 eMMC memory

• 1 Half mini-PCIE slot

• 1 Full size SD/MMC connector

• 1 Full-size HDMI port

• 1 USB 2.0 port, micro AB

• 1 USB 3.0 port, A

• 1 RS232 serial port

• 1 ALC5639 Realtek Audio codec with Mic in and Line out

• 1 RTL8111GS Realtek GigE LAN

• 1 SATA data port

• SPI 4MByte boot flash

Page 38: Cuda meetup presentation 5

Research projects?

Page 39: Cuda meetup presentation 5

• IT industry experiences an increasing growth for display surfaces with high resolution• Use cases for such surfaces include satellite and map data, x-ray and microscope images, multimedia, CCTV, etc.• Existing solutions are not scalable, do not offer hardware abstraction, suffer from wiring limitations

Proposed Virtual Machine Based Monitor Wall Architecture

Introduction Scalability

Conclusions

• The current experiments show that this architecture is very feasiblefor non FPS intensive use cases where the display wall can be drivenby a single physical GPU

• The total resolution provided by this architecture even using thecurrently available compression technology greatly exceeds theresolutions of existing solutions, it would be expected for theresolution to grow in the future

• The architecture itself scales very good, it is limited mainly by OSsupport for multiple monitors (this can be overcome by simulatinga single high resolution display in the virtual machine that spans thewhole resolution of the physical wall) and the possibility to stackmultiple GPU’s in the host system

• Future work should focus on the ability to virtualize OpenGL andDirect3D to remove the advantages of non-virtualized architectures

OSGPU

GPUMonit

orMonit

or

Monitor

Monitor

OS GPU

Monitor

Monitor

Monitor

Monitor

Splitter / Scaler

Currently there are two main alternatives to be used as the H.264 encoder in this architecture – Intel Quick Sync and NVENC. NVCENC is more feasible because:• The total encoding power can be increased by stacking up multiple GPUs that support NVENC without penalties while not all Intel Quick Sync GPUs have built in video memory so scaling these cards introduce a performance penalty

of using system memory• NVENC does not put any limitations on other components of the system, while Intel Quick Sync supports a limited amount of CPUs• Current benchmarks seem to show that the overall FPS performance for a single GPU (which is the main criteria for this architecture) is better for NVENC than Intel Quick Sync

Why NVENC?

Pro: OS can natively manage the displaysCon: Power consumption, supported monitor count limited by the output count of the GPUs and expansion slots for the GPUs on the motherboard, deployment is limited by wiring

Pro: Software complexity is reduced since it does not have to be multiple monitor awareCon: Small resolution and DPI, visualization is not displayed in it’s native resolutionCon: Expensive

Currently Popular Monitor Wall Architectures

Pro: Scalable, host machine can run multiple virtual machines, multiple virtualized GPU’s map to physical GPU’s to maximize efficiencyPro: LAN connection to the display wall removes wire lengthlimitations forced by DVI/HDMI cablesPro: Total resolution of the wall goes beyond the ones that can be achieved using physical hardwareCon: Lossy compressionCon: No Direct3D, OpenGL support

• The host machine collects the frame buffer data from the virtual machine GPUs and performs H.264 encoding of the video stream on the physical host GPU thus the architecture heavily relies on a fast hardware H.264 encoder allowing the hosted virtual machines to fully use the CPU

• Non FPS intensive use cases allow a great number of virtual monitors to be hosted on a single physical GPU thus reducing the power consumption

0

50

100

Using Video…

Max

imu

m

The graph below demonstrates the scalability possibilities in terms of possible maximal amount of connected monitors for the traditional architecture versus the proposed one on a Quadro K4000 card that has 4 outputs.

Display wall architecture where each output of the GPU maps to a tile on the display wall Display wall architecture where each output of the GPU is split/upscaled among the tiles on the display wall

Host Machine

Virtual MachineGPU

GPU

GPU

GPUG

PU

GPU

GPU

GPUG

PU

GPU

GPU

GPUG

PU

GPU

GPU

GPU

H.2

64

/RTP

/LA

N

GP

U

The proposed display wall architecture where each output of a virtual GPU maps to a tile on the display wall and is transmitted as a H.264 stream over LAN

Virtual machine based monitor wall running Google maps inside Chrome web browser on 16 tiles at 1920x1080 pixels each giving a total resolution of 32 megapixels

Each tile has a dedicated LAN connection and H.264 decoder

Scalability of supported monitor count

NVENC Based H.264 Encoding for Virtual Machine Based Monitor Wall Architecture

R. Bundulis ([email protected]), G. Arnicans ([email protected]), and R. Gailums ([email protected])

University of Latvia / Riga High Tech University, Latvia

Page 40: Cuda meetup presentation 5

For Startups by Meetup members:

$1800 per year of FREE Azure cloud servicesFree Microsoft software and tools

Page 41: Cuda meetup presentation 5

Latvijas Garantiju Aģentūra

Government corporations which supports Latvian entepreneurs

and helps in realisation of their business ideas.

Page 42: Cuda meetup presentation 5

Featured Resources

Page 43: Cuda meetup presentation 5

Featured Startup

Page 44: Cuda meetup presentation 5

Speakers

Page 45: Cuda meetup presentation 5

Mobile APP Startup pitch

Page 46: Cuda meetup presentation 5

Why GPU Computing

Page 47: Cuda meetup presentation 5

GPUCPU

Add GPUs: Accelerate Science Applications

© NVIDIA 2013

Page 48: Cuda meetup presentation 5

Small Changes, Big Speed-up

Application Code

+

GPU CPU

Use GPU to Parallelize

Compute-Intensive Functions

Rest of SequentialCPU Code

© NVIDIA 2013

Page 49: Cuda meetup presentation 5

Fastest Performance on Scientific Applications

Tesla K20X Speed-Up over Sandy Bridge CPUs

CPU results: Dual socket E5-2687w, 3.10 GHz, GPU results: Dual socket E5-2687w + 2 Tesla K20X GPUs

*MATLAB results comparing one i7-2600K CPU vs with Tesla K20 GPU

Disclaimer: Non-NVIDIA implementations may not have been fully optimized

0.0x 5.0x 10.0x 15.0x 20.0x

AMBER

SPECFEM3D

Chroma

MATLAB (FFT)*Engineering

Earth

Science

Physics

Molecular

Dynamics

© NVIDIA 2013

Page 50: Cuda meetup presentation 5

Why Computing Perf/Watt Matters?

Traditional CPUs arenot economically feasible

2.3 PFlops 7000 homes

7.0 Megawatts

7.0 Megawatts

CPUOptimized for Serial Tasks

GPU AcceleratorOptimized for Many

Parallel Tasks

10x performance/socket

> 5x energy efficiency

Era of GPU-accelerated computing is here

© NVIDIA 2013

Page 51: Cuda meetup presentation 5

World’s Fastest, Most Energy Efficient Accelerator

Tesla K20X

Tesla K20

Xeon CPU, E5-2690

Xeon Phi225W

0.0

1.0

2.0

3.0

0.0 0.5 1.0 1.5

SG

EM

M (

TFLO

PS)

DGEMM (TFLOPS)

Tesla K20X vs Xeon CPU

8x Faster SGEMM

6x Faster DGEMM

Tesla K20X vs Xeon Phi

90% Faster SGEMM

60% Faster DGEMM

© NVIDIA 2013

Page 52: Cuda meetup presentation 5

Introduction to the CUDA Platform

Page 53: Cuda meetup presentation 5

CUDA Parallel Computing Platform

Hardware Capabilities

GPUDirectSMXDynamic

ParallelismHyperQ

Programming Approaches

Libraries

“Drop-in”

Acceleration

Programming Languages

OpenACC Directives

Maximum FlexibilityEasily Accelerate

Apps

DevelopmentEnvironment

Nsight IDELinux, Mac and Windows

GPU Debugging and

Profiling

CUDA-GDB

debugger

NVIDIA Visual

Profiler

Open CompilerTool Chain

Enables compiling new languages to CUDA

platform, and CUDA languages to other

architectures

www.nvidia.com/getcuda

© NVIDIA 2013

Page 54: Cuda meetup presentation 5

Applications

Libraries

“Drop-in”

Acceleration

Programming Languages

OpenACC Directives

Easily Accelerate

Applications

3 Ways to Accelerate Applications

Maximum

Flexibility

© NVIDIA 2013

Page 55: Cuda meetup presentation 5

3 Ways to Accelerate Applications

Applications

Libraries

“Drop-in”

Acceleration

Programming Languages

OpenACCDirectives

Maximum

Flexibility

Easily Accelerate

Applications

© NVIDIA 2013

Page 56: Cuda meetup presentation 5

Libraries: Easy, High-Quality Acceleration

• Ease of use: Using libraries enables GPU acceleration without in-depth

knowledge of GPU programming

• “Drop-in”: Many GPU-accelerated libraries follow standard APIs, thus

enabling acceleration with minimal code changes

• Quality: Libraries offer high-quality implementations of functions

encountered in a broad range of applications

• Performance: NVIDIA libraries are tuned by experts

© NVIDIA 2013

Page 57: Cuda meetup presentation 5

Some GPU-accelerated Libraries

NVIDIA cuBLAS NVIDIA cuRAND NVIDIA cuSPARSE NVIDIA NPP

Vector SignalImage Processing

GPU AcceleratedLinear Algebra

Matrix Algebra on GPU and Multicore

NVIDIA cuFFT

C++ STL Features for

CUDAIMSL Library

Building-block Algorithms for

CUDAArrayFire Matrix

Computations

Sparse Linear Algebra

© NVIDIA 2013

Page 58: Cuda meetup presentation 5

3 Steps to CUDA-accelerated application

• Step 1: Substitute library calls with equivalent CUDA library calls

saxpy ( … ) cublasSaxpy ( … )

• Step 2: Manage data locality

- with CUDA: cudaMalloc(), cudaMemcpy(), etc.

- with CUBLAS: cublasAlloc(), cublasSetVector(), etc.

• Step 3: Rebuild and link the CUDA-accelerated library

nvcc myobj.o –l cublas

© NVIDIA 2013

Page 59: Cuda meetup presentation 5

Explore the CUDA (Libraries) Ecosystem

• CUDA Tools and Ecosystem described in detail on NVIDIA Developer Zone:developer.nvidia.com/cuda-tools-ecosystem

© NVIDIA 2013

Page 60: Cuda meetup presentation 5

3 Ways to Accelerate Applications

Applications

Libraries

“Drop-in”

Acceleration

Programming Languages

OpenACCDirectives

Maximum

Flexibility

Easily Accelerate

Applications

© NVIDIA 2013

Page 61: Cuda meetup presentation 5

OpenACC Directives

© NVIDIA 2013

Program myscience

... serial code ...

!$acc kernels

do k = 1,n1

do i = 1,n2

... parallel code ...

enddo

enddo

!$acc end kernels

...

End Program myscience

CPU GPU

Your original

Fortran or C

code

Simple Compiler hints

Compiler Parallelizes

code

Works on many-core

GPUs & multicore CPUs

OpenACC

compiler

Hint

Page 62: Cuda meetup presentation 5

• Easy: Directives are the easy path to accelerate compute intensive applications

• Open: OpenACC is an open GPU directives standard, making GPU programming straightforward and portable across parallel and multi-core processors

• Powerful: GPU Directives allow complete access to the massive parallel power of a GPU

OpenACCThe Standard for GPU Directives

© NVIDIA 2013

Page 63: Cuda meetup presentation 5

Real-Time Object Detection

Global Manufacturer of Navigation Systems

Valuation of Stock Portfolios using Monte

Carlo

Global Technology Consulting Company

Interaction of Solvents and Biomolecules

University of Texas at San Antonio

Directives: Easy & Powerful

Optimizing code with directives is quite easy, especially compared to CPU threads or writing CUDA kernels. The most important thing is avoiding restructuring of existing code for production applications. ” -- Developer at the Global Manufacturer of

Navigation Systems

5x in 40 Hours 2x in 4 Hours 5x in 8 Hours

© NVIDIA 2013

Page 64: Cuda meetup presentation 5

Start Now with OpenACC Directives

Free trial license to PGI Accelerator

Tools for quick ramp

www.nvidia.com/gpudirectives

Sign up for a free trial of

the directives compiler

now!

© NVIDIA 2013

Page 65: Cuda meetup presentation 5

3 Ways to Accelerate Applications

Applications

Libraries

“Drop-in”

Acceleration

Programming Languages

OpenACCDirectives

Maximum

Flexibility

Easily Accelerate

Applications

© NVIDIA 2013

Page 66: Cuda meetup presentation 5

GPU Programming Languages

OpenACC, CUDA FortranFortran

OpenACC, CUDA CC

Thrust, CUDA C++C++

PyCUDA, CopperheadPython

Alea.cuBaseF#

MATLAB, Mathematica, LabVIEWNumerical analytics

© NVIDIA 2013

Page 67: Cuda meetup presentation 5

// generate 32M random numbers on host

thrust::host_vector<int> h_vec(32 << 20);

thrust::generate(h_vec.begin(),

h_vec.end(),

rand);

// transfer data to device (GPU)

thrust::device_vector<int> d_vec = h_vec;

// sort data on device

thrust::sort(d_vec.begin(), d_vec.end());

// transfer data back to host

thrust::copy(d_vec.begin(),

d_vec.end(),

h_vec.begin());

Rapid Parallel C++ Development

• Resembles C++ STL

• High-level interface

• Enhances developer productivity

• Enables performance portability between GPUs and multicore CPUs

• Flexible

• CUDA, OpenMP, and TBB backends

• Extensible and customizable

• Integrates with existing software

• Open source

http://developer.nvidia.com/thrust or http://thrust.googlecode.com

Page 68: Cuda meetup presentation 5

MATLAB

http://www.mathworks.com/discovery/

matlab-gpu.html

Learn MoreThese languages are supported on all CUDA-capable GPUs.

You might already have a CUDA-capable GPU in your laptop

or desktop PC!

CUDA C/C++

http://developer.nvidia.com/cuda-toolkit

Thrust C++ Template Library

http://developer.nvidia.com/thrust

CUDA Fortran

http://developer.nvidia.com/cuda-toolkit

GPU.NET

http://tidepowerd.com

PyCUDA (Python)

http://mathema.tician.de/software/pycuda

Mathematica

http://www.wolfram.com/mathematica/new

-in-8/cuda-and-opencl-support/

© NVIDIA 2013

Page 69: Cuda meetup presentation 5

Getting Started

© NVIDIA 2013

• Download CUDA Toolkit & SDK: www.nvidia.com/getcuda

• Nsight IDE (Eclipse or Visual Studio): www.nvidia.com/nsight

• Programming Guide/Best Practices:• docs.nvidia.com

• Questions:• NVIDIA Developer forums: devtalk.nvidia.com

• Search or ask on: www.stackoverflow.com/tags/cuda

• General: www.nvidia.com/cudazone

Page 70: Cuda meetup presentation 5

© NVIDIA 2013

https://www.youtube.com/watch?v=IzU4AVcMFys

Intro to CUDA - An introduction, how-to, to NVIDIA's GPU parallel

Mythbusters Demo GPU versus CPU: http://youtu.be/-P28LKWTzrI

Page 71: Cuda meetup presentation 5

Research projects?

Page 72: Cuda meetup presentation 5
Page 73: Cuda meetup presentation 5
Page 74: Cuda meetup presentation 5

Jetson TEGRA TK1Tegra K1 SOC

• Kepler GPU with 192 CUDA cores• 4-Plus-1 quad-core ARM Cortex A15 CPU

• 2 GB x16 memory with 64 bit width

• 16 GB 4.51 eMMC memory

• 1 Half mini-PCIE slot

• 1 Full size SD/MMC connector

• 1 Full-size HDMI port

• 1 USB 2.0 port, micro AB

• 1 USB 3.0 port, A

• 1 RS232 serial port

• 1 ALC5639 Realtek Audio codec with Mic in and Line out

• 1 RTL8111GS Realtek GigE LAN

• 1 SATA data port

• SPI 4MByte boot flash

Page 75: Cuda meetup presentation 5

NVIDIA GTX 750Ti

• Nvidia MAXWELL technology

• Cost – 170 USD

• Only 60 W of power, no dedicated power connections

• 250 MHash/sek

Vs

• Nvidia GTX 780 – 350 MHash/sek + Power cosumption

• Nvidia TESLA K40 – 560 MHash/sek + Power cosumption

Page 76: Cuda meetup presentation 5

Latvian CUDA & parallel programmingecosystemNext meetups, frequencySpeakersTopicsGroup marketing channels