Introduction to gpu architecture

Introduction to GPU Architectures for Graphics

Ludan 滷蛋

Outline● GPU Pipeline● Rendering (CPU vs GPU)

GPU Pipeline⬅ Rendering (CPU vs GPU)

RasterizationRasterizing 較ray tracing, radiosity等適合被實作於硬體(相依性的考量)，在GPU中負責將向量圖形sample到pixel space，非常花運算，現在由GPU來做這件事

Ref : http://www.zedalos.org/2014/05/introduction-into-zedalos-computer-graphics-the-rendering-pipeline/


PipelineFragment generation是由SIMD處理器內的Raster engine專門來執行

Ref : https://queue.acm.org/detail.cfm?id=1365498

Vertex generation

Vertex processing

Primitive generation

Primitive processing

Fragment generation

Fragment processing

Pixel operation


PipelineRasterization就是在Fragment generation，以FG為分界，在FG之前負責計算圖形的坐標，FG之後負責圖形著色及貼圖(texture)以及輸出到framebuffer


Vertex generation

Vertex processing



Fragment generation

Fragment processing

Pixel operation


Pipeline黃色的階段是可以被程式的，例如GL, Dx.. 為什麼pipline要區分灰色跟黃色的階段? 因為灰色的階段硬體隱藏了分配shader core平行的工作，讓黃色可程式的階段可以清鬆寫(像單執行緒，不用考慮多執行緒的分配)


Vertex generation

Vertex processing



Fragment generation

Fragment processing

Pixel operation


Pipeline所以黃色可程式的部份會被安排到SIMD的處理器中CUDA core執行，現代GPU中最少有一個或多個SIMD的處理器，Nvidia稱作SMX，每個SMX中含有多個CUDA core，程式的部份就是由CUDA core來執行


Vertex generation

Vertex processing



Fragment generation

Fragment processing

Pixel operation


Pipeline灰色VG、PG、FG會由SMX裡的PolyMorph Engine來處理，這樣一來簡化程式GL/DX繪圖語言的複雜度

Vertex generation

Vertex processing



Fragment generation

Fragment processing

Pixel operation


Vertex generation先來看第一個階段: Vertex generation 這個階段GPU會將欲繪製的點坐標(CPU傳來的)，組成input stream分配給shader core

Vertex generation

Vertex processing



Fragment generation

Fragment processing

Pixel operation


Vertex generation而負責處理這件事是在PolyMorph engine中的Vertex Fetch負責

Vertex generation

Vertex processing



Fragment generation

Fragment processing

Pixel operation


Vertex processingVP這個階段是可程式，以3D物件來舉例，主要是負責運算坐標的轉換與投影，每一個點坐標都是可獨立運算的

Ref : http://www.eecs.berkeley.edu/~sequin/CS184/LECT_2011/L4.html

Vertex generation

Vertex processing



Fragment generation

Fragment processing

Pixel operation


Vertex processing圖1是model space，每個物件都是獨立被設計，彼此間是沒有相對關係

Vertex generation

Vertex processing



Fragment generation

Fragment processing

Pixel operation


Vertex processing把所有的物件合在一個坐標系World space，會利用modeling transformation，在vertex processing階段運算

Vertex generation

Vertex processing



Fragment generation

Fragment processing

Pixel operation


Vertex processinggl_Position = mP*mV*mM*s_vPosition;

將Model space坐標乘上Model Matrix

轉換至World space

Vertex generation

Vertex processing



Fragment generation

Fragment processing

Pixel operation



經由前一步轉換到world space後，想要轉換不同視角來觀看，就再乘上一個View Matrix來轉換視角的方向與遠近

Vertex generation

Vertex processing



Fragment generation

Fragment processing

Pixel operation



最後再乘上Projection Matrix取得三維投影到平面的結果，得到的坐標就是最終要顯

示的坐標

Vertex generation

Vertex processing



Fragment generation

Fragment processing

Pixel operation


Primitive generationVP後坐標都確定了，PG就是為了群組坐標成為不同的集合(primitive)，分配給shader core，例如下面兩個三角形就是不同的兩個集合

Vertex generation

Vertex processing



Fragment generation

Fragment processing

Pixel operation


Primitive generation經過Tessellator輸出有順序性的新頂點，Tessellator而具備"無中生有"的能力，可讓原本的3D模組產生更多三角形。用白話來說就是一種讓多邊形細分為碎片的方法，藉此獲得更細緻的物體表面

Vertex generation

Vertex processing



Fragment generation

Fragment processing

Pixel operation


Primitive generation讓原本稜稜角角, 變得細緻減輕shader運算(shader做需計算更多的點)

ref : http://maxpgiganti.com/tutorials/using-tessellation-and-dispacement-maps/

Vertex generation

Vertex processing



Fragment generation

Fragment processing

Pixel operation


Primitive processingPP階段程式出最終的primitives，以線為例子，可程式不同類的線

Ref : http://tkengo.github.io/blog/2015/01/03/opengl-es-2-2d-knowledge-2/

Vertex generation

Vertex processing



Fragment generation

Fragment processing

Pixel operation


Primitive processingPP階段程式出最終的primitives，以三角形為例，可程式不同類的三角形

Ref : http://tkengo.github.io/blog/2015/01/03/opengl-es-2-2d-knowledge-2/

Vertex generation

Vertex processing



Fragment generation

Fragment processing

Pixel operation


Primitive processingPP階段完成了就會狠多的mesh物件

Ref : https://protonassets.wordpress.com/

Vertex generation

Vertex processing



Fragment generation

Fragment processing

Pixel operation


Fragment generation在PolyMorph engine中Viewport Transform會將投影平面超出能顯示範圍的去掉，將坐標normolize到能顯示的範圍

Vertex generation

Vertex processing



Fragment generation

Fragment processing

Pixel operation


Fragment generation接著Attribute Setup，會將這些頂點屬性轉換為方程式，最後Stream Output則是將這些數據傳送到記憶體內，接著讓Raster Engine來處理

Vertex generation

Vertex processing



Fragment generation

Fragment processing

Pixel operation


Fragment generationRasterization 硬體會將每個primitive的向量坐標sample到pixel space，分配給shader core

Ref: http://www.cs.ucr.edu/~shinar/courses/cs130-spring-2012/content/Lecture2.pdf

Vertex generation

Vertex processing



Fragment generation

Fragment processing

Pixel operation


Fragment generationRaster Engine只有3個階段，分別是Edge Setup、Rasterizer和Z-Cull。Raster Engine運作首先在Edge Setup階段，頂點位置被取出，做rasterize，最後將不需要在視角顯示的頂點與三角形剔除

Vertex generation

Vertex processing



Fragment generation

Fragment processing

Pixel operation


Fragment generation每個Edge Setup單元，能夠在1個周期內處理1個點、1條線，或是1個三角形。接著在Rasterizer運作的過程中，會計算每個方程式的像素覆蓋率，如果開啟反鋸齒功能，則會在這個階段計算覆蓋率以進行反鋸齒採樣

Vertex generation

Vertex processing



Fragment generation

Fragment processing

Pixel operation


Fragment generation最後在著色之前，Z-Cull會擷取Pixel Tile（就把照片分一塊塊），將已經運算完畢的Pixel Tile與處理中的Pixel Tile進行空間深度比對，將位於其他Pixel Tile背面的Pixel Tile刪除（也就是螢幕中不可見的部分），避免進入著色器浪費運算效能(Early-Z)

Vertex generation

Vertex processing



Fragment generation

Fragment processing

Pixel operation


Fragment generation最後，每個primitive就會被rasterize成各別的fragment

Vertex generation

Vertex processing



Fragment generation

Fragment processing

Pixel operation


Fragment processingFP階段可依光照的方向程式每個pixel的顏色的強度

Vertex generation

Vertex processing



Fragment generation

Fragment processing

Pixel operation


Fragment processing舉例光線角度會影響到pixel的顏色強弱，可程式光對顏色的不同表現

vec3 N = normalize(fN);vec3 L = normalize(fL);

float diffuse_intensity = max(dot(N, L), 0.0);

if (diffuse_intensity == 0) { fColor = vec4(0.05, 0.05, 0.05, 1.0);} else { fColor = vec4(diffuse_intensity, diffuse_intensity, diffuse_intensity, 0.5);}

Ref : http://marcinignac.com/blog/pragmatic-pbr-setup-and-gamma/

Vertex generation

Vertex processing



Fragment generation

Fragment processing

Pixel operation


Fragment processingFP階段也可程式Texture mapping為圖形貼上需要的材質

Ref : https://www.cs.unc.edu/~rademach/xroads-RT/RTarticle.html

Vertex generation

Vertex processing



Fragment generation

Fragment processing

Pixel operation


Fragment processingTexture 狠耗運算，在GPU中也有一個專門的硬體負責，TA負責取值，TS負責處理放大、縮小

Ref : https://www.beyond3d.com/content/articles/118/5

Vertex generation

Vertex processing



Fragment generation

Fragment processing

Pixel operation


Pixel operation最後的階段就是要把畫好的圖貼進 frame buffer 了. 比方說要畫個眼睛在臉上, 眼睛總不能大過臉, 因此要在可以畫眼睛的範圍以外塗上框框, 這個框就是模板 (stencil). 當然後景不蓋前景, 需要更新的區域才更新等基本限制, 在真正畫出去之前, 要先做這類的測試. 例如 scissor test, alpha test, depth and stencil test 等等.

ref: https://open.gl/depthstencils

Vertex generation

Vertex processing



Fragment generation

Fragment processing

Pixel operation


Pixel operation每個pixel的位置要貼在framebuffer哪裡確定完，會處理某個pixel是由多個fragment合成的像是半透明作alpha blending，千辛萬之後才能進到framebuffer

Vertex generation

Vertex processing



Fragment generation

Fragment processing

Pixel operation


Cross-task communication不同的階段的task是透過L1/L2 Cache在通訊的，而對於GPU繪圖來说，一張圖都是通過很小的色塊和紋理組成，不壓縮的話會占用傳輸頻寬，nvidia對於在bus傳輸圖片也有自己的Delta Color Compression無損壓縮

GPU Pipeline Rendering (CPU vs GPU)⬅

CPU-Style看完GPU pipeline，我們可以知道每個坐標或是pixel都是可視為獨立不相關的問題處理，而且shader coding在pipeline的設計下，不需要特別去程式平行的問題，舉例來說，每個fragment都是在執行自已的指令

ref : http://haifux.org/lectures/267/Introduction-to-GPUs.pdf


CPU-Style不同的fragment都是執行相同的指令，只是在不同的位置，舉例來說用雙核心來同時處理兩個fragment



CPU-Style我們也可以狠容易想像，今天要處理的更快，就是讓核心數量變更多，所以我們可以往上加核心數，例如四核心



CPU-Style還是不夠快，再往上加到16核心，拿16核心的來處理16個pexel，好像有點貴



CPU-Style當我們狠聰明的狂加核心數的同時，我們發現每個fragment都是被同樣的指令所處理，那是不是可以share咧



CPU-Style回想一下，我們剛才都是一直加核心數，可以節省一點嗎，不要加整個核心



SIMD processing有了這個想法，就進化了



SIMD processing我們不直接加核心的數量，而是增加運算單元，讓ALU變狠多，但是原來的指令也需要配合這個想法做修改

要改


SIMD processing我們重新編譯出新的shader，將原來的問題，當成向量來處理



SIMD processing於是一個指令就可以處理不同的分向量，在同一個核心中，這就是single instruction multiple data


SIMD processing假設每個核心可以處理向量的寬度是8，有16個核心就可以同時處理128個fragment，越高階的顯卡，核心數就越多


SIMD processing也可以說16個核心可以同時接受16個不同的instruction streams，每個核心可以處理pipeline中不同階段的問題或運算，像是verices, primitives, fragments, cuda/opencl


SIMD processingBranches 在SIMD架構下，注意要避免複雜多層的branches，以下圖來說ALU1,2,4會執行黃色部份的指令，其它的3,5,6,7,8休息，黃色指令執行完才輪到 ALU3,5,6,7,8執行藍灰色的指令，1,2,4則休息


SIMD processing所以要避免錯綜複雜的if esle，屁如說你的branch有八種分支的可能，那最壞的情形就是每種都做一次，這樣就8個ALU每次都只有一個會動，其它都睡著了。


Hiding shader stalls因為存取cache, memory, texture… I/O會有latency，每當發生I/O會讓核心閒置，想辦法增加核心的利用率，避免閒置，想法上就是當有這次處理的向量等I/O，就切換另一組向量來執行，但是現在向量寬度是8，只有8組context在記憶，是做不到的


Hiding shader stalls簡單的想法就是增加記憶context的組數，如下圖我們context由一組增加到四組，我們就可以切換四組來hide I/O lantency


Hiding shader stalls舉例走一下，當我們遇到第一組fragment存取I/O，我們就可以馬上切換到第二組執行，因為現在有四組context，依此類推，每當發生stall，就切換到ready的那組執行，避免核心閒置


Hiding shader stalls所以記憶context的硬體越多，越能hide latency，但成本就提升


Hiding shader stallsHide lantency在GPU來說是狠重要的，所以可以想像如果今天你的所要運算的資料量不是狠大，在GPU上來說沒有足夠的運算量來切換來hide lantency，這樣用GPU是不會得到好處地，講到這裡就講完了，希望有幫助溜。

Introduction to gpu architecture

Technology