当前位置 : 主页 > 手机开发 > 其它 >

CUDA核心管道

来源:互联网 收集:自由互联 发布时间:2021-06-22
我关注 this article关于GPU的预测模型.在第5页第二列几乎结束时他们说 One has to finally take care of the fact that each of the Nc cores(SPs) in an SM on the GPU has a D-deep pipeline that has the effect of executing D
我关注 this article关于GPU的预测模型.在第5页第二列几乎结束时他们说

One has to finally take care of the fact that each of the Nc cores(SPs) in an SM on the GPU has a D-deep pipeline that has the effect of executing D threads in parallel.

我的问题与D-deep管道有关.这个管道是什么样的?是不是类似于CPU的管道(我的意思只是因为GPU-CPU架构完全不同)关于获取,解码,执行,回写?

是否存在记录此文档的文档?

是的,GPU SM的管道看起来有点像CPU.不同之处在于管道的前端/后端比例:GPU具有单个提取/解码和许多小ALU(认为有32个并行执行子管道),在SM内部分组为“Cuda核心”.这与超标量CPU类似(例如,Core-i7具有6-8个发布端口,每个独立ALU管道一个端口).

有GTX 460 SM(图片来自destructoid.com;我们甚至可以看到每个CUDA核心内部有两个管道:Dispatch端口,然后是Operand收集器,然后是两个并行单元,一个用于Int,另一个用于FP和Result队列):

(或从http://www.legitreviews.com/article/1193/2/起更好的质量图像http://www.legitreviews.com/images/reviews/1193/sm.jpg)

我们看到这个SM中有一个指令缓存,两个warp调度程序和4个调度单元.并且有单个寄存器文件.因此,GPU SM管道的第一阶段是SM的共同资源.在指令规划之后,它们被分派到CUDA核心,每个核心可能有自己的多级(流水线)ALU,特别是对于复杂的操作.

管道的长度隐藏在架构内部,但我假设总管道深度远远超过4.(显然有4个时钟周期延迟的指令,因此ALU管道> = 4级并且假设总SM管道深度为超过20个阶段:https://devtalk.nvidia.com/default/topic/390366/instruction-latency/)

有关指令完全延迟的一些附加信息:https://devtalk.nvidia.com/default/topic/419456/how-to-schedule-warps-/ – SP为24-28个时钟,DP为48-52个时钟.

Anandtech发布了AMD GPU的一些图片,我们可以假设两个供应商的流水线操作的主要思路应该类似:http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute/4

因此,fetch,decode和Branch单元对于所有SIMD内核都是通用的,并且有很多ALU管道.在AMD中,寄存器文件在ALU组之间进行分段,在Nvidia中,它显示为单个单元(但它可以实现为分段并通过互连网络访问)

正如this work所述

Fine-grained parallelism, however, is what sets GPUs apart. Recall that threads execute synchronously in bundles known as warps. GPUs run most efficiently when the number of warps-in-flight is large. Although only one warp can be serviced per cycle (Fermi technically services two half-warps per shader cycle), the SM’s scheduler will immediately switch to another active warp when a hazard is encountered. If the instruction stream generated by the CUDA compiler expresses an ILP of 3.0 (that is, an average of three instructions can be executed before a hazard), and the instruction pipeline depth is 22 stages, as few as eight active warps (22 / 3) may be sufficient to completely hide instruction latency and achieve max arithmetic throughput. GPU latency hiding delivers good utilization of the GPU’s vast execution resources with little burden on the programmer.

因此,每个时钟只从管道前端(SM调度程序)调度一个warp,并且调度程序的调度与ALU完成计算的时间之间存在一些延迟.

有来自Realworldtech http://www.realworldtech.com/cayman/5/和http://www.realworldtech.com/cayman/11/的图片的一部分与Fermi管道.注意每个ALU / FPU中的[16]音符 – 这意味着物理上有16个相同的ALU.

网友评论