tag:blogger.com,1999:blog-11078740724377674662024-03-13T06:45:51.908-07:00Yuhao's BlogYuhaohttp://www.blogger.com/profile/08569555359590748704noreply@blogger.comBlogger13125tag:blogger.com,1999:blog-1107874072437767466.post-28013220625661599332011-01-13T09:40:00.000-08:002011-01-13T09:40:59.250-08:00向量处理器(9)<div style="font-size: 14.0pt; font-weight: bold; margin: 0in;"><span lang="en-US" style="font-family: Calibri;">4.4 </span><span lang="zh-CN" style="font-family: 宋体;">多道向量处理器</span></div><div style="font-size: 11.0pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体;">向量指令集的一个最大的优点是它能够允许软件传递大量的并行任务给硬件,而只需要一条很短的指令即可。一条向量指令可以包括数十上百个独立的操作,但是仍然和通常的标量指令一样译码为相同的长度。向量指令的并行语义使得执行这些元素操作有两种方法。第一是使用深度流水化的功能单元,就像我们研究的</span><span lang="en-US" style="font-family: Calibri;"> VMIPS </span><span lang="zh-CN" style="font-family: 宋体;">一样;或者通过一组并行的功能单元,或者是并行单元和流水化单元的组合。图</span><span lang="en-US" style="font-family: Calibri;"> F.11 </span><span lang="zh-CN" style="font-family: 宋体;">展示了如何通过使用并行流水线执行向量</span><span lang="en-US" style="font-family: Calibri;"> add </span><span lang="zh-CN" style="font-family: 宋体;">指令来提升向量性能。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11.0pt; margin: 0in;"><br />
</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEih5IZLq2uNw3jhkjhJF7aN91j5WksgNH83jN70bqb1WdqX9GlAEdGOSM9q0v2ToAUEzLA78cgm8CUR9E9gCH8czZ1QYmr8zmDtW-sF39cBI9Qem2XX9dVwulcjtLaaqQ5hGxhsguUFvTkm/s1600/F11.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="277" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEih5IZLq2uNw3jhkjhJF7aN91j5WksgNH83jN70bqb1WdqX9GlAEdGOSM9q0v2ToAUEzLA78cgm8CUR9E9gCH8czZ1QYmr8zmDtW-sF39cBI9Qem2XX9dVwulcjtLaaqQ5hGxhsguUFvTkm/s400/F11.PNG" width="400" /></a></div><div style="margin: 0in;"><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">图</span></span><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="en-US" style="font-family: Calibri; font-weight: bold;"> F.11 </span></span><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">使用多个功能单元来改进</span></span><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="en-US" style="font-family: Calibri; font-weight: bold;"> add </span></span><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">向量指令</span></span><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="en-US" style="font-family: Calibri; font-weight: bold;">C = A + B </span></span><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">的性能。</span></span><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="zh-CN" style="font-family: 宋体;">(</span></span><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="en-US" style="font-family: Calibri;">a</span></span><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="zh-CN" style="font-family: 宋体;">)中的机器有一条</span></span><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="en-US" style="font-family: Calibri;"> add </span></span><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="zh-CN" style="font-family: 宋体;">流水线,可以在每个周期完成一次加法。(</span></span><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="en-US" style="font-family: Calibri;">b</span></span><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="zh-CN" style="font-family: 宋体;">)中的机器有四个加法流水线,并且在每个周期可以完成</span></span><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="en-US" style="font-family: Calibri;"> 4 </span></span><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="zh-CN" style="font-family: 宋体;">个加法才做。一个向量加法操作涉及到的那些元素被分散分布在四条流水线上。一起通过这些流水线的那组元素被称为元素组(</span></span><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="en-US" style="font-family: Calibri;">element group</span></span><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="zh-CN" style="font-family: 宋体;">)。</span></span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11.0pt; margin: 0in;"><br />
</div><div style="font-size: 11.0pt; margin: 0in;"><span lang="en-US" style="font-family: Calibri;">VMIPS </span><span lang="zh-CN" style="font-family: 宋体;">指令集设计的特点是所有的向量指令只允许一个向量寄存器的</span><span lang="en-US" style="font-family: Calibri;"> N </span><span lang="zh-CN" style="font-family: 宋体;">个元素和另一个向量寄存器的</span><span lang="en-US" style="font-family: Calibri;"> N </span><span lang="zh-CN" style="font-family: 宋体;">个元素参与运算。这极大地简化了可被组织成多个并行道(</span><span lang="en-US" style="font-family: Calibri;">lane</span><span lang="zh-CN" style="font-family: 宋体;">)的高度并行的向量单元的设计。就像一个高速公路一样,我们可以通过增加更多的</span><span lang="en-US" style="font-family: Calibri;"> lane </span><span lang="zh-CN" style="font-family: 宋体;">来提升一个向量单元的峰值吞吐率,如图</span><span lang="en-US" style="font-family: Calibri;"> F.12 </span><span lang="zh-CN" style="font-family: 宋体;">所示。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11.0pt; margin: 0in;"><br />
</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgorPSi7Di3urLUQ1sRTPKOfSjTQy1oGStH3zS1ZaVfLeeaW_3w-fOu4noNu6TDbBf1GZXvVvOznzeDHkfY8Lbo6ZkpSrinWaOz3gm_IF0qta6-w4S3D9wGCM8o5uIi61x2eb7vMOSYXYvF/s1600/F12.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="263" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgorPSi7Di3urLUQ1sRTPKOfSjTQy1oGStH3zS1ZaVfLeeaW_3w-fOu4noNu6TDbBf1GZXvVvOznzeDHkfY8Lbo6ZkpSrinWaOz3gm_IF0qta6-w4S3D9wGCM8o5uIi61x2eb7vMOSYXYvF/s400/F12.PNG" width="400" /></a></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11.0pt; margin: 0in;"><span class="Apple-style-span" style="font-family: 'Times New Roman'; font-style: italic;"><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">图</span><span lang="en-US" style="font-family: Calibri; font-weight: bold;"> F.12 </span><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">一个包含有</span><span lang="en-US" style="font-family: Calibri; font-weight: bold;"> 4 </span><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">个道的向量单元的结构。</span><span lang="zh-CN" style="font-family: 宋体;">向量寄存器的存储被分布在各</span><span lang="en-US" style="font-family: Calibri;"> lane </span><span lang="zh-CN" style="font-family: 宋体;">之间,每个</span><span lang="en-US" style="font-family: Calibri;"> lane </span><span lang="zh-CN" style="font-family: 宋体;">包含有每个向量寄存器的</span><span lang="en-US" style="font-family: Calibri;"> 4 </span><span lang="zh-CN" style="font-family: 宋体;">个元素。图中展示了</span><span lang="en-US" style="font-family: Calibri;"> 3 </span><span lang="zh-CN" style="font-family: 宋体;">个向量功能单元,一个浮点加法单元,一个浮点乘法单元,以及一个</span><span lang="en-US" style="font-family: Calibri;"> L/S </span><span lang="zh-CN" style="font-family: 宋体;">单元。每个向量算术单元包含有</span><span lang="en-US" style="font-family: Calibri;"> 4 </span><span lang="zh-CN" style="font-family: 宋体;">个执行流水线,每个</span><span lang="en-US" style="font-family: Calibri;"> lane </span><span lang="zh-CN" style="font-family: 宋体;">一个。他们协同共组去完成每一条向量指令。注意向量寄存器的每一个部分是如何只需要提供本地的</span><span lang="en-US" style="font-family: Calibri;"> lane </span><span lang="zh-CN" style="font-family: 宋体;">访问的足够端口即可的。这极大减少了提供多端口的开销。提供向量</span><span lang="en-US" style="font-family: Calibri;">-</span><span lang="zh-CN" style="font-family: 宋体;">标量指令(</span><span lang="en-US" style="font-family: Calibri;">vector-scalar instruction</span><span lang="zh-CN" style="font-family: 宋体;">)中标量操作数的通路并没有在这张图中显示出来,但是标量值必须被广播到所有的</span><span lang="en-US" style="font-family: Calibri;"> lane </span><span lang="zh-CN" style="font-family: 宋体;">中。</span></span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11.0pt; margin: 0in;"><br />
</div><div style="font-size: 11.0pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体;">每一个</span><span lang="en-US" style="font-family: Calibri;"> lane </span><span lang="zh-CN" style="font-family: 宋体;">包含了向量寄存器文件的一部分以及每个向量功能单元中的一条流水线。每个向量功能单元通过多个流水线(每个</span><span lang="en-US" style="font-family: Calibri;"> lane </span><span lang="zh-CN" style="font-family: 宋体;">一个)以每个周期一个</span><span lang="en-US" style="font-family: Calibri;"> element group </span><span lang="zh-CN" style="font-family: 宋体;">的速率来执行向量指令。第一个</span><span lang="en-US" style="font-family: Calibri;"> lane </span><span lang="zh-CN" style="font-family: 宋体;">包含有所有向量寄存器的第一个元素,因此任何有关第一个元素的向量指令的源操作数和目的操作数都在第一个</span><span lang="en-US" style="font-family: Calibri;"> lane </span><span lang="zh-CN" style="font-family: 宋体;">中。这使得一个算术运算的流水线的操作能在一个</span><span lang="en-US" style="font-family: Calibri;"> lane </span><span lang="zh-CN" style="font-family: 宋体;">内部完成而不需要和其他</span><span lang="en-US" style="font-family: Calibri;"> lane </span><span lang="zh-CN" style="font-family: 宋体;">通信。</span><span lang="en-US" style="font-family: Calibri;">Lane </span><span lang="zh-CN" style="font-family: 宋体;">间互联只是在访存的时候才需要。缺少</span><span lang="en-US" style="font-family: Calibri;"> lane </span><span lang="zh-CN" style="font-family: 宋体;">间通信能够降低互联导线开销和用于高度并行执行单元的寄存器文件的端口,并且解释了为什么当前的超级计算机可以在每个周期完成至多</span><span lang="en-US" style="font-family: Calibri;"> 64 </span><span lang="zh-CN" style="font-family: 宋体;">个操作(</span><span lang="en-US" style="font-family: Calibri;">16 </span><span lang="zh-CN" style="font-family: 宋体;">个</span><span lang="en-US" style="font-family: Calibri;"> lane</span><span lang="zh-CN" style="font-family: 宋体;">,每个有</span><span lang="en-US" style="font-family: Calibri;"> 2 </span><span lang="zh-CN" style="font-family: 宋体;">个算术单元和</span><span lang="en-US" style="font-family: Calibri;"> 2 </span><span lang="zh-CN" style="font-family: 宋体;">个</span><span lang="en-US" style="font-family: Calibri;"> L/S </span><span lang="zh-CN" style="font-family: 宋体;">单元)。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11.0pt; margin: 0in;"><br />
</div><div style="font-size: 11.0pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体;">增加多个</span><span lang="en-US" style="font-family: Calibri;"> lane </span><span lang="zh-CN" style="font-family: 宋体;">是一种很通用的提升向量性能的技术,因为它只需要很小的控制复杂度上的靠小,并且不需要改动现有的代码。有一些向量超级计算机以</span><span lang="en-US" style="font-family: Calibri;"> lane </span><span lang="zh-CN" style="font-family: 宋体;">数目可变的系列方式进行销售。这使得用户能够自行权衡价格和峰值性能。</span><span lang="en-US" style="font-family: Calibri;">Cray SV1 </span><span lang="zh-CN" style="font-family: 宋体;">和</span><span lang="en-US" style="font-family: Calibri;"> X1 </span><span lang="zh-CN" style="font-family: 宋体;">允许</span><span lang="en-US" style="font-family: Calibri;"> 4 </span><span lang="zh-CN" style="font-family: 宋体;">个</span><span lang="en-US" style="font-family: Calibri;"> 2 </span><span lang="zh-CN" style="font-family: 宋体;">道的向量</span><span lang="en-US" style="font-family: Calibri;"> CPU </span><span lang="zh-CN" style="font-family: 宋体;">组合在一起形成一个单个更大的</span><span lang="en-US" style="font-family: Calibri;"> 8 </span><span lang="zh-CN" style="font-family: 宋体;">道</span><span lang="en-US" style="font-family: Calibri;"> CPU</span><span lang="zh-CN" style="font-family: 宋体;">,具体在第七小节讨论。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11.0pt; margin: 0in;"><br />
</div><div style="font-size: 14.0pt; font-weight: bold; margin: 0in;"><span lang="en-US" style="font-family: Calibri;">4.5 </span><span lang="zh-CN" style="font-family: 宋体;">流水化指令的启动</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11.0pt; margin: 0in;">增加多条道能够提升峰值性能,但是却不能改变启动延迟,所以通过允许一条向量指令的开始与之前一条指令的完成想重合来降低启动开销就变得很关键了。最简单的情形是当两个向量指令访问不同的向量寄存器的时候。比如,在下面的代码段中:</div><div style="font-family: Calibri; font-size: 11.0pt; margin: 0in;"><span style="mso-spacerun: yes;"> </span>ADDV.D<span style="mso-spacerun: yes;"> </span>V1, V2, V3</div><div style="font-family: Calibri; font-size: 11.0pt; margin: 0in;"><span style="mso-spacerun: yes;"> </span>ADDV.D<span style="mso-spacerun: yes;"> </span>V4, V5, V6</div><div lang="zh-CN" style="font-family: 宋体; font-size: 11.0pt; margin: 0in;"><br />
</div><div style="font-size: 11.0pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体;">一种实现方式是允许第二条指令中的第一个元素紧跟着第一条指令的最后一个元素在浮点加法流水线中执行。为了减少控制逻辑的复杂度,有些向量机器在分发(</span><span lang="en-US" style="font-family: Calibri;">dispatch</span><span lang="zh-CN" style="font-family: 宋体;">)到同一向量单元上的两条向量指令之间需要一些恢复时间(</span><span lang="en-US" style="font-family: Calibri;">recovery time</span><span lang="zh-CN" style="font-family: 宋体;">)或死时间(</span><span lang="en-US" style="font-family: Calibri;">dead time</span><span lang="zh-CN" style="font-family: 宋体;">)图</span><span lang="en-US" style="font-family: Calibri;"> F.13 </span><span lang="zh-CN" style="font-family: 宋体;">展示了一条流水线的启动延迟和死时间。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11.0pt; margin: 0in;"><br />
</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhXI7YZreCvUJEbkOHIrFPHS7_wqBBoRb5FViLHH_CYnmujvN3tCy0NX4_Vx191PaAMc9vTepGBLGwhPqj1f81Df7D5IWbSMSJ-lH6bt3zBx0i3Rf2jrWqGUF37CNrQ4rfQ_FQn3Wg8LEMJ/s1600/F13.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="230" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhXI7YZreCvUJEbkOHIrFPHS7_wqBBoRb5FViLHH_CYnmujvN3tCy0NX4_Vx191PaAMc9vTepGBLGwhPqj1f81Df7D5IWbSMSJ-lH6bt3zBx0i3Rf2jrWqGUF37CNrQ4rfQ_FQn3Wg8LEMJ/s400/F13.PNG" width="400" /></a></div><div style="margin: 0in;"><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">图</span></span><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="en-US" style="font-family: Calibri; font-weight: bold;"> F.13 </span></span><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">一条向量流水线的启动延迟和死时间。</span></span><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="zh-CN" style="font-family: 宋体;">每个元素操作有</span></span><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="en-US" style="font-family: Calibri;"> 5 </span></span><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="zh-CN" style="font-family: 宋体;">个周期的延迟:</span></span><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="en-US" style="font-family: Calibri;">1 </span></span><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="zh-CN" style="font-family: 宋体;">个周期去读向量寄存器文件,</span></span><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="en-US" style="font-family: Calibri;">3 </span></span><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="zh-CN" style="font-family: 宋体;">个执行周期,然后一个周期写回寄存器文件。同一个向量指令中的元素可以连续在流水线中执行,但是这个机器在两个向量指令之间插入了</span></span><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="en-US" style="font-family: Calibri;"> 4 </span></span><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="zh-CN" style="font-family: 宋体;">个周期的</span></span><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="en-US" style="font-family: Calibri;"> dead time</span></span><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="zh-CN" style="font-family: 宋体;">。这个</span></span><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="en-US" style="font-family: Calibri;"> dead time </span></span><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="zh-CN" style="font-family: 宋体;">可以利用更复杂的控制逻辑来减少。</span></span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11.0pt; margin: 0in;"><br />
</div><div lang="zh-CN" style="font-family: 宋体; font-size: 11.0pt; margin: 0in;">下面的例子展示了死时间对于向量处理器性能的影响。</div><div style="font-family: Calibri; font-size: 11.0pt; font-weight: bold; margin: 0in;">_____________________________________________________</div><div style="font-size: 11.0pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">例题</span><span lang="zh-CN" style="font-family: 宋体;">:</span><span lang="en-US" style="font-family: Calibri;">Cray C90 </span><span lang="zh-CN" style="font-family: 宋体;">有两个</span><span lang="en-US" style="font-family: Calibri;"> lane</span><span lang="zh-CN" style="font-family: 宋体;">,但是在任何同一个功能单元上执行的两个向量指令之间需要</span><span lang="en-US" style="font-family: Calibri;"> 4 </span><span lang="zh-CN" style="font-family: 宋体;">个周期的</span><span lang="en-US" style="font-family: Calibri;"> dead time</span><span lang="zh-CN" style="font-family: 宋体;">,即使他们之间没有数据依赖。对于最大向量长度为</span><span lang="en-US" style="font-family: Calibri;"> 128 </span><span lang="zh-CN" style="font-family: 宋体;">的情形而言,由</span><span lang="en-US" style="font-family: Calibri;"> dead time </span><span lang="zh-CN" style="font-family: 宋体;">导致的峰值性能的减少是多少?如果</span><span lang="en-US" style="font-family: Calibri;"> lane </span><span lang="zh-CN" style="font-family: 宋体;">的数目增加到</span><span lang="en-US" style="font-family: Calibri;"> 16</span><span lang="zh-CN" style="font-family: 宋体;">,那性能折扣又是多少?</span></div><div style="font-size: 11.0pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">解答</span><span lang="zh-CN" style="font-family: 宋体;">:最多</span><span lang="en-US" style="font-family: Calibri;"> 128 </span><span lang="zh-CN" style="font-family: 宋体;">个元素被划分到</span><span lang="en-US" style="font-family: Calibri;"> 2 </span><span lang="zh-CN" style="font-family: 宋体;">个道上,并且占据一个向量功能单元</span><span lang="en-US" style="font-family: Calibri;"> 64 </span><span lang="zh-CN" style="font-family: 宋体;">个时钟周期。</span><span lang="en-US" style="font-family: Calibri;">Dead time </span><span lang="zh-CN" style="font-family: 宋体;">另外增加了</span><span lang="en-US" style="font-family: Calibri;"> 4 </span><span lang="zh-CN" style="font-family: 宋体;">个周期的占用,使得峰值性能降为没有</span><span lang="en-US" style="font-family: Calibri;"> dead time </span><span lang="zh-CN" style="font-family: 宋体;">的</span><span lang="en-US" style="font-family: Calibri;"> 64/(64 + 4) = 94.1%</span><span lang="zh-CN" style="font-family: 宋体;">。如果</span><span lang="en-US" style="font-family: Calibri;"> lane </span><span lang="zh-CN" style="font-family: 宋体;">的数目增加为</span><span lang="en-US" style="font-family: Calibri;"> 16</span><span lang="zh-CN" style="font-family: 宋体;">,那么</span><span lang="en-US" style="font-family: Calibri;"> 128 </span><span lang="zh-CN" style="font-family: 宋体;">个元素占用一个功能单元的时间只需要</span><span lang="en-US" style="font-family: Calibri;"> 128/16 = 8 </span><span lang="zh-CN" style="font-family: 宋体;">个周期,这样</span><span lang="en-US" style="font-family: Calibri;"> dead time </span><span lang="zh-CN" style="font-family: 宋体;">会导致</span><span lang="en-US" style="font-family: Calibri;"> 8/(8 + 4) = 66.6% </span><span lang="zh-CN" style="font-family: 宋体;">的峰值性能降低。在第二种情况下,向量单元永远不会有超过</span><span lang="en-US" style="font-family: Calibri;"> 2/3 </span><span lang="zh-CN" style="font-family: 宋体;">的忙碌时间。</span></div><div style="font-family: Calibri; font-size: 11.0pt; font-weight: bold; margin: 0in;">_____________________________________________________</div><div lang="zh-CN" style="font-family: 宋体; font-size: 11.0pt; margin: 0in;"><br />
</div><div style="font-size: 11.0pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体;">流水化指令的启动在多条指令可以读写同一个向量寄存器或者一些指令不可预期地停顿比如</span><span lang="en-US" style="font-family: Calibri;"> load </span><span lang="zh-CN" style="font-family: 宋体;">指令遇到</span><span lang="en-US" style="font-family: Calibri;"> bank conflict </span><span lang="zh-CN" style="font-family: 宋体;">的时候变得更复杂了。但是,因为</span><span lang="en-US" style="font-family: Calibri;"> lane </span><span lang="zh-CN" style="font-family: 宋体;">的数目和流水线的延迟增加了,现在完全流水化指令的启动时间变得越来越重要了。</span></div>Yuhaohttp://www.blogger.com/profile/08569555359590748704noreply@blogger.com0tag:blogger.com,1999:blog-1107874072437767466.post-10183122221891444782011-01-08T21:22:00.000-08:002011-01-08T21:23:08.408-08:00向量处理器(8)<div style="font-size: 14pt; font-weight: bold; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="en-US" style="font-family: Calibri;">4.3 </span><span lang="zh-CN" style="font-family: 宋体;">稀疏矩阵</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;">存在一些能够使得基于稀疏矩阵的程序在向量模式下运行的技术。在一个稀疏矩阵中,向量的元素通常是以紧凑的形式储存,并且以间接的方式被访问。我们会看到以下一个简化了的稀疏矩阵的代码:</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"> <span style="font-weight: bold;">do</span> 100 i = 1, n</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"> 100 A(K(i)) = A(K(i)) + C(M(i))</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">这段代码利用</span><span lang="en-US" style="font-family: Calibri;"> K </span><span lang="zh-CN" style="font-family: 宋体;">和</span><span lang="en-US" style="font-family: Calibri;"> M </span><span lang="zh-CN" style="font-family: 宋体;">作为索引向量来给出</span><span lang="en-US" style="font-family: Calibri;"> A </span><span lang="zh-CN" style="font-family: 宋体;">和</span><span lang="en-US" style="font-family: Calibri;"> C </span><span lang="zh-CN" style="font-family: 宋体;">中的非零元素实现了数组</span><span lang="en-US" style="font-family: Calibri;"> A </span><span lang="zh-CN" style="font-family: 宋体;">和</span><span lang="en-US" style="font-family: Calibri;"> C </span><span lang="zh-CN" style="font-family: 宋体;">的稀疏向量求和和。(</span><span lang="en-US" style="font-family: Calibri;">A </span><span lang="zh-CN" style="font-family: 宋体;">和</span><span lang="en-US" style="font-family: Calibri;"> C </span><span lang="zh-CN" style="font-family: 宋体;">必须有相同数目个非零元素</span><span lang="en-US" style="font-family: Calibri;">--</span><span lang="zh-CN" style="font-family: 宋体;">本例中的</span><span lang="en-US" style="font-family: Calibri;"> n</span><span lang="zh-CN" style="font-family: 宋体;">)另外一个常见的稀疏矩阵的表示形式是用一个位向量来表示哪些位是非零元素以及一个稠密向量包含所有的非零元素。通常这两种表示形式会同时在一个程序里出现。在很多代码里都能看到稀疏矩阵的影子,并且根据不同的数据结构有很多中实现方法。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">支持稀疏矩阵的最主要的机制是利用索引向量(</span><span lang="en-US" style="font-family: Calibri;">index vector</span><span lang="zh-CN" style="font-family: 宋体;">)的</span><span lang="en-US" style="font-family: Calibri;"> scatter-gather </span><span lang="zh-CN" style="font-family: 宋体;">操作。这类操作的目标是支持在稠密表示(即没有非零元素)和正常表示(即包括非零元素)之间进行数据迁移。</span><span lang="en-US" style="font-family: Calibri;">Gather </span><span lang="zh-CN" style="font-family: 宋体;">操作根据一个索引向量,通过把索引向量给出的偏移值加到基地址上来取出向量的元素。其输出是在一个在向量寄存器里的稀疏向量。在这些元素以稠密的形式被处理之后,稀疏向量要以扩展的形式通过</span><span lang="en-US" style="font-family: Calibri;"> scatter </span><span lang="zh-CN" style="font-family: 宋体;">操作存回内存,使用同样的索引数组。对于这两个操作的硬件支持称之为</span><span lang="en-US" style="font-family: Calibri;"> scatter-gather</span><span lang="zh-CN" style="font-family: 宋体;">,并且在几乎所有的现在向量处理器上都能看到。</span><span lang="en-US" style="font-family: Calibri;">VMIPS </span><span lang="zh-CN" style="font-family: 宋体;">提供了</span><span lang="en-US" style="font-family: Calibri;"> LVI</span><span lang="zh-CN" style="font-family: 宋体;">(</span><span lang="en-US" style="font-family: Calibri;">load vector indexed</span><span lang="zh-CN" style="font-family: 宋体;">)和</span><span lang="en-US" style="font-family: Calibri;"> SVI</span><span lang="zh-CN" style="font-family: 宋体;">(</span><span lang="en-US" style="font-family: Calibri;">store vector indexed</span><span lang="zh-CN" style="font-family: 宋体;">)指令来实现这两个操作。比如,假定</span><span lang="en-US" style="font-family: Calibri;"> Ra</span><span lang="zh-CN" style="font-family: 宋体;">,</span><span lang="en-US" style="font-family: Calibri;">Rc</span><span lang="zh-CN" style="font-family: 宋体;">,</span><span lang="en-US" style="font-family: Calibri;">Rk </span><span lang="zh-CN" style="font-family: 宋体;">和</span><span lang="en-US" style="font-family: Calibri;"> Rm </span><span lang="zh-CN" style="font-family: 宋体;">分别有前例中四个向量的起始地址,那么彼代码段的内层循环可以用以下的向量指令实现:</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"> LV Vk, Rk ;load K</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"> LVI Va, (Ra + Vk) ;load A(K(I))</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"> LV Vm, Rm ;load M</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"> LVI Vc, (Rc + Vm) ;load C(M(I))</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"> ADDV.D Va, Va, Vc ;add them</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"> SVI (Ra + Vk), Va ;store A(K(I))</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">这个技术使得稀疏矩阵的代码可以以向量模式执行。简单的向量编译器不能自动向量化以上代码,因为编译器不会知道</span><span lang="en-US" style="font-family: Calibri;"> K </span><span lang="zh-CN" style="font-family: 宋体;">中的元素的值互相是不相同的,因此不存在任何依赖关系</span><span lang="en-US" style="font-family: Calibri;"> [1]</span><span lang="zh-CN" style="font-family: 宋体;">。因此,程序员需要告诉编译器该循环可以以向量模式运行。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">更为复杂的向量编译器可以自动向量化以上循环而不需要程序员的干预。这是通过插入运行时对于数据</span><span lang="en-US" style="font-family: Calibri;"> dependency </span><span lang="zh-CN" style="font-family: 宋体;">的检查实现的。这种运行时检查是通过</span><span lang="en-US" style="font-family: Calibri;"> Itanium </span><span lang="zh-CN" style="font-family: 宋体;">处理器中的</span><span lang="en-US" style="font-family: Calibri;"> Advanced Load Address Table</span><span lang="zh-CN" style="font-family: 宋体;">(</span><span lang="en-US" style="font-family: Calibri;">ALAT</span><span lang="zh-CN" style="font-family: 宋体;">)硬件机构的向量化软件版本实现的。</span><span lang="en-US" style="font-family: Calibri;">ALAT </span><span lang="zh-CN" style="font-family: 宋体;">在</span><span lang="en-US" style="font-family: Calibri;"> </span><span lang="zh-CN" style="font-family: 宋体;">附录</span><span lang="en-US" style="font-family: Calibri;"> G </span><span lang="zh-CN" style="font-family: 宋体;">中有描述。</span><span lang="en-US" style="font-family: Calibri;">ALAT </span><span lang="zh-CN" style="font-family: 宋体;">硬件被一个软件哈希表所代替。该哈希表能够检测出在同一个</span><span lang="en-US" style="font-family: Calibri;"> strip-mining </span><span lang="zh-CN" style="font-family: 宋体;">的迭代循环中的两个元素是否指向同一个地址。如果没有检测到</span><span lang="en-US" style="font-family: Calibri;"> dependency</span><span lang="zh-CN" style="font-family: 宋体;">,该</span><span lang="en-US" style="font-family: Calibri;"> strip-mining </span><span lang="zh-CN" style="font-family: 宋体;">循环则可以以长度</span><span lang="en-US" style="font-family: Calibri;"> MVL </span><span lang="zh-CN" style="font-family: 宋体;">来完成。如果检测到了</span><span lang="en-US" style="font-family: Calibri;"> dependency</span><span lang="zh-CN" style="font-family: 宋体;">,向量的长度则被重置为较小的一个可以避免</span><span lang="en-US" style="font-family: Calibri;"> dependency </span><span lang="zh-CN" style="font-family: 宋体;">的值,而留下剩下的部分给下一个循环执行。虽然这种机制给执行循环增加了很多软件开销,但是这种开销还是会被更为常见的没有依赖的情况所均摊,因此这个循环仍然会比标量代码快得多(当然会比程序员直接指出可以向量化的情况慢得多)。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">近来大多数的超级计算机都有</span><span lang="en-US" style="font-family: Calibri;"> s</span><span lang="en-US" style="font-family: Calibri;">catter-gather </span><span lang="zh-CN" style="font-family: 宋体;">的能力。这种操作比有跨度的操作更为之慢因为实现起来更复杂,而且更容易出现</span><span lang="en-US" style="font-family: Calibri;"> bank </span><span lang="zh-CN" style="font-family: 宋体;">冲突,但是比之标量版本,则要快很多。如果一个矩阵的稀疏程度改变了,必须重新计算索引向量。很多处理器提供了快速计算所以向量的方法。</span><span lang="en-US" style="font-family: Calibri;">VMIPS </span><span lang="zh-CN" style="font-family: 宋体;">中的</span><span lang="en-US" style="font-family: Calibri;"> CVI </span><span lang="zh-CN" style="font-family: 宋体;">指令可以根据一个给定的跨度值(</span><span lang="en-US" style="font-family: Calibri;">m</span><span lang="zh-CN" style="font-family: 宋体;">)来创建一个所以向量。其各个元素值为</span><span lang="en-US" style="font-family: Calibri;">0</span><span lang="zh-CN" style="font-family: 宋体;">,</span><span lang="en-US" style="font-family: Calibri;">m</span><span lang="zh-CN" style="font-family: 宋体;">,</span><span lang="en-US" style="font-family: Calibri;">2</span><span lang="zh-CN" style="font-family: 宋体;">×</span><span lang="en-US" style="font-family: Calibri;">m</span><span lang="zh-CN" style="font-family: 宋体;">,</span><span lang="en-US" style="font-family: Calibri;">...</span><span lang="zh-CN" style="font-family: 宋体;">,</span><span lang="en-US" style="font-family: Calibri;">63</span><span lang="zh-CN" style="font-family: 宋体;">×</span><span lang="en-US" style="font-family: Calibri;">m</span><span lang="zh-CN" style="font-family: 宋体;">。一些处理器提供一条创建压缩形式的索引向量的指令。该向量中各元素值对应于掩码寄存器中相应位置为</span><span lang="en-US" style="font-family: Calibri;"> 1 </span><span lang="zh-CN" style="font-family: 宋体;">的元素。另外一些则提供压缩向量的指令。在</span><span lang="en-US" style="font-family: Calibri;"> VMIPS </span><span lang="zh-CN" style="font-family: 宋体;">中,我们定义</span><span lang="en-US" style="font-family: Calibri;"> CVI </span><span lang="zh-CN" style="font-family: 宋体;">指令为总是根据向量掩码来创建一个压缩过的索引向量。当所有的掩码都为</span><span lang="en-US" style="font-family: Calibri;"> 1 </span><span lang="zh-CN" style="font-family: 宋体;">的时候,则创建一个标准的索引向量。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">索引化的</span><span lang="en-US" style="font-family: Calibri;"> L/S </span><span lang="zh-CN" style="font-family: 宋体;">操作以及</span><span lang="en-US" style="font-family: Calibri;"> CVI </span><span lang="zh-CN" style="font-family: 宋体;">指令提供了支持条件向量执行的一种新方法。以下是我们在上一小节中循环的另一种实现:</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"> LV V1, Ra ;load vector A into V1</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"> L.D F0, #0 ;load FP zero into F0</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"> SNEVS.D V1, F0 ;sets the VM to 1 if V1(i) != F0</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"> CVI V2, #8 ;generates indices in V2</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"> POP R1, VM ;find the number of 1's in VM</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"> MTC1 VLR, R1 ;load vector-length register</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"> CVM ;clears the mask</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"> LVI V3, (Ra + V2) ;load the nonzero A elements</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"> LVI V4, (Rb + V2) ;load corresponding B elements</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"> SUBV.D V3, V3, V4 ;do the subtract</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"> SVI (Ra + V2), V3 ;store A back</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">至于究竟是使用</span><span lang="en-US" style="font-family: Calibri;"> scatter-gather </span><span lang="zh-CN" style="font-family: 宋体;">的版本更好还是使用条件执行的版本更好取决于该条件测试满足的频率以及这些操作的开销。不考虑</span><span lang="en-US" style="font-family: Calibri;"> chaining </span><span lang="zh-CN" style="font-family: 宋体;">的话,第一个版本(前一小节)的耗时是</span><span lang="en-US" style="font-family: Calibri;"> 5n + c1</span><span lang="zh-CN" style="font-family: 宋体;">。第二个版本,也即采用每一个周期能执行对于一个元素索引化</span><span lang="en-US" style="font-family: Calibri;"> L/S </span><span lang="zh-CN" style="font-family: 宋体;">的版本,的执行时间是</span><span lang="en-US" style="font-family: Calibri;">4n + 4fn + c2 [2]</span><span lang="zh-CN" style="font-family: 宋体;">,其中</span><span lang="en-US" style="font-family: Calibri;"> f </span><span lang="zh-CN" style="font-family: 宋体;">是条件测试满足(也即</span><span lang="en-US" style="font-family: Calibri;">A(i) != 0</span><span lang="zh-CN" style="font-family: 宋体;">)的比率。如果我们假设</span><span lang="en-US" style="font-family: Calibri;"> c1 </span><span lang="zh-CN" style="font-family: 宋体;">和</span><span lang="en-US" style="font-family: Calibri;"> c2 </span><span lang="zh-CN" style="font-family: 宋体;">差不多,或者说他们都远小于</span><span lang="en-US" style="font-family: Calibri;"> n</span><span lang="zh-CN" style="font-family: 宋体;">,我们可以求出什么时候第二个版本会更好。</span></div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"> Time1 = 5n</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"> Time2 = 4n + 4fn</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">如果我们要</span><span lang="en-US" style="font-family: Calibri;"> Time1 > Time2</span><span lang="zh-CN" style="font-family: 宋体;">,那么</span></div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"> 5n > 4n + 4fn</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"> 1/4 > f</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">也就是说,如果小于</span><span lang="en-US" style="font-family: Calibri;"> 1/4 </span><span lang="zh-CN" style="font-family: 宋体;">的元素是非零元素,那么第二个版本更好。在很多情况下,条件满足的比率要小得多。如果可以索引向量可以被重用,或者在</span><span lang="en-US" style="font-family: Calibri;"> if </span><span lang="zh-CN" style="font-family: 宋体;">语句下的向量语句</span><span lang="en-US" style="font-family: Calibri;"> [3] </span><span lang="zh-CN" style="font-family: 宋体;">的数目增加,</span><span lang="en-US" style="font-family: Calibri;">scatter-gather </span><span lang="zh-CN" style="font-family: 宋体;">的优势会显著增加。</span></div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: center;"><span lang="en-US" style="font-family: Calibri;">---------------</span><span lang="zh-CN" style="font-family: 宋体;">大家好,我是分割线</span><span lang="en-US" style="font-family: Calibri;">---------------</span></div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="en-US" style="font-family: Calibri;">[1] </span><span lang="zh-CN" style="font-family: 宋体;">如果</span><span lang="en-US" style="font-family: Calibri;"> K </span><span lang="zh-CN" style="font-family: 宋体;">中有两个元素的值相等,那么它们会指向对于</span><span lang="en-US" style="font-family: Calibri;"> A </span><span lang="zh-CN" style="font-family: 宋体;">中同一个元素进行操作。简单的编译器会认为这两个操作之间有相互数据依赖而不能自动并行化。</span></div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="en-US" style="font-family: Calibri;">[2] </span><span lang="zh-CN" style="font-family: 宋体;">为什么是</span><span lang="en-US" style="font-family: Calibri;"> 4n + 4fn</span><span lang="zh-CN" style="font-family: 宋体;">?在第二个版本中,需要完全执行(也即需要遍历向量中所有的元素)的向量指令是</span><span lang="en-US" style="font-family: Calibri;"> LV</span><span lang="zh-CN" style="font-family: 宋体;">,</span><span lang="en-US" style="font-family: Calibri;">SNEVS.D</span><span lang="zh-CN" style="font-family: 宋体;">,</span><span lang="en-US" style="font-family: Calibri;">CVI</span><span lang="zh-CN" style="font-family: 宋体;">,</span><span lang="en-US" style="font-family: Calibri;">POP</span><span lang="zh-CN" style="font-family: 宋体;">;需要部分执行(也即只需要对满足条件的向量元素进行操作)的指令是</span><span lang="en-US" style="font-family: Calibri;"> 2 </span><span lang="zh-CN" style="font-family: 宋体;">条</span><span lang="en-US" style="font-family: Calibri;"> LVI</span><span lang="zh-CN" style="font-family: 宋体;">,</span><span lang="en-US" style="font-family: Calibri;">SUBV.D</span><span lang="zh-CN" style="font-family: 宋体;">,</span><span lang="en-US" style="font-family: Calibri;">SVI</span><span lang="zh-CN" style="font-family: 宋体;">。条件满足的比率是</span><span lang="en-US" style="font-family: Calibri;"> f</span><span lang="zh-CN" style="font-family: 宋体;">。</span></div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="en-US" style="font-family: Calibri;">[3] </span><span lang="zh-CN" style="font-family: 宋体;">也就是满足条件测试后所需要执行的那部分语句。拿本例来说,如果</span><span lang="en-US" style="font-family: Calibri;"> A(i) != 0 </span><span lang="zh-CN" style="font-family: 宋体;">的情况下不仅仅是做简单的减法而是执行一系列复杂的操作,那么索引化</span><span lang="en-US" style="font-family: Calibri;">L/S </span><span lang="zh-CN" style="font-family: 宋体;">的优势更明显。原因在于索引化</span><span lang="en-US" style="font-family: Calibri;"> L/S </span><span lang="zh-CN" style="font-family: 宋体;">相比原来只用掩码的版本的优点正在于减少了</span><span lang="en-US" style="font-family: Calibri;"> if </span><span lang="zh-CN" style="font-family: 宋体;">语句下的那些指令的执行时间(即只对满足条件的元素执行操作)。如果</span><span lang="en-US" style="font-family: Calibri;"> if </span><span lang="zh-CN" style="font-family: 宋体;">语句下的那些指令数目增大,这个优势会被扩大。读者可以用上面给出的分析模型定性分析。</span></div>Yuhaohttp://www.blogger.com/profile/08569555359590748704noreply@blogger.com0tag:blogger.com,1999:blog-1107874072437767466.post-11062999456668567762011-01-06T14:51:00.000-08:002011-01-06T14:51:19.397-08:00向量处理器(7)<div style="font-size: 18pt; font-weight: bold; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="en-US" style="font-family: Calibri;">4. </span><span lang="zh-CN" style="font-family: 宋体;">改进向量处理器性能</span></div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">在这一节里我们会展示五种改善向量处理器性能的技术。第一种技术,称之为链接(</span><span lang="en-US" style="font-family: Calibri;">chaining</span><span lang="zh-CN" style="font-family: 宋体;">),能够让一系列相互依赖的向量操作运行得更快。它起源于</span><span lang="en-US" style="font-family: Calibri;"> Cray-1</span><span lang="zh-CN" style="font-family: 宋体;">,但是现在大多数的向量处理器都支持这种技术。接下来两种技术通过引入新的向量指令类型来处理条件执行(</span><span lang="en-US" style="font-family: Calibri;">conditional execution</span><span lang="zh-CN" style="font-family: 宋体;">)和稀疏矩阵从而扩展可以被向量化的循环类型。第四种技术通过以增加道(</span><span lang="en-US" style="font-family: Calibri;">lane</span><span lang="zh-CN" style="font-family: 宋体;">)的方式增加更多的并行执行单元来提升向量处理器的峰值性能。第五种技术通过流水化以及重叠指令的启动来降低启动开销。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 14pt; font-weight: bold; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="en-US" style="font-family: Calibri;">4.1 </span><span lang="zh-CN" style="font-family: 宋体;">链接(</span><span lang="en-US" style="font-family: Calibri;">chaining</span><span lang="zh-CN" style="font-family: 宋体;">)</span><span lang="en-US" style="font-family: Calibri;">--</span><span lang="zh-CN" style="font-family: 宋体;">前推(</span><span lang="en-US" style="font-family: Calibri;">forwarding</span><span lang="zh-CN" style="font-family: 宋体;">)的概念在向量寄存器上的扩展</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;">考虑一下简单的向量序列:</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"> MULV.D V1, V2, V3</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"> ADDV.D V4, V1, V5</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">在</span><span lang="en-US" style="font-family: Calibri;"> VMIPS </span><span lang="zh-CN" style="font-family: 宋体;">中,就像我们看到的那样,这两条指令必须放到两个独立的</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">中,因为他们是互相依赖的。另一方面,如果我们把向量寄存器,在本例中即</span><span lang="en-US" style="font-family: Calibri;"> V1</span><span lang="zh-CN" style="font-family: 宋体;">,不看成一个单个个体,而是看成一组寄存器,那么所谓前推(</span><span lang="en-US" style="font-family: Calibri;">forwarding</span><span lang="zh-CN" style="font-family: 宋体;">)的概念就可以扩展以作用于向量的每个元素上。这一允许</span><span lang="en-US" style="font-family: Calibri;"> ADDV.D </span><span lang="zh-CN" style="font-family: 宋体;">更早开始执行的电子称为链接(</span><span lang="en-US" style="font-family: Calibri;">chaining</span><span lang="zh-CN" style="font-family: 宋体;">)。</span><span lang="en-US" style="font-family: Calibri;">Chaining </span><span lang="zh-CN" style="font-family: 宋体;">允许只要一个向量操作的源操作向量中的某个元素已经准备就绪时就开始执行该元素的操作:在链(</span><span lang="en-US" style="font-family: Calibri;">chain</span><span lang="zh-CN" style="font-family: 宋体;">)中,前一个功能单元的结果被</span><span lang="en-US" style="font-family: Calibri;"> forward </span><span lang="zh-CN" style="font-family: 宋体;">到后一个功能单元。在实际中,</span><span lang="en-US" style="font-family: Calibri;">chaining </span><span lang="zh-CN" style="font-family: 宋体;">通常是通过允许处理器同时读写同一个向量寄存器的不同元素来实现的。早期的</span><span lang="en-US" style="font-family: Calibri;"> chaining </span><span lang="zh-CN" style="font-family: 宋体;">确实是以类似前推的方式工作的,但是这限制了链中源指令和目的指令的时序。近来的实现采用了灵活链接(</span><span lang="en-US" style="font-family: Calibri;">flexible chaining</span><span lang="zh-CN" style="font-family: 宋体;">)的方法,允许一个向量指令和任何别的活跃的向量指令链接,只要没有结构</span><span lang="en-US" style="font-family: Calibri;"> hazard [1]</span><span lang="zh-CN" style="font-family: 宋体;">。</span><span lang="en-US" style="font-family: Calibri;">Flexible chaining </span><span lang="zh-CN" style="font-family: 宋体;">需要几条指令同时访问一个向量寄存器,这可以通过增加读写端口或者把向量寄存器文件组织成类似于内存系统里</span><span lang="en-US" style="font-family: Calibri;"> bank </span><span lang="zh-CN" style="font-family: 宋体;">的形式类实现。我们在整个附录里就假定采用这种链接方式。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">虽然一组操作互相依赖,但是</span><span lang="en-US" style="font-family: Calibri;"> chaining </span><span lang="zh-CN" style="font-family: 宋体;">允许对于不同元素的操作并行执行。这使得这一组操作能够被调度到一个</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">中,从而减少</span><span lang="en-US" style="font-family: Calibri;"> chime </span><span lang="zh-CN" style="font-family: 宋体;">的数目。对于前一个例子而言,可以达到持续达到每一个周期</span><span lang="en-US" style="font-family: Calibri;"> 2 </span><span lang="zh-CN" style="font-family: 宋体;">个浮点操作,或者一个</span><span lang="en-US" style="font-family: Calibri;"> chime</span><span lang="zh-CN" style="font-family: 宋体;">,的速率(不考虑启动开销),即使他们是互相依赖的!它的总共执行时间为:</span></div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: center;"><span lang="zh-CN" style="font-family: 宋体;">向量长度</span><span lang="en-US" style="font-family: Calibri;"> + ADDV </span><span lang="zh-CN" style="font-family: 宋体;">的启动时间</span><span lang="en-US" style="font-family: Calibri;"> + MULV </span><span lang="zh-CN" style="font-family: 宋体;">的启动时间</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">图</span><span lang="en-US" style="font-family: Calibri;"> F.10 </span><span lang="zh-CN" style="font-family: 宋体;">展示了上例链接和没有链接两个版本的情况,其中向量长度为</span><span lang="en-US" style="font-family: Calibri;"> 64</span><span lang="zh-CN" style="font-family: 宋体;">。新的</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">仍然只需要一个</span><span lang="en-US" style="font-family: Calibri;"> chime</span><span lang="zh-CN" style="font-family: 宋体;">,但是因为他使用了</span><span lang="en-US" style="font-family: Calibri;"> chaining</span><span lang="zh-CN" style="font-family: 宋体;">,启动时间会很显著。在图</span><span lang="en-US" style="font-family: Calibri;"> F.10 </span><span lang="zh-CN" style="font-family: 宋体;">中,链接版本的总共执行时间为</span><span lang="en-US" style="font-family: Calibri;"> 77 </span><span lang="zh-CN" style="font-family: 宋体;">个时钟周期,或者说平均一个结果</span><span lang="en-US" style="font-family: Calibri;"> 1.2 </span><span lang="zh-CN" style="font-family: 宋体;">个周期。由于有</span><span lang="en-US" style="font-family: Calibri;"> 128 </span><span lang="zh-CN" style="font-family: 宋体;">个浮点操作在其间执行,所以我们达到了</span><span lang="en-US" style="font-family: Calibri;"> 1.7 FLOPS </span><span lang="zh-CN" style="font-family: 宋体;">每周期。对于未链接版本,一共要花费</span><span lang="en-US" style="font-family: Calibri;"> 141 </span><span lang="zh-CN" style="font-family: 宋体;">个周期,或者说</span><span lang="en-US" style="font-family: Calibri;"> 0.9 FLOPS </span><span lang="zh-CN" style="font-family: 宋体;">每周期。</span></div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;"><br />
</span></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjLSDdJQ810y4_EWfP0ndGVaS1JaREabt70x8WYCzS11k5NwspmsqDprcUqpUiQPzOjM-XIYgY4cprZ9R96Ynd9OWt99ODeuxfK1dyDSOJRZfVuE4rfG5upX_qpngcsJBMIOurmmu8MTvhv/s1600/F10.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="170" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjLSDdJQ810y4_EWfP0ndGVaS1JaREabt70x8WYCzS11k5NwspmsqDprcUqpUiQPzOjM-XIYgY4cprZ9R96Ynd9OWt99ODeuxfK1dyDSOJRZfVuE4rfG5upX_qpngcsJBMIOurmmu8MTvhv/s400/F10.PNG" width="400" /></a></div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;"><br />
</span></div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;"></span></div><div style="font-size: 11.0pt; font-style: italic; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">图</span><span lang="en-US" style="font-family: Calibri; font-weight: bold;"> F.10 </span><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">链接和未链接版本的一个互相依赖的</span><span lang="en-US" style="font-family: Calibri; font-weight: bold;"> ADDV </span><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">和</span><span lang="en-US" style="font-family: Calibri; font-weight: bold;"> MULV </span><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">向量操作序列的时序。</span><span lang="en-US" style="font-family: Calibri;"> </span><span lang="zh-CN" style="font-family: 宋体;">图中</span><span lang="en-US" style="font-family: Calibri;"> 6 </span><span lang="zh-CN" style="font-family: 宋体;">和</span><span lang="en-US" style="font-family: Calibri;"> 7 </span><span lang="zh-CN" style="font-family: 宋体;">个时钟周期的延迟分别是加法和乘法的延迟。</span></div><br />
<div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span class="Apple-style-span" style="font-family: 'Times New Roman';"><span lang="zh-CN" style="font-family: 宋体;">虽然</span><span lang="en-US" style="font-family: Calibri;"> chaining </span><span lang="zh-CN" style="font-family: 宋体;">通过把两个互相依赖的操作放置到同一个</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">的方式减少了以</span><span lang="en-US" style="font-family: Calibri;"> chime </span><span lang="zh-CN" style="font-family: 宋体;">表示的执行时间,它并没有降低启动延迟。如果我们期望得到一个精确执行时间,我们就必须考虑启动开销。在</span><span lang="en-US" style="font-family: Calibri;"> chaining </span><span lang="zh-CN" style="font-family: 宋体;">技术中,一个向量序列的</span><span lang="en-US" style="font-family: Calibri;"> chime </span><span lang="zh-CN" style="font-family: 宋体;">数由不同的功能单元的个数以及程序所需的实际个数所决定。特别注意的是在任何</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">中都不能有结构</span><span lang="en-US" style="font-family: Calibri;"> hazard</span><span lang="zh-CN" style="font-family: 宋体;">。这意味着在</span><span lang="en-US" style="font-family: Calibri;"> VMIPS </span><span lang="zh-CN" style="font-family: 宋体;">这样只有一个向量</span><span lang="en-US" style="font-family: Calibri;"> load-store </span><span lang="zh-CN" style="font-family: 宋体;">单元的处理器上,如果一个程序有两个向量指令,那它必须至少占据两个</span><span lang="en-US" style="font-family: Calibri;"> convoy</span><span lang="zh-CN" style="font-family: 宋体;">,因此至少花两个</span><span lang="en-US" style="font-family: Calibri;"> chime </span><span lang="zh-CN" style="font-family: 宋体;">的时间。</span></span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">我们会在第六小节看到</span><span lang="en-US" style="font-family: Calibri;"> chaining </span><span lang="zh-CN" style="font-family: 宋体;">对提升性能有极其重要的作用。实际上,</span><span lang="en-US" style="font-family: Calibri;">chaining </span><span lang="zh-CN" style="font-family: 宋体;">是如此的重要以至于现在每一个向量处理器都支持</span><span lang="en-US" style="font-family: Calibri;"> flexible chaining</span><span lang="zh-CN" style="font-family: 宋体;">。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 14pt; font-weight: bold; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="en-US" style="font-family: Calibri;">4.2 </span><span lang="zh-CN" style="font-family: 宋体;">条件执行语句</span></div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">根据</span><span lang="en-US" style="font-family: Calibri;"> Amdahl </span><span lang="zh-CN" style="font-family: 宋体;">定律,我们知道如果一个程序可向量化部分很少或者不高,那么加速比是非常有限的。两个限制向量化的原因是在循环内部条件代码的存在以及稀疏矩阵的使用。在循环里包含</span><span lang="en-US" style="font-family: Calibri;"> if </span><span lang="zh-CN" style="font-family: 宋体;">语句的程序不能在向量模式下执行,因为</span><span lang="en-US" style="font-family: Calibri;"> if </span><span lang="zh-CN" style="font-family: 宋体;">语句引入了循环内部的控制依赖(</span><span lang="en-US" style="font-family: Calibri;">control dependency</span><span lang="zh-CN" style="font-family: 宋体;">)。同样的,稀疏矩阵也不能有效地利用我们之前讨论过的相关技术有效实现。我们在这一小节讨论如何处理条件执行,下一小节讨论稀疏矩阵。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;">考虑如下循环:</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"> <span style="font-weight: bold;">do</span> 100 i = 1, 64</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"> <span style="font-weight: bold;">if</span> (A(i).ne. 0) <span style="font-weight: bold;">then</span></div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"> A(i) = A(i) - B(i)</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"> <span style="font-weight: bold;">end if</span></div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"> 100 <span style="font-weight: bold;">continue</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">这个循环由于包含有条件执行而不能正常地向量化。但是如果内层循环可以对</span><span lang="en-US" style="font-family: Calibri;"> A(i) != 0 </span><span lang="zh-CN" style="font-family: 宋体;">的部分进行迭代,那么减法操作就可以被向量化。在附录</span><span lang="en-US" style="font-family: Calibri;"> G </span><span lang="zh-CN" style="font-family: 宋体;">中,我们看到了条件执行指令不是正常指令集的不一份。它们可以把控制依赖转换成为数据依赖,从而增加了循环的并行度。向量处理器可以类似地获益。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">通常我们使用的一种扩展技术称为向量</span><span lang="en-US" style="font-family: Calibri;">-</span><span lang="zh-CN" style="font-family: 宋体;">掩码控制(</span><span lang="en-US" style="font-family: Calibri;">vector-mask control</span><span lang="zh-CN" style="font-family: 宋体;">)。</span><span lang="en-US" style="font-family: Calibri;">Vector-mask control </span><span lang="zh-CN" style="font-family: 宋体;">利用一个长度为</span><span lang="en-US" style="font-family: Calibri;"> MVL </span><span lang="zh-CN" style="font-family: 宋体;">的二值向量来控制向量指令的执行,就如同条件执行指令利用一个二值条件来决定一条指令是否执行一样。当向量</span><span lang="en-US" style="font-family: Calibri;">-</span><span lang="zh-CN" style="font-family: 宋体;">掩码寄存器(</span><span lang="en-US" style="font-family: Calibri;">vector-mask register</span><span lang="zh-CN" style="font-family: 宋体;">)被启用的时候,任何向量指令都只作用于向量掩码寄存器中相应元素值为</span><span lang="en-US" style="font-family: Calibri;"> 1 </span><span lang="zh-CN" style="font-family: 宋体;">的那些向量元素上。在目的向量寄存器(</span><span lang="en-US" style="font-family: Calibri;">destination vector register</span><span lang="zh-CN" style="font-family: 宋体;">)中那些掩码寄存器里对应位置为</span><span lang="en-US" style="font-family: Calibri;"> 0 </span><span lang="zh-CN" style="font-family: 宋体;">的元素不受向量操作的影响。如果向量</span><span lang="en-US" style="font-family: Calibri;">-</span><span lang="zh-CN" style="font-family: 宋体;">掩码寄存器由某个条件语句的结果来设置,那么只有满足条件的元素才会受到影响。把该寄存器清空表示将里面所有元素的值设为</span><span lang="en-US" style="font-family: Calibri;"> 1</span><span lang="zh-CN" style="font-family: 宋体;">,使得后续的向量操作作用于所有的向量元素上。下面的代码可以用来实现前面的循环,假定</span><span lang="en-US" style="font-family: Calibri;"> A </span><span lang="zh-CN" style="font-family: 宋体;">和</span><span lang="en-US" style="font-family: Calibri;"> B </span><span lang="zh-CN" style="font-family: 宋体;">的起始地址分别在</span><span lang="en-US" style="font-family: Calibri;"> Ra </span><span lang="zh-CN" style="font-family: 宋体;">和</span><span lang="en-US" style="font-family: Calibri;"> Rb</span><span lang="zh-CN" style="font-family: 宋体;">中。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"> LV V1, Ra ;load vector A into V1</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"> LV V2, Rb ;load vector B</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"> L.D F0, #0 ;load FP zero into F0</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"> SNEVS.D V1, F0 ;set VM(i) to 1 if V1(i) != F0</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"> SUBV.D V1, V1, V2 ;subtract under vector mask [2]</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"> CVM ;set the vector mask to all 1s</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"> SV Ra, V1 ;store the result in A</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">大多数最近的向量处理器都提供了向量</span><span lang="en-US" style="font-family: Calibri;">-</span><span lang="zh-CN" style="font-family: 宋体;">掩码控制。在这里描述的这种向量</span><span lang="en-US" style="font-family: Calibri;">-</span><span lang="zh-CN" style="font-family: 宋体;">掩码在大多数的处理器里都可以看到,但是另外一些允许向量掩码只作用于一部分的向量指令上。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">然而,利用向量</span><span lang="en-US" style="font-family: Calibri;">-</span><span lang="zh-CN" style="font-family: 宋体;">掩码寄存器也有缺点。在讨论条件执行指令时,我们看到即使条件没有满足,该指令仍然需要花时间。但是,消除了分支和相应的控制依赖仍然使得条件指令执行地更快即使它得做一些没用的工作。同样的,应用向量掩码的向量指令即使对那些掩码值为</span><span lang="en-US" style="font-family: Calibri;"> 0 </span><span lang="zh-CN" style="font-family: 宋体;">的元素而言也需要一定的执行时间。同理,即使大部分掩码值都为</span><span lang="en-US" style="font-family: Calibri;"> 0</span><span lang="zh-CN" style="font-family: 宋体;">,使用向量</span><span lang="en-US" style="font-family: Calibri;">-</span><span lang="zh-CN" style="font-family: 宋体;">掩码控制仍然可能会比标量模式要快得多。实际上,在向量模式和标量模式之间巨大的潜在性能差距使得包含向量</span><span lang="en-US" style="font-family: Calibri;">-</span><span lang="zh-CN" style="font-family: 宋体;">掩码指令非常重要。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">第二,在有些向量处理器中,向量掩码只用于禁止结果写回目的寄存器,而实际的计算操作会发生。这样的话,如果在前面例子中的操作是除法而不是减法,并且是对</span><span lang="en-US" style="font-family: Calibri;"> B </span><span lang="zh-CN" style="font-family: 宋体;">而不是</span><span lang="en-US" style="font-family: Calibri;"> A </span><span lang="zh-CN" style="font-family: 宋体;">进行条件测试,那么由于除数为</span><span lang="en-US" style="font-family: Calibri;"> 0 </span><span lang="zh-CN" style="font-family: 宋体;">导致的浮点异常可能会发生。利用掩码同时屏蔽计算和写回的处理器可以避免这个问题。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: center;"><span lang="en-US" style="font-family: Calibri;">---------------</span><span lang="zh-CN" style="font-family: 宋体;">大家好,我是分割线</span><span lang="en-US" style="font-family: Calibri;">---------------</span></div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="en-US" style="font-family: Calibri;">[1] Flexible chaining </span><span lang="zh-CN" style="font-family: 宋体;">简单来讲就是放宽了</span><span lang="en-US" style="font-family: Calibri;"> chaining </span><span lang="zh-CN" style="font-family: 宋体;">的限制。最初的</span><span lang="en-US" style="font-family: Calibri;"> chaining </span><span lang="zh-CN" style="font-family: 宋体;">只能作用于相邻的两条指令之间。但是很有可能由于写程序的关系,原本可以</span><span lang="en-US" style="font-family: Calibri;"> chaining </span><span lang="zh-CN" style="font-family: 宋体;">的两条语句被人为地分开了而不能利用</span><span lang="en-US" style="font-family: Calibri;"> chaining </span><span lang="zh-CN" style="font-family: 宋体;">技术。</span><span lang="en-US" style="font-family: Calibri;">Flexible chaining </span><span lang="zh-CN" style="font-family: 宋体;">允许这两条不相邻的指令重叠执行。</span></div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="en-US" style="font-family: Calibri;">[2] </span><span lang="zh-CN" style="font-family: 宋体;">想一想这条指令硬件有可能是如何执行的?</span></div>Yuhaohttp://www.blogger.com/profile/08569555359590748704noreply@blogger.com0tag:blogger.com,1999:blog-1107874072437767466.post-58133102946733995172011-01-06T08:00:00.000-08:002011-01-06T08:02:00.064-08:00向量处理器(6)<div style="font-size: 14pt; font-weight: bold; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="en-US" style="font-family: Calibri;">3.2 </span><span lang="zh-CN" style="font-family: 宋体;">向量跨度</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;">本节的第二个要解决的问题是,一个向量中相邻的元素在内存中的位置不一定是顺序的。考虑以下直观的矩阵相乘的代码:</div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"> <b>do</b> 10 i = 1, 100</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"> <b>do</b> 10 j = 1, 100</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"> A(i, j) = 0.0</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"> <b>do</b> 10 k = 1, 100</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"> 10 A(i, j) = A(i, j) + B(i, k) * C(k, j)</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">在标号为</span><span lang="en-US" style="font-family: Calibri;"> 10 </span><span lang="zh-CN" style="font-family: 宋体;">的这条语句这里,我们可以向量化</span><span lang="en-US" style="font-family: Calibri;"> B </span><span lang="zh-CN" style="font-family: 宋体;">的一整行和</span><span lang="en-US" style="font-family: Calibri;"> C </span><span lang="zh-CN" style="font-family: 宋体;">的一整列的相乘,并且以</span><span lang="en-US" style="font-family: Calibri;"> k </span><span lang="zh-CN" style="font-family: 宋体;">作为下标对内层循环采取</span><span lang="en-US" style="font-family: Calibri;"> strip-mine </span><span lang="zh-CN" style="font-family: 宋体;">技术。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">要能这么干的话,我们必须考虑</span><span lang="en-US" style="font-family: Calibri;"> B </span><span lang="zh-CN" style="font-family: 宋体;">中相邻的元素和</span><span lang="en-US" style="font-family: Calibri;"> C </span><span lang="zh-CN" style="font-family: 宋体;">中相邻元素是如何被寻址的。当在内存中为一个数组开辟空间时,它是线性化的,并且必须以或者</span><span lang="en-US" style="font-family: Calibri;"> row-major </span><span lang="zh-CN" style="font-family: 宋体;">或者</span><span lang="en-US" style="font-family: Calibri;"> column-major </span><span lang="zh-CN" style="font-family: 宋体;">为准排布数据。这样的线性化意味着一行或者一列的元素在内存中是不相邻的。比如,如果上述循环代码是用</span><span lang="en-US" style="font-family: Calibri;"> FORTRAN </span><span lang="zh-CN" style="font-family: 宋体;">语言写的,也即是以</span><span lang="en-US" style="font-family: Calibri;"> column-major </span><span lang="zh-CN" style="font-family: 宋体;">的方式排布数据,那么在内层循环中</span><span lang="en-US" style="font-family: Calibri;"> B </span><span lang="zh-CN" style="font-family: 宋体;">的相邻元素互相之间将有</span><span lang="en-US" style="font-family: Calibri;"> 8</span><span lang="zh-CN" style="font-family: 宋体;">(每个元素的字节大小)倍于行长度的距离,总共是</span><span lang="en-US" style="font-family: Calibri;"> 800 </span><span lang="zh-CN" style="font-family: 宋体;">字节。在第五章,我们看到了</span><span lang="en-US" style="font-family: Calibri;"> blocking </span><span lang="zh-CN" style="font-family: 宋体;">的技术可以用以增加在基于</span><span lang="en-US" style="font-family: Calibri;"> cache </span><span lang="zh-CN" style="font-family: 宋体;">的系统的局部性。对于向量处理器而言,它没有</span><span lang="en-US" style="font-family: Calibri;"> cache</span><span lang="zh-CN" style="font-family: 宋体;">,我们需要另一种技术来取出那些在内存中不相邻的元素。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">我们称把分开两个要被收集(</span><span lang="en-US" style="font-family: Calibri;">gather</span><span lang="zh-CN" style="font-family: 宋体;">)到一个寄存器里的两个元素的距离为跨度(</span><span lang="en-US" style="font-family: Calibri;">stride</span><span lang="zh-CN" style="font-family: 宋体;">)。在这个例子中,用</span><span lang="en-US" style="font-family: Calibri;"> column-major </span><span lang="zh-CN" style="font-family: 宋体;">的数据排布方式意味着矩阵</span><span lang="en-US" style="font-family: Calibri;"> C </span><span lang="zh-CN" style="font-family: 宋体;">的</span><span lang="en-US" style="font-family: Calibri;"> stride </span><span lang="zh-CN" style="font-family: 宋体;">为</span><span lang="en-US" style="font-family: Calibri;">1</span><span lang="zh-CN" style="font-family: 宋体;">,或者说</span><span lang="en-US" style="font-family: Calibri;"> 1 </span><span lang="zh-CN" style="font-family: 宋体;">个双字(</span><span lang="en-US" style="font-family: Calibri;">8 </span><span lang="zh-CN" style="font-family: 宋体;">个字节),而矩阵</span><span lang="en-US" style="font-family: Calibri;"> B </span><span lang="zh-CN" style="font-family: 宋体;">的</span><span lang="en-US" style="font-family: Calibri;"> stride </span><span lang="zh-CN" style="font-family: 宋体;">为</span><span lang="en-US" style="font-family: Calibri;"> 100</span><span lang="zh-CN" style="font-family: 宋体;">,或者说</span><span lang="en-US" style="font-family: Calibri;"> 100 </span><span lang="zh-CN" style="font-family: 宋体;">个双字(</span><span lang="en-US" style="font-family: Calibri;">800 </span><span lang="zh-CN" style="font-family: 宋体;">个字节)。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">只要一个向量被</span><span lang="en-US" style="font-family: Calibri;"> load </span><span lang="zh-CN" style="font-family: 宋体;">到向量寄存器之后,它就好像在逻辑上是相邻的元素了。因此一个</span><span lang="en-US" style="font-family: Calibri;"> vector-register </span><span lang="zh-CN" style="font-family: 宋体;">型的处理器可以处理</span><span lang="en-US" style="font-family: Calibri;"> stride </span><span lang="zh-CN" style="font-family: 宋体;">大于</span><span lang="en-US" style="font-family: Calibri;"> 1 </span><span lang="zh-CN" style="font-family: 宋体;">的情况,称之为非单元化跨度(</span><span lang="en-US" style="font-family: Calibri;">nonunit stride</span><span lang="zh-CN" style="font-family: 宋体;">),只要向量的</span><span lang="en-US" style="font-family: Calibri;"> L/S </span><span lang="zh-CN" style="font-family: 宋体;">操作有处理</span><span lang="en-US" style="font-family: Calibri;"> stride </span><span lang="zh-CN" style="font-family: 宋体;">的能力即可。这种访问不连续的内存地址并且把它们重构成为一个紧致的数据结构的能力是向量处理器相比基于</span><span lang="en-US" style="font-family: Calibri;"> cache </span><span lang="zh-CN" style="font-family: 宋体;">的处理器的一大优点。</span><span lang="en-US" style="font-family: Calibri;">cache </span><span lang="zh-CN" style="font-family: 宋体;">内在而言是处理单元化跨度(</span><span lang="en-US" style="font-family: Calibri;">unit stride</span><span lang="zh-CN" style="font-family: 宋体;">)的数据的</span><span lang="en-US" style="font-family: Calibri;"> [1]</span><span lang="zh-CN" style="font-family: 宋体;">,所以虽然增加</span><span lang="en-US" style="font-family: Calibri;"> block </span><span lang="zh-CN" style="font-family: 宋体;">的大小可以降低有单元化跨度访问特性的大规模数据集的缺失率,但是另一方面对于以非单元化跨度模式访问的数据而言有负面影响。虽然</span><span lang="en-US" style="font-family: Calibri;"> blocking </span><span lang="zh-CN" style="font-family: 宋体;">可以部分解决这些问题(参考</span><span lang="en-US" style="font-family: Calibri;"> 5.2 </span><span lang="zh-CN" style="font-family: 宋体;">小节),能够访问不连续的数据的能力仍然是向量处理器在这类问题上的优势。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">在</span><span lang="en-US" style="font-family: Calibri;"> VMIPS </span><span lang="zh-CN" style="font-family: 宋体;">上,可寻址的最小单元是一个字节,因此我们的例子中</span><span lang="en-US" style="font-family: Calibri;"> stride </span><span lang="zh-CN" style="font-family: 宋体;">是</span><span lang="en-US" style="font-family: Calibri;"> 800</span><span lang="zh-CN" style="font-family: 宋体;">。这个值必须动态求得,因为矩阵的大小有可能在编译时并不知道,或者</span><span lang="en-US" style="font-family: Calibri;">--</span><span lang="zh-CN" style="font-family: 宋体;">就像向量的长度一样</span><span lang="en-US" style="font-family: Calibri;">--</span><span lang="zh-CN" style="font-family: 宋体;">可能在执行时因为对同一条语句的多次执行而变化。向量跨度,就和向量的起始地址一样,可以被放入一个通用寄存器中。然后可以使用</span><span lang="en-US" style="font-family: Calibri;"> VMIPS </span><span lang="zh-CN" style="font-family: 宋体;">的</span><span lang="en-US" style="font-family: Calibri;"> LVWS </span><span lang="zh-CN" style="font-family: 宋体;">(</span><span lang="en-US" style="font-family: Calibri;">load vector with stride</span><span lang="zh-CN" style="font-family: 宋体;">)指令取出某个向量置入向量寄存器中。同样的,当</span><span lang="en-US" style="font-family: Calibri;"> store </span><span lang="zh-CN" style="font-family: 宋体;">一个非单元化跨度的向量时,可以使用</span><span lang="en-US" style="font-family: Calibri;"> SVWS </span><span lang="zh-CN" style="font-family: 宋体;">(</span><span lang="en-US" style="font-family: Calibri;">store vector with stride</span><span lang="zh-CN" style="font-family: 宋体;">)指令。在有些向量处理器上,</span><span lang="en-US" style="font-family: Calibri;">L/S </span><span lang="zh-CN" style="font-family: 宋体;">指令总是需要一个存储在寄存器中的</span><span lang="en-US" style="font-family: Calibri;"> stride </span><span lang="zh-CN" style="font-family: 宋体;">值,所以只需要一条</span><span lang="en-US" style="font-family: Calibri;"> load </span><span lang="zh-CN" style="font-family: 宋体;">指令和一条</span><span lang="en-US" style="font-family: Calibri;"> store </span><span lang="zh-CN" style="font-family: 宋体;">指令就够了。单元化跨度的访问比非单元化跨度访问频繁得多,并且可以从内存系统的特别处理中获益,因此就像在</span><span lang="en-US" style="font-family: Calibri;"> VMIPS </span><span lang="zh-CN" style="font-family: 宋体;">中一样通常是和非单元化跨度访问分开的。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">内存系统因为支持大于</span><span lang="en-US" style="font-family: Calibri;"> 1 </span><span lang="zh-CN" style="font-family: 宋体;">的</span><span lang="en-US" style="font-family: Calibri;"> stride </span><span lang="zh-CN" style="font-family: 宋体;">而变得复杂。在我们之前的例子中,我们看到如果</span><span lang="en-US" style="font-family: Calibri;"> memory bank </span><span lang="zh-CN" style="font-family: 宋体;">的数目至少和</span><span lang="en-US" style="font-family: Calibri;"> bank </span><span lang="zh-CN" style="font-family: 宋体;">忙碌时间(</span><span lang="en-US" style="font-family: Calibri;">bank busy time</span><span lang="zh-CN" style="font-family: 宋体;">)的时钟周期数一样的话,单元化跨度访问就可以满速进行。然而,一旦非单元化跨度被引入,很可能访问同一个</span><span lang="en-US" style="font-family: Calibri;"> bank </span><span lang="zh-CN" style="font-family: 宋体;">的频率会超过</span><span lang="en-US" style="font-family: Calibri;"> bank </span><span lang="zh-CN" style="font-family: 宋体;">忙碌时间所允许的最大值。当几个访问竞争同一个</span><span lang="en-US" style="font-family: Calibri;"> bank </span><span lang="zh-CN" style="font-family: 宋体;">的时候,</span><span lang="en-US" style="font-family: Calibri;">bank </span><span lang="zh-CN" style="font-family: 宋体;">冲突(</span><span lang="en-US" style="font-family: Calibri;">bank conflict</span><span lang="zh-CN" style="font-family: 宋体;">)就会发生,其中的一个访问必须停顿。</span><span lang="en-US" style="font-family: Calibri;">bank </span><span lang="zh-CN" style="font-family: 宋体;">冲突,也即</span><span lang="en-US" style="font-family: Calibri;">bank </span><span lang="zh-CN" style="font-family: 宋体;">停顿,满足以下条件就会产生:</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="en-US" style="font-family: Calibri;">bank </span><span lang="zh-CN" style="font-family: 宋体;">数目</span><span lang="en-US" style="font-family: Calibri;"> / </span><span lang="zh-CN" style="font-family: 宋体;">最小公倍数(跨度,</span><span lang="en-US" style="font-family: Calibri;">bank </span><span lang="zh-CN" style="font-family: 宋体;">数目)</span><span lang="en-US" style="font-family: Calibri;"> < bank </span><span lang="zh-CN" style="font-family: 宋体;">忙碌时间</span></div><div style="font-family: Calibri; font-size: 11pt; font-weight: bold; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;">_____________________________________________________</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">例题</span><span lang="zh-CN" style="font-family: 宋体;">:假定我们有</span><span lang="en-US" style="font-family: Calibri;"> 8 </span><span lang="zh-CN" style="font-family: 宋体;">个</span><span lang="en-US" style="font-family: Calibri;"> memory bank</span><span lang="zh-CN" style="font-family: 宋体;">,每个</span><span lang="en-US" style="font-family: Calibri;"> bank </span><span lang="zh-CN" style="font-family: 宋体;">的忙碌时间是</span><span lang="en-US" style="font-family: Calibri;">6 </span><span lang="zh-CN" style="font-family: 宋体;">个时钟周期,并且内存的访问延迟是</span><span lang="en-US" style="font-family: Calibri;"> 12 </span><span lang="zh-CN" style="font-family: 宋体;">个周期。对</span><span lang="en-US" style="font-family: Calibri;"> 64 </span><span lang="zh-CN" style="font-family: 宋体;">个向量元素进行</span><span lang="en-US" style="font-family: Calibri;"> stride </span><span lang="zh-CN" style="font-family: 宋体;">为</span><span lang="en-US" style="font-family: Calibri;"> 1 </span><span lang="zh-CN" style="font-family: 宋体;">的访问需要多少时间?如果</span><span lang="en-US" style="font-family: Calibri;"> stride </span><span lang="zh-CN" style="font-family: 宋体;">是</span><span lang="en-US" style="font-family: Calibri;"> 32 </span><span lang="zh-CN" style="font-family: 宋体;">呢?</span><span lang="en-US" style="font-family: Calibri;">[2]</span></div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">解答</span><span lang="zh-CN" style="font-family: 宋体;">:对于</span><span lang="en-US" style="font-family: Calibri;"> stride </span><span lang="zh-CN" style="font-family: 宋体;">为</span><span lang="en-US" style="font-family: Calibri;"> 1 </span><span lang="zh-CN" style="font-family: 宋体;">的情形,因为</span><span lang="en-US" style="font-family: Calibri;"> bank </span><span lang="zh-CN" style="font-family: 宋体;">的数目大于</span><span lang="en-US" style="font-family: Calibri;"> bank </span><span lang="zh-CN" style="font-family: 宋体;">的忙碌时间,所以</span><span lang="en-US" style="font-family: Calibri;"> load </span><span lang="zh-CN" style="font-family: 宋体;">操作会花</span><span lang="en-US" style="font-family: Calibri;"> 12 + 64 = 76 </span><span lang="zh-CN" style="font-family: 宋体;">个周期,或者说平均每个元素</span><span lang="en-US" style="font-family: Calibri;"> 1.2 </span><span lang="zh-CN" style="font-family: 宋体;">个周期。最差的</span><span lang="en-US" style="font-family: Calibri;"> stride </span><span lang="zh-CN" style="font-family: 宋体;">情况是</span><span lang="en-US" style="font-family: Calibri;"> memory bank </span><span lang="zh-CN" style="font-family: 宋体;">数目的倍数,就像在本例中,</span><span lang="en-US" style="font-family: Calibri;">8 </span><span lang="zh-CN" style="font-family: 宋体;">个</span><span lang="en-US" style="font-family: Calibri;"> memory bank</span><span lang="zh-CN" style="font-family: 宋体;">,</span><span lang="en-US" style="font-family: Calibri;">stride </span><span lang="zh-CN" style="font-family: 宋体;">为</span><span lang="en-US" style="font-family: Calibri;"> 32</span><span lang="zh-CN" style="font-family: 宋体;">。每个访存操作(除了第一个)都会和前一个操作冲突,因此需要等待</span><span lang="en-US" style="font-family: Calibri;"> 6 </span><span lang="zh-CN" style="font-family: 宋体;">个周期的</span><span lang="en-US" style="font-family: Calibri;"> bank </span><span lang="zh-CN" style="font-family: 宋体;">忙碌时间。总共的时间是</span><span lang="en-US" style="font-family: Calibri;"> 12 + 1 + 6 </span><span lang="zh-CN" style="font-family: 宋体;">×</span><span lang="en-US" style="font-family: Calibri;"> 63 = 391 </span><span lang="zh-CN" style="font-family: 宋体;">个周期</span><span lang="en-US" style="font-family: Calibri;"> [3]</span><span lang="zh-CN" style="font-family: 宋体;">,或者说平均每个元素</span><span lang="en-US" style="font-family: Calibri;"> 6.1 </span><span lang="zh-CN" style="font-family: 宋体;">个周期。</span><br />
<span lang="zh-CN" style="font-family: 宋体;"></span><span class="Apple-style-span" style="font-family: Calibri; font-weight: bold;">_____________________________________________________</span><br />
<span class="Apple-style-span" style="font-family: Calibri; font-weight: bold;"><br />
</span></div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">如果</span><span lang="en-US" style="font-family: Calibri;"> memory bank </span><span lang="zh-CN" style="font-family: 宋体;">的数目和</span><span lang="en-US" style="font-family: Calibri;"> stride </span><span lang="zh-CN" style="font-family: 宋体;">互素,并且有足够多的</span><span lang="en-US" style="font-family: Calibri;"> bank </span><span lang="zh-CN" style="font-family: 宋体;">来防止单元化跨度访问的冲突,那么</span><span lang="en-US" style="font-family: Calibri;"> memory bank </span><span lang="zh-CN" style="font-family: 宋体;">冲突就不会发生。在没有</span><span lang="en-US" style="font-family: Calibri;"> bank </span><span lang="zh-CN" style="font-family: 宋体;">冲突的时候,多字访问和单元化跨度访问的速度是一样的。增加</span><span lang="en-US" style="font-family: Calibri;"> bank </span><span lang="zh-CN" style="font-family: 宋体;">的数目超过最少满足单元化跨度访问无冲突的需求会减少可能的停顿的发生频率。比如,对于</span><span lang="en-US" style="font-family: Calibri;"> 64 </span><span lang="zh-CN" style="font-family: 宋体;">个</span><span lang="en-US" style="font-family: Calibri;"> bank </span><span lang="zh-CN" style="font-family: 宋体;">而言,跨度为</span><span lang="en-US" style="font-family: Calibri;"> 32 </span><span lang="zh-CN" style="font-family: 宋体;">的访问会每隔一次产生,而不是每次访问都产生。如果我们一开始的时候</span><span lang="en-US" style="font-family: Calibri;"> stride </span><span lang="zh-CN" style="font-family: 宋体;">是</span><span lang="en-US" style="font-family: Calibri;"> 8</span><span lang="zh-CN" style="font-family: 宋体;">,</span><span lang="en-US" style="font-family: Calibri;"> bank </span><span lang="zh-CN" style="font-family: 宋体;">数目是</span><span lang="en-US" style="font-family: Calibri;"> 16</span><span lang="zh-CN" style="font-family: 宋体;">,那么同样每隔一次访问会产生一次停顿。但是如果有</span><span lang="en-US" style="font-family: Calibri;"> 64 </span><span lang="zh-CN" style="font-family: 宋体;">个</span><span lang="en-US" style="font-family: Calibri;"> bank</span><span lang="zh-CN" style="font-family: 宋体;">,</span><span lang="en-US" style="font-family: Calibri;">stride </span><span lang="zh-CN" style="font-family: 宋体;">为</span><span lang="en-US" style="font-family: Calibri;"> 8 </span><span lang="zh-CN" style="font-family: 宋体;">的访问会每隔</span><span lang="en-US" style="font-family: Calibri;"> 8 </span><span lang="zh-CN" style="font-family: 宋体;">个才停顿。如果我们有多条存储流水线(</span><span lang="en-US" style="font-family: Calibri;">memory pipeline</span><span lang="zh-CN" style="font-family: 宋体;">)</span><span lang="en-US" style="font-family: Calibri;">[4] </span><span lang="zh-CN" style="font-family: 宋体;">或者多个处理器共享一个内存系统,我们就会需要更多的</span><span lang="en-US" style="font-family: Calibri;"> bank </span><span lang="zh-CN" style="font-family: 宋体;">以避免冲突。即使对于只有一条</span><span lang="en-US" style="font-family: Calibri;"> memory </span><span lang="zh-CN" style="font-family: 宋体;">流水线并且访存模式为单元化跨度的机器而言,我们仍然会在一条指令的最后几个元素和下一条指令的头几个元素之间遇到冲突。增加</span><span lang="en-US" style="font-family: Calibri;"> bank </span><span lang="zh-CN" style="font-family: 宋体;">数目有助于降低这种指令间冲突的可能性。在</span><span lang="en-US" style="font-family: Calibri;"> 2006 </span><span lang="zh-CN" style="font-family: 宋体;">年,大多数的向量超级计算机都把每个</span><span lang="en-US" style="font-family: Calibri;"> CPU </span><span lang="zh-CN" style="font-family: 宋体;">的访存请求分布到几百个</span><span lang="en-US" style="font-family: Calibri;"> memory bank </span><span lang="zh-CN" style="font-family: 宋体;">上。因为</span><span lang="en-US" style="font-family: Calibri;"> bank </span><span lang="zh-CN" style="font-family: 宋体;">冲突仍然有可能在非单元化跨度访问时发生,程序员无论何时都倾向于单元化访问。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">一台现代超级计算机可能有几打</span><span lang="en-US" style="font-family: Calibri;"> CPU</span><span lang="zh-CN" style="font-family: 宋体;">,每个都有好几条存储流水线连接到几千个</span><span lang="en-US" style="font-family: Calibri;"> memory bank</span><span lang="zh-CN" style="font-family: 宋体;">。在每个存储流水线和每个</span><span lang="en-US" style="font-family: Calibri;"> memory bank </span><span lang="zh-CN" style="font-family: 宋体;">之间都只用专用的通路是不切实际的。所以通常我们会使用一个多级的交换网络(</span><span lang="en-US" style="font-family: Calibri;">switching network</span><span lang="zh-CN" style="font-family: 宋体;">)来连接存储流水线和</span><span lang="en-US" style="font-family: Calibri;"> memory bank</span><span lang="zh-CN" style="font-family: 宋体;">。在不同的向量访问竞争同一条线路时会发生网络的拥挤,导致内存系统额外的停顿。</span></div><div style="font-family: Calibri; font-size: 11pt; font-weight: bold; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><div style="text-align: center;"><span lang="en-US" style="font-family: Calibri;">---------------</span><span lang="zh-CN" style="font-family: 宋体;">大家好,我是分割线</span><span lang="en-US" style="font-family: Calibri;">---------------</span></div></div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="en-US" style="font-family: Calibri;">[1] unit stride </span><span lang="zh-CN" style="font-family: 宋体;">指的是</span><span lang="en-US" style="font-family: Calibri;"> stride </span><span lang="zh-CN" style="font-family: 宋体;">为</span><span lang="en-US" style="font-family: Calibri;"> 1 </span><span lang="zh-CN" style="font-family: 宋体;">的情形。这段话的基本意思是如果访问的数据不具有很好的局部性,那么</span><span lang="en-US" style="font-family: Calibri;"> cache </span><span lang="zh-CN" style="font-family: 宋体;">的效果会非常差。</span></div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="en-US" style="font-family: Calibri;">[2] </span><span lang="zh-CN" style="font-family: 宋体;">想一想什么情况下内存的访问延迟会是</span><span lang="en-US" style="font-family: Calibri;"> bank </span><span lang="zh-CN" style="font-family: 宋体;">的忙碌时间的两倍?</span></div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="en-US" style="font-family: Calibri;">[3] </span><span lang="zh-CN" style="font-family: 宋体;">为什么这里要额外加</span><span lang="en-US" style="font-family: Calibri;"> 1</span><span lang="zh-CN" style="font-family: 宋体;">?为什么前一种情况是</span><span lang="en-US" style="font-family: Calibri;"> 12 + 64 </span><span lang="zh-CN" style="font-family: 宋体;">而不是</span><span lang="en-US" style="font-family: Calibri;"> 12 + 63</span><span lang="zh-CN" style="font-family: 宋体;">?这里额外的</span><span lang="en-US" style="font-family: Calibri;"> 1 </span><span lang="zh-CN" style="font-family: 宋体;">可能是数据从</span><span lang="en-US" style="font-family: Calibri;"> memory system </span><span lang="zh-CN" style="font-family: 宋体;">给出到到达处理器经过互联网络时的耗时。</span></div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="en-US" style="font-family: Calibri;">[4] </span><span lang="zh-CN" style="font-family: 宋体;">我认为存储流水线(</span><span lang="en-US" style="font-family: Calibri;">memory pipeline</span><span lang="zh-CN" style="font-family: 宋体;">)是一个重要但是模糊的概念,参见</span><a href="http://yuhaozhu.blogspot.com/2010/12/terminology.html"><span lang="zh-CN" style="font-family: 宋体;">此专用页面</span></a><span lang="zh-CN" style="font-family: 宋体;">。</span></div>Yuhaohttp://www.blogger.com/profile/08569555359590748704noreply@blogger.com0tag:blogger.com,1999:blog-1107874072437767466.post-85980069310931756172011-01-04T21:18:00.000-08:002011-01-05T09:28:19.624-08:00向量处理器(5)<div style="font-size: 18pt; font-weight: bold; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="en-US" style="font-family: Calibri;">3. </span><span lang="zh-CN" style="font-family: 宋体;">两个实际问题:向量长度和跨度</span></div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">本小节解决两个来源于实际程序的问题:如果一个程序中向量的长度不是</span><span lang="en-US" style="font-family: Calibri;"> 64 </span><span lang="zh-CN" style="font-family: 宋体;">的话我们怎么办?我们怎么处理在内存中不相邻的向量元素?让我们首先来考虑一下向量长度的问题。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 14pt; font-weight: bold; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="en-US" style="font-family: Calibri;">3.1 </span><span lang="zh-CN" style="font-family: 宋体;">向量长度控制</span></div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="en-US" style="font-family: Calibri;">Vector-register </span><span lang="zh-CN" style="font-family: 宋体;">类的处理器有一个天然向量长度值,其取决于每个向量寄存器可容纳的元素个数。这个长度,对于</span><span lang="en-US" style="font-family: Calibri;"> VMIPS </span><span lang="zh-CN" style="font-family: 宋体;">而言是</span><span lang="en-US" style="font-family: Calibri;"> 64</span><span lang="zh-CN" style="font-family: 宋体;">,不太可能正好等于实际程序中的真实向量长度。另外,实际程序中的向量长度通常知道编译时才会知道。实际上,一段简单的代码都可能需要不同的向量长度。比如,考虑以下代码:</span></div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span class="Apple-style-span" style="font-family: 宋体;"> </span><span class="Apple-style-span" style="font-family: Calibri;"><b>do</b> 10 i = 1, n</span></div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span class="Apple-style-span" style="font-family: Calibri;"> 10 Y(i) = a * X(i) + Y(i)</span></div><div><div style="text-align: justify;"><span class="Apple-style-span" style="font-family: Calibri; font-size: 15px;"></span><span class="Apple-style-span" style="font-size: 15px;"><span lang="zh-CN" style="font-family: 宋体;">所有的向量操作的长度都取决于</span><span lang="en-US" style="font-family: Arial;"> </span><span lang="en-US" style="font-family: Calibri;">n</span><span lang="zh-CN" style="font-family: 宋体;">,而</span><span lang="en-US" style="font-family: Calibri;"> n </span><span lang="zh-CN" style="font-family: 宋体;">知道运行时才知道!</span><span lang="en-US" style="font-family: Calibri;">n </span><span lang="zh-CN" style="font-family: 宋体;">甚至可能是一个包含上面代码的函数的参数,因此在执行的过程中其值会变化。</span></span></div><div><div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">解决这个问题的方法是引入一个向量长度寄存器(</span><span lang="en-US" style="font-family: Calibri;">vector-length register</span><span lang="zh-CN" style="font-family: 宋体;">,</span><span lang="en-US" style="font-family: Calibri;">VLR</span><span lang="zh-CN" style="font-family: 宋体;">)。</span><span lang="en-US" style="font-family: Calibri;">VLR </span><span lang="zh-CN" style="font-family: 宋体;">控制了任何向量操作的长度,包括向量</span><span lang="en-US" style="font-family: Calibri;"> L/S </span><span lang="zh-CN" style="font-family: 宋体;">操作。但是</span><span lang="en-US" style="font-family: Calibri;"> VLR </span><span lang="zh-CN" style="font-family: 宋体;">的值不能比向量寄存器的长度更大。只要实际的向量长度不超过由处理器自己定义的最大向量长度(</span><span lang="en-US" style="font-family: Calibri;">maximum vector length</span><span lang="zh-CN" style="font-family: 宋体;">,</span><span lang="en-US" style="font-family: Calibri;">MVL</span><span lang="zh-CN" style="font-family: 宋体;">),</span><span lang="en-US" style="font-family: Calibri;">VLR </span><span lang="zh-CN" style="font-family: 宋体;">就可以解决我们的问题。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">如果在编译的时候不知道</span><span lang="en-US" style="font-family: Calibri;"> n </span><span lang="zh-CN" style="font-family: 宋体;">的值,它有可能比</span><span lang="en-US" style="font-family: Calibri;"> MVL </span><span lang="zh-CN" style="font-family: 宋体;">大。为了解决这个问题,我们引入</span><span lang="en-US" style="font-family: Calibri;">strip mining </span><span lang="zh-CN" style="font-family: 宋体;">技术。</span><span lang="en-US" style="font-family: Calibri;">Strip mining </span><span lang="zh-CN" style="font-family: 宋体;">实际上是一个代码生成的技术,它使得每个向量操作都由一系列长度不超过</span><span lang="en-US" style="font-family: Calibri;"> MVL </span><span lang="zh-CN" style="font-family: 宋体;">的向量子操作完成。我们可以对一个循环采用类似循环展开技术(参考附录</span><span lang="en-US" style="font-family: Calibri;"> G</span><span lang="zh-CN" style="font-family: 宋体;">)的方法进行</span><span lang="en-US" style="font-family: Calibri;"> strip-mining</span><span lang="zh-CN" style="font-family: 宋体;">:生成一个可以反复迭代的循环来处理长度为</span><span lang="en-US" style="font-family: Calibri;"> MVL </span><span lang="zh-CN" style="font-family: 宋体;">的向量操作和另一个处理剩下部分的循环,后一循环的长度一定比</span><span lang="en-US" style="font-family: Calibri;"> MVL </span><span lang="zh-CN" style="font-family: 宋体;">小。实际情况中,编译器通常只会生成一个参数化的循环来动态地处理长度的变化以包含上述两种情况。下面给出了</span><span lang="en-US" style="font-family: Calibri;"> Stripe-mined </span><span lang="zh-CN" style="font-family: 宋体;">的版本的</span><span lang="en-US" style="font-family: Calibri;"> DAXPY </span><span lang="zh-CN" style="font-family: 宋体;">程序,以</span><span lang="en-US" style="font-family: Calibri;"> FORTRAN </span><span lang="zh-CN" style="font-family: 宋体;">语言(大多数科学计算程序的主要语言)写成,</span><span lang="en-US" style="font-family: Calibri;">C </span><span lang="zh-CN" style="font-family: 宋体;">语言风格给出注释:</span></div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 1.5in; margin-right: 0in; margin-top: 0in; text-align: justify;">low = 1</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 1.5in; margin-right: 0in; margin-top: 0in; text-align: justify;">VL = (n mod MVL) /*find the odd-size piece*/</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 1.5in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span style="font-weight: bold;">do</span> 1 j = 0, (n / MVL) /*outer loop*/</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 1.875in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span style="font-weight: bold;">do</span> 10 i = low, low + VL - 1 /*runs for length VL*/</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 2.25in; margin-right: 0in; margin-top: 0in; text-align: justify;">Y(i) = a * X(i) + Y(i) /*main operation*/</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 1.125in; margin-right: 0in; margin-top: 0in; text-align: justify;">10<span style="font-weight: bold;"> continue</span></div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 1.875in; margin-right: 0in; margin-top: 0in; text-align: justify;">low = low + VL /*start of next vector*/</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 1.875in; margin-right: 0in; margin-top: 0in; text-align: justify;">VL = MVL /*reset the length to max*/</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 1.125in; margin-right: 0in; margin-top: 0in; text-align: justify;">1<span style="font-weight: bold;"> continue</span></div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="en-US" style="font-family: Calibri;">n / MVL </span><span lang="zh-CN" style="font-family: 宋体;">这一项代表了截取的整数部分(</span><span lang="en-US" style="font-family: Calibri;">FORTRAN </span><span lang="zh-CN" style="font-family: 宋体;">就是这么干的)并且在整个程序中都要用到。以上循环的效果是把一个向量分成几段,然后由内循环来处理。第一段的长度是</span><span lang="en-US" style="font-family: Calibri;"> (n mod MVL)</span><span lang="zh-CN" style="font-family: 宋体;">,而所有之后的段长度都是</span><span lang="en-US" style="font-family: Calibri;"> MVL</span><span lang="zh-CN" style="font-family: 宋体;">。参考图</span><span lang="en-US" style="font-family: Calibri;"> F.8 </span><span lang="zh-CN" style="font-family: 宋体;">的图例。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj_fUKBJ0WqvaOUKtFWMDCOFPp_jqg4prV5hrjih9Bxfwv_RSTorEIkTuIVA7kCHzKXt4J3j3vvQKf6mpwQ3Vz1o13yY-4tAuteN1-Knk8KjxgUL9z-T0O0WNlOh0MGCvIMBKZ-4wn9Xcuo/s1600/F8.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="102" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj_fUKBJ0WqvaOUKtFWMDCOFPp_jqg4prV5hrjih9Bxfwv_RSTorEIkTuIVA7kCHzKXt4J3j3vvQKf6mpwQ3Vz1o13yY-4tAuteN1-Knk8KjxgUL9z-T0O0WNlOh0MGCvIMBKZ-4wn9Xcuo/s400/F8.PNG" width="400" /></a></div><div style="margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; font-style: italic; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">图</span><span lang="en-US" style="font-family: Calibri; font-weight: bold;"> F.8 </span><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">以</span><span lang="en-US" style="font-family: Calibri; font-weight: bold;"> strip-mining </span><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">技术处理任意长度的向量操作。</span><span lang="zh-CN" style="font-family: 宋体;">除了第一段以外所有的分段的长度都是</span><span lang="en-US" style="font-family: Calibri;"> MVL</span><span lang="zh-CN" style="font-family: 宋体;">以充分利用向量处理器的能力。在本图中,变量</span><span lang="en-US" style="font-family: Calibri;"> m </span><span lang="zh-CN" style="font-family: 宋体;">代替了</span><span lang="en-US" style="font-family: Calibri;"> (n mod MVL)</span><span lang="zh-CN" style="font-family: 宋体;">。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">上述代码中的内层循环被向量化成长度为</span><span lang="en-US" style="font-family: Calibri;"> VL</span><span lang="zh-CN" style="font-family: 宋体;">,其值等于</span><span lang="en-US" style="font-family: Calibri;"> (n mod MVL) </span><span lang="zh-CN" style="font-family: 宋体;">或者</span><span lang="en-US" style="font-family: Calibri;"> MVL</span><span lang="zh-CN" style="font-family: 宋体;">。</span><span lang="en-US" style="font-family: Calibri;">VLR </span><span lang="zh-CN" style="font-family: 宋体;">必须被设置两次</span><span lang="en-US" style="font-family: Calibri;">--</span><span lang="zh-CN" style="font-family: 宋体;">每次</span><span lang="en-US" style="font-family: Calibri;"> VL </span><span lang="zh-CN" style="font-family: 宋体;">被赋值的地方。在多个向量操作并行执行的情况下,硬件必须在向量操作发射时把</span><span lang="en-US" style="font-family: Calibri;"> VLR </span><span lang="zh-CN" style="font-family: 宋体;">的值拷贝多份到不同的向量功能部件处,以应对</span><span lang="en-US" style="font-family: Calibri;"> VLR </span><span lang="zh-CN" style="font-family: 宋体;">可能在后续的向量操作中改变的情形。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">有些向量指令集已经可以支持不同的</span><span lang="en-US" style="font-family: Calibri;"> MVL</span><span lang="zh-CN" style="font-family: 宋体;">。比如</span><span lang="en-US" style="font-family: Calibri;"> IBM 370 </span><span lang="zh-CN" style="font-family: 宋体;">系列大型机的向量扩展支持从</span><span lang="en-US" style="font-family: Calibri;">8 </span><span lang="zh-CN" style="font-family: 宋体;">到</span><span lang="en-US" style="font-family: Calibri;"> 512 </span><span lang="zh-CN" style="font-family: 宋体;">之间的任何</span><span lang="en-US" style="font-family: Calibri;"> MVL</span><span lang="zh-CN" style="font-family: 宋体;">。它提供了一条</span><span lang="en-US" style="font-family: Calibri;"> load vector count and update </span><span lang="zh-CN" style="font-family: 宋体;">(</span><span lang="en-US" style="font-family: Calibri;">VLVCU</span><span lang="zh-CN" style="font-family: 宋体;">)指令来控制</span><span lang="en-US" style="font-family: Calibri;"> strip-mined </span><span lang="zh-CN" style="font-family: 宋体;">的循环。</span><span lang="en-US" style="font-family: Calibri;">VLVCU </span><span lang="zh-CN" style="font-family: 宋体;">指令用一个标量操作数来指出期望的向量长度。</span><span lang="en-US" style="font-family: Calibri;">VLR </span><span lang="zh-CN" style="font-family: 宋体;">被设置成期望的向量长度和</span><span lang="en-US" style="font-family: Calibri;">MVL </span><span lang="zh-CN" style="font-family: 宋体;">中较小的一个,并且这个值要被从一个标量寄存器里面减去从而设置条件码以指示循环是否结束。通过这种方式,目标代码不需要做任何改变就可可以在两种不同的实现之间迁移,并且充分利用可能的最大向量长度。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">除了启动延迟外,我们还需要考虑</span><span lang="en-US" style="font-family: Calibri;"> strip-mining </span><span lang="zh-CN" style="font-family: 宋体;">带来的开销。假定任何</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">都不能和其他的指令重叠执行的话,由重新启动向量序列以及设置</span><span lang="en-US" style="font-family: Calibri;"> VLR </span><span lang="zh-CN" style="font-family: 宋体;">带来的</span><span lang="en-US" style="font-family: Calibri;"> strip-mining </span><span lang="zh-CN" style="font-family: 宋体;">开销增加了实际的启动开销。如果这个额外开销对于一个</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">而言是</span><span lang="en-US" style="font-family: Calibri;"> 10 </span><span lang="zh-CN" style="font-family: 宋体;">个周期的话,那么每</span><span lang="en-US" style="font-family: Calibri;"> 64 </span><span lang="zh-CN" style="font-family: 宋体;">个向量元素的开销增加了</span><span lang="en-US" style="font-family: Calibri;"> 10 </span><span lang="zh-CN" style="font-family: 宋体;">个周期,或者说每个元素</span><span lang="en-US" style="font-family: Calibri;"> 0.15 </span><span lang="zh-CN" style="font-family: 宋体;">个周期。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">决定一个由一系列</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">组成的</span><span lang="en-US" style="font-family: Calibri;"> strip-mining </span><span lang="zh-CN" style="font-family: 宋体;">循环执行时间的有两个重要因素:</span></div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="en-US" style="font-family: Calibri;">1. </span><span lang="zh-CN" style="font-family: 宋体;">一个循环中</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">的个数。它决定了</span><span lang="en-US" style="font-family: Calibri;"> chime </span><span lang="zh-CN" style="font-family: 宋体;">的个数。我们使用</span><span lang="en-US" style="font-family: Calibri;"> Tchime </span><span lang="zh-CN" style="font-family: 宋体;">来表示以</span><span lang="en-US" style="font-family: Calibri;"> chime </span><span lang="zh-CN" style="font-family: 宋体;">表示的执行时间。</span></div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="en-US" style="font-family: Calibri;">2. </span><span lang="zh-CN" style="font-family: 宋体;">执行</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">序列的开销。包括执行每个分段中标量代码的开销</span><span lang="en-US" style="font-family: Calibri;"> Tloop </span><span lang="zh-CN" style="font-family: 宋体;">以及每个</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">的启动时间</span><span lang="en-US" style="font-family: Calibri;"> Tstart</span><span lang="zh-CN" style="font-family: 宋体;">。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;">在第一次建立向量序列的时候可能还会有一些固定的开销。在最近的向量处理器中,这种开销已经很小了,所以我们忽略之。</div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">我们用</span><span lang="en-US" style="font-family: Calibri;"> Tn </span><span lang="zh-CN" style="font-family: 宋体;">来代表一个向量操作序列作用于一个长度为</span><span lang="en-US" style="font-family: Calibri;"> n </span><span lang="zh-CN" style="font-family: 宋体;">的向量上的总共运行时间:</span></div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="en-US" style="font-family: Calibri;">Tn = [n / MVL] </span><span lang="zh-CN" style="font-family: 宋体;">×</span><span lang="en-US" style="font-family: Calibri;"> (Tloop + Tstart) + n </span><span lang="zh-CN" style="font-family: 宋体;">×</span><span lang="en-US" style="font-family: Calibri;"> Tchime</span></div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify; vertical-align: sub;"><br />
</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="en-US" style="font-family: Calibri;">Tloop</span><span lang="zh-CN" style="font-family: 宋体;">,</span><span lang="en-US" style="font-family: Calibri;">Tstart </span><span lang="zh-CN" style="font-family: 宋体;">和</span><span lang="en-US" style="font-family: Calibri;"> Tchime </span><span lang="zh-CN" style="font-family: 宋体;">的值都是和编译器还有处理器相关的。基于对于</span><span lang="en-US" style="font-family: Calibri;"> Cray-1 </span><span lang="zh-CN" style="font-family: 宋体;">的很多测试的结果,我们选择</span><span lang="en-US" style="font-family: Calibri;"> 15 </span><span lang="zh-CN" style="font-family: 宋体;">作为</span><span lang="en-US" style="font-family: Calibri;"> Tloop </span><span lang="zh-CN" style="font-family: 宋体;">的值。乍一看,你可能认为这个值很小。每个循环的额外开销包括设置向量的起始地址和跨度(</span><span lang="en-US" style="font-family: Calibri;">stride</span><span lang="zh-CN" style="font-family: 宋体;">),自增循环计数器,然后执行循环分支指令。织机上,这些标量指令可以全部或者部分地和向量指令重叠,最小化这些额外开销。</span><span lang="en-US" style="font-family: Calibri;">Tloop </span><span lang="zh-CN" style="font-family: 宋体;">的值当然取决于循环的结构,但是它们之间的依赖性相比向量代码和</span><span lang="en-US" style="font-family: Calibri;">Tstart </span><span lang="zh-CN" style="font-family: 宋体;">以及</span><span lang="en-US" style="font-family: Calibri;"> Tchime </span><span lang="zh-CN" style="font-family: 宋体;">的值之间的联系而言就比较小了。</span></div><div style="font-family: Calibri; font-size: 11pt; font-weight: bold; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;">_____________________________________________________</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">例题</span><span lang="zh-CN" style="font-family: 宋体;">:在</span><span lang="en-US" style="font-family: Calibri;"> VMIPS </span><span lang="zh-CN" style="font-family: 宋体;">上</span><span lang="en-US" style="font-family: Calibri;"> A= B </span><span lang="zh-CN" style="font-family: 宋体;">×</span><span lang="en-US" style="font-family: Calibri;"> s </span><span lang="zh-CN" style="font-family: 宋体;">的执行时间是多少?其中</span><span lang="en-US" style="font-family: Calibri;"> s </span><span lang="zh-CN" style="font-family: 宋体;">是一个标量,</span><span lang="en-US" style="font-family: Calibri;">A </span><span lang="zh-CN" style="font-family: 宋体;">和</span><span lang="en-US" style="font-family: Calibri;"> B </span><span lang="zh-CN" style="font-family: 宋体;">的长度为</span><span lang="en-US" style="font-family: Calibri;"> 200</span><span lang="zh-CN" style="font-family: 宋体;">。</span></div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">解答</span><span lang="zh-CN" style="font-family: 宋体;">:假设</span><span lang="en-US" style="font-family: Calibri;"> A </span><span lang="zh-CN" style="font-family: 宋体;">和</span><span lang="en-US" style="font-family: Calibri;"> B </span><span lang="zh-CN" style="font-family: 宋体;">的起始地址被初始化在</span><span lang="en-US" style="font-family: Calibri;"> Ra </span><span lang="zh-CN" style="font-family: 宋体;">和</span><span lang="en-US" style="font-family: Calibri;"> Rb </span><span lang="zh-CN" style="font-family: 宋体;">中,</span><span lang="en-US" style="font-family: Calibri;">s </span><span lang="zh-CN" style="font-family: 宋体;">在</span><span lang="en-US" style="font-family: Calibri;"> Fs </span><span lang="zh-CN" style="font-family: 宋体;">中。另外回忆一下,在</span><span lang="en-US" style="font-family: Calibri;"> MIPS </span><span lang="zh-CN" style="font-family: 宋体;">(因此</span><span lang="en-US" style="font-family: Calibri;">VMIPS</span><span lang="zh-CN" style="font-family: 宋体;">)中,</span><span lang="en-US" style="font-family: Calibri;">R0 </span><span lang="zh-CN" style="font-family: 宋体;">中的值永远是</span><span lang="en-US" style="font-family: Calibri;"> 0.</span><span lang="zh-CN" style="font-family: 宋体;">因为</span><span lang="en-US" style="font-family: Calibri;"> (200 mod 64) = 8</span><span lang="zh-CN" style="font-family: 宋体;">,</span><span lang="en-US" style="font-family: Calibri;">strip-mined </span><span lang="zh-CN" style="font-family: 宋体;">之后的循环的第一次迭代会执行于长度为</span><span lang="en-US" style="font-family: Calibri;"> 8 </span><span lang="zh-CN" style="font-family: 宋体;">的向量上,之后的迭代作用于长度为</span><span lang="en-US" style="font-family: Calibri;"> 64 </span><span lang="zh-CN" style="font-family: 宋体;">的向量。每个向量的下一个分段的起始地址是向量长度的</span><span lang="en-US" style="font-family: Calibri;"> 8 </span><span lang="zh-CN" style="font-family: 宋体;">倍。因为向量长度或者是</span><span lang="en-US" style="font-family: Calibri;"> 8</span><span lang="zh-CN" style="font-family: 宋体;">,或者是</span><span lang="en-US" style="font-family: Calibri;"> 64</span><span lang="zh-CN" style="font-family: 宋体;">,我们在第一个分段之后把地址寄存器的值加</span><span lang="en-US" style="font-family: Calibri;"> 8 </span><span lang="zh-CN" style="font-family: 宋体;">×</span><span lang="en-US" style="font-family: Calibri;"> 8 = 64 </span><span lang="zh-CN" style="font-family: 宋体;">而对于后面的几个分段则加上</span><span lang="en-US" style="font-family: Calibri;"> 8 </span><span lang="zh-CN" style="font-family: 宋体;">×</span><span lang="en-US" style="font-family: Calibri;"> 64 = 512</span><span lang="zh-CN" style="font-family: 宋体;">。每个向量的总字节数是</span><span lang="en-US" style="font-family: Calibri;"> 8 </span><span lang="zh-CN" style="font-family: 宋体;">×</span><span lang="en-US" style="font-family: Calibri;"> 200 = 1600</span><span lang="zh-CN" style="font-family: 宋体;">。我们通过比较起始地址加上</span><span lang="en-US" style="font-family: Calibri;"> 1600 </span><span lang="zh-CN" style="font-family: 宋体;">和下一个向量分段的起始地址来确定循环是否应该结束。以下是实际的代码:</span></div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN"><span class="Apple-style-span" style="font-family: Calibri;"><br />
</span></span></div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN"></span><span lang="zh-CN"><span class="Apple-style-span" style="font-family: 宋体;"> </span><span class="Apple-style-span" style="font-family: Calibri;">DADDUI</span></span><span lang="en-US" style="font-family: Calibri;"> </span><span lang="zh-CN" style="font-family: Calibri;">R2,R0,#1600 </span><span lang="en-US" style="font-family: Calibri;"> </span><span lang="zh-CN" style="font-family: Calibri;">;total # bytes in vector</span></div><div lang="zh-CN" style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: Calibri;"></span><span class="Apple-style-span" style="font-family: Calibri;"><span lang="zh-CN"><span class="Apple-style-span" style="font-family: 宋体;"> </span>DADDU </span></span><span class="Apple-style-span" style="font-family: Calibri;"><span lang="en-US"> </span></span><span class="Apple-style-span" style="font-family: Calibri;"><span lang="zh-CN">R2,R2,Ra </span></span><span class="Apple-style-span" style="font-family: Calibri;"><span lang="en-US"> </span></span><span class="Apple-style-span" style="font-family: Calibri;"><span lang="zh-CN">;address of the end of A vector</span></span></div><div lang="zh-CN" style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span class="Apple-style-span" style="font-family: Calibri;"><span lang="zh-CN"></span></span><span class="Apple-style-span" style="font-family: Calibri;"><span lang="zh-CN"><span class="Apple-style-span" style="font-family: 宋体;"> </span>DADDUI </span><span lang="en-US"> </span><span lang="zh-CN">R1,R0,#8 </span><span lang="en-US"> </span><span lang="zh-CN">;loads length of 1st segment</span></span></div><div lang="zh-CN" style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span class="Apple-style-span" style="font-family: Calibri;"><span lang="zh-CN"></span></span><span class="Apple-style-span" style="font-family: Calibri;"><span lang="zh-CN"><span class="Apple-style-span" style="font-family: 宋体;"> </span>MTC1 </span><span lang="en-US"> </span><span lang="zh-CN">VLR,R1 </span><span lang="en-US"> </span><span lang="zh-CN">;load vector length in VLR</span></span></div><div lang="zh-CN" style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span class="Apple-style-span" style="font-family: Calibri;"><span lang="zh-CN"></span></span><span class="Apple-style-span" style="font-family: Calibri;"><span lang="zh-CN"><span class="Apple-style-span" style="font-family: 宋体;"> </span>DADDUI </span><span lang="en-US"> </span><span lang="zh-CN">R1,R0,#64 </span><span lang="en-US"> </span><span lang="zh-CN">;length in bytes of 1st segment</span></span></div><div lang="zh-CN" style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span class="Apple-style-span" style="font-family: Calibri;"><span lang="zh-CN"></span></span><span class="Apple-style-span" style="font-family: Calibri;"><span lang="zh-CN"> DADDUI </span><span lang="en-US"> </span><span lang="zh-CN">R3,R0,#64 </span><span lang="en-US"> </span><span lang="zh-CN">;vector length of other segments</span></span></div><div lang="zh-CN" style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span class="Apple-style-span" style="font-family: Calibri;"><span lang="zh-CN"></span></span><span class="Apple-style-span" style="font-family: Calibri;"><span lang="zh-CN"> </span><span lang="en-US"> </span><span lang="zh-CN">LV </span><span lang="en-US"> </span><span lang="zh-CN">V1,Rb </span><span lang="en-US"> </span><span lang="zh-CN">;load B</span></span></div><div lang="zh-CN" style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span class="Apple-style-span" style="font-family: Calibri;"><span lang="zh-CN"></span></span><span class="Apple-style-span" style="font-family: Calibri;"><span lang="zh-CN"><span class="Apple-style-span" style="font-family: 宋体;"> </span>MULVS.D </span><span lang="en-US"> </span><span lang="zh-CN">V2,V1,Fs </span><span lang="en-US"> </span><span lang="zh-CN">;vector * scalar</span></span></div><div lang="zh-CN" style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span class="Apple-style-span" style="font-family: Calibri;"><span lang="zh-CN"><span class="Apple-style-span" style="font-family: 宋体;"> </span>SV </span><span lang="en-US"> </span><span lang="zh-CN">Ra,V2 </span><span lang="en-US"> </span><span lang="zh-CN">;store A</span></span></div><div lang="zh-CN" style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span class="Apple-style-span" style="font-family: Calibri;"><span lang="zh-CN"></span></span><span class="Apple-style-span" style="font-family: Calibri;"><span lang="zh-CN"><span class="Apple-style-span" style="font-family: 宋体;"> </span>DADDU </span><span lang="en-US"> </span><span lang="zh-CN">Ra,Ra,R1 </span><span lang="en-US"> </span><span lang="zh-CN">;address of next segment of A</span></span></div><div lang="zh-CN" style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span class="Apple-style-span" style="font-family: Calibri;"><span lang="zh-CN"></span></span><span class="Apple-style-span" style="font-family: Calibri;"><span lang="zh-CN"><span class="Apple-style-span" style="font-family: 宋体;"> </span>DADDU </span><span lang="en-US"> </span><span lang="zh-CN">Rb,Rb,R1 </span><span lang="en-US"> </span><span lang="zh-CN">;address of next segment of B</span></span></div><div lang="zh-CN" style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span class="Apple-style-span" style="font-family: Calibri;"><span lang="zh-CN"></span></span><span class="Apple-style-span" style="font-family: Calibri;"><span lang="zh-CN"><span class="Apple-style-span" style="font-family: 宋体;"> </span>DADDUI </span><span lang="en-US"> </span><span lang="zh-CN">R1,R0,#512 </span><span lang="en-US"> </span><span lang="zh-CN">;load byte offset next segment</span></span></div><div lang="zh-CN" style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span class="Apple-style-span" style="font-family: Calibri;"><span lang="zh-CN"></span></span><span class="Apple-style-span" style="font-family: Calibri;"><span lang="zh-CN"><span class="Apple-style-span" style="font-family: 宋体;"> </span>MTC1 </span><span lang="en-US"> </span><span lang="zh-CN">VLR,R3 </span><span lang="en-US"> </span><span lang="zh-CN">;set length to 64 elements</span></span></div><div lang="zh-CN" style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span class="Apple-style-span" style="font-family: Calibri;"><span lang="zh-CN"></span></span><span class="Apple-style-span" style="font-family: Calibri;"><span lang="zh-CN"><span class="Apple-style-span" style="font-family: 宋体;"> </span>DSUBU </span><span lang="en-US"> </span><span lang="zh-CN">R4,R2,Ra </span><span lang="en-US"> </span><span lang="zh-CN">;at the end of A?</span></span></div><div lang="zh-CN" style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span class="Apple-style-span" style="font-family: Calibri;"><span lang="zh-CN"></span></span><span class="Apple-style-span" style="font-family: Calibri;"><span lang="zh-CN"><span class="Apple-style-span" style="font-family: 宋体;"> </span>BNEZ </span><span lang="en-US"> </span><span lang="zh-CN">R4,Loop </span><span lang="en-US"> </span><span lang="zh-CN">;if not, go back</span></span></div><div lang="zh-CN" style="font-family: Arial; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">这个循环中的三条向量指令互相依赖,因为必须进入到三个</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">中,因此</span><span lang="en-US" style="font-family: Calibri;"> Tchime = 3.</span><span lang="zh-CN" style="font-family: 宋体;">让我们使用我们的基本公式:</span></div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="en-US" style="font-family: Calibri;">Tn = [n / MVL] </span><span lang="zh-CN" style="font-family: 宋体;">×</span><span lang="en-US" style="font-family: Calibri;"> (Tloop + Tstart) + n </span><span lang="zh-CN" style="font-family: 宋体;">×</span><span lang="en-US" style="font-family: Calibri;"> Tchime</span></div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="en-US" style="font-family: Calibri;">T200 = 4 </span><span lang="zh-CN" style="font-family: 宋体;">×</span><span lang="en-US" style="font-family: Calibri;"> (15 + Tstart) + 200 </span><span lang="zh-CN" style="font-family: 宋体;">×</span><span lang="en-US" style="font-family: Calibri;"> 3</span></div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="en-US" style="font-family: Calibri;">T200 = 60 + (4 </span><span lang="zh-CN" style="font-family: 宋体;">×</span><span lang="en-US" style="font-family: Calibri;"> Tstart) + 600 = 660 + (4 </span><span lang="zh-CN" style="font-family: 宋体;">×</span><span lang="en-US" style="font-family: Calibri;"> Tstart)</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="en-US" style="font-family: Calibri;">Tstart </span><span lang="zh-CN" style="font-family: 宋体;">的值是以下几项的和:</span></div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"></div><ul><li><span lang="en-US" style="font-family: Calibri;">load </span><span lang="zh-CN" style="font-family: 宋体;">指令的启动时间</span><span lang="en-US" style="font-family: Calibri;"> 12 </span><span lang="zh-CN" style="font-family: 宋体;">个时钟周期</span></li>
<li><span lang="en-US" style="font-family: Calibri;">multiply </span><span lang="zh-CN" style="font-family: 宋体;">指令的</span><span lang="en-US" style="font-family: Calibri;"> 7 </span><span lang="zh-CN" style="font-family: 宋体;">个周期的启动时间</span></li>
<li><span lang="en-US" style="font-family: Calibri;">store </span><span lang="zh-CN" style="font-family: 宋体;">指令的</span><span lang="en-US" style="font-family: Calibri;"> 12 </span><span lang="zh-CN" style="font-family: 宋体;">个周期的启动时间</span></li>
</ul><br />
<div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">因此,</span><span lang="en-US" style="font-family: Calibri;">Tstart </span><span lang="zh-CN" style="font-family: 宋体;">的值由下式给出:</span></div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;">Tstart = 12 + 7 + 12 = 31</div><div style="font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;">所以,总共的执行时间为:</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="en-US" style="font-family: Calibri;">T200 = 660 + 4 </span><span lang="zh-CN" style="font-family: 宋体;">×</span><span lang="en-US" style="font-family: Calibri;">31 = 784</span></div><div style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">每个元素的平均执行时间为</span><span lang="en-US" style="font-family: Calibri;"> 784 / 200 = 3.9</span><span lang="zh-CN" style="font-family: 宋体;">,而利用</span><span lang="en-US" style="font-family: Calibri;"> chime </span><span lang="zh-CN" style="font-family: 宋体;">做的估算则为</span><span lang="en-US" style="font-family: Calibri;"> 3</span><span lang="zh-CN" style="font-family: 宋体;">。在第</span><span lang="en-US" style="font-family: Calibri;"> 4 </span><span lang="zh-CN" style="font-family: 宋体;">小节,我们会更激进一些</span><span lang="en-US" style="font-family: Calibri;">--</span><span lang="zh-CN" style="font-family: 宋体;">允许不同的</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">的执行相互重叠。</span></div><div style="font-family: Calibri; font-size: 11pt; font-weight: bold; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;">_____________________________________________________</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;"><br />
</span></div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">图</span><span lang="en-US" style="font-family: Calibri;"> F.9 </span><span lang="zh-CN" style="font-family: 宋体;">给出了对于前面的例子(</span><span lang="en-US" style="font-family: Calibri;">A= B </span><span lang="zh-CN" style="font-family: 宋体;">×</span><span lang="en-US" style="font-family: Calibri;"> s</span><span lang="zh-CN" style="font-family: 宋体;">)平均每个元素的执行开销随着向量长度的变化。</span><span lang="en-US" style="font-family: Calibri;">chime </span><span lang="zh-CN" style="font-family: 宋体;">的模型给出的结果是平均每个元素的</span><span lang="en-US" style="font-family: Calibri;"> 3 </span><span lang="zh-CN" style="font-family: 宋体;">个时钟周期,不过前述的两个额外开销来源给每个元素增加了</span><span lang="en-US" style="font-family: Calibri;"> 0.9 </span><span lang="zh-CN" style="font-family: 宋体;">个时钟周期。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg3uJQj_RBY1Xh7cxNe0qovgRZbESAC8vDzfQIWj5qmGWTsWAxIX0ybR6MCarBJC4lgkgCqG01Yy8DI4QGJSQzJ990XrjXf5mYjMm8ntPIjqGniPoKb2-E-gjl0BDp3WGKoKdlbAn2_1pKg/s1600/F9.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="242" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg3uJQj_RBY1Xh7cxNe0qovgRZbESAC8vDzfQIWj5qmGWTsWAxIX0ybR6MCarBJC4lgkgCqG01Yy8DI4QGJSQzJ990XrjXf5mYjMm8ntPIjqGniPoKb2-E-gjl0BDp3WGKoKdlbAn2_1pKg/s400/F9.PNG" width="400" /></a></div><div style="margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">图</span></span><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="en-US" style="font-family: Calibri; font-weight: bold;"> F.9 </span></span><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">每个元素的平均执行时间和每个元素的平均额外开销随向量长度的变化。</span></span><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="zh-CN" style="font-family: 宋体;">对于短向量而言,启动时间超过了总共时间的一半,但是对于长向量而言,这个比例减少到了三分之一。在向量长度穿过</span></span><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="en-US" style="font-family: Calibri;"> 64 </span></span><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="zh-CN" style="font-family: 宋体;">的倍数的时候会有一个跳变,因为这产生了一个新的循环迭代和一组新的向量指令的执行。这些操作给</span></span><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="en-US" style="font-family: Calibri;"> Tn </span></span><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="zh-CN" style="font-family: 宋体;">增加了</span></span><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="en-US" style="font-family: Calibri;">Tloop + Tstart</span></span><span class="Apple-style-span" style="font-size: 15px; font-style: italic;"><span lang="zh-CN" style="font-family: 宋体;">。</span></span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin-bottom: 0in; margin-left: 0in; margin-right: 0in; margin-top: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">接下来的几节会介绍几个减少这些额外开销的技术。我们会看到如何利用一个被称为</span><span lang="en-US" style="font-family: Calibri;"> chaining </span><span lang="zh-CN" style="font-family: 宋体;">的方法来减少</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">的数目从而减少</span><span lang="en-US" style="font-family: Calibri;"> chime </span><span lang="zh-CN" style="font-family: 宋体;">的数目。每个循环的开销</span><span lang="en-US" style="font-family: Calibri;"> Tloop </span><span lang="zh-CN" style="font-family: 宋体;">可以通过进一步重叠向量和标量的执行,允许一个循环中的标量的执行和前一个循环的向量指令的执行的完成,来降低。最后,向量的启动延迟可以通过一个允许不同</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">中的向量指令的重叠执行而消除。</span></div></div></div></div>Yuhaohttp://www.blogger.com/profile/08569555359590748704noreply@blogger.com0tag:blogger.com,1999:blog-1107874072437767466.post-80460141105445066502011-01-03T12:56:00.000-08:002011-01-03T12:56:17.525-08:00向量处理器(4)<div style="font-size: 14pt; font-weight: bold; margin: 0in;"><span lang="en-US" style="font-family: Calibri;">2.3 </span><span lang="zh-CN" style="font-family: 宋体;">向量处理器</span><span lang="en-US" style="font-family: Calibri;"> Load-Store </span><span lang="zh-CN" style="font-family: 宋体;">单元及内存系统</span></div><div style="font-size: 11pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体;">向量处理器中</span><span lang="en-US" style="font-family: Calibri;"> Load-Store </span><span lang="zh-CN" style="font-family: 宋体;">单元的行为远比算术功能部件复杂得多。</span><span lang="en-US" style="font-family: Calibri;">Load </span><span lang="zh-CN" style="font-family: 宋体;">的启动时间(</span><span lang="en-US" style="font-family: Calibri;">start-up</span><span lang="zh-CN" style="font-family: 宋体;">)指的是把第一个</span><span lang="en-US" style="font-family: Calibri;"> word </span><span lang="zh-CN" style="font-family: 宋体;">从内存中取到寄存器里花的时间。如果向量中剩下的部分可以没有停顿地从内存中给出,那么其触发率(</span><span lang="en-US" style="font-family: Calibri;">initiation rate</span><span lang="zh-CN" style="font-family: 宋体;">)就等于新的</span><span lang="en-US" style="font-family: Calibri;"> word </span><span lang="zh-CN" style="font-family: 宋体;">被取出或者存进去的速率。不同于较为简单的功能部件,</span><span lang="en-US" style="font-family: Calibri;">LS </span><span lang="zh-CN" style="font-family: 宋体;">单元的触发率不一定是</span><span lang="en-US" style="font-family: Calibri;"> 1 </span><span lang="zh-CN" style="font-family: 宋体;">个时钟周期,因为</span><span lang="en-US" style="font-family: Calibri;"> memory bank </span><span lang="zh-CN" style="font-family: 宋体;">的停顿会降低有效吞吐率。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体;">通常来讲,</span><span lang="en-US" style="font-family: Calibri;">LS<span> </span></span><span lang="zh-CN" style="font-family: 宋体;">单元的启动时间带来的性能惩罚比算术功能部件要高</span><span lang="en-US" style="font-family: Calibri;">--</span><span lang="zh-CN" style="font-family: 宋体;">在某些处理器上超过</span><span lang="en-US" style="font-family: Calibri;"> 100 </span><span lang="zh-CN" style="font-family: 宋体;">个时钟周期。对于</span><span lang="en-US" style="font-family: Calibri;"> VMIPS </span><span lang="zh-CN" style="font-family: 宋体;">而言,我们假定启动时间是</span><span lang="en-US" style="font-family: Calibri;"> 12 </span><span lang="zh-CN" style="font-family: 宋体;">个时钟周期,和</span><span lang="en-US" style="font-family: Calibri;"> Cray-1 </span><span lang="zh-CN" style="font-family: 宋体;">一样。图</span><span lang="en-US" style="font-family: Calibri;"> F.6 </span><span lang="zh-CN" style="font-family: 宋体;">总结了</span><span lang="en-US" style="font-family: Calibri;"> VMIPS </span><span lang="zh-CN" style="font-family: 宋体;">向量操作的启动延迟。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体;">为了维持触发率为每个周期取出或者存储</span><span lang="en-US" style="font-family: Calibri;"> 1 </span><span lang="zh-CN" style="font-family: 宋体;">个</span><span lang="en-US" style="font-family: Calibri;"> word</span><span lang="zh-CN" style="font-family: 宋体;">,内存系统必须有能力给出或者接受那么多的数据。这通常可以由把访存操作分散到独立的</span><span lang="en-US" style="font-family: Calibri;"> memory bank </span><span lang="zh-CN" style="font-family: 宋体;">中实现。就像我们将在下一小节中会看到的,拥有很多的</span><span lang="en-US" style="font-family: Calibri;"> bank </span><span lang="zh-CN" style="font-family: 宋体;">对于存取或者储存一行或者一列数据这样的向量访存操作很有效。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体;">大多数的向量处理器基于以下三个原因采用</span><span lang="en-US" style="font-family: Calibri;"> memory bank </span><span lang="zh-CN" style="font-family: 宋体;">而不是简单的</span><span lang="en-US" style="font-family: Calibri;"> interleaving [1]</span><span lang="zh-CN" style="font-family: 宋体;">:</span></div><div style="font-size: 11pt; margin: 0in;"><span lang="en-US" style="font-family: Calibri;">1.<span> </span></span><span lang="zh-CN" style="font-family: 宋体;">大多数的向量处理器支持一个周期内进行多个</span><span lang="en-US" style="font-family: Calibri;"> load </span><span lang="zh-CN" style="font-family: 宋体;">或者</span><span lang="en-US" style="font-family: Calibri;"> store </span><span lang="zh-CN" style="font-family: 宋体;">操作,并且</span><span lang="en-US" style="font-family: Calibri;"> memory bank </span><span lang="zh-CN" style="font-family: 宋体;">的时钟周期通常是</span><span lang="en-US" style="font-family: Calibri;"> CPU </span><span lang="zh-CN" style="font-family: 宋体;">时钟周期的几倍长。为了支持多个同时的访存操作,内存系统需要多个</span><span lang="en-US" style="font-family: Calibri;"> memory bank</span><span lang="zh-CN" style="font-family: 宋体;">,并且要有能力独立地控制访问每个</span><span lang="en-US" style="font-family: Calibri;"> bank </span><span lang="zh-CN" style="font-family: 宋体;">的地址。</span></div><div style="font-size: 11pt; margin: 0in;"><span lang="en-US" style="font-family: Calibri;">2.<span> </span></span><span lang="zh-CN" style="font-family: 宋体;">就像我们在下一小节会看到的那样,很多向量处理器需要支持地址不连续的访存操作。在这样的情况下,需要独立的</span><span lang="en-US" style="font-family: Calibri;"> bank </span><span lang="zh-CN" style="font-family: 宋体;">寻址,而不是简单的</span><span lang="en-US" style="font-family: Calibri;"> interleaving</span><span lang="zh-CN" style="font-family: 宋体;">。</span></div><div style="font-size: 11pt; margin: 0in;"><span lang="en-US" style="font-family: Calibri;">3.<span> </span></span><span lang="zh-CN" style="font-family: 宋体;">很多向量计算机支持多个处理器共享一个内存系统,所以每个处理器会产生自己各自的独立的地址流。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体;">总而言之,以上特性导致了向量处理器中大量的独立的</span><span lang="en-US" style="font-family: Calibri;"> memory bank</span><span lang="zh-CN" style="font-family: 宋体;">,就像下面的例子所示。</span></div><div style="font-family: Calibri; font-size: 11pt; font-weight: bold; margin: 0in;">_____________________________________________________</div><div style="font-size: 11pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">例题</span><span lang="zh-CN" style="font-family: 宋体;">:</span><span lang="en-US" style="font-family: Calibri;">Cray T90 </span><span lang="zh-CN" style="font-family: 宋体;">的</span><span lang="en-US" style="font-family: Calibri;"> CPU </span><span lang="zh-CN" style="font-family: 宋体;">时钟周期是</span><span lang="en-US" style="font-family: Calibri;"> 2.167 ns</span><span lang="zh-CN" style="font-family: 宋体;">,在最大的配置情况下</span><span lang="en-US" style="font-family: Calibri;"> </span><span lang="zh-CN" style="font-family: 宋体;">(</span><span lang="en-US" style="font-family: Calibri;">Cray T932</span><span lang="zh-CN" style="font-family: 宋体;">)有</span><span lang="en-US" style="font-family: Calibri;"> 32 </span><span lang="zh-CN" style="font-family: 宋体;">个处理器,每个处理器可以在一个时钟周期产生</span><span lang="en-US" style="font-family: Calibri;"> 4 </span><span lang="zh-CN" style="font-family: 宋体;">个</span><span lang="en-US" style="font-family: Calibri;"> load </span><span lang="zh-CN" style="font-family: 宋体;">操作和</span><span lang="en-US" style="font-family: Calibri;"> 2 </span><span lang="zh-CN" style="font-family: 宋体;">个</span><span lang="en-US" style="font-family: Calibri;">store </span><span lang="zh-CN" style="font-family: 宋体;">操作。</span><span lang="en-US" style="font-family: Calibri;">CPU </span><span lang="zh-CN" style="font-family: 宋体;">的时钟周期是</span><span lang="en-US" style="font-family: Calibri;"> 2.167 ns</span><span lang="zh-CN" style="font-family: 宋体;">,但是内存系统使用的</span><span lang="en-US" style="font-family: Calibri;"> SRAM </span><span lang="zh-CN" style="font-family: 宋体;">的时钟周期是</span><span lang="en-US" style="font-family: Calibri;"> 15 ns</span><span lang="zh-CN" style="font-family: 宋体;">。计算最少需要的</span><span lang="en-US" style="font-family: Calibri;"> memory bank </span><span lang="zh-CN" style="font-family: 宋体;">数目以使得所有的</span><span lang="en-US" style="font-family: Calibri;"> CPU </span><span lang="zh-CN" style="font-family: 宋体;">能够获得满内存带宽。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">解答</span><span lang="zh-CN" style="font-family: 宋体;">:每个时钟周期最大可能的访存数目是</span><span lang="en-US" style="font-family: Calibri;"> 192 </span><span lang="zh-CN" style="font-family: 宋体;">(</span><span lang="en-US" style="font-family: Calibri;">32 </span><span lang="zh-CN" style="font-family: 宋体;">个</span><span lang="en-US" style="font-family: Calibri;"> CPU</span><span lang="zh-CN" style="font-family: 宋体;">,每个产生</span><span lang="en-US" style="font-family: Calibri;"> 6 </span><span lang="zh-CN" style="font-family: 宋体;">个访存请求)。每个</span><span lang="en-US" style="font-family: Calibri;"> SRAM bank </span><span lang="zh-CN" style="font-family: 宋体;">相当于</span><span lang="en-US" style="font-family: Calibri;"> 15 / 2.167 = 6.92 </span><span lang="zh-CN" style="font-family: 宋体;">个</span><span lang="en-US" style="font-family: Calibri;">CPU </span><span lang="zh-CN" style="font-family: 宋体;">时钟周期,进位到</span><span lang="en-US" style="font-family: Calibri;"> 7</span><span lang="zh-CN" style="font-family: 宋体;">。所与我们最最少需要</span><span lang="en-US" style="font-family: Calibri;"> 192 </span><span lang="zh-CN" style="font-family: 宋体;">×</span><span lang="en-US" style="font-family: Calibri;"> 7 = 1344 </span><span lang="zh-CN" style="font-family: 宋体;">个</span><span lang="en-US" style="font-family: Calibri;"> memory bank</span><span lang="zh-CN" style="font-family: 宋体;">!</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in;"><span lang="en-US" style="font-family: Calibri;">Cray T932 </span><span lang="zh-CN" style="font-family: 宋体;">实际上有</span><span lang="en-US" style="font-family: Calibri;"> 1024 </span><span lang="zh-CN" style="font-family: 宋体;">个</span><span lang="en-US" style="font-family: Calibri;"> memory bank</span><span lang="zh-CN" style="font-family: 宋体;">,所以它不能满足所有</span><span lang="en-US" style="font-family: Calibri;"> CPU </span><span lang="zh-CN" style="font-family: 宋体;">同时的带宽要求。后继的</span><span lang="en-US" style="font-family: Calibri;"> </span><span lang="zh-CN" style="font-family: 宋体;">内存系统升级使用了</span><span lang="en-US" style="font-family: Calibri;"> 15 ns </span><span lang="zh-CN" style="font-family: 宋体;">的异步</span><span lang="en-US" style="font-family: Calibri;"> SRAM </span><span lang="zh-CN" style="font-family: 宋体;">以及流水化的可以减半内存时钟周期的同步</span><span lang="en-US" style="font-family: Calibri;"> SRAM </span><span lang="zh-CN" style="font-family: 宋体;">,从而提供足够的带宽。</span></div><div style="font-family: Calibri; font-size: 11pt; font-weight: bold; margin: 0in;">_____________________________________________________</div><div style="font-family: Calibri; font-size: 11pt; font-weight: bold; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体;">所需的访存速率和每个</span><span lang="en-US" style="font-family: Calibri;"> bank </span><span lang="zh-CN" style="font-family: 宋体;">的访存时间决定了需要多少个</span><span lang="en-US" style="font-family: Calibri;"> memory bank </span><span lang="zh-CN" style="font-family: 宋体;">以避免</span><span lang="en-US" style="font-family: Calibri;"> stall</span><span lang="zh-CN" style="font-family: 宋体;">。下一个例子展示了这些时序的东西在向量处理器里是如何相互联系的。</span></div><div style="font-family: Calibri; font-size: 11pt; font-weight: bold; margin: 0in;">_____________________________________________________</div><div style="font-size: 11pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">例题</span><span lang="zh-CN" style="font-family: 宋体;">:假定我们想取出从地址</span><span lang="en-US" style="font-family: Calibri;"> 136 </span><span lang="zh-CN" style="font-family: 宋体;">开始的</span><span lang="en-US" style="font-family: Calibri;"> 64 </span><span lang="zh-CN" style="font-family: 宋体;">个向量元素,每一次访存操作耗时</span><span lang="en-US" style="font-family: Calibri;"> 6 </span><span lang="zh-CN" style="font-family: 宋体;">个时钟周期。需要多少个</span><span lang="en-US" style="font-family: Calibri;"> memory bank </span><span lang="zh-CN" style="font-family: 宋体;">以支持平均每一个时钟周期一次</span><span lang="en-US" style="font-family: Calibri;"> fetch </span><span lang="zh-CN" style="font-family: 宋体;">操作?访问</span><span lang="en-US" style="font-family: Calibri;"> bank </span><span lang="zh-CN" style="font-family: 宋体;">的地址分别是什么?各向量元素分别是在什么时刻到达</span><span lang="en-US" style="font-family: Calibri;"> CPU</span><span lang="zh-CN" style="font-family: 宋体;">?</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">解答</span><span lang="zh-CN" style="font-family: 宋体;">:每次访存</span><span lang="en-US" style="font-family: Calibri;"> 6 </span><span lang="zh-CN" style="font-family: 宋体;">个时钟周期意味着至少需要</span><span lang="en-US" style="font-family: Calibri;"> 6 </span><span lang="zh-CN" style="font-family: 宋体;">个</span><span lang="en-US" style="font-family: Calibri;">bank</span><span lang="zh-CN" style="font-family: 宋体;">,因为我们希望</span><span lang="en-US" style="font-family: Calibri;"> bank </span><span lang="zh-CN" style="font-family: 宋体;">数是</span><span lang="en-US" style="font-family: Calibri;"> 2 </span><span lang="zh-CN" style="font-family: 宋体;">的幂,因为选择</span><span lang="en-US" style="font-family: Calibri;"> 8 </span><span lang="zh-CN" style="font-family: 宋体;">个</span><span lang="en-US" style="font-family: Calibri;">bank</span><span lang="zh-CN" style="font-family: 宋体;">。图</span><span lang="en-US" style="font-family: Calibri;"> F.7 </span><span lang="zh-CN" style="font-family: 宋体;">展示了对于一个</span><span lang="en-US" style="font-family: Calibri;"> 8 bank</span><span lang="zh-CN" style="font-family: 宋体;">,访存延迟为</span><span lang="en-US" style="font-family: Calibri;"> 6 </span><span lang="zh-CN" style="font-family: 宋体;">个周期的内存系统进行开始几次访存的情形</span><span lang="en-US" style="font-family: Calibri;"> [2]</span><span lang="zh-CN" style="font-family: 宋体;">。</span></div><div style="font-family: Calibri; font-size: 11pt; font-weight: bold; margin: 0in;">_____________________________________________________</div><div style="font-family: Calibri; font-size: 11pt; font-weight: bold; margin: 0in;"><br />
</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhjVHDtYKTswWB0HrF92bZcq4Aou_pgOkFz7lt1ErAT8ncuBrbf0X8Dd4xWJJivjfF4_aklbUXpb3yrW7m88HolutDva8HhM4POd6DtZGDdzOV9s1I92GWfv_EZ2muGjH0hI60G37EuiEz5/s1600/F7.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="343" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhjVHDtYKTswWB0HrF92bZcq4Aou_pgOkFz7lt1ErAT8ncuBrbf0X8Dd4xWJJivjfF4_aklbUXpb3yrW7m88HolutDva8HhM4POd6DtZGDdzOV9s1I92GWfv_EZ2muGjH0hI60G37EuiEz5/s400/F7.PNG" width="400" /></a></div><div style="font-size: 11pt; font-style: italic; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">图</span><span lang="en-US" style="font-family: Calibri; font-weight: bold;"> F.7 </span><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">内存地址以</span><span lang="en-US" style="font-family: Calibri; font-weight: bold;">bank </span><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">编号以及每个访存开始的时间。</span><span lang="en-US" style="font-family: Calibri; font-weight: bold;"> </span><span lang="zh-CN" style="font-family: 宋体;">每个</span><span lang="en-US" style="font-family: Calibri;"> memory bank </span><span lang="zh-CN" style="font-family: 宋体;">在每个访存开始的时候锁存访存地址,然后用</span><span lang="en-US" style="font-family: Calibri;"> 6 </span><span lang="zh-CN" style="font-family: 宋体;">个周期给出一个数据返回</span><span lang="en-US" style="font-family: Calibri;"> CPU</span><span lang="zh-CN" style="font-family: 宋体;">。注意</span><span lang="en-US" style="font-family: Calibri;"> CPU </span><span lang="zh-CN" style="font-family: 宋体;">没法保持所有</span><span lang="en-US" style="font-family: Calibri;"> 8 </span><span lang="zh-CN" style="font-family: 宋体;">个</span><span lang="en-US" style="font-family: Calibri;"> bank </span><span lang="zh-CN" style="font-family: 宋体;">都处于忙碌状态,因为它在一个周期只能给出一个地址或者接受一个数据。</span></div><div style="font-family: Calibri; font-size: 11pt; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体;">实际的</span><span lang="en-US" style="font-family: Calibri;"> memory bank </span><span lang="zh-CN" style="font-family: 宋体;">的时序被分离成两个不同的部分:访存延迟(</span><span lang="en-US" style="font-family: Calibri;">access latency</span><span lang="zh-CN" style="font-family: 宋体;">)和</span><span lang="en-US" style="font-family: Calibri;"> bank </span><span lang="zh-CN" style="font-family: 宋体;">周期时间(</span><span lang="en-US" style="font-family: Calibri;">bank cycle time</span><span lang="zh-CN" style="font-family: 宋体;">)或者</span><span lang="en-US" style="font-family: Calibri;"> bank </span><span lang="zh-CN" style="font-family: 宋体;">忙碌时间(</span><span lang="en-US" style="font-family: Calibri;">bank busy time</span><span lang="zh-CN" style="font-family: 宋体;">)。访存延迟是指从地址到达</span><span lang="en-US" style="font-family: Calibri;"> bank </span><span lang="zh-CN" style="font-family: 宋体;">开始直到</span><span lang="en-US" style="font-family: Calibri;"> bank </span><span lang="zh-CN" style="font-family: 宋体;">给出一个数据的时间。忙碌时间指的是一个</span><span lang="en-US" style="font-family: Calibri;"> bank </span><span lang="zh-CN" style="font-family: 宋体;">被一个访存请求占据而忙碌的时间。</span><span lang="en-US" style="font-family: Calibri;">Access latency </span><span lang="zh-CN" style="font-family: 宋体;">是算在从内存中取出一个向量的启动开销里面的(总共的</span><span lang="en-US" style="font-family: Calibri;"> memory latency </span><span lang="zh-CN" style="font-family: 宋体;">还包括穿越流水化的互联网络以从</span><span lang="en-US" style="font-family: Calibri;"> CPU </span><span lang="zh-CN" style="font-family: 宋体;">传输地址和数据到</span><span lang="en-US" style="font-family: Calibri;"> memory bank</span><span lang="zh-CN" style="font-family: 宋体;">)。</span><span lang="en-US" style="font-family: Calibri;">Bank </span><span lang="zh-CN" style="font-family: 宋体;">忙碌时间决定了一个内存系统的有效带宽因为处理器直到</span><span lang="en-US" style="font-family: Calibri;"> bank </span><span lang="zh-CN" style="font-family: 宋体;">忙碌时间过去了才能发射到相同的</span><span lang="en-US" style="font-family: Calibri;"> bank</span><span lang="zh-CN" style="font-family: 宋体;">的第二个访存请求。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体;">对于前例中使用的简单的非流水化的</span><span lang="en-US" style="font-family: Calibri;"> SRAM bank</span><span lang="zh-CN" style="font-family: 宋体;">而言,访存延迟和忙碌时间几乎是一样的。对于流水化的</span><span lang="en-US" style="font-family: Calibri;"> SRAM </span><span lang="zh-CN" style="font-family: 宋体;">而言,访存延迟则大于忙碌时间因为一个元素的访存只会占据</span><span lang="en-US" style="font-family: Calibri;"> memory bank </span><span lang="zh-CN" style="font-family: 宋体;">流水线的一级。对于</span><span lang="en-US" style="font-family: Calibri;"> DRAM </span><span lang="zh-CN" style="font-family: 宋体;">的</span><span lang="en-US" style="font-family: Calibri;"> bank </span><span lang="zh-CN" style="font-family: 宋体;">而言,访存延迟通常比忙碌时间短,因为</span><span lang="en-US" style="font-family: Calibri;"> DRAM </span><span lang="zh-CN" style="font-family: 宋体;">在一次破坏性读取之后需要额外的时间来恢复被读出的值。对于一个支持多个向量访存同时进行或者非顺序访问的内存系统而言,所需的</span><span lang="en-US" style="font-family: Calibri;"> memory bank </span><span lang="zh-CN" style="font-family: 宋体;">的数目应该比最低的要求来得多;否则则会发生</span><span lang="en-US" style="font-family: Calibri;"> memory bank </span><span lang="zh-CN" style="font-family: 宋体;">冲突</span><span lang="en-US" style="font-family: Calibri;"> </span><span lang="zh-CN" style="font-family: 宋体;">(</span><span lang="en-US" style="font-family: Calibri;">memory bank conflict</span><span lang="zh-CN" style="font-family: 宋体;">)。我们会在下一节中详细讨论这个问题。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in; text-align: center;"><span lang="en-US" style="font-family: Calibri;">---------------</span><span lang="zh-CN" style="font-family: 宋体;">大家好,我是分割线</span><span lang="en-US" style="font-family: Calibri;">---------------</span></div><div style="font-size: 11pt; margin: 0in;"><span lang="en-US" style="font-family: Calibri;">[1] </span><span lang="zh-CN" style="font-family: 宋体;">简单的</span><span lang="en-US" style="font-family: Calibri;"> interleaving </span><span lang="zh-CN" style="font-family: 宋体;">可能指的是在同一个</span><span lang="en-US" style="font-family: Calibri;"> memory bank </span><span lang="zh-CN" style="font-family: 宋体;">内部多个</span><span lang="en-US" style="font-family: Calibri;"> memory chip </span><span lang="zh-CN" style="font-family: 宋体;">之间地址的</span><span lang="en-US" style="font-family: Calibri;"> interleaving</span><span lang="zh-CN" style="font-family: 宋体;">。</span></div><div style="font-size: 11pt; margin: 0in;"><span lang="en-US" style="font-family: Calibri;">[2] </span><span lang="zh-CN" style="font-family: 宋体;">注意每个向量元素都是</span><span lang="en-US" style="font-family: Calibri;"> 64 </span><span lang="zh-CN" style="font-family: 宋体;">位,也即</span><span lang="en-US" style="font-family: Calibri;"> 8 byte</span><span lang="zh-CN" style="font-family: 宋体;">。</span></div>Yuhaohttp://www.blogger.com/profile/08569555359590748704noreply@blogger.com0tag:blogger.com,1999:blog-1107874072437767466.post-5914874017014374942011-01-02T15:15:00.000-08:002011-01-06T12:52:31.469-08:00向量处理器(3)<div style="font-size: 14pt; font-weight: bold; margin: 0in;"><span lang="en-US" style="font-family: Calibri;">2.1 </span><span lang="zh-CN" style="font-family: 宋体;">向量处理器如何工作:一个实例</span></div><div style="font-size: 11pt; margin: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">要理解向量处理器的工作流程,最好的办法是研究一个向量循环(</span><span lang="en-US" style="font-family: Calibri;">vector loop</span><span lang="zh-CN" style="font-family: 宋体;">)如何在</span><span lang="en-US" style="font-family: Calibri;"> VMIPS </span><span lang="zh-CN" style="font-family: 宋体;">上工作。让我们使用以下这个典型的向量问题,这个附录的其余部分也会使用它:</span></div><div style="font-size: 11pt; margin: 0in; text-align: justify;"><span lang="en-US" style="font-family: Calibri;">Y = a </span><span lang="zh-CN" style="font-family: 宋体;">×</span><span lang="en-US" style="font-family: Calibri;"> X + Y</span></div><div style="font-size: 11pt; margin: 0in; text-align: justify;"><span lang="en-US" style="font-family: Calibri;">X </span><span lang="zh-CN" style="font-family: 宋体;">和</span><span lang="en-US" style="font-family: Calibri;"> Y </span><span lang="zh-CN" style="font-family: 宋体;">都是向量,一开始就在内存之中;</span><span lang="en-US" style="font-family: Calibri;">a </span><span lang="zh-CN" style="font-family: 宋体;">是一个标量。这就是所谓的</span><span lang="en-US" style="font-family: Calibri;"> SAXPY </span><span lang="zh-CN" style="font-family: 宋体;">或者</span><span lang="en-US" style="font-family: Calibri;"> DAXPY </span><span lang="zh-CN" style="font-family: 宋体;">循环。(</span><span lang="en-US" style="font-family: Calibri;">SAXPY </span><span lang="zh-CN" style="font-family: 宋体;">表示</span><span lang="en-US" style="font-family: Calibri;"> </span><span lang="en-US" style="font-family: Calibri; text-decoration: underline;">s</span><span lang="en-US" style="font-family: Calibri;">ingle-precision </span><span lang="en-US" style="font-family: Calibri; text-decoration: underline;">a</span><span lang="en-US" style="font-family: Calibri;"> </span><span lang="zh-CN" style="font-family: 宋体;">×</span><span lang="en-US" style="font-family: Calibri;"> </span><span lang="en-US" style="font-family: Calibri; text-decoration: underline;">X</span><span lang="en-US" style="font-family: Calibri;"> </span><span lang="en-US" style="font-family: Calibri; text-decoration: underline;">p</span><span lang="en-US" style="font-family: Calibri;">lus </span><span lang="en-US" style="font-family: Calibri; text-decoration: underline;">Y</span><span lang="zh-CN" style="font-family: 宋体;">;</span><span lang="en-US" style="font-family: Calibri;">DAXPY </span><span lang="zh-CN" style="font-family: 宋体;">表示</span><span lang="en-US" style="font-family: Calibri;"> </span><span lang="en-US" style="font-family: Calibri; text-decoration: underline;">d</span><span lang="en-US" style="font-family: Calibri;">ouble-precision </span><span lang="en-US" style="font-family: Calibri; text-decoration: underline;">a</span><span lang="en-US" style="font-family: Calibri;"> </span><span lang="zh-CN" style="font-family: 宋体;">×</span><span lang="en-US" style="font-family: Calibri;"> </span><span lang="en-US" style="font-family: Calibri; text-decoration: underline;">X</span><span lang="en-US" style="font-family: Calibri;"> </span><span lang="en-US" style="font-family: Calibri; text-decoration: underline;">p</span><span lang="en-US" style="font-family: Calibri;">lus </span><span lang="en-US" style="font-family: Calibri; text-decoration: underline;">Y</span><span lang="zh-CN" style="font-family: 宋体;">。)</span><span lang="en-US" style="font-family: Calibri;">Linpack </span><span lang="zh-CN" style="font-family: 宋体;">是由一组线性代数例程组成,其中进行高斯消元操作的例程被称为</span><span lang="en-US" style="font-family: Calibri;"> Linpack Benckmark</span><span lang="zh-CN" style="font-family: 宋体;">。这里的</span><span lang="en-US" style="font-family: Calibri;">DAXPY </span><span lang="zh-CN" style="font-family: 宋体;">例程仅仅代表</span><span lang="en-US" style="font-family: Calibri;"> Linpack Benchmark </span><span lang="zh-CN" style="font-family: 宋体;">的一小部分,但是其操作占了整个</span><span lang="en-US" style="font-family: Calibri;"> benchmark </span><span lang="zh-CN" style="font-family: 宋体;">运行时间的绝大部分。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">从现在开始,让我们假定向量寄存器的元素的个数,或者说其长度(</span><span lang="en-US" style="font-family: Calibri;">64</span><span lang="zh-CN" style="font-family: 宋体;">)等于我们所关心的向量操作的长度。(我们之后会放宽对此的限制)</span><br />
<span style="background-color: black;"></span><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">————————————————————</span></div><div style="font-family: Calibri; font-size: 11pt; font-weight: bold; margin: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">例题</span><span lang="zh-CN" style="font-family: 宋体;">:</span><span lang="en-US" style="font-family: Calibri;"> </span><span lang="zh-CN" style="font-family: 宋体;">给出</span><span lang="en-US" style="font-family: Calibri;"> MIPS </span><span lang="zh-CN" style="font-family: 宋体;">和</span><span lang="en-US" style="font-family: Calibri;"> VMIPS </span><span lang="zh-CN" style="font-family: 宋体;">上</span><span lang="en-US" style="font-family: Calibri;"> DAXPY </span><span lang="zh-CN" style="font-family: 宋体;">循环的代码。假定</span><span lang="en-US" style="font-family: Calibri;"> X </span><span lang="zh-CN" style="font-family: 宋体;">和</span><span lang="en-US" style="font-family: Calibri;"> Y </span><span lang="zh-CN" style="font-family: 宋体;">的起始地址分别在</span><span lang="en-US" style="font-family: Calibri;"> Rx </span><span lang="zh-CN" style="font-family: 宋体;">和</span><span lang="en-US" style="font-family: Calibri;"> Ry</span><span lang="zh-CN" style="font-family: 宋体;">。</span><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;"> </span></div><div style="font-size: 11pt; margin: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">解答</span><span lang="zh-CN" style="font-family: 宋体;">:以下是</span><span lang="en-US" style="font-family: Calibri;"> MIPS </span><span lang="zh-CN" style="font-family: 宋体;">代码:</span><br />
<span lang="zh-CN" style="font-family: 宋体;"> </span><span lang="zh-CN"><span style="font-family: Arial, Helvetica, sans-serif;">L.D</span></span><span style="font-family: Arial, Helvetica, sans-serif;"> </span><span lang="zh-CN" style="font-family: Arial, Helvetica, sans-serif;">F0,a</span><span style="font-family: Arial, Helvetica, sans-serif;"> </span><span lang="zh-CN" style="font-family: Arial, Helvetica, sans-serif;">;load scalar a</span><br />
<div style="font-family: Arial,Helvetica,sans-serif;"><span lang="zh-CN"> DADDIU </span><span lang="en-US"></span><span lang="zh-CN"> R4,Rx,#512</span><span lang="en-US"> </span><span lang="zh-CN">;last address to load</span></div><div style="font-family: Arial,Helvetica,sans-serif;"><span lang="zh-CN"> Loop: </span><span lang="en-US"></span><span lang="zh-CN">L.D</span><span lang="en-US"> </span><span lang="zh-CN">F2,0(Rx) </span><span lang="en-US"> </span><span lang="zh-CN">;load X(i)</span></div><div style="font-family: Arial,Helvetica,sans-serif;"><span lang="zh-CN"> MUL.D</span><span lang="en-US"> </span><span lang="zh-CN">F2,F2,F0 </span><span lang="en-US"> </span><span lang="zh-CN">;a </span><span lang="zh-CN">× X(i)</span></div><div style="font-family: Arial,Helvetica,sans-serif;"><span lang="zh-CN"> L.D</span><span lang="en-US"> </span><span lang="zh-CN">F4,0(Ry) </span><span lang="en-US"> </span><span lang="zh-CN">;load Y(i)</span></div><div style="font-family: Arial,Helvetica,sans-serif;"><span lang="zh-CN"> ADD.D</span><span lang="en-US"> </span><span lang="zh-CN">F4,F4,F2 </span><span lang="en-US"> </span><span lang="zh-CN">;a </span><span lang="zh-CN">× X(i) + Y(i)</span></div><div style="font-family: Arial,Helvetica,sans-serif;"><span lang="zh-CN"> S.D</span><span lang="en-US"> </span><span lang="zh-CN">0(Ry),F4 </span><span lang="en-US"> </span><span lang="zh-CN">;store into Y(i)</span></div><div style="font-family: Arial,Helvetica,sans-serif;"><span lang="zh-CN"> DADDIU</span> <span lang="zh-CN">Rx,Rx,#8 </span><span lang="en-US"> </span><span lang="zh-CN">;increment index to X</span></div><div style="font-family: Arial,Helvetica,sans-serif;"><span lang="zh-CN"> DADDIU </span><span lang="en-US"> </span><span lang="zh-CN">Ry,Ry,#8 </span><span lang="en-US"> </span><span lang="zh-CN">;increment index to Y</span></div><div style="font-family: Arial,Helvetica,sans-serif;"><span lang="zh-CN"> DSUBU </span><span lang="en-US"> </span><span lang="zh-CN">R20,R4,Rx </span><span lang="en-US"> </span><span lang="zh-CN">;compute bound</span></div><div style="font-family: Arial,Helvetica,sans-serif;"><span lang="zh-CN"> BNEZ </span><span lang="en-US"> </span><span lang="zh-CN">R20,Loop </span><span lang="en-US"> </span><span lang="zh-CN">;check if done</span></div></div><div style="font-size: 11pt; margin: 0in; text-align: justify;"><span lang="en-US" style="font-family: Calibri;"> </span><span lang="zh-CN" style="font-family: 宋体;">以下是</span><span lang="en-US" style="font-family: Calibri;"> VMIPS </span><span lang="zh-CN">的代码:</span><br />
<span lang="zh-CN"><span style="font-family: Arial, Helvetica, sans-serif;"> L.D </span></span><span lang="en-US" style="font-family: Arial, Helvetica, sans-serif;"> </span><span lang="zh-CN" style="font-family: Arial, Helvetica, sans-serif;">F0,a </span><span lang="en-US" style="font-family: Arial, Helvetica, sans-serif;"> </span><span lang="zh-CN" style="font-family: Arial, Helvetica, sans-serif;">;load scalar a</span><br />
<div style="font-family: Arial,Helvetica,sans-serif;"><span lang="zh-CN"> LV </span><span lang="en-US"> </span><span lang="zh-CN">V1,Rx </span><span lang="en-US"> </span><span lang="zh-CN">;load vector X</span></div><div style="font-family: Arial,Helvetica,sans-serif;"><span lang="zh-CN"> MULVS.D </span><span lang="en-US"> </span><span lang="zh-CN">V2,V1,F0 </span><span lang="en-US"> </span><span lang="zh-CN">;vector-scalar multiply</span></div><div style="font-family: Arial,Helvetica,sans-serif;"><span lang="zh-CN"> </span><span lang="zh-CN">LV </span><span lang="en-US"> </span><span lang="zh-CN">V3,Ry </span><span lang="en-US"> </span><span lang="zh-CN">;load vector Y</span></div><div style="font-family: Arial,Helvetica,sans-serif;"><span lang="zh-CN"> </span><span lang="zh-CN">ADDV.D </span><span lang="en-US"> </span><span lang="zh-CN">V4,V2,V3 </span><span lang="en-US"> </span><span lang="zh-CN">;add</span></div><div style="font-family: Arial,Helvetica,sans-serif;"><span lang="zh-CN"> </span><span lang="zh-CN">SV </span><span lang="en-US"> </span><span lang="zh-CN">Ry,V4 </span><span lang="en-US"> </span><span lang="zh-CN">;store the result</span></div></div><div style="font-size: 11pt; margin: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">这两个版本代码之间的比较颇有意思。最显著的区别是,向量处理器极大地减少了动态指令</span><span lang="en-US" style="font-family: Calibri;">[1]</span><span lang="zh-CN" style="font-family: 宋体;">的带宽需求,只需要执行</span><span lang="en-US" style="font-family: Calibri;"> 6 </span><span lang="zh-CN" style="font-family: 宋体;">条指令</span><span lang="en-US" style="font-family: Calibri;">--</span><span lang="zh-CN" style="font-family: 宋体;">相对应于</span><span lang="en-US" style="font-family: Calibri;"> MIPS </span><span lang="zh-CN" style="font-family: 宋体;">的</span><span lang="en-US" style="font-family: Calibri;">600 </span><span lang="zh-CN" style="font-family: 宋体;">条。这是因为向量指令可以同时操作于</span><span lang="en-US" style="font-family: Calibri;"> 64 </span><span lang="zh-CN" style="font-family: 宋体;">个元素,并且那些几乎占了一个循环中一半代码的循环冗余指令</span><span lang="en-US" style="font-family: Calibri;">[2]</span><span lang="zh-CN" style="font-family: 宋体;">现在都不存在了。</span></div><div style="font-family: Calibri; font-size: 11pt; font-weight: bold; margin: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">————————————————————</span></div><div style="font-size: 11pt; margin: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">另一个重要的区别是流水线</span><span lang="en-US" style="font-family: Calibri;"> </span><span lang="zh-CN" style="font-family: 宋体;">互锁(</span><span lang="en-US" style="font-family: Calibri;">interlock</span><span lang="zh-CN" style="font-family: 宋体;">)的频率。在直观的</span><span lang="en-US" style="font-family: Calibri;"> MIPS </span><span lang="zh-CN" style="font-family: 宋体;">代码中,每个</span><span lang="en-US" style="font-family: Calibri;"> ADD.D </span><span lang="zh-CN" style="font-family: 宋体;">都要等待</span><span lang="en-US" style="font-family: Calibri;"> MUL.D</span><span lang="zh-CN" style="font-family: 宋体;">,每个</span><span lang="en-US" style="font-family: Calibri;"> S.D </span><span lang="zh-CN" style="font-family: 宋体;">都要等待</span><span lang="en-US" style="font-family: Calibri;"> ADD.D</span><span lang="zh-CN" style="font-family: 宋体;">。在向量处理器中,每个向量指令只在第一个元素的操作上停顿(</span><span lang="en-US" style="font-family: Calibri;">stall</span><span lang="zh-CN" style="font-family: 宋体;">),之后的元素可以顺利地流过流水线。因此,流水线停顿只在每个向量操作发生一次,而不是每个向量的元素都发生。在这个例子中,</span><span lang="en-US" style="font-family: Calibri;">MIPS </span><span lang="zh-CN" style="font-family: 宋体;">上流水线停顿的频率大约是</span><span lang="en-US" style="font-family: Calibri;"> VMIPS </span><span lang="zh-CN" style="font-family: 宋体;">上的</span><span lang="en-US" style="font-family: Calibri;">64</span><span lang="zh-CN" style="font-family: 宋体;">倍。流水线停顿在</span><span lang="en-US" style="font-family: Calibri;"> MIPS </span><span lang="zh-CN" style="font-family: 宋体;">上可以使用软件流水(</span><span lang="en-US" style="font-family: Calibri;">software pipeling</span><span lang="zh-CN" style="font-family: 宋体;">)或者循环展开(</span><span lang="en-US" style="font-family: Calibri;">loop unrolling</span><span lang="zh-CN" style="font-family: 宋体;">)的技巧来避免(参照附录</span><span lang="en-US" style="font-family: Calibri;"> G</span><span lang="zh-CN" style="font-family: 宋体;">)。然而,巨大的指令带宽的需求仍然存在。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin: 0in; text-align: justify;"><br />
</div><div style="font-size: 14pt; font-weight: bold; margin: 0in; text-align: justify;"><span lang="en-US" style="font-family: Calibri;">2.2 </span><span lang="zh-CN" style="font-family: 宋体;">向量处理器执行时间</span></div><div style="font-size: 11pt; margin: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">一串向量操作的执行时间主要取决于三个因素:每个向量操作的长度,向量操作之间的</span><span lang="en-US" style="font-family: Calibri;"> structural hazard</span><span lang="zh-CN" style="font-family: 宋体;">,以及</span><span lang="en-US" style="font-family: Calibri;"> data dependency</span><span lang="zh-CN" style="font-family: 宋体;">。给定了向量的长度和引发率(</span><span lang="en-US" style="font-family: Calibri;">initiation rate</span><span lang="zh-CN" style="font-family: 宋体;">)即一个向量单元消耗新的操作数产生新结果的速率,我们可以计算出单个向量指令的执行时间。所有的现代超级计算机都拥有由多个并行道(或者叫</span><span lang="en-US" style="font-family: Calibri;"> lane</span><span lang="zh-CN" style="font-family: 宋体;">)组成的向量功能单元。他们可以在每个时钟周期内产生两个或者多个结果,但是同时也会有多个没有完全流水化的功能单元。为了简化讨论,我们的</span><span lang="en-US" style="font-family: Calibri;"> VMIPS </span><span lang="zh-CN" style="font-family: 宋体;">实现中包括一个</span><span lang="en-US" style="font-family: Calibri;"> lane</span><span lang="zh-CN" style="font-family: 宋体;">,单个操作的</span><span lang="en-US" style="font-family: Calibri;"> initiation rate </span><span lang="zh-CN" style="font-family: 宋体;">为每个周期一个元素。这样的话,单个向量指令的执行时间大约和向量的长度相等。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">为了简化对于向量执行及其执行时间的讨论,我们会使用</span><span lang="en-US" style="font-family: Calibri;"> </span><span lang="en-US" style="font-family: Calibri; font-style: italic;">convoy</span><span lang="en-US" style="font-family: Calibri;"> </span><span lang="zh-CN" style="font-family: 宋体;">这个概念。它指一组可以在一个时钟周期中同时执行的向量指令。(虽然</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">是一个向量编译中的概念,但是其实并不存在一个统一标准的术语。因此,我们创造了</span><span lang="en-US" style="font-family: Calibri;"> </span><span lang="en-US" style="font-family: Calibri; font-style: italic;">convoy</span><span lang="en-US" style="font-family: Calibri;"> </span><span lang="zh-CN" style="font-family: 宋体;">这个术语。)在一个</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">中的指令之间必须不能有任何的</span><span lang="en-US" style="font-family: Calibri;"> structural </span><span lang="zh-CN" style="font-family: 宋体;">或者</span><span lang="en-US" style="font-family: Calibri;">data hazard </span><span lang="zh-CN" style="font-family: 宋体;">(虽然我们会在后面放宽这个限制)。如果有</span><span lang="en-US" style="font-family: Calibri;"> hazard </span><span lang="zh-CN" style="font-family: 宋体;">的存在,这些指令必须被串行化,放置到不同的</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">中执行。把向量指令放到一个</span><span lang="en-US" style="font-family: Calibri; font-style: italic;"> </span><span lang="en-US" style="font-family: Calibri;">convoy </span><span lang="zh-CN" style="font-family: 宋体;">中就好比把标量指令放到一条</span><span lang="en-US" style="font-family: Calibri;"> VLIW </span><span lang="zh-CN" style="font-family: 宋体;">指令中一样。为了让我们的分析简单化,我们假定一个</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">中的指令必须在其他任何指令,包括标量指令和下一个</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">中的向量指令,开始之前完成执行。我们会在第</span><span lang="en-US" style="font-family: Calibri;"> 4 </span><span lang="zh-CN" style="font-family: 宋体;">小节中通过使用一种更宽松的,但是更复杂的指令发射机制从而放宽对此的限制。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">和</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">相对应的是</span><span lang="en-US" style="font-family: Calibri;"> chime </span><span lang="zh-CN" style="font-family: 宋体;">这个概念。它可以用来估计一系列由</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">组成的向量操作的性能。一个</span><span lang="en-US" style="font-family: Calibri;"> chime </span><span lang="zh-CN" style="font-family: 宋体;">是执行一个</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">所需要的时间。一个</span><span lang="en-US" style="font-family: Calibri;"> chime </span><span lang="zh-CN" style="font-family: 宋体;">是对一个向量操作序列执行时间的估计。</span><span lang="en-US" style="font-family: Calibri;">Chime </span><span lang="zh-CN" style="font-family: 宋体;">与向量长度无关。因此,一个由</span><span lang="en-US" style="font-family: Calibri;"> m </span><span lang="zh-CN" style="font-family: 宋体;">和</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">组成的向量序列的执行时间为</span><span lang="en-US" style="font-family: Calibri;"> m </span><span lang="zh-CN" style="font-family: 宋体;">个</span><span lang="en-US" style="font-family: Calibri;"> chime</span><span lang="zh-CN" style="font-family: 宋体;">。如果向量长度为</span><span lang="en-US" style="font-family: Calibri;"> n </span><span lang="zh-CN" style="font-family: 宋体;">的话,总共的执行时间大致为</span><span lang="en-US" style="font-family: Calibri;"> m </span><span lang="zh-CN" style="font-family: 宋体;">×</span><span lang="en-US" style="font-family: Calibri;"> n </span><span lang="zh-CN" style="font-family: 宋体;">个时钟周期。用</span><span lang="en-US" style="font-family: Calibri;"> chime </span><span lang="zh-CN" style="font-family: 宋体;">来做估算忽略了一些与特定处理器相关的开销,很多与向量长度相关。因此,利用</span><span lang="en-US" style="font-family: Calibri;"> chime </span><span lang="zh-CN" style="font-family: 宋体;">来估算执行时间对于长向量操作比较适用。我们之后会采用</span><span lang="en-US" style="font-family: Calibri;"> chime </span><span lang="zh-CN" style="font-family: 宋体;">而不是时钟周期来估计执行时间以显式地忽略那些开销。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin: 0in; text-align: justify;"><br />
</div><div style="font-size: 11pt; margin: 0in; text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">如果我们知道一串向量操作的</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">数目,我们就能知道以</span><span lang="en-US" style="font-family: Calibri;"> chime </span><span lang="zh-CN" style="font-family: 宋体;">表示的执行时间。用</span><span lang="en-US" style="font-family: Calibri;"> chime </span><span lang="zh-CN" style="font-family: 宋体;">估算时一个被忽略的因素是对于在一个周期内触发(</span><span lang="en-US" style="font-family: Calibri;">initiate</span><span lang="zh-CN" style="font-family: 宋体;">)多条指令执行的限制。如果在一个时钟周期中只能触发一条指令(这正是在大多数向量处理器中的实际情况),只计算</span><span lang="en-US" style="font-family: Calibri;"> chime </span><span lang="zh-CN" style="font-family: 宋体;">数其实低估了一个</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">的执行时间。但是因为通常向量的长度比一个</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">中向量指令的个数来得多得多,我们可以简单地认为一个</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">的执行时间就是一个</span><span lang="en-US" style="font-family: Calibri;"> chime</span><span lang="zh-CN" style="font-family: 宋体;">。</span><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;"> </span><br />
<span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">————————————————————</span></div><div style="font-size: 11pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">例题</span><span lang="zh-CN" style="font-family: 宋体;">:给出下面代码的</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN">形式,假定每一个向量功能单元只有一个拷贝。</span><br />
<div style="font-family: Arial,Helvetica,sans-serif; text-align: justify;"><span lang="zh-CN"> LV </span><span lang="en-US"> </span><span lang="zh-CN">V1,Rx </span><span lang="en-US"> </span><span lang="zh-CN">;load vector X</span></div><div style="font-family: Arial,Helvetica,sans-serif; text-align: justify;"><span lang="zh-CN"> MULVS.D </span><span lang="en-US"> </span><span lang="zh-CN">V2,V1,F0 </span><span lang="en-US"> </span><span lang="zh-CN">;vector-scalar multiply</span></div><div style="font-family: Arial,Helvetica,sans-serif; text-align: justify;"><span lang="zh-CN"> LV </span><span lang="en-US"> </span><span lang="zh-CN">V3,Ry</span><span lang="en-US"> </span><span lang="zh-CN">;load vector Y</span></div><div style="font-family: Arial,Helvetica,sans-serif; text-align: justify;"><span lang="zh-CN"> ADDV.D </span><span lang="en-US"> </span><span lang="zh-CN">V4,V2,V3 </span><span lang="en-US"> </span><span lang="zh-CN">;add</span></div><div style="font-family: Arial,Helvetica,sans-serif; text-align: justify;"><span lang="zh-CN"> SV </span><span lang="en-US"> </span><span lang="zh-CN">Ry,V4 </span><span lang="en-US"> </span><span lang="zh-CN">;store the result</span></div></div><div style="font-size: 11pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体;">这个向量程序要花多少</span><span lang="en-US" style="font-family: Calibri;"> chime</span><span lang="zh-CN" style="font-family: 宋体;">?每</span><span lang="en-US" style="font-family: Calibri;"> FLOP</span><span lang="zh-CN" style="font-family: 宋体;">(</span><span lang="en-US" style="font-family: Calibri;">floating-point operation</span><span lang="zh-CN" style="font-family: 宋体;">)</span><span lang="en-US" style="font-family: Calibri;"> </span><span lang="zh-CN" style="font-family: 宋体;">需要多少时钟周期,假定忽略向量指令发射的开销?</span></div><div style="font-size: 11pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">解答</span><span lang="zh-CN" style="font-family: 宋体;">:第一个</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">由第一个</span><span lang="en-US" style="font-family: Calibri;"> LV </span><span lang="zh-CN" style="font-family: 宋体;">指令占据。</span><span lang="en-US" style="font-family: Calibri;"> MULVS.D </span><span lang="zh-CN" style="font-family: 宋体;">和第一个</span><span lang="en-US" style="font-family: Calibri;"> LV </span><span lang="zh-CN" style="font-family: 宋体;">相关,因此不能放在一个</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">中。第二个</span><span lang="en-US" style="font-family: Calibri;"> LV </span><span lang="zh-CN" style="font-family: 宋体;">指令可以和</span><span lang="en-US" style="font-family: Calibri;"> MULVS.D </span><span lang="zh-CN" style="font-family: 宋体;">在同一个</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">中。</span><span lang="en-US" style="font-family: Calibri;">ADDV.D </span><span lang="zh-CN" style="font-family: 宋体;">和第二个</span><span lang="en-US" style="font-family: Calibri;"> LV </span><span lang="zh-CN" style="font-family: 宋体;">也有相关性,所以必须放在第三个</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">中。最后</span><span lang="en-US" style="font-family: Calibri;"> SV </span><span lang="zh-CN" style="font-family: 宋体;">依赖于</span><span lang="en-US" style="font-family: Calibri;"> ADDV.D</span><span lang="zh-CN" style="font-family: 宋体;">,所以必须进到下面一个</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">里。于是就有下面的</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">形式:</span></div><div style="font-family: Calibri; font-size: 11pt; margin: 0in;">1. LV</div><div style="font-family: Calibri; font-size: 11pt; margin: 0in;">2. MULVS.D LV</div><div style="font-family: Calibri; font-size: 11pt; margin: 0in;">3. ADDV.D</div><div style="font-family: Calibri; font-size: 11pt; margin: 0in;">4. SV</div><div style="font-size: 11pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体;">这个序列需要</span><span lang="en-US" style="font-family: Calibri;"> 4 </span><span lang="zh-CN" style="font-family: 宋体;">个</span><span lang="en-US" style="font-family: Calibri;"> convoy</span><span lang="zh-CN" style="font-family: 宋体;">,因此需要花</span><span lang="en-US" style="font-family: Calibri;"> 4 </span><span lang="zh-CN" style="font-family: 宋体;">个</span><span lang="en-US" style="font-family: Calibri;"> chime</span><span lang="zh-CN" style="font-family: 宋体;">的时间。</span><span lang="en-US" style="font-family: Calibri;"> </span><span lang="zh-CN" style="font-family: 宋体;">因为这个序列花</span><span lang="en-US" style="font-family: Calibri;"> 4 </span><span lang="zh-CN" style="font-family: 宋体;">个</span><span lang="en-US" style="font-family: Calibri;"> chime</span><span lang="zh-CN" style="font-family: 宋体;">,并且每个结果需要</span><span lang="en-US" style="font-family: Calibri;"> 2 </span><span lang="zh-CN" style="font-family: 宋体;">个浮点操作,因此每一个</span><span lang="en-US" style="font-family: Calibri;"> FLOP </span><span lang="zh-CN" style="font-family: 宋体;">要花</span><span lang="en-US" style="font-family: Calibri;"> 2 </span><span lang="zh-CN" style="font-family: 宋体;">个时钟周期</span><span lang="en-US" style="font-family: Calibri;"> </span><span lang="zh-CN" style="font-family: 宋体;">(忽略任何向量指令发射开销)</span><span lang="en-US" style="font-family: Calibri;">[3]</span><span lang="zh-CN" style="font-family: 宋体;">。注意虽然我们允许</span><span lang="en-US" style="font-family: Calibri;"> MULVS.D </span><span lang="zh-CN" style="font-family: 宋体;">和</span><span lang="en-US" style="font-family: Calibri;"> LV </span><span lang="zh-CN" style="font-family: 宋体;">在</span><span lang="en-US" style="font-family: Calibri;"> convoy 2 </span><span lang="zh-CN" style="font-family: 宋体;">中同时执行,大多数的向量机器需要</span><span lang="en-US" style="font-family: Calibri;">2 </span><span lang="zh-CN" style="font-family: 宋体;">个时钟周期去触发(</span><span lang="en-US" style="font-family: Calibri;">initiate</span><span lang="zh-CN" style="font-family: 宋体;">)这两条指令。</span></div><div style="font-family: Calibri; font-size: 11pt; font-weight: bold; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">————————————————————</span></div><div style="font-size: 11pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体;">利用</span><span lang="en-US" style="font-family: Calibri;"> chime </span><span lang="zh-CN" style="font-family: 宋体;">做估计对于长向量而言是合理的。比如,对于一个</span><span lang="en-US" style="font-family: Calibri;"> 64 </span><span lang="zh-CN" style="font-family: 宋体;">个元素的向量而言,上例需要花费</span><span lang="en-US" style="font-family: Calibri;"> 4 </span><span lang="zh-CN" style="font-family: 宋体;">个</span><span lang="en-US" style="font-family: Calibri;"> chime</span><span lang="zh-CN" style="font-family: 宋体;">,所以一共需要花费大约</span><span lang="en-US" style="font-family: Calibri;"> 256 </span><span lang="zh-CN" style="font-family: 宋体;">个时钟周期。相比而言,在</span><span lang="en-US" style="font-family: Calibri;"> 2 </span><span lang="zh-CN" style="font-family: 宋体;">个时钟周期里面发射一个</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">的开销就显得微不足道了。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体;">另一个开销比上述的发射限制来得更为重要和显著。</span><span lang="en-US" style="font-family: Calibri;">Chime </span><span lang="zh-CN" style="font-family: 宋体;">模型忽略的最大的开销是所谓的向量“启动”(</span><span lang="en-US" style="font-family: Calibri;">start-up</span><span lang="zh-CN" style="font-family: 宋体;">)时间。启动时间来源于执行向量指令的流水线延迟,并且由流水线深度主要决定。启动时间增加了执行一个</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">所需的实际执行时间,使之超过一个</span><span lang="en-US" style="font-family: Calibri;"> chime</span><span lang="zh-CN" style="font-family: 宋体;">。因为我们假定</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">之间互相不重叠,启动时间因此也延长了后续</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">的执行。当然因为后续的</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">和当前的</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">本身就有结构或者数据上的</span><span lang="en-US" style="font-family: Calibri;"> hazard</span><span lang="zh-CN" style="font-family: 宋体;">,不重叠的假定是合理的。完成一个</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">所需的时间实际上等于向量的长度加上其启动时间。如果向量的长度是无限的,启动时间的开销可以被均摊;但是有限长度的向量则会暴露启动延时,如下例所示。</span><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;"> </span><br />
<span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">————————————————————</span></div><div style="font-size: 11pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">例题</span><span lang="zh-CN" style="font-family: 宋体;">:假定各功能部件的启动时间如图</span><span lang="en-US" style="font-family: Calibri;"> F.4 </span><span lang="zh-CN" style="font-family: 宋体;">所示。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体;">给出每个</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">可以开始的时间以及总共需要的周期数。这个时间和单纯使用</span><span lang="en-US" style="font-family: Calibri;"> chime </span><span lang="zh-CN" style="font-family: 宋体;">的估计对于向量长度为</span><span lang="en-US" style="font-family: Calibri;"> 64 </span><span lang="zh-CN" style="font-family: 宋体;">而言有和区别?</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">解答</span><span lang="zh-CN" style="font-family: 宋体;">:图</span><span lang="en-US" style="font-family: Calibri;"> F.5 </span><span lang="zh-CN" style="font-family: 宋体;">给出了基于</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">的解答,假定向量的长度是</span><span lang="en-US" style="font-family: Calibri;"> n</span><span lang="zh-CN" style="font-family: 宋体;">。一个很恼人的问题是我们到底认为何时向量操作序列才算结束?这决定了</span><span lang="en-US" style="font-family: Calibri;"> SV </span><span lang="zh-CN" style="font-family: 宋体;">的启动时间到底算还是不算。我们假定</span><span lang="en-US" style="font-family: Calibri;"> SV </span><span lang="zh-CN" style="font-family: 宋体;">之后的指令不能进入其同一个</span><span lang="en-US" style="font-family: Calibri;"> convoy</span><span lang="zh-CN" style="font-family: 宋体;">,并且不同的</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">不能互相重叠执行。那么总共的执行时间则需要计算到最后一个</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">的最后一个向量指令结束。这仅仅是一个估算,最后一个向量指令的启动时间有时候可见,有时候不可见。为了简化讨论,我们始终计算它。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体;">对于向量长度为</span><span lang="en-US" style="font-family: Calibri;"> 64 </span><span lang="zh-CN" style="font-family: 宋体;">的实例而言,平均每出一个结果需要</span><span lang="en-US" style="font-family: Calibri;"> 4 + (42 / 64) = 4.65 </span><span lang="zh-CN" style="font-family: 宋体;">个时钟周期,然而需要的</span><span lang="en-US" style="font-family: Calibri;"> chime </span><span lang="zh-CN" style="font-family: 宋体;">数只是</span><span lang="en-US" style="font-family: Calibri;">4</span><span lang="zh-CN" style="font-family: 宋体;">。考虑了启动开销之后执行时间大约有</span><span lang="en-US" style="font-family: Calibri;"> 1.16 </span><span lang="zh-CN" style="font-family: 宋体;">倍的增加。</span></div><div style="font-family: Calibri; font-size: 11pt; font-weight: bold; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">———————————————————— </span></div><div style="margin: 0in;"></div><div style="font-size: 11pt; font-style: italic; font-weight: bold; margin: 0in;"><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh1Zl56m2MC0ToSVHmhMPjssi_AnPtnv9kS2OyVaYTy4LYwF_rIdrcpSWHWsIhLnnM36Lo0q43eKTu8fduyOWTF0pCtVmAy6-6Gf59a4ukBj1XAkhRt2AvPfpjd1SfV1TQ9boSXFrsYAjHa/s1600/F4.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="90" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh1Zl56m2MC0ToSVHmhMPjssi_AnPtnv9kS2OyVaYTy4LYwF_rIdrcpSWHWsIhLnnM36Lo0q43eKTu8fduyOWTF0pCtVmAy6-6Gf59a4ukBj1XAkhRt2AvPfpjd1SfV1TQ9boSXFrsYAjHa/s400/F4.PNG" width="400" /></a></div><span lang="zh-CN" style="font-family: 宋体;">图</span><span lang="en-US" style="font-family: Calibri;"> F.4 </span><span lang="zh-CN" style="font-family: 宋体;">启动开销</span><br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgv3Tb9XurOlCBGEiAjthe8rqEtMK43ADTu6423cs58Lsy0w-YgO-VuicZvQwogD8hvGl43TBvFdsYEDTH34_13A9FF0B9_VaSjuJZwCvwBID1W7Pnfuasiah8WFWkJh0UN66dEwwyQB66H/s1600/F5.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="103" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgv3Tb9XurOlCBGEiAjthe8rqEtMK43ADTu6423cs58Lsy0w-YgO-VuicZvQwogD8hvGl43TBvFdsYEDTH34_13A9FF0B9_VaSjuJZwCvwBID1W7Pnfuasiah8WFWkJh0UN66dEwwyQB66H/s400/F5.PNG" width="400" /></a></div><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">图</span><span lang="en-US" style="font-family: Calibri; font-weight: bold;"> F.5 convoy 1 </span><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">至</span><span lang="en-US" style="font-family: Calibri; font-weight: bold;"> 4 </span><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">的开始时间以及给出第一个及最后一个结果的时间。</span><span lang="zh-CN" style="font-family: 宋体;">向量的长度是</span><span lang="en-US" style="font-family: Calibri;"> n</span><span lang="zh-CN" style="font-family: 宋体;">。</span></div><div style="margin: 0in 0in 0in 0.375in;"></div><div style="font-size: 11pt; font-style: italic; margin: 0in;"><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj4G-wLovOln2rFqgUsmA4ices33j2n6e5L-TST8NH4xd76zmhT7ud-k9hJU6kOE-7tLLfWWnlrY-7DO76EKx0CwLkElRbFsAbI5Bm6bqfNAYtJ9xJ2nmN70pboMMD22tiyB2dfjI9IDGU5/s1600/F6.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="105" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj4G-wLovOln2rFqgUsmA4ices33j2n6e5L-TST8NH4xd76zmhT7ud-k9hJU6kOE-7tLLfWWnlrY-7DO76EKx0CwLkElRbFsAbI5Bm6bqfNAYtJ9xJ2nmN70pboMMD22tiyB2dfjI9IDGU5/s400/F6.PNG" width="400" /></a></div><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">图</span><span lang="en-US" style="font-family: Calibri; font-weight: bold;"> F.6 VMIPS </span><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">的启动开销导致的性能惩罚。</span><span lang="zh-CN" style="font-family: 宋体;">这是</span><span lang="en-US" style="font-family: Calibri;"> VMIPS </span><span lang="zh-CN" style="font-family: 宋体;">的向量操作的启动延迟导致的性能惩罚,以时钟周期数形式给出。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体;">为了简化讨论,我们使用</span><span lang="en-US" style="font-family: Calibri;"> chime </span><span lang="zh-CN" style="font-family: 宋体;">作为执行时间的估计,仅仅在我们需要详细的性能数据以展现某些优化技巧时才考虑启动时间。对于长向量而言,启动时间的开销并不大。在本附录稍后章节,我们将讨论如何降低启动开销。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体;">一条指令的启动时间来源于执行该指令的功能单元的流水线的深度。如果我们想保持触发率(</span><span lang="en-US" style="font-family: Calibri;">initiation rate</span><span lang="zh-CN" style="font-family: 宋体;">)在每一个时钟周期给出一个结果,那么:</span></div><div style="font-size: 11pt; margin: 0in 0in 0in 0.75in;"><span lang="zh-CN" style="font-family: 宋体;">流水线深度</span><span lang="en-US" style="font-family: Calibri;"> = [</span><span lang="zh-CN" style="font-family: 宋体;">总共需要的执行时间</span><span lang="en-US" style="font-family: Calibri;"> / </span><span lang="zh-CN" style="font-family: 宋体;">时钟周期长度</span><span lang="en-US" style="font-family: Calibri;">]</span></div><div style="font-size: 11pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体;">比如,如果一个操作需要花</span><span lang="en-US" style="font-family: Calibri;">10</span><span lang="zh-CN" style="font-family: 宋体;">个时钟周期,那么它必须被划分为</span><span lang="en-US" style="font-family: Calibri;"> 10 </span><span lang="zh-CN" style="font-family: 宋体;">级流水进行操作才能保持触发率为平均一个时钟周期。流水线的深度取决于操作的复杂度和处理器的时钟周期长度。功能单元的流水线的深度变化很大</span><span lang="en-US" style="font-family: Calibri;">--</span><span lang="zh-CN" style="font-family: 宋体;">从</span><span lang="en-US" style="font-family: Calibri;"> 2 </span><span lang="zh-CN" style="font-family: 宋体;">到</span><span lang="en-US" style="font-family: Calibri;"> 20 </span><span lang="zh-CN" style="font-family: 宋体;">都不是不常见的</span><span lang="en-US" style="font-family: Calibri;">--</span><span lang="zh-CN" style="font-family: 宋体;">虽然大多数常用的单元都是</span><span lang="en-US" style="font-family: Calibri;"> 4</span><span lang="zh-CN" style="font-family: 宋体;">~</span><span lang="en-US" style="font-family: Calibri;">8 </span><span lang="zh-CN" style="font-family: 宋体;">级流水。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体;">对于</span><span lang="en-US" style="font-family: Calibri;"> VMIPS </span><span lang="zh-CN" style="font-family: 宋体;">而言,我们采用和</span><span lang="en-US" style="font-family: Calibri;"> Cray-1 </span><span lang="zh-CN" style="font-family: 宋体;">一样的流水线深度,虽然在更多的现代处理器中延迟有所增加,特别是</span><span lang="en-US" style="font-family: Calibri;"> load</span><span lang="zh-CN" style="font-family: 宋体;">。所以的功能单元都是完全流水的。就如图</span><span lang="en-US" style="font-family: Calibri;"> F.6 </span><span lang="zh-CN" style="font-family: 宋体;">所示,浮点加法的流水线的深度是</span><span lang="en-US" style="font-family: Calibri;"> 6 </span><span lang="zh-CN" style="font-family: 宋体;">个时钟周期,浮点乘法的深度是</span><span lang="en-US" style="font-family: Calibri;"> 7 </span><span lang="zh-CN" style="font-family: 宋体;">个时钟周期。在</span><span lang="en-US" style="font-family: Calibri;"> VMIPS </span><span lang="zh-CN" style="font-family: 宋体;">上,就如同大多数的向量处理器,互相独立的使用不同的功能单元向量操作可以在一个</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">中发射。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in; text-align: center;"><span lang="en-US" style="font-family: Calibri;">---------------</span><span lang="zh-CN" style="font-family: 宋体;">大家好,我是分割线</span><span lang="en-US" style="font-family: Calibri;">---------------</span></div><div style="font-size: 11pt; margin: 0in;"><span lang="en-US" style="font-family: Calibri;">[1] </span><span lang="zh-CN" style="font-family: 宋体;">这里所谓的动态指令是指在运行时实际执行的指令。在</span><span lang="en-US" style="font-family: Calibri;"> MIPS </span><span lang="zh-CN" style="font-family: 宋体;">中,由于循环的存在,每一次迭代都需要重新取指。</span></div><div style="font-size: 11pt; margin: 0in;"><span lang="en-US" style="font-family: Calibri;">[2] </span><span lang="zh-CN" style="font-family: 宋体;">所谓的循环冗余指令是为了维护循环的控制流而必须的指令,包括自减循环索引,判断是否达到循环边界,跳转等等。这些指令在向量处理器的代码中是不存在的</span><span lang="en-US" style="font-family: Calibri;">--</span><span lang="zh-CN" style="font-family: 宋体;">因为根本就没有循环的存在。</span></div><div style="font-size: 11pt; margin: 0in;"><span lang="en-US" style="font-family: Calibri;">[3] </span><span lang="zh-CN" style="font-family: 宋体;">假定向量长度是</span><span lang="en-US" style="font-family: Calibri;"> n</span><span lang="zh-CN" style="font-family: 宋体;">,一共进行了</span><span lang="en-US" style="font-family: Calibri;"> 2 </span><span lang="zh-CN" style="font-family: 宋体;">×</span><span lang="en-US" style="font-family: Calibri;"> n </span><span lang="zh-CN" style="font-family: 宋体;">个浮点操作(每个向量元素需要</span><span lang="en-US" style="font-family: Calibri;"> 2 </span><span lang="zh-CN" style="font-family: 宋体;">个浮点操作),花了</span><span lang="en-US" style="font-family: Calibri;"> 4 </span><span lang="zh-CN" style="font-family: 宋体;">×</span><span lang="en-US" style="font-family: Calibri;"> n </span><span lang="zh-CN" style="font-family: 宋体;">个时钟周期(假定每个</span><span lang="en-US" style="font-family: Calibri;"> chime </span><span lang="zh-CN" style="font-family: 宋体;">的周期数等于向量长度),可以得到每</span><span lang="en-US" style="font-family: Calibri;"> FLOP </span><span lang="zh-CN" style="font-family: 宋体;">需要</span><span lang="en-US" style="font-family: Calibri;"> 2 </span><span lang="zh-CN" style="font-family: 宋体;">个时钟周期。或者可以更简单地来考虑:执行完所有的指令需要</span><span lang="en-US" style="font-family: Calibri;"> 4 </span><span lang="zh-CN" style="font-family: 宋体;">个</span><span lang="en-US" style="font-family: Calibri;"> chime</span><span lang="zh-CN" style="font-family: 宋体;">,其实等效于每一个向量元素需要</span><span lang="en-US" style="font-family: Calibri;"> 4 </span><span lang="zh-CN" style="font-family: 宋体;">个时钟周期来执行完其自身需要的操作。</span></div>Yuhaohttp://www.blogger.com/profile/08569555359590748704noreply@blogger.com0tag:blogger.com,1999:blog-1107874072437767466.post-74524296498719822822010-12-19T20:56:00.000-08:002010-12-19T20:56:39.930-08:00刘晓波获得诺贝尔和平奖之余<div style="font-size: 11pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体;">我写这篇文章的起因是我在推特上对刘晓波有几个批评,于是立马掉了不少</span><span lang="en-US" style="font-family: Calibri;"> Follower</span><span lang="zh-CN" style="font-family: 宋体;">。我不知道这两者是否有关联,但是这让我不得不认识到一个有趣的现象,也即在谈刘晓波其人其事的问题上,或许以前还有可以讨论的空间,但是自从他获得了今年的诺奖和平奖之后,似乎已然完全没有了回旋的余地。不管出于何种原因,或者是不读历史不知刘当初如何上中央电视台作证天安门屠杀没有死一个人,或者是不问时事不知刘晓波如何对高智晟这样的维权律师</span><a href="http://www.peacehall.com/news/gb/china/2006/08/200608191041.shtml"><span lang="zh-CN" style="font-family: 宋体;">落井下石</span></a><span lang="zh-CN" style="font-family: 宋体;">,对杨佳</span><a href="http://www.bullogger.com/blogs/tuna/archives/262359.aspx"><span lang="zh-CN" style="font-family: 宋体;">污蔑</span></a><span lang="zh-CN" style="font-family: 宋体;">,总之众人都是一片掌声。当然也有不少明白真相或者似乎明白真相的群众。比如有人说刘晓波是把天安门清场理解成仅仅是纪念碑一带的活动,刘晓波是真的没有看到在外围的屠杀。但是据吴仁华老师的《天安门血腥清场内幕》一书记载,即便是纪念碑一带,也是有人死亡的。即便刘晓波再没有看见,在当时如此恐怖如此疯狂的情形下,他如何就能做出肯定断言,声称天安门没有死人?又有人在推特上和我说,“僅僅因為早期言論就否定一切,一個人的道德是從長期的行為來判斷的。”显然他是指刘晓波作伪证之后幡然悔悟对自己来了一番痛心疾首的悔悟,并且《零八宪章》足以抹去他历史上的斑点。但是我要说,我不仅仅是因为他的早期言论就否定他,我看的正是他长期的行为!我指的是什么?我指的正是他对高智晟等受到迫害的维权律师和杨佳这样原始正义捍卫者的落井下石。我说落井下石,是一个事实判断,未必是价值判断</span><span lang="en-US" style="font-family: Calibri;">--</span><span lang="zh-CN" style="font-family: 宋体;">虽然我越来越认为这也是一个价值判断。当然有人会辩解说这是晓波的非暴力抵抗的原则。那么我且抛开刘晓波式的所谓非暴力抵抗不谈,我们来看看刘晓波到底对高智晟和杨佳事件做出何种评论,他何以就忍心说出这样的话?对高智晟律师一案,刘晓波说,“如果高智晟案进入司法程序,我们敦促司法当局公开审理,确保司法公正和被告人得到充分的法律辩护。”这简直是荒唐。如果在刘晓波现如今身陷囹圄的局面下,我站出来说,“我认为刘晓波应当得到公开公正的审理,我敦促立刻将刘晓波案件纳入正常的司法程序,我相信党和政府一定会给刘晓波一个公正的评判”,不知道各位是和看法,做何反应?对杨佳,他说,“事实上,在互联网时代的中国,杨佳的个人维权也并非穷尽了所有非暴力手段,起码还有一条非暴力维权之路</span><span lang="en-US" style="font-family: 宋体;">——</span><span lang="zh-CN" style="font-family: 宋体;">通过在媒体上公开他的冤情和诉求</span><span lang="en-US" style="font-family: Calibri;"> </span><span lang="en-US" style="font-family: 宋体;">来寻求舆论救济。传统媒体不行,他还可以利用互联网,类似杨佳的情况大概还不至于被封杀。试想,如果杨佳把他的遭遇和维权过程持续地在网上披露,说不定会</span><span lang="en-US" style="font-family: Calibri;"> </span><span lang="en-US" style="font-family: 宋体;">引发关注而变成一个公共话题,那么杨佳本人肯定会得到网络民意的支持,鼓励他坚持依法维权,上海警方也将受到网络舆论压力,他也许就不会采取暴力复仇的极</span><span lang="en-US" style="font-family: Calibri;"> </span><span lang="en-US" style="font-family: 宋体;">端手段。</span><span lang="zh-CN" style="font-family: 宋体;">”我要来替杨佳问你一句,以你如此的声望,几乎举全国有良知的公民之力,尚且是十一年牢狱之灾,你又何忍心说出让杨佳走什么媒体渠道网络渠道这样的鬼话出来?我们根本不用讨论什么非暴力抵抗,只要想想,在高杨二人一个生死不明一个身首异处的基本前提下,刘晓波怎能说出这种为虎作伥助纣为虐的话来?于情于理能说得通吗?</span></div><div lang="zh-CN" style="font-family: Calibri; font-size: 11pt; margin: 0in;"><br />
</div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin: 0in;">当然一片莺歌燕舞之中,仍然能够听到一些不同的声音。方励之教授的<a href="http://www.voanews.com/chinese/news/20101214-olso-by-fang-111866974.html">奥斯陆日记</a>里面回忆了当初晓波先生是如何使用言论自由权来对自己“开骂”的。魏京生先生的文章<a href="http://lihlii.posterous.com/36599317">《如今的诺贝尔和平奖给人们提供了什么》</a>对刘晓波获奖颇有微词。<a href="http://www.duping.net/XHC/show.php?bbs=11&post=1110695">安魂曲</a>和<a href="http://peacehall.com/forum/201012/boxun2010/158712.shtml">徐水良</a>两人也各自有文章对刘晓波的人品道德提出了质疑。不过无论方励之还是魏京生,他们或暗有讥讽,或明确表示对刘晓波其人其事的鄙视,都毫无疑问地表明了对于刘获奖的明确支持。这正是他们和刘晓波的本质不同之处。要行使言论自由,基本前提是言论双方都是自由人。如今刘晓波身陷囹圄,不具备对等谈话的条件,那么首要之举,乃是抗议中共的暴举,营救刘晓波出狱。然后再来和他聊一聊过往,算一算账簿。但是这并不代表可以把刘晓波的过去一笔抹干净,也不代表可以把他对高杨等人的评论抹干净。到那个时候,我们或许还可以讨论一下到底非暴力抵抗是什么,刘晓波式的非暴力抵抗又是什么。但是现在能做的,是记住刘晓波其人其事,然后为诺贝尔奖委员会高唱赞歌。</div>Yuhaohttp://www.blogger.com/profile/08569555359590748704noreply@blogger.com1tag:blogger.com,1999:blog-1107874072437767466.post-37228195886782288042010-12-19T09:46:00.000-08:002010-12-19T09:46:18.382-08:00向量处理器(2)<div lang="zh-CN" style="font-family: 宋体; margin: 0in;"><b><span style="font-size: large;">2. 向量处理器基本体系结构</span></b></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体;">一个向量处理器通常由一个普通的流水化的标量单元加上一个向量单元组成。在这个向量单元里的所有功能部件都有几个时钟周期的延迟。这使得能够使用较短的时钟周期,并且与复杂的需要深度流水化来避免数据</span><span lang="en-US" style="font-family: Calibri;"> hazard </span><span lang="zh-CN" style="font-family: 宋体;">的向量运算兼容。大多数的向量处理器所允许的向量运算包括浮点运算,整型运算或者逻辑运算。这里我们重点关注浮点运算。标量单元基本上和我们在第二章和第三章里面讨论过的高级的流水线化的</span><span lang="en-US" style="font-family: Calibri;"> CPU </span><span lang="zh-CN" style="font-family: 宋体;">里面的没区别。并且实际上商用的向量处理器里面都同时包含了乱序的标量单元(</span><span lang="en-US" style="font-family: Calibri;">NEC SX/5</span><span lang="zh-CN" style="font-family: 宋体;">)和</span><span lang="en-US" style="font-family: Calibri;"> VLIW </span><span lang="zh-CN" style="font-family: 宋体;">的标量单元</span><span lang="en-US" style="font-family: Calibri;"> </span><span lang="zh-CN" style="font-family: 宋体;">(</span><span lang="en-US" style="font-family: Calibri;">Fujitsu VPP5000</span><span lang="zh-CN" style="font-family: 宋体;">)。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体;">向量处理器主要有两种类型:向量</span><span lang="en-US" style="font-family: Calibri;">-</span><span lang="zh-CN" style="font-family: 宋体;">寄存器(</span><span lang="en-US" style="font-family: Calibri;">vector-register</span><span lang="zh-CN" style="font-family: 宋体;">)处理器和内存</span><span lang="en-US" style="font-family: Calibri;">-</span><span lang="zh-CN" style="font-family: 宋体;">内存(</span><span lang="en-US" style="font-family: Calibri;">memory-memory</span><span lang="zh-CN" style="font-family: 宋体;">)处理器。在</span><span lang="en-US" style="font-family: Calibri;"> vector-register </span><span lang="zh-CN" style="font-family: 宋体;">类的处理器中,所有的向量操作</span><span lang="en-US" style="font-family: Calibri;">--</span><span lang="zh-CN" style="font-family: 宋体;">除了</span><span lang="en-US" style="font-family: Calibri;"> load </span><span lang="zh-CN" style="font-family: 宋体;">和</span><span lang="en-US" style="font-family: Calibri;"> store--</span><span lang="zh-CN" style="font-family: 宋体;">都是在向量寄存器(</span><span lang="en-US" style="font-family: Calibri;">vector register</span><span lang="zh-CN" style="font-family: 宋体;">)里面进行的。这类的处理器就和我们在标量处理器里面谈过的</span><span lang="en-US" style="font-family: Calibri;"> load-store </span><span lang="zh-CN" style="font-family: 宋体;">体系结构相对应。在</span><span lang="en-US" style="font-family: Calibri;"> 80 </span><span lang="zh-CN" style="font-family: 宋体;">年代后期发布的几乎所有向量计算机都采用了这个结构,这其中包括</span><span lang="en-US" style="font-family: Calibri;"> Cray Research </span><span lang="zh-CN" style="font-family: 宋体;">的处理器</span><span lang="en-US" style="font-family: Calibri;"> </span><span lang="zh-CN" style="font-family: 宋体;">(</span><span lang="en-US" style="font-family: Calibri;">Cray-1</span><span lang="zh-CN" style="font-family: 宋体;">,</span><span lang="en-US" style="font-family: Calibri;">Cray-2</span><span lang="zh-CN" style="font-family: 宋体;">,</span><span lang="en-US" style="font-family: Calibri;">X-MP</span><span lang="zh-CN" style="font-family: 宋体;">,</span><span lang="en-US" style="font-family: Calibri;">YMP</span><span lang="zh-CN" style="font-family: 宋体;">,</span><span lang="en-US" style="font-family: Calibri;">C90</span><span lang="zh-CN" style="font-family: 宋体;">,</span><span lang="en-US" style="font-family: Calibri;">T90</span><span lang="zh-CN" style="font-family: 宋体;">,</span><span lang="en-US" style="font-family: Calibri;">SV1 </span><span lang="zh-CN" style="font-family: 宋体;">和</span><span lang="en-US" style="font-family: Calibri;"> X1</span><span lang="zh-CN" style="font-family: 宋体;">),日本的超级计算机(从</span><span lang="en-US" style="font-family: Calibri;">NEC SX/2 </span><span lang="zh-CN" style="font-family: 宋体;">到</span><span lang="en-US" style="font-family: Calibri;"> SX/8</span><span lang="zh-CN" style="font-family: 宋体;">,</span><span lang="en-US" style="font-family: Calibri;">Fujitsu VP200 </span><span lang="zh-CN" style="font-family: 宋体;">到</span><span lang="en-US" style="font-family: Calibri;"> VPP5000</span><span lang="zh-CN" style="font-family: 宋体;">,以及</span><span lang="en-US" style="font-family: Calibri;">Hitachi S820 </span><span lang="zh-CN" style="font-family: 宋体;">和</span><span lang="en-US" style="font-family: Calibri;"> S-8300</span><span lang="zh-CN" style="font-family: 宋体;">)以及迷你超级计算机(从</span><span lang="en-US" style="font-family: Calibri;"> Convex C-1 </span><span lang="zh-CN" style="font-family: 宋体;">到</span><span lang="en-US" style="font-family: Calibri;"> C-4</span><span lang="zh-CN" style="font-family: 宋体;">)。在</span><span lang="en-US" style="font-family: Calibri;"> memory-memory </span><span lang="zh-CN" style="font-family: 宋体;">类的向量处理器中,所有的向量运算都是从内存到内存的。第一个向量处理器就是这种类型,</span><span lang="en-US" style="font-family: Calibri;">CDC </span><span lang="zh-CN" style="font-family: 宋体;">系列的亦是如此。从现在开始让我们把注意力放在</span><span lang="en-US" style="font-family: Calibri;"> vector-register </span><span lang="zh-CN" style="font-family: 宋体;">类的处理器上我们会。在这个附录的最后(第</span><span lang="en-US" style="font-family: Calibri;">10</span><span lang="zh-CN" style="font-family: 宋体;">小节),我们会简单地讨论一下为什么</span><span lang="en-US" style="font-family: Calibri;"> memory-memory </span><span lang="zh-CN" style="font-family: 宋体;">类的处理器没有</span><span lang="en-US" style="font-family: Calibri;"> vector-register </span><span lang="zh-CN" style="font-family: 宋体;">类的处理器那么成功。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体;">我们讨论的</span><span lang="en-US" style="font-family: Calibri;"> vector-register </span><span lang="zh-CN" style="font-family: 宋体;">类的处理器主要由图</span><span lang="en-US" style="font-family: Calibri;"> F.1 </span><span lang="zh-CN" style="font-family: 宋体;">中的部件组成。这个基本上类似</span><span lang="en-US" style="font-family: Calibri;"> Cray-1 </span><span lang="zh-CN" style="font-family: 宋体;">的处理器是我们整章的讨论基础。我们把它叫</span><span lang="en-US" style="font-family: Calibri;"> VMIPS--</span><span lang="zh-CN" style="font-family: 宋体;">它的标量单元就是</span><span lang="en-US" style="font-family: Calibri;"> MIPS</span><span lang="zh-CN" style="font-family: 宋体;">,而向量单元则是</span><span lang="en-US" style="font-family: Calibri;"> MIPS </span><span lang="zh-CN" style="font-family: 宋体;">的扩展。本小节余下部分将讨论</span><span lang="en-US" style="font-family: Calibri;"> VMIPS </span><span lang="zh-CN" style="font-family: 宋体;">和其他处理器有何相关。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin: 0in;"><br />
</div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjTabKL7w_U8BBqlFZxkZGecDz0IjxBREdwVPFKeiLdCWsfD-JTygeb0m5THc1TaZ36zPYhz8cuuEHAC2FdhSamxlZFcIJy6MLF40zvFM-QiPwDJkMzxdEdSvb5_maXm_So6GVhyphenhyphent_xOzmI/s1600/CropperCapture%255B1%255D.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="380" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjTabKL7w_U8BBqlFZxkZGecDz0IjxBREdwVPFKeiLdCWsfD-JTygeb0m5THc1TaZ36zPYhz8cuuEHAC2FdhSamxlZFcIJy6MLF40zvFM-QiPwDJkMzxdEdSvb5_maXm_So6GVhyphenhyphent_xOzmI/s400/CropperCapture%255B1%255D.png" width="400" /></a></div><div style="margin: 0in 0in 0in 1.125in;"><br />
</div><div style="font-size: 11pt; margin: 0in;"><span lang="en-US" style="font-family: Calibri;"> </span><i><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">图</span><span lang="en-US" style="font-family: Calibri; font-weight: bold;">F.1 VMIPS </span><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">的基本结构。</span><span lang="zh-CN" style="font-family: 宋体;">这个处理器有一个类似</span><span lang="en-US" style="font-family: Calibri;"> MIPS </span><span lang="en-US" style="font-family: Calibri;">的标量单元。</span><span lang="zh-CN" style="font-family: 宋体;">另外还有</span><span lang="en-US" style="font-family: Calibri;"> 8 </span><span lang="zh-CN" style="font-family: 宋体;">个包含</span><span lang="en-US" style="font-family: Calibri;"> 64 </span><span lang="en-US" style="font-family: Calibri;">个元素的向量寄存器,并且所有的功能部件都</span><span lang="en-US" style="font-family: Calibri;">是向量运算单元。定义了包括算术运算和访存在内的特别的向量指令。</span><span lang="en-US" style="font-family: Calibri;">在这张图里面我们包括了逻辑运算和整型运算的部件,这些部件在</span><span lang="zh-CN" style="font-family: 宋体;">标准的向量处理器里面其实也是存在的。但是除了后面的练习题之外我们不会讨论这两个部件。标量寄存器及和向量寄存器都有大量的读写端口以支持同时多个向量操作。这些端口通过一组</span><span lang="en-US" style="font-family: Calibri;"> crossbar </span><span lang="zh-CN" style="font-family: 宋体;">(图中灰色所示)和向量功能部件的输入输出相连。在第四小节中,我们会再加上</span><span lang="en-US" style="font-family: Calibri;"> chaining </span><span lang="zh-CN" style="font-family: 宋体;">的部分,它需要更强的互联能力。</span></i></div><div style="font-size: 11pt; margin: 0in;"></div><div style="font-size: 11pt; margin: 0in;"> <br />
<div style="font-size: 11pt; margin: 0in;"><span lang="en-US" style="font-family: Calibri;">VMIPS</span><span lang="zh-CN" style="font-family: 宋体;">的主要组成部分是:</span></div><ul><li><span lang="en-US" style="font-family: Calibri;"></span><span lang="zh-CN" style="font-family: 宋体;">向量寄存器(</span><span lang="en-US" style="font-family: Calibri;">Vector registers</span><span lang="zh-CN" style="font-family: 宋体;">):每个向量寄存器都是一个定长的</span><span lang="en-US" style="font-family: Calibri;"> bank</span><span lang="zh-CN" style="font-family: 宋体;">,能够容纳一个向量。</span><span lang="en-US" style="font-family: Calibri;">VMIPS</span><span lang="zh-CN" style="font-family: 宋体;">有</span><span lang="en-US" style="font-family: Calibri;"> 8 </span><span lang="zh-CN" style="font-family: 宋体;">个向量寄存器,每个寄存器能够容纳</span><span lang="en-US" style="font-family: Calibri;"> 64 </span><span lang="zh-CN" style="font-family: 宋体;">个向量元素。每个寄存器至少有两个读端口和一个写端口。这能够保证需要不同向量寄存器的多个向量运算能够互相同步进行</span><span lang="en-US" style="font-family: Calibri;">[1]</span><span lang="zh-CN" style="font-family: 宋体;">(我们并不考虑由于寄存器端口短缺而引起的问题。在实际的机器中,这会导致</span><span lang="en-US" style="font-family: Calibri;"> structural hazard</span><span lang="zh-CN" style="font-family: 宋体;">)。总共</span><span lang="en-US" style="font-family: Calibri;"> 16 </span><span lang="zh-CN" style="font-family: 宋体;">个读端口和</span><span lang="en-US" style="font-family: Calibri;"> 8 </span><span lang="zh-CN" style="font-family: 宋体;">个写端口通过一对</span><span lang="en-US" style="font-family: Calibri;"> crossbar </span><span lang="zh-CN" style="font-family: 宋体;">和功能部件的输入输出相连(我们在这里简化了对于</span><span lang="en-US" style="font-family: Calibri;"> vector-register </span><span lang="zh-CN" style="font-family: 宋体;">类的处理器的寄存器文件的描述。真实的机器里会利用在一条向量指令里规则的访问模式来简化寄存器文件的电路设计</span><span lang="en-US" style="font-family: Calibri;">[Asanovic 1998]</span><span lang="zh-CN" style="font-family: 宋体;">。比如</span><span lang="en-US" style="font-family: Calibri;"> Cray-1 </span><span lang="zh-CN" style="font-family: 宋体;">就能够设计使得每个寄存器只需要一个端口)。</span></li>
</ul><ul><li><span lang="en-US" style="font-family: Calibri;"></span><span lang="zh-CN" style="font-family: 宋体;">向量功能部件(</span><span lang="en-US" style="font-family: Calibri;">Vector functional units</span><span lang="zh-CN" style="font-family: 宋体;">):每个部件都是完全流水化的,并且每一个新的时钟周期可以开始对于一个新向量元素的操作。另外还需要一个控制部件来检测</span><span lang="en-US" style="font-family: Calibri;"> hazard</span><span lang="zh-CN" style="font-family: 宋体;">,包括功能部件的冲突(</span><span lang="en-US" style="font-family: Calibri;">structural hazards</span><span lang="zh-CN" style="font-family: 宋体;">)和寄存器访问的冲突(</span><span lang="en-US" style="font-family: Calibri;">data hazards</span><span lang="zh-CN" style="font-family: 宋体;">)</span><span lang="en-US" style="font-family: Calibri;">[2]</span><span lang="zh-CN" style="font-family: 宋体;">。如图</span><span lang="en-US" style="font-family: Calibri;"> F.1 </span><span lang="zh-CN" style="font-family: 宋体;">所示,</span><span lang="en-US" style="font-family: Calibri;">VMIPS </span><span lang="zh-CN" style="font-family: 宋体;">有五个功能部件。为了简化讨论,我们将只讨论浮点运算单元。取决于不同的设计,标量运算可能使用向量功能部件,或者有单独的一组功能部件。我们这里假设功能部件是共享的,并且,再一次忽略任何可能的冲突。</span></li>
</ul><ul><li><span lang="en-US" style="font-family: Calibri;"></span><span lang="zh-CN" style="font-family: 宋体;">向量访存单元(</span><span lang="en-US" style="font-family: Calibri;">Vector load-store unit</span><span lang="zh-CN" style="font-family: 宋体;">):这是一个能够</span><span lang="en-US" style="font-family: Calibri;"> load </span><span lang="zh-CN" style="font-family: 宋体;">和</span><span lang="en-US" style="font-family: Calibri;"> store </span><span lang="zh-CN" style="font-family: 宋体;">一个向量到主存的单元。</span><span lang="en-US" style="font-family: Calibri;">VMIPS </span><span lang="zh-CN" style="font-family: 宋体;">中的</span><span lang="en-US" style="font-family: Calibri;"> load </span><span lang="zh-CN" style="font-family: 宋体;">和</span><span lang="en-US" style="font-family: Calibri;"> store </span><span lang="zh-CN" style="font-family: 宋体;">操作是完全流水化的,因此在一开始的延迟之后,寄存器和内存之间的带宽可以达到每一个时钟周期一个</span><span lang="en-US" style="font-family: Calibri;"> word</span><span lang="zh-CN" style="font-family: 宋体;">。这个单元通常也处理标量访存工作。</span></li>
</ul><ul><li><span lang="en-US" style="font-family: Calibri;"></span><span lang="zh-CN" style="font-family: 宋体;">一组标量寄存器(</span><span lang="en-US" style="font-family: Calibri;">scalar registers</span><span lang="zh-CN" style="font-family: 宋体;">):标量寄存器能为标量功能部件提供输入数据</span><span lang="en-US" style="font-family: Calibri;">[3]</span><span lang="zh-CN" style="font-family: 宋体;">,并且也能为内存访问单元提供地址。这些其实就是</span><span lang="en-US" style="font-family: Calibri;">MIPS</span><span lang="zh-CN" style="font-family: 宋体;">中常见的</span><span lang="en-US" style="font-family: Calibri;"> 32 </span><span lang="zh-CN" style="font-family: 宋体;">个通用寄存器和</span><span lang="en-US" style="font-family: Calibri;"> 32 </span><span lang="zh-CN" style="font-family: 宋体;">个浮点寄存器。标量数据从标量寄存器中读出,然后锁存到向量单元的一个输入之中。</span></li>
</ul><div style="font-size: 11pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体;">图</span><span lang="en-US" style="font-family: Calibri;"> F.2 </span><span lang="zh-CN" style="font-family: 宋体;">展示了一些典型的向量处理器的特征,包括寄存器的大小和个数,功能单元的数目和类型,访存单元的数目。最后一栏展示的是一个机器中</span><span lang="en-US" style="font-family: Calibri;"> lane </span><span lang="zh-CN" style="font-family: 宋体;">的数目,也即可以同时执行一个向量中各个元素操作的并行流水线的数目。第四小节会详细描述</span><span lang="en-US" style="font-family: Calibri;"> Lane </span><span lang="zh-CN" style="font-family: 宋体;">的概念。这里我们假设</span><span lang="en-US" style="font-family: Calibri;"> VMIPS </span><span lang="zh-CN" style="font-family: 宋体;">中每个功能部件只有一个流水线(也即一个</span><span lang="en-US" style="font-family: Calibri;"> lane</span><span lang="zh-CN" style="font-family: 宋体;">)。</span></div><div style="font-size: 11pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体;"><br />
</span></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgevXQhrmm_BcvRizQFUPeVfNKtW3ISfGmR9kQK3plOXaA7gc7RuggAnCz0QxUYoofizE2BbeJgYhJoj2rd2khOvcrQEhxfFQKTynousNCchDgXxF6jyCYdY5_Sg9roLNXMfYRcvs9vkzS3/s1600/F2.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="382" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgevXQhrmm_BcvRizQFUPeVfNKtW3ISfGmR9kQK3plOXaA7gc7RuggAnCz0QxUYoofizE2BbeJgYhJoj2rd2khOvcrQEhxfFQKTynousNCchDgXxF6jyCYdY5_Sg9roLNXMfYRcvs9vkzS3/s400/F2.PNG" width="400" /></a></div><div style="font-size: 11pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体;"><br />
</span></div><div style="font-size: 11pt; margin: 0in;"> </div><div style="font-size: 11pt; margin: 0in;"><span lang="en-US" style="font-family: 宋体; font-style: italic; font-weight: bold;">图</span><span lang="zh-CN" style="font-family: Calibri; font-style: italic; font-weight: bold;">F.2 </span><span lang="zh-CN" style="font-family: 宋体; font-style: italic; font-weight: bold;">几个</span><span lang="en-US" style="font-family: Calibri; font-style: italic; font-weight: bold;"> vector-register </span><span lang="zh-CN" style="font-family: 宋体; font-style: italic; font-weight: bold;">类型的体系结构的特征。</span><span lang="zh-CN" style="font-family: 宋体; font-style: italic;">如果某个机器是多处理器(</span><span lang="en-US" style="font-family: Calibri; font-style: italic;">multiprocessor</span><span lang="zh-CN" style="font-family: 宋体; font-style: italic;">),上面只列出一个处理器的数据。有几个机器采用了不同频率的标量和向量单元,上面的频率数据是向量单元的。</span><span lang="en-US" style="font-family: Calibri; font-style: italic;">Fujitsu</span><span lang="zh-CN" style="font-family: 宋体; font-style: italic;">的机器的向量寄存器是可配置的:</span><span lang="en-US" style="font-family: Calibri; font-style: italic;">8K </span><span lang="zh-CN" style="font-family: 宋体; font-style: italic;">个</span><span lang="en-US" style="font-family: Calibri; font-style: italic;"> 64-bit </span><span lang="zh-CN" style="font-family: 宋体; font-style: italic;">的</span><span lang="en-US" style="font-family: Calibri; font-style: italic;"> entry </span><span lang="en-US" style="font-family: Calibri;">[4]</span><span lang="zh-CN" style="font-family: 宋体; font-style: italic;">是可以变化的(比如在</span><span lang="en-US" style="font-family: Calibri; font-style: italic;"> VP200 </span><span lang="zh-CN" style="font-family: 宋体; font-style: italic;">里面,可以从</span><span lang="en-US" style="font-family: Calibri; font-style: italic;"> 8 </span><span lang="zh-CN" style="font-family: 宋体; font-style: italic;">个寄存器,每个包含</span><span lang="en-US" style="font-family: Calibri; font-style: italic;"> 1K </span><span lang="zh-CN" style="font-family: 宋体; font-style: italic;">和元素到</span><span lang="en-US" style="font-family: Calibri; font-style: italic;"> 256 </span><span lang="zh-CN" style="font-family: 宋体; font-style: italic;">个寄存器,每个包含</span><span lang="en-US" style="font-family: Calibri; font-style: italic;"> 32 </span><span lang="zh-CN" style="font-family: 宋体; font-style: italic;">个元素)。</span><span lang="en-US" style="font-family: Calibri; font-style: italic;">NEC </span><span lang="zh-CN" style="font-family: 宋体; font-style: italic;">的机器有</span><span lang="en-US" style="font-family: Calibri; font-style: italic;"> 8 </span><span lang="zh-CN" style="font-family: 宋体; font-style: italic;">个前段的向量寄存器和算术运算单元相连,还有</span><span lang="en-US" style="font-family: Calibri; font-style: italic;"> 32-64 </span><span lang="zh-CN" style="font-family: 宋体; font-style: italic;">个后端寄存器连接内存和前段寄存器。</span><span lang="en-US" style="font-family: Calibri; font-style: italic;">Add </span><span lang="zh-CN" style="font-family: 宋体; font-style: italic;">流水线进行加法和减法操作。</span><span lang="en-US" style="font-family: Calibri; font-style: italic;">Hitachi S810/820 </span><span lang="zh-CN" style="font-family: 宋体; font-style: italic;">里面的</span><span lang="en-US" style="font-family: Calibri; font-style: italic;"> multiply/divide-add </span><span lang="zh-CN" style="font-family: 宋体; font-style: italic;">单元先进行一个浮点的乘法或者除法运算紧跟着一个加法或者减法操作(</span><span lang="en-US" style="font-family: Calibri; font-style: italic;">multiply-add </span><span lang="zh-CN" style="font-family: 宋体; font-style: italic;">单元先进行一个浮点的乘法运算紧跟着一个加法或者减法操作)。注意大多数的处理器使用向量浮点单元来进行向量整型运算,还有一些处理器使用相同的功能单元来进行向量和标量操作。每个</span><span lang="en-US" style="font-family: Calibri; font-style: italic;"> vector load-store </span><span lang="zh-CN" style="font-family: 宋体; font-style: italic;">单元能够执行一个独立的,可以相互重叠的</span><span lang="en-US" style="font-family: Calibri; font-style: italic;"> load </span><span lang="zh-CN" style="font-family: 宋体; font-style: italic;">或者</span><span lang="en-US" style="font-family: Calibri; font-style: italic;"> store </span><span lang="zh-CN" style="font-family: 宋体; font-style: italic;">操作。</span><span lang="en-US" style="font-family: Calibri; font-style: italic;">Lane </span><span lang="zh-CN" style="font-family: 宋体; font-style: italic;">的数目,就像第四小节中提到的,是每个功能部件中并行流水线的数目。比如,</span><span lang="en-US" style="font-family: Calibri; font-style: italic;">NEC SX/5 </span><span lang="zh-CN" style="font-family: 宋体; font-style: italic;">的乘法部件可以在一个时钟周期之中完成</span><span lang="en-US" style="font-family: Calibri; font-style: italic;"> 16 </span><span lang="zh-CN" style="font-family: 宋体; font-style: italic;">个乘法运算</span><span lang="en-US" style="font-family: Calibri;">[5]</span><span lang="zh-CN" style="font-family: 宋体; font-style: italic;">。有些机器可以把一个</span><span lang="en-US" style="font-family: Calibri; font-style: italic;"> 64-bit </span><span lang="zh-CN" style="font-family: 宋体; font-style: italic;">的</span><span lang="en-US" style="font-family: Calibri; font-style: italic;"> lane </span><span lang="zh-CN" style="font-family: 宋体; font-style: italic;">分离成</span><span lang="en-US" style="font-family: Calibri; font-style: italic;"> 2 </span><span lang="zh-CN" style="font-family: 宋体; font-style: italic;">个</span><span lang="en-US" style="font-family: Calibri; font-style: italic;"> 32-bit </span><span lang="zh-CN" style="font-family: 宋体; font-style: italic;">的</span><span lang="en-US" style="font-family: Calibri; font-style: italic;"> lane</span><span lang="zh-CN" style="font-family: 宋体; font-style: italic;">以提高那些对精度要求不高的程序的性能。</span><span lang="en-US" style="font-family: Calibri; font-style: italic;">Cray SV1 </span><span lang="zh-CN" style="font-family: 宋体; font-style: italic;">和</span><span lang="en-US" style="font-family: Calibri; font-style: italic;"> Cray X1 </span><span lang="zh-CN" style="font-family: 宋体; font-style: italic;">可以把有</span><span lang="en-US" style="font-family: Calibri; font-style: italic;"> 2 </span><span lang="zh-CN" style="font-family: 宋体; font-style: italic;">个</span><span lang="en-US" style="font-family: Calibri; font-style: italic;">lane </span><span lang="zh-CN" style="font-family: 宋体; font-style: italic;">的</span><span lang="en-US" style="font-family: Calibri; font-style: italic;"> 4 </span><span lang="zh-CN" style="font-family: 宋体; font-style: italic;">个</span><span lang="en-US" style="font-family: Calibri; font-style: italic;"> CPU </span><span lang="zh-CN" style="font-family: 宋体; font-style: italic;">组合成好像一个有</span><span lang="en-US" style="font-family: Calibri; font-style: italic;"> 8 </span><span lang="zh-CN" style="font-family: 宋体; font-style: italic;">个</span><span lang="en-US" style="font-family: Calibri; font-style: italic;"> lane </span><span lang="zh-CN" style="font-family: 宋体; font-style: italic;">的</span><span lang="en-US" style="font-family: Calibri; font-style: italic;"> CPU</span><span lang="zh-CN" style="font-family: 宋体; font-style: italic;">。</span><span lang="en-US" style="font-family: Calibri; font-style: italic;">Cray</span><span lang="zh-CN" style="font-family: 宋体; font-style: italic;">把它叫做</span><span lang="en-US" style="font-family: Calibri; font-style: italic;"> Multi-Streaming Processor </span><span lang="zh-CN" style="font-family: 宋体; font-style: italic;">(</span><span lang="en-US" style="font-family: Calibri; font-style: italic;">MSP</span><span lang="zh-CN" style="font-family: 宋体; font-style: italic;">)。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体;">在</span><span lang="en-US" style="font-family: Calibri;"> VMIPS </span><span lang="zh-CN" style="font-family: 宋体;">中,向量指令采用和</span><span lang="en-US" style="font-family: Calibri;"> MIPS </span><span lang="zh-CN" style="font-family: 宋体;">指令相同的名字,只是多了加了“</span><span lang="en-US" style="font-family: Calibri;">V</span><span lang="zh-CN" style="font-family: 宋体;">”。这样的话,</span><span lang="en-US" style="font-family: Calibri;">ADDV.D </span><span lang="zh-CN" style="font-family: 宋体;">就是一个把两个双精度的向量相加的操作。向量指令的输入或者是一对向量寄存器(</span><span lang="en-US" style="font-family: Calibri;">ADDV.D</span><span lang="zh-CN" style="font-family: 宋体;">)或者是一个向量寄存器和一个标量及存取(</span><span lang="en-US" style="font-family: Calibri;">ADDVS.D</span><span lang="zh-CN" style="font-family: 宋体;">)。后者的标量输入被所有的向量元素视为输入</span><span lang="en-US" style="font-family: Calibri;">--ADDVS.D </span><span lang="zh-CN" style="font-family: 宋体;">会把标量寄存器的内容加到向量寄存器的每一个元素上。标量值在指令发射时被拷贝多份到功能部件。大多数的向量操作都有一个向量目标寄存器(</span><span lang="en-US" style="font-family: Calibri;">vector destination register</span><span lang="zh-CN" style="font-family: 宋体;">),也有一些操作的结果是被存储到标量寄存器里的。</span><span lang="en-US" style="font-family: Calibri;">LV </span><span lang="zh-CN" style="font-family: 宋体;">和</span><span lang="en-US" style="font-family: Calibri;"> SV </span><span lang="zh-CN" style="font-family: 宋体;">代表向量</span><span lang="en-US" style="font-family: Calibri;"> load </span><span lang="zh-CN" style="font-family: 宋体;">和向量</span><span lang="en-US" style="font-family: Calibri;"> store</span><span lang="zh-CN" style="font-family: 宋体;">操作。他们</span><span lang="en-US" style="font-family: Calibri;"> load </span><span lang="zh-CN" style="font-family: 宋体;">或者</span><span lang="en-US" style="font-family: Calibri;"> store </span><span lang="zh-CN" style="font-family: 宋体;">一整个双精度的向量。一个操作数是一个要被</span><span lang="en-US" style="font-family: Calibri;"> load </span><span lang="zh-CN" style="font-family: 宋体;">或者</span><span lang="en-US" style="font-family: Calibri;"> store </span><span lang="zh-CN" style="font-family: 宋体;">的向量寄存器里,另一个是一个</span><span lang="en-US" style="font-family: Calibri;"> MIPS </span><span lang="zh-CN" style="font-family: 宋体;">的通用寄存器,保存该向量在内存中的起始地址。图</span><span lang="en-US" style="font-family: Calibri;"> F.3 </span><span lang="zh-CN" style="font-family: 宋体;">列出了</span><span lang="en-US" style="font-family: Calibri;"> VMIPS </span><span lang="zh-CN" style="font-family: 宋体;">的向量指令。除了向量寄存器之外,我们还需要两个特殊用途的寄存器:向量长度寄存器(</span><span lang="en-US" style="font-family: Calibri;">vector-length register</span><span lang="zh-CN" style="font-family: 宋体;">)和向量掩码寄存器(</span><span lang="en-US" style="font-family: Calibri;">vector-mask register</span><span lang="zh-CN" style="font-family: 宋体;">)。我们分别在第三和第四小节会谈到这两个寄存器和他们的用途。</span></div><div style="font-size: 11pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体;"><br />
</span></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgVtHOCePZcP8YmdZMEKjf7EYA98d8rWW335CWO4BoTf5AT8SvIXerJmUN-IsJUbz13ZaKWN6H1eDOABLyCNqD-WAoeurEFlRTs5N9sxhSi5M5zJQT7Ei8VZ9EMor13docDrUE6F7Tf0PKk/s1600/F3.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="332" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgVtHOCePZcP8YmdZMEKjf7EYA98d8rWW335CWO4BoTf5AT8SvIXerJmUN-IsJUbz13ZaKWN6H1eDOABLyCNqD-WAoeurEFlRTs5N9sxhSi5M5zJQT7Ei8VZ9EMor13docDrUE6F7Tf0PKk/s400/F3.PNG" width="400" /></a></div><div style="font-size: 11pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体;"><br />
</span></div><div style="font-size: 11pt; margin: 0in;"> </div><div style="font-size: 11pt; font-style: italic; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">图</span><span lang="en-US" style="font-family: Calibri; font-weight: bold;">F.3 VMIPS </span><span lang="zh-CN" style="font-family: 宋体; font-weight: bold;">的向量指令。</span><span lang="zh-CN" style="font-family: 宋体;">这里只列出了双精度的指令。除了向量寄存器之外,还有两个特殊寄存器,</span><span lang="en-US" style="font-family: Calibri;">VLR</span><span lang="zh-CN" style="font-family: 宋体;">(在第三节谈到)和</span><span lang="en-US" style="font-family: Calibri;"> VM</span><span lang="zh-CN" style="font-family: 宋体;">(在第四节谈到)。假定这些特殊寄存器都在</span><span lang="en-US" style="font-family: Calibri;"> MIPS </span><span lang="zh-CN" style="font-family: 宋体;">的</span><span lang="en-US" style="font-family: Calibri;"> coprocessor 1 </span><span lang="zh-CN" style="font-family: 宋体;">中,也即和</span><span lang="en-US" style="font-family: Calibri;"> FPU </span><span lang="zh-CN" style="font-family: 宋体;">寄存器在同一空间中。有</span><span lang="en-US" style="font-family: Calibri;"> stride </span><span lang="zh-CN" style="font-family: 宋体;">的操作会在第三小节提到。新建索引和索引化的访存操作会在第四节里提到。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; font-style: italic; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in; text-align: center;"><span lang="en-US" style="font-family: Calibri;">---------------</span><span lang="zh-CN" style="font-family: 宋体;">大家好,我是分割线</span><span lang="en-US" style="font-family: Calibri;">---------------</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in;"><span lang="en-US" style="font-family: Calibri;">[1] </span><span lang="zh-CN" style="font-family: 宋体;">在后面我们会看到</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">这个概念。一个</span><span lang="en-US" style="font-family: Calibri;"> convoy </span><span lang="zh-CN" style="font-family: 宋体;">表示一组可以同时执行的向量指令。他们可能访问不同的寄存器,我们必须提供足够多的端口以保证不会发生端口短缺。</span></div><div style="font-size: 11pt; margin: 0in;"><span lang="en-US" style="font-family: Calibri;">[2] </span><span lang="zh-CN" style="font-family: 宋体;">这里的寄存器访问的冲突不是指端口短缺引起的冲突。而是由于对于数据依赖不正当的处理而引起的对寄存器错误访问。</span></div><div style="font-size: 11pt; margin: 0in;"><span lang="en-US" style="font-family: Calibri;">[3] </span><span lang="zh-CN" style="font-family: 宋体;">想一下什么情况下向量运算需要标量数据?什么情况下向量运算的结果需要写到标量寄存器里面?</span></div><div style="font-size: 11pt; margin: 0in;"><span lang="en-US" style="font-family: Calibri;">[4] </span><span lang="zh-CN" style="font-family: 宋体;">这里的</span><span lang="en-US" style="font-family: Calibri;"> entry </span><span lang="zh-CN" style="font-family: 宋体;">指的是一个向量寄存器里面的一个元素。</span><span lang="en-US" style="font-family: Calibri;">64-bit </span><span lang="zh-CN" style="font-family: 宋体;">是指每个元素是</span><span lang="en-US" style="font-family: Calibri;"> 64-bit </span><span lang="zh-CN" style="font-family: 宋体;">长。</span></div><div style="font-size: 11pt; margin: 0in;"><span lang="en-US" style="font-family: Calibri;">[5] </span><span lang="zh-CN" style="font-family: 宋体;">更精确地说,是完成</span><span lang="en-US" style="font-family: Calibri;">16</span><span lang="zh-CN" style="font-family: 宋体;">个来自不同向量的操作。</span></div></div>Yuhaohttp://www.blogger.com/profile/08569555359590748704noreply@blogger.com0tag:blogger.com,1999:blog-1107874072437767466.post-6781048532395037082010-12-18T09:27:00.000-08:002011-01-06T07:59:22.854-08:00Terminology<div></div><div style="text-align: justify;">这是物理上独立的一个页面,但是在逻辑上他依附于从<a href="http://yuhaozhu.blogspot.com/2010/12/1.html">这篇文章</a>开始的一系列文章。它的主要目的是澄清在我的翻译过程中令人望而生畏的体系结构中莫衷一是的术语。<br />
<span class="Apple-style-span" style="font-family: Calibri; font-size: 15px;">1.</span> <span lang="en-US" style="color: red; font-family: Calibri; font-size: 11pt;">dependency/hazard</span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">:这两个说法严格来讲是可以区分的,但是大多数情况下大家会混用。严格地说,</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;">dependency </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">是一个程序本身的特性,而</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> hazard </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">是由于对于</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> dependency </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">的不正当处理造成的处理器的混乱。</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> </span></div><div style="text-align: justify;"><span lang="en-US" style="font-family: Calibri; font-size: 11pt;">2. <span class="Apple-style-span" style="color: red;">in flight</span></span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">:</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;">In flight </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">的指令并非一定仅限于正在</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> FU </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">里面执行的指令。更广义地讲,他也包括已经执行完毕但是没有</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> retire </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">的指令和正在</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> reservation station </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">中等待的指令。从这个意义上讲,</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;">instruction window </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">包含的就是所有</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> in flight </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">的指令。</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> </span></div><div style="text-align: justify;"><span lang="en-US" style="font-family: Calibri; font-size: 11pt;">3. <span class="Apple-style-span" style="color: red;">instruction window</span></span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">:</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">有两个截然不同的说法。比较常见的解释是,</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;">instruction window </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">中的指令是那些正在</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> functional unit </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">中执行的指令。或者也可以扩大一下,是在</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;">ROB</span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">中还没有</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> retire </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">的指令。在这个说法中,</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;">instruction window </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">其实只是一个逻辑上的概念。整本量化里面应该使用的是这个含义。另一种说法是指的在</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> instruction fetch </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">和</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> decode/renaming </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">之间的一个</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> buffer</span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">。当由于某种原因不能进行</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> decode </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">或者</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> rename </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">的时候,</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;">fetch </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">单元会把指令取到这个</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> buffer </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">里面而不停顿。不过通常这个</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> buffer </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">被称为</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> instruction queue</span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">。这个说法中,</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;">instruction window </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">是一个物理上存在的存储结构。</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> </span></div><div style="text-align: justify;"><span lang="en-US" style="font-family: Calibri; font-size: 11pt;">4. <span class="Apple-style-span" style="color: red;">interlock/stall/bubble</span></span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">:</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;">interlock </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">和</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> stall </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">可以认为是一件事情,即使得流水线停顿。通常停顿流水线的方法是</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> clock gating</span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">,或者说关闭流水线寄存器的</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> load enable </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">信号。但是组合电路的特性是即使不往寄存器里写,每个周期仍然会有数据被读出,这些数据是无效的因为前半段的流水线正处于停顿状态。这时候我们需要一个</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> valid </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">位来告诉后半段的流水线级:现在给出的数据是无效的,不要处理它,直到</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> valid </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">位置</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> 1</span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">。把</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> valid </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">置</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> 0 </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">的操作通常被称为插入了一个</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> bubble</span><span lang="en-US" style="font-family: 宋体; font-size: 11pt;">。</span></div><div style="text-align: justify;"><span class="Apple-style-span" style="font-family: Calibri; font-size: 15px;"><span lang="en-US" style="font-family: Calibri; font-size: 11pt;">5. <span class="Apple-style-span" style="color: red;">memory pipeline</span></span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">:概念上讲</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> memory pipeline </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">是指可以独立进行访存请求的通路。通常来讲,一个</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> memory system </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">有一条</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> memory pipeline</span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">,因为我们只有一条</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> address bus </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">和一条</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> data bus</span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">。因为在同一个时刻我们只能进行一个访存请求(虽然我们可以同时有多个</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> bank </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">在工作)。如果我们增加一组</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> bus</span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">,那么就可以说我们增加了一条</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> memory pipeline</span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">。有两种办法增加</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> memory pipeline</span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">。一种是现在所谓的多通道技术(</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;">multi-channel</span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">)。也即有两个独立的内存控制器(</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;">memory controller</span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">)。分别控制两套物理上独立的存储系统。另一种办法就是在向量处理器里面采用的,仍然只有一套物理上独立的存储系统,但是每个</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> memory bank </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">现在有多个端口,每个端口匹配一组</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> address </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">和</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> data bus</span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">。如果现在每个</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> bank </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">有两个端口,那么我们需要两组</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> address/data bus</span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">。这样我们可以同时发出两条访存请求,访问两个不同的</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> bank</span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">,并且可以在一个周期接受两个数据。在</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> </span><a href="http://www.eecg.toronto.edu/~corinna/"><span lang="en-US" style="font-family: Calibri; font-size: 11pt;">Corinna Grace Lee</span></a><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">的</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> </span><a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.53.5235"><span lang="en-US" style="font-family: Calibri; font-size: 11pt;">Ph.D. thesis</span></a><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">中提到说,相比于</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> superscalar </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">处理器而言,向量处理器需要的连接</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> CPU </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">和</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> memory system </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">的管脚数目明显要少,因为即使有多个</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> memory port</span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">,但是仍然可以只需要一组</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> bus</span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">,而不是想超标量处理器里那样每个端口都需要匹配一组</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> bus</span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">。她的意思应该是,对于向量处理器而言,访存地址的计算是在</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> memory system </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">中完成的,</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;">CPU </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">只需要一次性提供一组基本的访存模式的信息即可,因此虽然</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> memory controller </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">的设计是复杂的,但是一次</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> address bus </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">的使用就可以触发多个(同时的)访存请求。而对于超标量处理器而言,每一次访存请求的地址都需要</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> CPU </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">端首先计算出然后传送到</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> memory system</span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">。因此如果需要同时进行多个访存的话,就需要多条</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> address bus</span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">。</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;">Lee </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">也提到,在向量处理器中,</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;">gather/scatter </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">是一个特例,因为这类操作的地址没有固定模式,需要</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> CPU </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">端计算之后分别发送到内存系统。但是现在向量处理器完全有能力由内存系统完成这一计算过程。</span></span></div><ol style="direction: ltr; font-family: Calibri; font-size: 11pt; margin-bottom: 0in; margin-left: 0.375in; margin-top: 0in; text-align: justify; unicode-bidi: embed;" type="1"></ol>Yuhaohttp://www.blogger.com/profile/08569555359590748704noreply@blogger.com0tag:blogger.com,1999:blog-1107874072437767466.post-52608557563379012272010-12-18T08:41:00.000-08:002010-12-19T09:50:37.202-08:00向量处理器(1)<div style="text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">我决定翻译《计算机体系结构</span><span lang="en-US" style="font-family: Calibri;">--</span><span lang="zh-CN" style="font-family: 宋体;">量化研究方法》第四版附录</span><span lang="en-US" style="font-family: Calibri;">F</span><span lang="zh-CN" style="font-family: 宋体;">。原因在于向量处理器现在变得异常重要。可以说向量处理器是</span><span lang="en-US" style="font-family: Calibri;"> GPU </span><span lang="zh-CN" style="font-family: 宋体;">的先驱。要从体系结构上理解</span><span lang="en-US" style="font-family: Calibri;"> GPU</span><span lang="zh-CN" style="font-family: 宋体;">,没有道理不先理解历史上的向量处理器。很多向量处理器里面用到的技术,在现在最先进的微处理器中又奇迹般地得到了重生。比如</span><span lang="en-US" style="font-family: Calibri;"> Cray-1 </span><span lang="zh-CN" style="font-family: 宋体;">里面的</span><span lang="en-US" style="font-family: Calibri;"> T Register File </span><span lang="zh-CN" style="font-family: 宋体;">和</span><span lang="en-US" style="font-family: Calibri;"> B Register File</span><span lang="zh-CN" style="font-family: 宋体;">,和</span><span lang="en-US" style="font-family: Calibri;"> Stream Processor </span><span lang="zh-CN" style="font-family: 宋体;">中的</span><span lang="en-US" style="font-family: Calibri;"> Stream Register File </span><span lang="zh-CN" style="font-family: 宋体;">(</span><span lang="en-US" style="font-family: Calibri;">SRF</span><span lang="zh-CN" style="font-family: 宋体;">)如出一辙。而现代面向通用计算的</span><span lang="en-US" style="font-family: Calibri;"> GPU </span><span lang="zh-CN" style="font-family: 宋体;">中的</span><span lang="en-US" style="font-family: Calibri;">Shared Memory </span><span lang="zh-CN" style="font-family: 宋体;">和</span><span lang="en-US" style="font-family: Calibri;"> S Register File </span><span lang="zh-CN" style="font-family: 宋体;">以及</span><span lang="en-US" style="font-family: Calibri;"> A Register File</span><span lang="zh-CN" style="font-family: 宋体;">的设计初衷也有异曲同工之妙。当然另一个原因是我没有看到有关量化附录的翻译。</span></div><div></div><div lang="zh-CN" style="font-family: Calibri; font-size: 11pt; margin: 0in 0in 0in 0.375in; text-align: justify;"><br />
</div><div style="text-align: justify;"><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">这个附录的作者是</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> </span><a href="http://www.eecs.berkeley.edu/%7Ekrste"><span lang="en-US" style="font-family: Calibri; font-size: 11pt;">Krste Asanović</span></a><span lang="zh-CN" style="font-family: 宋体; font-size: 10pt;">。他本人的</span><a href="http://www.eecs.berkeley.edu/%7Ekrste/thesis.html"><span lang="en-US" style="font-family: Calibri; font-size: 11pt;">Ph.D. Thesis</span></a><span lang="en-US" style="font-family: Calibri; font-size: 10pt;"> </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">就是一个向量处理器的设计。据说随着</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> David Patterson </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">慢慢地不管事,现在他已经成为了了</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> ParLab </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">的实际领袖。我完全是抱着学习这个附录的心态来翻译的,任何质疑都可以在下面留言。</span></div><div style="text-align: justify;"></div><div lang="zh-CN" style="font-family: Calibri; font-size: 11pt; margin: 0in 0in 0in 0.375in; text-align: justify;"><br />
</div><div style="text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">最后要说的一点,体系结构里最让人头疼的事莫过于五花八门的术语了,而如果要在中英文之间切换这些术语则更恶化这个现象。我的策略是两个。第一,如果有非常好非常成熟非常通用的中文翻译,我用中文</span><span lang="en-US" style="font-family: Calibri;">--</span><span lang="zh-CN" style="font-family: 宋体;">毕竟我是在翻译。但是即使这样我也会用括号表明英文术语。第二,有些术语本身有很大争议。比如</span><span lang="en-US" style="font-family: Calibri;"> issue </span><span lang="zh-CN" style="font-family: 宋体;">和</span><span lang="en-US" style="font-family: Calibri;"> dispatch</span><span lang="zh-CN" style="font-family: 宋体;">,</span><span lang="en-US" style="font-family: Calibri;">X86 </span><span lang="zh-CN" style="font-family: 宋体;">和</span><span lang="en-US" style="font-family: Calibri;"> MIPS </span><span lang="zh-CN" style="font-family: 宋体;">对他们的使用完全相反。我于是准备采用<a href="http://yuhaozhu.blogspot.com/2010/12/terminology.html">另一个页面</a>,来解释那些令人困惑到发指的术语,以免读者完全理解错误原文应有之义。</span>另外,原文之中有一些我认为讲得不是特别清楚或者难以理解的地方我都以注释的形式给出我自己的理解。</div><div style="text-align: justify;"></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin: 0in 0in 0in 0.375in; text-align: justify;"><br />
</div><div style="text-align: justify;"></div><div style="font-size: 11pt; margin: 0in 0in 0in 0.375in; text-align: center;"><span lang="en-US" style="font-family: Calibri;">---------------</span><span lang="zh-CN" style="font-family: 宋体;">大家好,我是分割线</span><span lang="en-US" style="font-family: Calibri;">---------------</span></div><div style="text-align: justify;"></div><div lang="zh-CN" style="font-family: Calibri; font-size: 11pt; margin: 0in 0in 0in 0.375in; text-align: justify;"><br />
</div><div style="text-align: justify;"><i><span lang="zh-CN" style="font-family: 宋体;">我当然没有在发明向量处理器。现如今就我所知就已经有三种向量处理器存在了。他们是 </span><span lang="en-US" style="font-family: Calibri;">IIIliac-IV</span><span lang="zh-CN" style="font-family: 宋体;">,</span><span lang="en-US" style="font-family: Calibri;"> (CDC)Star</span><span lang="zh-CN" style="font-family: 宋体;">,和 </span><span lang="en-US" style="font-family: Calibri;">T1(ASC)</span><span lang="zh-CN" style="font-family: 宋体;">。这三者都是向量处理器的先驱</span><span lang="en-US" style="font-family: 宋体;">…</span><span lang="zh-CN" style="font-family: 宋体;">作为先驱的一个问题是你总是要犯错误,而我绝对,绝对不想成为先驱。所以成为后来者的好处是你总是可以看到前人犯了什么错误。</span></i></div><div style="text-align: justify;"></div><div style="font-family: Calibri; font-size: 11pt; font-weight: bold; margin: 0in 0in 0in 0.375in; text-align: right;">Seymour Cray</div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin: 0in 0in 0in 0.375in; text-align: justify;"><br />
<br />
</div><div style="text-align: justify;"><span style="font-size: large;"><b><span style="font-family: 宋体;">1. 何出向量处理器?</span></b></span><br />
<br />
<span lang="zh-CN" style="font-family: 宋体;">在第</span><span lang="en-US" style="font-family: Calibri;">2</span><span lang="zh-CN" style="font-family: 宋体;">、</span><span lang="en-US" style="font-family: Calibri;">3</span><span lang="zh-CN" style="font-family: 宋体;">章我们看到了如何通过每个时钟周期发射多条指令和利用更深的执行单元流水线来开发指令级并行(</span><span lang="en-US" style="font-family: Calibri;">ILP</span><span lang="zh-CN" style="font-family: 宋体;">)以显著提高性能。(这个附录假定你已经完整阅读了第</span><span lang="en-US" style="font-family: Calibri;">2</span><span lang="zh-CN" style="font-family: 宋体;">、</span><span lang="en-US" style="font-family: Calibri;">3</span><span lang="zh-CN" style="font-family: 宋体;">章和附录</span><span lang="en-US" style="font-family: Calibri;">G</span><span lang="zh-CN" style="font-family: 宋体;">。另外,对向量处理器的内存系统的讨论需要你阅读附录</span><span lang="en-US" style="font-family: Calibri;">C</span><span lang="zh-CN" style="font-family: 宋体;">和第</span><span lang="en-US" style="font-family: Calibri;">5</span><span lang="zh-CN" style="font-family: 宋体;">章。)不幸的是,我们看到在挖掘更大程度的</span><span lang="en-US" style="font-family: Calibri;"> ILP </span><span lang="zh-CN" style="font-family: 宋体;">的时候遇到了各种各样的困难。</span></div><div style="text-align: justify;"></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin: 0in 0in 0in 0.375in; text-align: justify;"><br />
</div><div style="text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">随着我们增加指令发射的宽度和流水线的级数,我们同时也需要更多的不相关指令以保持流水线忙碌。这意味着可以同时</span><span lang="en-US" style="font-family: Calibri;"> in flight </span><span lang="zh-CN" style="font-family: 宋体;">的指令数目的增长。对于一个动态调度的处理器而言,这意味着硬件资源比如指令窗口(</span><span lang="en-US" style="font-family: Calibri;">instruction window</span><span lang="zh-CN" style="font-family: 宋体;">),</span><span lang="en-US" style="font-family: Calibri;">ROB</span><span lang="zh-CN" style="font-family: 宋体;">,重命名寄存器</span><span lang="en-US" style="font-family: Calibri;"> </span><span lang="zh-CN" style="font-family: 宋体;">(</span><span lang="en-US" style="font-family: Calibri;">renaming register file</span><span lang="zh-CN" style="font-family: 宋体;">)也要相应增长以保持足够的能力去维护所有</span><span lang="en-US" style="font-family: Calibri;"> in flight </span><span lang="zh-CN" style="font-family: 宋体;">的指令的信息。更糟的是每一个硬件单元的端口都要随着发射宽度的增长而增长。跟踪所有</span><span lang="en-US" style="font-family: Calibri;"> in flight </span><span lang="zh-CN" style="font-family: 宋体;">的指令之间依赖性(</span><span lang="en-US" style="font-family: Calibri;">dependency</span><span lang="zh-CN" style="font-family: 宋体;">)的逻辑随着指令的数目以</span><span lang="en-US" style="font-family: Calibri;"> 2 </span><span lang="zh-CN" style="font-family: 宋体;">次方的关系增长</span><span lang="zh-CN" style="font-family: 宋体;">。即使对于一个把更多的调度工作转移到了编译器上的静态调度的</span><span lang="en-US" style="font-family: Calibri;"> VLIW </span><span lang="zh-CN" style="font-family: 宋体;">处理器而言,它仍然需要更多的寄存器,更多的寄存器端口,更多的</span><span lang="en-US" style="font-family: Calibri;"> hazard interlock </span><span lang="zh-CN" style="font-family: 宋体;">逻辑(我们假定由硬件在指令发射的时候检测是否需要</span><span lang="en-US" style="font-family: Calibri;"> interlock</span><span lang="zh-CN" style="font-family: 宋体;">)来支持更多的</span><span lang="en-US" style="font-family: Calibri;"> in flight </span><span lang="zh-CN" style="font-family: 宋体;">指令。这同样导致了电路规模和复杂度的</span><span lang="en-US" style="font-family: Calibri;"> 2 </span><span lang="zh-CN" style="font-family: 宋体;">次方</span><span lang="zh-CN" style="font-family: 宋体;">增长[1]。如此快速的电路复杂度的增长使得设计一个能够控制大量</span><span lang="en-US" style="font-family: Calibri;"> in flight </span><span lang="zh-CN" style="font-family: 宋体;">指令的处理器困难重重,而且这反过来实际上也限制了发射宽度和流水线深度。</span></div><div style="text-align: justify;"></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin: 0in 0in 0in 0.375in; text-align: justify;"><br />
</div><div style="text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">向量处理器早在</span><span lang="en-US" style="font-family: Calibri;"> ILP </span><span lang="zh-CN" style="font-family: 宋体;">处理器之前就已经成功商业化了。它采用了一种不同的策略来控制多个深度流水的功能部件。向量处理器提供了高层的对于向量</span><span lang="en-US" style="font-family: Calibri;">--</span><span lang="zh-CN" style="font-family: 宋体;">线性数组</span><span lang="en-US" style="font-family: Calibri;">--</span><span lang="zh-CN" style="font-family: 宋体;">的操作。一个典型的向量操作是两个</span><span lang="en-US" style="font-family: Calibri;"> 64 </span><span lang="zh-CN" style="font-family: 宋体;">个浮点元素的向量相加得到一个新的</span><span lang="en-US" style="font-family: Calibri;"> 64 </span><span lang="zh-CN" style="font-family: 宋体;">个元素的向量。这条向量指令等同于一整个循环,每一次迭代计算出一个元素的结果,更新循环变量,然后跳转回循环头部继续执行。</span></div><div style="text-align: justify;"></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin: 0in 0in 0in 0.375in; text-align: justify;"><br />
</div><div style="text-align: justify;">向量处理器有以下几个重要的特性使得它能够解决大多数上面提到的问题:<span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;"> </span><br />
<ul><li><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">一条向量指令能够做很多事情</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;">--</span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">它等价于一整个循环。每条指令代表了数十上百条的操作,所以为了保持多个深度流水化的功能单元忙碌而需要的</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> instruction fetch </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">和</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> instruction decode </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">的带宽急剧减少了。</span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;"> </span></li>
</ul><ul><li><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">通过使用一条向量指令,编译器或者程序员显式地指出了在一个向量之中的各个元素之间的计算互相独立,所以硬件不需要检测一个向量内部的</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> data hazard</span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">。可以使用一组并行的功能单元或者一个非常深度流水的功能单元或者任何以上两种方式的组合来计算向量中的各个元素。</span></li>
</ul><ul><li><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">硬件只需要检测两条向量指令之间的</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> data hazard</span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">,而不是每个向量之中的元素之间的</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> hazard</span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">。这意味着所需的依赖检测逻辑的规模其实和标量处理器所需的大致相同,但是现在更多的(对于向量元素)操作可以同时</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> in flight</span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">。</span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;"> </span></li>
</ul><ul><li><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">向量指令的访存有固定可知的模式。如果一个向量的元素是相邻的,那么从一组高度</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> interleaved </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">的</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> memory bank </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">中取那个向量会效果非常好。相对于访问</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> cache </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">而言更高的访问主存的延迟被均摊了,因为一个向量访存操作是为向量中的所有元素发起的,而不只是一个元素。</span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;"> </span></li>
</ul><ul><li><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">因为一整个循环都被一条向量指令代替了,而这条向量指令的行为是可预期的,所以通常由循环分支而引起的</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> control hazard </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">现在不存在了。</span></li>
</ul></div><div style="text-align: justify;"><ul style="direction: ltr; margin-bottom: 0in; margin-left: 0.375in; margin-top: 0in; text-align: justify; unicode-bidi: embed;" type="disc"></ul></div><div style="text-align: justify;"></div><div style="text-align: justify;">因为这些原因,向量操作对于同样数目的数据进行操作的时候比相应的一系列标量指令要快得多,所以设计者们如果发现他们的应用程序会经常进行向量操作的话会在设计中包含有向量单元。</div><div style="text-align: justify;"></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin: 0in 0in 0in 0.375in; text-align: justify;"><br />
</div><div style="text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">向量处理器尤其对于大规模的科学工程计算特别有效,包括汽车碰撞模拟和天气预报。这些应用程序通常需要一台超级计算机跑上几打个小时来处理</span><span lang="en-US" style="font-family: Calibri;"> Gigabyte </span><span lang="zh-CN" style="font-family: 宋体;">级别的数据。高速的标量处理器依赖于</span><span lang="en-US" style="font-family: Calibri;"> cache </span><span lang="zh-CN" style="font-family: 宋体;">来减少访问主存的延迟,但是大规模长时间运行的科学计算程序通常有很大规模的工作集,并且通常局部性非常低,这导致</span><span lang="en-US" style="font-family: Calibri;"> memory hierarchy </span><span lang="zh-CN" style="font-family: 宋体;">的性能非常糟糕。所以标量处理器会提供旁路</span><span lang="en-US" style="font-family: Calibri;"> cache </span><span lang="zh-CN" style="font-family: 宋体;">的机制如果软件发现访存的局部性很差。但是使得主存饱和需要硬件跟踪数百上千条的</span><span lang="en-US" style="font-family: Calibri;"> in flight </span><span lang="zh-CN" style="font-family: 宋体;">的标量访存操作,而这在标量处理器</span><span lang="en-US" style="font-family: Calibri;"> ISA </span><span lang="zh-CN" style="font-family: 宋体;">中已被证实开销是非常大的。相反,向量</span><span lang="en-US" style="font-family: Calibri;"> ISA </span><span lang="zh-CN" style="font-family: 宋体;">可以只使用一条向量指令就可以发起对于一整个向量中元素的访存操作,所以非常简单的逻辑就可以提供很高的带宽[2]。</span></div><div style="text-align: justify;"></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin: 0in 0in 0in 0.375in; text-align: justify;"><br />
</div><div style="text-align: justify;"><span lang="zh-CN" style="font-family: 宋体;">当这个附录上一次在</span><span lang="en-US" style="font-family: Calibri;"> 2001 </span><span lang="zh-CN" style="font-family: 宋体;">年写的时候,诡异的向量超级计算机已经慢慢地从超级计算机领域中淡出了,取而代之的是超标量处理器。但是在</span><span lang="en-US" style="font-family: Calibri;"> 2002 </span><span lang="zh-CN" style="font-family: 宋体;">年,日本造出了当时世界上最快的超级计算机,</span><span lang="en-US" style="font-family: Calibri;">the Earth Simulator</span><span lang="zh-CN" style="font-family: 宋体;">。它是为创造一个“虚拟星球”来分析和预测世界环境和气候变化而设计的。它比之前最快的超级计算机还要快</span><span lang="en-US" style="font-family: Calibri;"> 5 </span><span lang="zh-CN" style="font-family: 宋体;">倍,并且比身后的</span><span lang="en-US" style="font-family: Calibri;"> 12 </span><span lang="zh-CN" style="font-family: 宋体;">个超级计算机加起来还要快。这在高性能计算领域引起了一阵骚乱,特别是在美国。美国人被如此之快地就丢失如此具有战略意义的高性能计算阵地而感到震惊。</span><span lang="en-US" style="font-family: Calibri;">The Earth Simulator </span><span lang="zh-CN" style="font-family: 宋体;">比那些与之竞争的机器有更少的处理器,但是每一个节点都是一个单芯片的向量微处理器。它对于很多具有重要意义的超级计算代码都有非常高的性能</span><span lang="en-US" style="font-family: Calibri;">--</span><span lang="zh-CN" style="font-family: 宋体;">原因就像之前提到的。</span><span lang="en-US" style="font-family: Calibri;">The Earth Simulator </span><span lang="zh-CN" style="font-family: 宋体;">以及</span><span lang="en-US" style="font-family: Calibri;">Cray </span><span lang="zh-CN" style="font-family: 宋体;">发布的新一代向量处理器的影响力导致了对于向量处理器的重新关注和重视。</span><br />
<div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体;"> </span> </div><div style="font-size: 11pt; margin: 0in; text-align: center;"><span lang="en-US" style="font-family: Calibri;">---------------</span><span lang="zh-CN" style="font-family: 宋体;">大家好,我是分割线</span><span lang="en-US" style="font-family: Calibri;">---------------</span></div><div style="font-size: 11pt; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in;"><span lang="en-US" style="font-family: Calibri;">[1] </span><span lang="zh-CN" style="font-family: 宋体;">为什么是</span><span lang="en-US" style="font-family: Calibri;"> 2 </span><span lang="zh-CN" style="font-family: 宋体;">次方请看原书第三章第二小节。简单来讲,这就是一个排列组合的问题。</span></div><div style="font-size: 11pt; margin: 0in;"><span lang="en-US" style="font-family: Calibri;">[2] </span><span lang="zh-CN" style="font-family: 宋体;">这段话的意思简单来讲就是通常高级的标量处理器里面的内存系统是非常复杂的,需要</span><span lang="en-US" style="font-family: Calibri;"> MSHR </span><span lang="zh-CN" style="font-family: 宋体;">这样的结构以及很复杂的访存调度算法甚至编译器的优化来提高访存效率,充分利用内存接口本就不高的带宽。而对于向量机而言,由于访问模式规则,非常简单的内存系统的设计配搭上一条向量指令就足以使得内存带宽饱和。</span></div></div>Yuhaohttp://www.blogger.com/profile/08569555359590748704noreply@blogger.com0tag:blogger.com,1999:blog-1107874072437767466.post-31377693414491352402010-12-14T21:46:00.000-08:002010-12-16T07:23:20.812-08:00SC10: Green500 and Booth "Awards"<div style="font-size: 11pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体;">好吧我食言了,我还是准备再翻译一篇</span><span lang="en-US" style="font-family: Calibri;"> Steve Keckler </span><span lang="zh-CN" style="font-family: 宋体;">关于</span><span lang="en-US" style="font-family: Calibri;"> SC'10 </span><span lang="zh-CN" style="font-family: 宋体;">的</span><span lang="en-US" style="font-family: Calibri;"> </span><a href="http://cacm.acm.org/blogs/blog-cacm/101908-sc10-green500-and-booth-awards/fulltext"><span lang="en-US" style="font-family: Calibri;">blog</span></a><span lang="zh-CN" style="font-family: 宋体;">,主题是关于绿色计算。在国内的</span><span lang="en-US" style="font-family: Calibri;"> bbs </span><span lang="zh-CN" style="font-family: 宋体;">上,有人戏称之为和谐计算</span><span lang="en-US" style="font-family: Calibri;">--</span><span lang="zh-CN" style="font-family: 宋体;">一个非常应景的主题。可以看到</span><span lang="en-US" style="font-family: Calibri;"> NVIDIA </span><span lang="zh-CN" style="font-family: 宋体;">把绿色计算作为噱头来吸引眼球是有来头的。在单线程性能糟糕的情况下,</span><span lang="en-US" style="font-family: Calibri;">NVIDIA </span><span lang="zh-CN" style="font-family: 宋体;">抛出以下三条断言。第一,将来的应用大多数将是面向吞吐率的;第二,</span><span lang="en-US" style="font-family: Calibri;">GPU </span><span lang="zh-CN" style="font-family: 宋体;">能够以牺牲单线程性能为代价获得极高的吞吐率;第三,为了获得相应的吞吐率表现,</span><span lang="en-US" style="font-family: Calibri;">GPU </span><span lang="zh-CN" style="font-family: 宋体;">付出的能耗代价较之</span><span lang="en-US" style="font-family: Calibri;"> CPU </span><span lang="zh-CN" style="font-family: 宋体;">不值一提。这三点从逻辑上环环相扣,几乎无懈可击。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in; text-align: center;"><span lang="en-US" style="font-family: Calibri;">---------------</span><span lang="zh-CN" style="font-family: 宋体;">大家好,我是分割线</span><span lang="en-US" style="font-family: Calibri;">---------------</span></div><div style="font-family: Calibri; font-size: 11pt; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体;">就像“更高,更快,更强”</span><span lang="en-US" style="font-family: Calibri;"> (Citius, Altius, Fortius) </span><span lang="zh-CN" style="font-family: 宋体;">的奥林匹克精神一样,</span><span lang="en-US" style="font-family: Calibri;"> SC10</span><span lang="zh-CN" style="font-family: 宋体;">也有一系列关于高性能计算机的榜单,而不仅仅是最初的</span><span lang="en-US" style="font-family: Calibri;"> Top500</span><span lang="zh-CN" style="font-family: 宋体;">。绿色</span><span lang="en-US" style="font-family: Calibri;">500<span> </span>(The Green500) </span><span lang="zh-CN" style="font-family: 宋体;">是由</span><span lang="en-US" style="font-family: Calibri;"> Virginia Tech </span><span lang="zh-CN" style="font-family: 宋体;">的</span><span lang="en-US" style="font-family: Calibri;"> Wu Feng </span><span lang="zh-CN" style="font-family: 宋体;">和其他一些人在</span><span lang="en-US" style="font-family: Calibri;"> 2007 </span><span lang="zh-CN" style="font-family: 宋体;">年发起的。其目的在于增进对于高性能计算机系统中功耗效率的关注。这个榜单已经演化成为各大计算机供应商和超级计算中心标榜自己的工具。在介绍这个榜单的时候,</span><span lang="en-US" style="font-family: Calibri;">Wu </span><span lang="zh-CN" style="font-family: 宋体;">表达了对于这个榜单可能被滥用和戏弄的忧虑。他提出了几点让此份榜单更有实际意义的建议,比如将测试程序扩展到</span><span lang="en-US" style="font-family: Calibri;"> LINPACK </span><span lang="zh-CN" style="font-family: 宋体;">以外和制定更为严格的评测和报告功耗数据的标准。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in;"><span lang="en-US" style="font-family: Calibri;">Wu </span><span lang="zh-CN" style="font-family: 宋体;">邀请了国家超级计算应用中心</span><span lang="en-US" style="font-family: Calibri;"> (National Center for Supercomputing Applications, NCSA) </span><span lang="zh-CN" style="font-family: 宋体;">的</span><span lang="en-US" style="font-family: Calibri;"> Craig Steffen </span><span lang="zh-CN" style="font-family: 宋体;">做了一个关于他们是如何评测</span><span lang="en-US" style="font-family: Calibri;"> Green500 </span><span lang="zh-CN" style="font-family: 宋体;">要求的相关数据的方法论(他们称之为</span><span lang="en-US" style="font-family: Calibri;">EcoG</span><span lang="zh-CN" style="font-family: 宋体;">)的报告。</span><span lang="en-US" style="font-family: Calibri;">Craig </span><span lang="zh-CN" style="font-family: 宋体;">搞了些很酷的照片,展示了他们是如何把一个夹子式的电流探测器嵌入到一个</span><span lang="en-US" style="font-family: Calibri;">PDU (power distribution unit) </span><span lang="zh-CN" style="font-family: 宋体;">里面去的。其输出直接和一个采集一秒间隔瞬时功耗的数据采集器相连。不过别在家里尝试这玩意儿</span><span lang="en-US" style="font-family: Calibri;">--</span><span lang="zh-CN" style="font-family: 宋体;">那些</span><span lang="en-US" style="font-family: Calibri;"> PDU </span><span lang="zh-CN" style="font-family: 宋体;">是工作在</span><span lang="en-US" style="font-family: Calibri;"> 208V </span><span lang="zh-CN" style="font-family: 宋体;">下的。被测试的</span><span lang="en-US" style="font-family: Calibri;"> PDU </span><span lang="zh-CN" style="font-family: 宋体;">是总共工作的</span><span lang="en-US" style="font-family: Calibri;"> 128 </span><span lang="zh-CN" style="font-family: 宋体;">个中的</span><span lang="en-US" style="font-family: Calibri;"> 8 </span><span lang="zh-CN" style="font-family: 宋体;">个,这符合</span><span lang="en-US" style="font-family: Calibri;"> Green500 </span><span lang="zh-CN" style="font-family: 宋体;">的规定,也即允许通过汇报一个系统中子系统的数据并扩展到整个系统的方法来报告整个系统的数据。</span><span lang="en-US" style="font-family: Calibri;">Craig </span><span lang="zh-CN" style="font-family: 宋体;">展示了一些在跑多遍</span><span lang="en-US" style="font-family: Calibri;"> LINPACK </span><span lang="zh-CN" style="font-family: 宋体;">程序是功耗随着时间变化的图片,非常酷。即使是在一遍运行内,功耗也有</span><span lang="en-US" style="font-family: Calibri;"> 15%</span><span lang="zh-CN" style="font-family: 宋体;">(峰值到峰值)的波动,并且平均功耗其实是随着运行的进行减少的。</span><span lang="en-US" style="font-family: Calibri;">Craig </span><span lang="zh-CN" style="font-family: 宋体;">指出他们还想从</span><span lang="en-US" style="font-family: Calibri;"> 200 </span><span lang="zh-CN" style="font-family: 宋体;">毫秒(他们电流采样器的最高分辨率)开始以更细的粒度来抓取数据。这样他们就可以把功耗变化和应用程序的行为更好的联系起来。另一个有趣的方面是</span><span lang="en-US" style="font-family: Calibri;"> EcoG </span><span lang="zh-CN" style="font-family: 宋体;">决定汇报一遍</span><span lang="en-US" style="font-family: Calibri;"> LINPACK </span><span lang="zh-CN" style="font-family: 宋体;">运行之后</span><span lang="en-US" style="font-family: Calibri;"> 80%<span> </span></span><span lang="zh-CN" style="font-family: 宋体;">的</span><span lang="en-US" style="font-family: Calibri;"> performance/Watt<span> </span></span><span lang="zh-CN" style="font-family: 宋体;">数据(从</span><span lang="en-US" style="font-family: Calibri;"> 10% </span><span lang="zh-CN" style="font-family: 宋体;">开始)而不仅仅限于</span><span lang="en-US" style="font-family: Calibri;"> Green500 </span><span lang="zh-CN" style="font-family: 宋体;">规定的</span><span lang="en-US" style="font-family: Calibri;"> 20% </span><span lang="zh-CN" style="font-family: 宋体;">的下限。他们认为去掉启动和完结阶段的中间</span><span lang="en-US" style="font-family: Calibri;"> 80% </span><span lang="zh-CN" style="font-family: 宋体;">的数据相比选取最优的</span><span lang="en-US" style="font-family: Calibri;"> 20% </span><span lang="zh-CN" style="font-family: 宋体;">的数据更有代表性。我倒是很好奇其他那些机器上面功耗是怎么随着时间推移变化的。</span></div><div lang="zh-CN" style="font-family: 宋体; font-size: 11pt; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体;">在吊足大家胃口之后,</span><span lang="en-US" style="font-family: Calibri;">Wu </span><span lang="zh-CN" style="font-family: 宋体;">终于发布了</span><span lang="en-US" style="font-family: Calibri;"> </span><a href="http://www.green500.org/"><span lang="en-US" style="font-family: Calibri;">Green500</span></a><span lang="en-US" style="font-family: Calibri;"> </span><span lang="zh-CN" style="font-family: 宋体;">中</span><span lang="en-US" style="font-family: Calibri;"> Top10 </span><span lang="zh-CN" style="font-family: 宋体;">的排名。其中</span><span lang="en-US" style="font-family: Calibri;"> 8 </span><span lang="zh-CN" style="font-family: 宋体;">个都是异构系统(或者基于</span><span lang="en-US" style="font-family: Calibri;"> Cell</span><span lang="zh-CN" style="font-family: 宋体;">,或者基于</span><span lang="en-US" style="font-family: Calibri;"> GPU</span><span lang="zh-CN" style="font-family: 宋体;">)。他还颁发了一下三个</span><span lang="en-US" style="font-family: Calibri;"> Green500 </span><span lang="zh-CN" style="font-family: 宋体;">奖项:</span></div><ul style="direction: ltr; margin-bottom: 0in; margin-left: 0.375in; margin-top: 0in; unicode-bidi: embed;" type="disc"><li style="margin-bottom: 0pt; margin-top: 0pt; vertical-align: middle;"><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">“世界最和谐超级计算机”</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> (Greenest Supercomputer in the World) </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">奖颁给了位于</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> IBM </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">研究院的</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> IBM BlueGene/Q </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">原型系统。这个计算机以</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> 1684 MFlops/Watt<span> </span></span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">(总共</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> 38KW</span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">)的数据领衔榜单。我之后顺道去</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> IBM </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">的摊位上瞅了一眼他们的硬件系统。虽然没有我在前几天提到的</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> Blue Waters </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">那么有气场,但是</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> BlueGene/Q </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">还是用了一组定制技术包括一个由</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> BlueGene </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">芯片和最多</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> 16GB </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">的本地存储组成的定制节点卡。这个系统也是水冷的,去除掉了一些诸如风扇电源之类的东西,并且很有可能通过运行在低温下降低了漏电功耗。</span></li>
<li style="margin-bottom: 0pt; margin-top: 0pt; vertical-align: middle;"><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">“世界最和谐超级计算机产品”</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;">(Greenest Production Supercomputer in the World) </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">奖颁给了</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> Tokyo Institute of Technology </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">的</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> Tsubame 2.0</span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">。它以</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> 958 MegaFlops/Watt<span> </span></span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">(总共</span><span lang="syr" style="direction: rtl; font-family: Calibri; font-size: 11pt; unicode-bidi: embed;">1244</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> KW</span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">)的数据排名榜单第二。</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;">Tsubame 2.0 </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">已经被实际部署了</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;">--</span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">它为每两个</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> Intel Westmere CPU </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">配备了</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> 3 </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">个</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> NVIDIA Tesla 20 </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">系列的</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> GPU</span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">。这个</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> GPU/CPU </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">比例比榜单靠后的其他超级计算机都要来得高。</span></li>
<li style="margin-bottom: 0pt; margin-top: 0pt; vertical-align: middle;"><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">“世界最和谐自建计算机”</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> (Greenest Self-Built Computer in the World) </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">奖颁给了</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> NCSA </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">的</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> EcoG</span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">。它以</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> 933 MegaFlops/Watt </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">(总共</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;">36KW</span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">)的数据名列第三。</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;">EcoG</span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">是一个和</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> NVIDIA Research </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">合作的学生项目(我几天前提到了这个机器)。</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;">EcoG </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">采用了</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> 1:1 </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">的</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> GPU/CPU </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">芯片,但是使用了</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;">Core i3 </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">代替更高端</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> CPU </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">芯片以期以牺牲串行性能的代价获得更好的</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> CPU </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">能耗效率。值得一提的是</span><span lang="en-US" style="font-family: Calibri; font-size: 11pt;"> EcoG </span><span lang="zh-CN" style="font-family: 宋体; font-size: 11pt;">是由一些在网上就能买到的日常组件搭建而成的。</span></li>
</ul><div style="font-family: Calibri; font-size: 11pt; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in; text-align: center;"><span lang="en-US" style="font-family: Calibri;">---------------</span><span lang="zh-CN" style="font-family: 宋体;">大家好,我是分割线</span><span lang="en-US" style="font-family: Calibri;">---------------</span></div><div style="font-family: Calibri; font-size: 11pt; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in;"><span lang="zh-CN" style="font-family: 宋体;">最后</span><span lang="en-US" style="font-family: Calibri;"> Steve </span><span lang="zh-CN" style="font-family: 宋体;">的一些闲聊的话就省略了。可以看到忝列</span><span lang="en-US" style="font-family: Calibri;"> Top500 </span><span lang="zh-CN" style="font-family: 宋体;">榜单第一的天河</span><span lang="en-US" style="font-family: Calibri;">-1A </span><span lang="zh-CN" style="font-family: 宋体;">在</span><span lang="en-US" style="font-family: Calibri;"> Green500 Top10</span><span lang="en-US" style="font-family: 宋体;"> </span><span lang="zh-CN" style="font-family: 宋体;">中却没了踪影。这么一大坨废铜烂铁,不能跑实际应用,还是吃电怪兽,真是一朵奇葩啊。</span></div>Yuhaohttp://www.blogger.com/profile/08569555359590748704noreply@blogger.com0tag:blogger.com,1999:blog-1107874072437767466.post-14550746400325697352010-12-13T21:56:00.000-08:002010-12-16T07:31:52.008-08:00SC10: Dally Keynote, Heterogeneous Computing Systems<div style="font-size: 11pt; margin: 0in;"><span style="font-family: Calibri;">我将翻译一篇 </span><a href="http://www.blogger.com/www.cs.utexas.edu/%7Eskeckler/"><span style="font-family: Calibri;">Steve Keckler</span></a><span style="font-family: Calibri;"> </span><span style="font-family: 宋体;">为</span><span style="font-family: Calibri;"> NVIDIA </span><span style="font-family: 宋体;">写的文章。原文网址点</span><a href="http://cacm.acm.org/blogs/blog-cacm/101867-sc10-dally-keynote-heterogeneous-computing-systems/fulltext"><span style="font-family: 宋体;">此</span></a><span style="font-family: 宋体;">。我计划在这个寒假译介一系列和体系结构相关的科普文章。除了传播的考虑之外,还有一点私心在于强迫自己认真得去读一些文章,蜻蜓点水实在是在犯罪。直到把这篇文章翻译完,我终于意识到这是一篇软文</span><span style="font-family: Calibri;">--</span><span style="font-family: 宋体;">但是不翻译的话,我又怎么会知道呢?</span></div><div style="font-family: Calibri; font-size: 11pt; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in; text-align: center;"><span lang="en-US" style="font-family: Calibri;">---------------</span><span lang="zh-CN" style="font-family: 宋体;">大家好,我是分割线</span><span lang="en-US" style="font-family: Calibri;">---------------</span></div><div style="font-family: Calibri; font-size: 11pt; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in;"><span style="font-family: Calibri;">把时间耗在 Supercomputing'10 这世界上最大的超级计算会议上,有点像喝消防栓里的水一样--除了能够听到无数技术报告之外,同时也是一个 </span><span style="font-family: 宋体;">会见同僚的绝佳机会。我大概一天只有</span><span style="font-family: Calibri;"> 45 </span><span style="font-family: 宋体;">分钟的时间是没有排在日程里的!今天,我想向各位报告三件聚焦于异构高性能计算</span><span style="font-family: Calibri;">(heterogeneous high-performance computing) </span><span style="font-family: 宋体;">的事件。</span></div><div style="font-family: Calibri; font-size: 11pt; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in;"><span lang="en-US" style="font-family: Calibri;">今天早上的高光时刻乃是 Bill Dally 做的 Keynote 报告。他是 NVIDIA 的首席科学家 (Chief Scientist)</span><span lang="en-US" style="font-family: 宋体;">和高级副总裁</span><span lang="en-US" style="font-family: Calibri;"> (Senior Vice President)</span><span lang="en-US" style="font-family: 宋体;">。当然他也是我的老板(并且在遥远的过去曾经是我</span><span lang="en-US" style="font-family: Calibri;"> Ph.D. </span><span lang="en-US" style="font-family: 宋体;">的导师)</span><span lang="en-US" style="font-family: Calibri;">--</span><span lang="en-US" style="font-family: 宋体;">所以别指望我从我这里得到对于他的演讲的任何公开批</span><span lang="en-US" style="font-family: Calibri;"> </span><span lang="en-US" style="font-family: 宋体;">评!</span><span lang="en-US" style="font-family: Calibri;">Bill </span><span lang="en-US" style="font-family: 宋体;">的演讲的题目是(坐稳了,光年外的粉丝们)</span><span lang="zh-CN" style="font-family: 宋体;">:</span><span lang="en-US" style="font-family: Calibri;">"GPU Computing to Exascale and Beyond."<span> </span></span><span lang="en-US" style="font-family: 宋体;">他首先聚焦于</span><span lang="en-US" style="font-family: Calibri;"> GPU </span><span lang="en-US" style="font-family: 宋体;">的能源效率问题,展示了当今</span><span lang="en-US" style="font-family: Calibri;"> Top500 </span><span lang="en-US" style="font-family: 宋体;">系统的性能</span><span lang="en-US" style="font-family: Calibri;">/</span><span lang="en-US" style="font-family: 宋体;">耗能情况。比如,天河</span><span lang="en-US" style="font-family: Calibri;">-1A </span><span lang="en-US" style="font-family: 宋体;">异构</span><span lang="en-US" style="font-family: Calibri;"> GPU </span><span lang="en-US" style="font-family: 宋体;">计算机,</span><span lang="en-US" style="font-family: Calibri;">Top500 </span><span lang="en-US" style="font-family: 宋体;">的第一名得主,从性能</span><span lang="en-US" style="font-family: Calibri;">/</span><span lang="en-US" style="font-family: 宋体;">功耗</span><span lang="en-US" style="font-family: Calibri;"> (flops/watt) </span><span lang="en-US" style="font-family: 宋体;">的角度来衡量大概是第二名</span><span lang="en-US" style="font-family: Calibri;"> Jaguar machine </span><span lang="en-US" style="font-family: 宋体;">这个同构系统的</span><span lang="en-US" style="font-family: Calibri;"> 2.5 </span><span lang="en-US" style="font-family: 宋体;">倍。东京理工学院</span><span lang="en-US" style="font-family: Calibri;"> (Tokyo Institute of Technology) </span><span lang="en-US" style="font-family: 宋体;">研制的异构</span><span lang="en-US" style="font-family: Calibri;"> Tsubame 2.0 </span><span lang="en-US" style="font-family: 宋体;">计算机,也即排名第四的</span><span lang="en-US" style="font-family: Calibri;"> TiTech machine</span><span lang="en-US" style="font-family: 宋体;">,大概比天河</span><span lang="en-US" style="font-family: Calibri;">-1A</span><span lang="en-US" style="font-family: 宋体;">效率高出</span><span lang="en-US" style="font-family: Calibri;"> 50%</span><span lang="en-US" style="font-family: 宋体;">。这其中的根本原因在于在同构机器中的</span><span lang="en-US" style="font-family: Calibri;"> CPU</span><span lang="en-US" style="font-family: 宋体;">是为优化单线程性能而设计的,采用了一系列现代的微处理器优</span><span lang="en-US" style="font-family: Calibri;"> </span><span lang="en-US" style="font-family: 宋体;">化技术包括分支预测</span><span lang="en-US" style="font-family: Calibri;"> (branch prediction)</span><span lang="en-US" style="font-family: 宋体;">,各式各样的投机执行</span><span lang="en-US" style="font-family: Calibri;"> (speculation)</span><span lang="en-US" style="font-family: 宋体;">,寄存器重命名</span><span lang="en-US" style="font-family: Calibri;"> (register renaming)</span><span lang="en-US" style="font-family: 宋体;">,动态调度</span><span lang="en-US" style="font-family: Calibri;"> (dynamic scheduling)</span><span lang="en-US" style="font-family: 宋体;">以及为优化单线程的延迟而设计的</span><span lang="en-US" style="font-family: Calibri;"> cache</span><span lang="en-US" style="font-family: 宋体;">;而相反,</span><span lang="en-US" style="font-family: Calibri;">GPU </span><span lang="en-US" style="font-family: 宋体;">是为优化吞吐率而设计的。通过省掉很多在</span><span lang="en-US" style="font-family: Calibri;"> CPU </span><span lang="en-US" style="font-family: 宋体;">上很常见但是却极其耗能的优化技术,平均对于一个操作而言它能够获得</span><span lang="en-US" style="font-family: Calibri;"> 10 </span><span lang="en-US" style="font-family: 宋体;">倍</span><span lang="en-US" style="font-family: Calibri;"> </span><span lang="en-US" style="font-family: 宋体;">的能源节省。但是这种高效不是没有代价的</span><span lang="en-US" style="font-family: Calibri;">--</span><span lang="en-US" style="font-family: 宋体;">单个线程在</span><span lang="en-US" style="font-family: Calibri;"> GPU </span><span lang="en-US" style="font-family: 宋体;">上的性能非常非常得糟糕。</span></div><div style="font-family: Calibri; font-size: 11pt; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in;"><span style="font-family: Calibri;">这种在效率上的区别将在为面向2018年计算而进行的超尺度计算机 (Exascale computers) </span><span style="font-family: 宋体;">设计中被放大。我们甚至只要考虑以下情况,即假设我们能够从工艺发展(更小的晶体管)中得到的</span><span style="font-family: Calibri;"> 4 </span><span style="font-family: 宋体;">倍能耗效率提升,并且在</span><span style="font-family: Calibri;"> CPU </span><span style="font-family: 宋体;">和</span><span style="font-family: Calibri;"> GPU </span><span style="font-family: 宋体;">上同时获得</span><span style="font-family: Calibri;"> 4 </span><span style="font-family: 宋体;">倍体系结构级别效率提升,那么</span><span style="font-family: Calibri;"> CPU </span><span style="font-family: 宋体;">内在的相对低效性将导致最终其</span><span style="font-family: Calibri;"> 6 </span><span style="font-family: 宋体;">倍低效于</span><span style="font-family: Calibri;"> GPU</span><span style="font-family: 宋体;">。这</span><span style="font-family: Calibri;"> 6 </span><span style="font-family: 宋体;">倍的差距可以导致一个</span><span style="font-family: Calibri;"> 20 </span><span style="font-family: 宋体;">兆瓦(是的没错,我</span><span style="font-family: Calibri;"> </span><span style="font-family: 宋体;">说的是“兆”瓦)的计算机和一个</span><span style="font-family: Calibri;"> 120 </span><span style="font-family: 宋体;">兆瓦的计算机。如果一年在一兆瓦上花</span><span style="font-family: Calibri;"> 100 </span><span style="font-family: 宋体;">万刀的话,这将是一个巨大的数目。</span></div><div style="font-family: Calibri; font-size: 11pt; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in;"><span style="font-family: Calibri;">那么其间的挑战自然就在于,如何利用异构系统的计算能力去解决影响科学和社会的实际问题而不是仅仅追求 LINPACK 的分数。Bill 举了个例子 </span><span style="font-family: 宋体;">说,</span><span style="font-family: Calibri;">NVIDIA </span><span style="font-family: 宋体;">对于编程语言比如</span><span style="font-family: Calibri;"> CUDA </span><span style="font-family: 宋体;">的投资正在使得程序员可以把以前的旧的单线程代码移植到新平台上</span><span style="font-family: Calibri;">--</span><span style="font-family: 宋体;">他给了几个例子。一个例子是在医药领</span><span style="font-family: Calibri;"> </span><span style="font-family: 宋体;">域,</span><span style="font-family: Calibri;">GPU </span><span style="font-family: 宋体;">被用于通过减少</span><span style="font-family: Calibri;"> X </span><span style="font-family: 宋体;">光扫描次数来减少</span><span style="font-family: Calibri;"> CT </span><span style="font-family: 宋体;">扫描的辐射剂量,这将最终导致降低癌症得病率。另一个例子是将</span><span style="font-family: Calibri;"> GPU </span><span style="font-family: 宋体;">利用于动态分子模拟上,去模拟表面活</span><span style="font-family: Calibri;"> </span><span style="font-family: 宋体;">性剂的化学特性并且开发更好的比如香精一类的产品。虽然这些应用当前都只是在少量的</span><span style="font-family: Calibri;"> GPU </span><span style="font-family: 宋体;">上跑,但是我预期在接下来的几个月里能够看到一些在</span><span style="font-family: Calibri;"> Top500<span> </span></span><span style="font-family: 宋体;">中那些大规模异构</span><span style="font-family: Calibri;"> GPU </span><span style="font-family: 宋体;">机器上运行实际的科学计算代码并且获得令人振奋的结果(</span><span style="font-family: Calibri;">Tsubame </span><span style="font-family: 宋体;">的那些哥们儿已经报告了一些这方面的工作了)。</span></div><div style="font-family: Calibri; font-size: 11pt; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in;"><span style="font-family: Calibri;">最后,Bill 谈了一些一个 NVIDIA 正在进行的新的超尺度计算的项目 Echelon。它部分地由 DARPA 的普适高性能计算 (Ubiquitous High Performance Computing, UHPC) </span><span style="font-family: 宋体;">项目资助。</span><span style="font-family: Calibri;">NVIDIA </span><span style="font-family: 宋体;">已经和</span><span style="font-family: Calibri;"> Cray</span><span style="font-family: 宋体;">,</span><span style="font-family: Calibri;">Oak Ridge National Labs </span><span style="font-family: 宋体;">以及</span><span style="font-family: Calibri;"> 6 </span><span style="font-family: 宋体;">个顶尖高校合作来开发高性能低功耗并且可靠的体系结构和编程系统了(译者按,其中这学期给我们上体系结构课的</span><span style="font-family: Calibri;"> Mattan Erez </span><span style="font-family: 宋体;">将领导这个项目在</span><span style="font-family: Calibri;"> UT </span><span style="font-family: 宋体;">的分支)。</span><span style="font-family: Calibri;">Bill </span><span style="font-family: 宋体;">展示了对于未来异构计算系统的展望。他预期能够解决当今</span><span style="font-family: Calibri;"> GPU </span><span style="font-family: 宋体;">系统的一些不足比如分离的内存空间和相对低</span><span style="font-family: Calibri;"> </span><span style="font-family: 宋体;">带宽的连接</span><span style="font-family: Calibri;"> CPU </span><span style="font-family: 宋体;">和</span><span style="font-family: Calibri;"> GPU </span><span style="font-family: 宋体;">的</span><span style="font-family: Calibri;"> I/O </span><span style="font-family: 宋体;">总线。</span><span style="font-family: Calibri;">Echelon </span><span style="font-family: 宋体;">的设计将整合一大坨为吞吐率而优化的计算核心和少量的优化延迟的计算核心于单个芯片上,他们共享一</span><span style="font-family: Calibri;"> </span><span style="font-family: 宋体;">个内存系统。这样的芯片有</span><span style="font-family: Calibri;"> 20 TeraFLOPs </span><span style="font-family: 宋体;">的性能并且如果把一坨这样的芯片放一起的话可以整成一个</span><span style="font-family: Calibri;"> 2.6 PetaFLOP </span><span style="font-family: 宋体;">的机架。达到超尺度计算只需要大概几百个这样的机架就好</span><span style="font-family: Calibri;">--</span><span style="font-family: 宋体;">这其实就和如今的高端集群差不多。再说一遍</span><span style="font-family: Calibri;">--</span><span style="font-family: 宋体;">这仅仅是一个目前在研的项目,</span><span style="font-family: Calibri;"> </span><span style="font-family: 宋体;">虽然我预期一些在我这个博客圈里的人会把</span><span style="font-family: Calibri;">Bill</span><span style="font-family: 宋体;">的介绍当成</span><span style="font-family: Calibri;"> NVIDIA </span><span style="font-family: 宋体;">的新品推介广告</span><span style="font-family: Calibri;">--</span><span style="font-family: 宋体;">哈哈!</span></div><div style="font-family: Calibri; font-size: 11pt; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in;"><span style="font-family: Calibri;">我个人的意见是,Bill 描绘了一幅非常有说服力的对于未来可实现的超尺度系统的蓝图,但是我不认为每个听众都信服。一些我听到的闲谈提到了对于异构系统 </span><span style="font-family: 宋体;">可编程性的怀疑。我个人的分析(这个我在下面将提到的</span><span style="font-family: Calibri;"> "Round 2" </span><span style="font-family: 宋体;">讨论会上也讲到了)是,我们其实没有别的选择。能耗的限制迫使系统采用优化能耗,面向吞吐率的处理器核与优化延迟的处理器核协同工作的方式。无论是好</span><span style="font-family: Calibri;"> </span><span style="font-family: 宋体;">是坏,大家都要一起工作来制定一种能够开发这种系统的编程模型,并且保证硬件能够包含支持这种编程模型的机制。</span></div><div style="font-family: Calibri; font-size: 11pt; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in;"><a href="http://sc10.supercomputing.org/schedule/event_detail.php?evid=pan129"><span lang="en-US" style="font-family: Calibri;">Round 2</span></a><span lang="en-US" style="font-family: Calibri;"> </span><span lang="en-US" style="font-family: 宋体;">是</span><span lang="en-US" style="font-family: Calibri;"> </span><span lang="en-US" style="font-family: 宋体;">一个关于异构计算系统的研讨会,由</span><span lang="en-US" style="font-family: Calibri;"> Oak Ridge National Labs </span><span lang="en-US" style="font-family: 宋体;">的</span><span lang="en-US" style="font-family: Calibri;"> Jeff Vette r</span><span lang="en-US" style="font-family: 宋体;">组织。</span><span lang="en-US" style="font-family: Calibri;">AMD</span><span lang="en-US" style="font-family: 宋体;">的</span><span lang="en-US" style="font-family: Calibri;"> Chuck Moore </span><span lang="en-US" style="font-family: 宋体;">给出了一个我认为是非常有远见的对于什么是异构计算什么不是的定义。异构计算不能是所谓的</span><span lang="en-US" style="font-family: Calibri;"> "Frankensystem"</span><span lang="en-US" style="font-family: 宋体;">,也就是那种随意地把</span><span lang="en-US" style="font-family: Calibri;"> </span><span lang="en-US" style="font-family: 宋体;">不同的硬件和软件坨在一起的系统。也不能是一粒拥有神奇的能耗或者性能表现的子弹。它也不会轻而易举地解决高性能计算所面临的功耗</span><span lang="en-US" style="font-family: Calibri;">/</span><span lang="en-US" style="font-family: 宋体;">性能</span><span lang="en-US" style="font-family: Calibri;">/</span><span lang="en-US" style="font-family: 宋体;">可编程性</span><span lang="en-US" style="font-family: Calibri;">/</span><span lang="en-US" style="font-family: 宋体;">挑</span><span lang="en-US" style="font-family: Calibri;"> </span><span lang="en-US" style="font-family: 宋体;">战。它是一种促进不同子系统之间通信的系统框架。更进一步,它必须能让菜鸟编程员轻松地利用专用的或者可编程的功能。我感觉到讨论会达成了一个共识,也即</span><span lang="en-US" style="font-family: Calibri;"> </span><span lang="en-US" style="font-family: 宋体;">异构计算平台的下一步在于把</span><span lang="zh-CN" style="font-family: 宋体;">“</span><span lang="en-US" style="font-family: 宋体;">加速器</span><span lang="en-US" style="font-family: Calibri;"> </span><span lang="zh-CN" style="font-family: 宋体;">”</span><span lang="en-US" style="font-family: Calibri;">(accelerator) </span><span lang="en-US" style="font-family: 宋体;">提升为计算系统中的一等公民。</span><span lang="en-US" style="font-family: Calibri;">AMD </span><span lang="en-US" style="font-family: 宋体;">已经很清楚地把这个问题的硬件一端揽入怀中</span><span lang="en-US" style="font-family: Calibri;">--</span><span lang="en-US" style="font-family: 宋体;">通过他们最近发布的</span><span lang="en-US" style="font-family: Calibri;"> "Fusion" </span><span lang="en-US" style="font-family: 宋体;">芯片。然而,大量的挑战仍然存在于</span><span lang="en-US" style="font-family: Calibri;"> </span><span lang="en-US" style="font-family: 宋体;">软件和编程之中。</span><span lang="en-US" style="font-family: Calibri;">NERSC </span><span lang="en-US" style="font-family: 宋体;">的领导</span><span lang="en-US" style="font-family: Calibri;"> Kathy Yelick </span><span lang="en-US" style="font-family: 宋体;">通过展示她的观察来强调了这些挑战。她认为在向超尺度前进的征程中一定会有一个从</span><span lang="en-US" style="font-family: Calibri;"> MPI/MPI+OpenMP </span><span lang="en-US" style="font-family: 宋体;">出发的编程模型。她不相信整个</span><span lang="en-US" style="font-family: Calibri;"> </span><span lang="en-US" style="font-family: 宋体;">圈内会容忍两个编程模型的存在,我们最好赶紧搞定这个模型,越快越好。</span></div><div style="font-family: Calibri; font-size: 11pt; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in;"><a href="http://sc10.supercomputing.org/schedule/event_detail.php?evid=pan130"><span style="font-family: Calibri;">Round 3</span></a><span style="font-family: 宋体;">是</span><span style="font-family: Calibri;"> </span><span style="font-family: 宋体;">一个由弗吉尼亚理工</span><span style="font-family: Calibri;"> (Virginia Tech) </span><span style="font-family: 宋体;">计算机系的</span><span style="font-family: Calibri;"> Wu Feng </span><span style="font-family: 宋体;">副教授,同时也是绿色</span><span style="font-family: Calibri;"> Top500 </span><span style="font-family: 宋体;">排名的维护者,组织的讨论会,主要议题是异构计算中的</span><span style="font-family: Calibri;"> 3P </span><span style="font-family: 宋体;">问题:性能</span><span style="font-family: Calibri;"> (Performance)</span><span style="font-family: 宋体;">,功耗</span><span style="font-family: Calibri;"> (Power) </span><span style="font-family: 宋体;">和可编程性</span><span style="font-family: Calibri;"> (Programmability)</span><span style="font-family: 宋体;">。</span><span style="font-family: Calibri;">AMD </span><span style="font-family: 宋体;">的</span><span style="font-family: Calibri;">Mike Houston </span><span style="font-family: 宋体;">举了一个强有力的例子说,应用程序通常会开发交织式的并行性</span><span style="font-family: Calibri;"> (braided parallelism)</span><span style="font-family: 宋体;">,也就是有很多条件数据的并行性</span><span style="font-family: Calibri;">--</span><span style="font-family: 宋体;">讨论会上大多数人都同意这个观点。讨论会上最活跃的家伙是来自</span><span style="font-family: Calibri;"> Intel </span><span style="font-family: 宋体;">的</span><span style="font-family: Calibri;"> Tim Mattson</span><span style="font-family: 宋体;">。他强力宣称可编程性的重要性远超过性能和功耗。如果你不能为一台机器编程,谁会在意它的峰值性能和理论效率?他甚至毫不掩饰他对于开源软</span><span style="font-family: Calibri;"> </span><span style="font-family: 宋体;">件标准的立场,任何一个单独的公司,包括</span><span style="font-family: Calibri;"> Intel</span><span style="font-family: 宋体;">,都不应该控制编程语言。这番言论后来演变成了</span><span style="font-family: Calibri;"> OpenCL PK CUDA </span><span style="font-family: 宋体;">的讨论了。虽然我不想一一点出这场争论中的主角,但是确实有不少人对于</span><span style="font-family: Calibri;"> Mattson </span><span style="font-family: 宋体;">提到的危害进行了一番慷慨激昂的演说。其他人的观点是这个技</span><span style="font-family: Calibri;"> </span><span style="font-family: 宋体;">术仍然只是在草创阶段,匆忙地在成熟之前制定标准只会抑制创新。虽然有不同的争论,但是与会这都热忱赞成软件乃是最大的挑战(有人感觉到了这个主题么?)</span></div><div style="font-family: Calibri; font-size: 11pt; margin: 0in;"><br />
</div><div style="font-family: Calibri; font-size: 11pt; margin: 0in;">这俩讨论会都太火了,简直没有立锥之地,房间几乎要爆了,参与的人也很不错。Jeff Vetter 尝试了一个听众反馈系统,能够让大家通过手机短信反馈调查问卷的问题。一旦大家都熟悉了这种听众参与的方式,我觉得这招在以后的讨论会都会显得比较酷。</div><div style="font-family: Calibri; font-size: 11pt; margin: 0in;"><br />
</div><div style="font-family: Calibri; font-size: 11pt; margin: 0in;">今天到底为止。</div><div style="font-family: Calibri; font-size: 11pt; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in; text-align: center;"><span lang="en-US" style="font-family: Calibri;">---------------</span><span lang="zh-CN" style="font-family: 宋体;">大家好,我是分割线</span><span lang="en-US" style="font-family: Calibri;">---------------</span></div><div style="font-family: Calibri; font-size: 11pt; margin: 0in;"><br />
</div><div style="font-size: 11pt; margin: 0in;"><span lang="en-US" style="font-family: Calibri;">1. 整个翻译完了我回顾一下其实这不是一个特别有技术营养的文章,毕竟只是个人见闻。而且谈的都是老汤戏了。但是如果大家对这个话题感兴趣,这个文章读起来也会比较有意思。因为讲的细节很少,这种宏观的讨论或许对把握大方向有好处。具体的细节,还是去读 paper </span><span lang="zh-CN" style="font-family: 宋体;">吧</span><span lang="en-US" style="font-family: Calibri;">。</span></div><div style="font-family: Calibri; font-size: 11pt; margin: 0in;">2. 下次挑选文章的时候基本上不会有这样的文章了。我本来预期会是一篇科普性质的文章,毕竟是在 CACM 上的,孰料其实是一篇游记。我也放弃了翻译Steve其余两篇游记的计划。但是应该还会在 CACM 上选。</div><div style="font-family: Calibri; font-size: 11pt; margin: 0in;">3. 大四最后一个学期做了一些异构架构的事情,下个学期也很可能做这方面的事情。让我们擦亮眼睛看看会如何演化吧。 </div>Yuhaohttp://www.blogger.com/profile/08569555359590748704noreply@blogger.com0