<html><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><span class="Apple-style-span" style="font-family: monospace; font-size: small; "><div class="pre" style="font-family: monospace; font-size: 12px; white-space: pre; "><span class="Apple-style-span" style="font-family: Times; font-size: medium; white-space: normal; "><pre>> 16x16 SATD (I have no idea what you mean by "sub blocks") takes 170
> cycles on a Nehalem CPU. A SAD takes something around 42, but
> normally 4 are batched up(SAD_X4) and that takes 152 clocks total. On
> Phenom it's around 110 or so.
</pre></span></div><div class="pre" style="font-family: monospace; font-size: 12px; white-space: pre; ">For example, a FPGA core that in a single cycle takes a 16x16 macroblock then computes and returns all sub-blocks in that 16x16 area all the way down to 4x4 (called a systolic array in some articles). Same for SATD. </div><div class="pre" style="font-family: monospace; font-size: 12px; white-space: pre; "><br></div><div class="pre" style="font-family: monospace; font-size: 12px; white-space: pre; ">16,16 16,8 8,16 8,8 8,4 4,8 4,4</div><div class="pre" style="font-family: monospace; font-size: 12px; white-space: pre; "><br></div><div class="pre" style="font-family: monospace; font-size: 12px; white-space: pre; ">David.</div></span></body></html>