<div style="line-height:1.7;color:#000000;font-size:14px;font-family:arial"><DIV>>+INIT_XMM sse4<BR>>+cglobal intra_pred_ang4_3, 3,4,8<BR>>+ cmp r4m, byte 33<BR>>+ cmove r2, r3mp<BR>>+ lea r3, [ang_table + 20 * 32]</DIV>
<DIV>why 32? on non-AVX2, register size is 16-bytes</DIV>
<DIV><BR>>+ movu m0, [r2 + 2] ; [8 7 6 5 4 3 2 1]<BR>>+ palignr m1, m0, 2 ; [x 8 7 6 5 4 3 2]<BR>>+ punpcklwd m2, m0, m1 ; [5 4 4 3 3 2 2 1]<BR>>+ palignr m5, m0, 4 ; [x x 8 7 6 5 4 3]<BR>>+ punpcklwd m3, m1, m5 ; [6 5 5 4 4 3 3 2]<BR>>+ palignr m1, m0, 6 ; [x x x 8 7 6 5 4]<BR>>+ punpcklwd m4, m5 ,m1 ; [7 6 6 5 5 4 4 3]<BR>>+ movhps m0, [r2 + 2] ; [x x x x 8 7 6 5]<BR>movhlps to avoid access to memory</DIV>
<DIV> </DIV>
<DIV>>+ punpcklwd m5, m1, m0 ; [8 7 7 6 6 5 5 4]<BR>>+<BR>>+ mova m0, [r3 + 6 * 32] ; [26]<BR>>+ mova m1, [r3] ; [20]<BR>>+ mova m6, [r3 - 6 * 32] ; [14]<BR>>+ mova m7, [r3 - 12 * 32] ; [ 8]<BR>>+<BR>>+ALIGN 32<BR>>+.do_filter4x4:<BR>>+ pmaddwd m2, m0<BR>>+ paddd m2, [pd_16]<BR>>+ psrld m2, 5<BR>>+<BR>>+ pmaddwd m3, m1<BR>>+ paddd m3, [pd_16]<BR>>+ psrld m3, 5<BR>>+ packusdw m2, m3<BR>>+<BR>>+ pmaddwd m4, m6<BR>>+ paddd m4, [pd_16]<BR>>+ psrld m4, 5<BR>>+<BR>>+ pmaddwd m5, m7<BR>>+ paddd m5, [pd_16]<BR>>+ psrld m5, 5<BR>>+ packusdw m4, m5<BR>>+<BR>>+ jz .store<BR>>+<BR>>+ ; transpose 4x4<BR>>+ punpckhwd m0, m2, m4<BR>>+ punpcklwd m2, m4<BR>>+ punpckhwd m4, m2, m0<BR>>+ punpcklwd m2, m0<BR>>+<BR>>+.store:<BR>>+ add r1, r1<BR>>+ movh [r0], m2<BR>>+ movhps [r0 + r1], m2<BR>>+ movh [r0 + r1 * 2], m4<BR>>+ lea r1, [r1 * 3]<BR>>+ movhps [r0 + r1], m4<BR>>+ RET<BR></DIV></div>