[x265] [PATCH x265] Add AVX2 assembly code for normFactor primitive.

Akil akil at multicorewareinc.com
Thu Mar 7 12:10:01 CET 2019


Hi Chen,
Thanks for the feedback. Will do the possible changes.

On Thu, Mar 7, 2019 at 4:37 PM Niranjankumar Balasubramanian <
niranjan at multicorewareinc.com> wrote:

> Hi Chen,
> Thanks for your suggestions. Your feedback is noted.
>
> On Thu, Mar 7, 2019 at 3:41 PM chen <chenm003 at 163.com> wrote:
>
>> Just say it works.
>>
>> First at all,
>> The expect algorithm is square of (x >> shift)
>> It is 8 bits (I assume we talk with 8bpp, the 16bpp are similar) multiple
>> of 8-bits and result is 16 bits.
>> The function works on CU-level, the blockSize is up to 64 only, or call
>> 6-bits.
>> So, we can decide the maximum dynamic range is 16+6+6 = 28 bits
>>
>> In this way, the output uint64_t is unnecessary on 8bpp mode.
>>
>> Moreover, PMOVZXBD+VPMULDQ can be replace by PMOVZXBW+PMADDWD, (please
>> remember that PMADDUBSW just work on one of unsigned input),
>> this way may accelerate 3~4 times of processing throughput.
>> I don't why not VPMULLD, it almost double performance
>>
>> Further, unnecessary VPSRLDQ because we choice VPMULDQ
>>
>> +    vpmuldq        m2,          m1,        m1
>> +    vpsrldq        m1,          m1,        4
>> +    vpmuldq        m1,          m1,        m1
>>
>>
>> Regards,
>> Min
>>
>> At 2019-03-07 17:36:19, "Dinesh Kumar Reddy" <dinesh at multicorewareinc.com>
>> wrote:
>>
>> +static void normFact_c(const pixel* src, uint32_t blockSize, int shift,
>>> uint64_t *z_k)
>>> +{
>>> +    *z_k = 0;
>>> +    for (uint32_t block_yy = 0; block_yy < blockSize; block_yy += 1)
>>> +    {
>>> +        for (uint32_t block_xx = 0; block_xx < blockSize; block_xx += 1)
>>> +        {
>>> +            uint32_t temp = src[block_yy * blockSize + block_xx] >>
>>> shift;
>>> +            *z_k += temp * temp;
>>> +        }
>>> +    }
>>> +}
>>> +
>>> diff -r d12a4caf7963 -r 19f27e0c8a6f source/common/x86/pixel-a.asm
>>> --- a/source/common/x86/pixel-a.asm Wed Feb 27 12:35:02 2019 +0530
>>> +++ b/source/common/x86/pixel-a.asm Mon Mar 04 15:36:38 2019 +0530
>>> @@ -388,6 +388,16 @@
>>>      vpaddq         m7,         m6
>>>  %endmacro
>>>
>>> +%macro NORM_FACT_COL 1
>>> +    vpsrld         m1,          m0,        SSIMRD_SHIFT
>>> +    vpmuldq        m2,          m1,        m1
>>> +    vpsrldq        m1,          m1,        4
>>> +    vpmuldq        m1,          m1,        m1
>>> +
>>> +    vpaddq         m1,          m2
>>> +    vpaddq         m3,          m1
>>> +%endmacro
>>> +
>>>  ; FIXME avoid the spilling of regs to hold 3*stride.
>>>  ; for small blocks on x86_32, modify pixel pointer instead.
>>>
>>> @@ -16303,3 +16313,266 @@
>>>      movq           [r4],         xm4
>>>      movq           [r6],         xm7
>>>      RET
>>> +
>>> +
>>> +;static void normFact_c(const pixel* src, uint32_t blockSize, int
>>> shift, uint64_t *z_k)
>>> +;{
>>> +;    *z_k = 0;
>>> +;    for (uint32_t block_yy = 0; block_yy < blockSize; block_yy += 1)
>>> +;    {
>>> +;        for (uint32_t block_xx = 0; block_xx < blockSize; block_xx +=
>>> 1)
>>> +;        {
>>> +;            uint32_t temp = src[block_yy * blockSize + block_xx] >>
>>> shift;
>>> +;            *z_k += temp * temp;
>>> +;        }
>>> +;    }
>>> +;}
>>>
>>> +;--------------------------------------------------------------------------------------
>>> +; void normFact_c(const pixel* src, uint32_t blockSize, int shift,
>>> uint64_t *z_k)
>>>
>>> +;--------------------------------------------------------------------------------------
>>> +INIT_YMM avx2
>>> +cglobal normFact8, 4, 5, 6
>>> +    mov            r4d,       8
>>> +    vpxor          m3,        m3                               ;z_k
>>> +    vpxor          m5,        m5
>>> +.row:
>>> +%if HIGH_BIT_DEPTH
>>> +    vpmovzxwd      m0,        [r0]                             ;src
>>> +%elif BIT_DEPTH == 8
>>> +    vpmovzxbd      m0,        [r0]
>>> +%else
>>> +    %error Unsupported BIT_DEPTH!
>>> +%endif
>>>
>>> _______________________________________________
>> x265-devel mailing list
>> x265-devel at videolan.org
>> https://mailman.videolan.org/listinfo/x265-devel
>>
>
>
> --
> *Regards,*
> *Akil*
> _______________________________________________
> x265-devel mailing list
> x265-devel at videolan.org
> https://mailman.videolan.org/listinfo/x265-devel
>


-- 
*Regards,*
*Akil R*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.videolan.org/pipermail/x265-devel/attachments/20190307/9c3b288b/attachment-0001.html>


More information about the x265-devel mailing list