[Pixman] [ssse3]Optimization for fetch_scanline_x8r8g8b8

Fri Aug 27 03:00:04 PDT 2010

On Friday 27 August 2010 05:59:00 Xu, Samuel wrote:
> Hi Siarhei Siamashka,
> 
> Here is a new patch, can you review it? Thank you!
> It address following suggestions:
> 1: SSSE3 file is split to a new file.

Thanks.

> Comparing with to duplicate every
> content from SSE2 file, I added a way to merge new fast path function
> pointer table runtimely by CPUID check, to avoid huge code expansion and a
> bug you pointed out.(Without runtime merging, I have to duplicate every
> line from SSE2 file to SSSE3 file, it is really not graceful and
> redundant.

Just implementing everything in the same way as done in SSE2 code, you would 
get a proper delegate chain trying SSSE3 fast paths first, then trying SSE2, 
MMX, C fast paths (compositing in one pass) and finally fallback to a slow 
general path code (fetch, combine and store steps done in separate passes).

The runtime merging is not strictly needed, even though it could probably 
improve performance somewhat. But this clearly belongs to a separate patch.

> 2: Copy right information added
> 3: Makefile fix
> 4: author name fix
> 
> For detect_cpu_features(), we possibly can remove the TWO lines asm code to
> copy edx to "result", to avoid SSE2 and MMX checking, since most preparing
> code still need. If we keep this 2 lines, it still give us some
> possibility to check other edx information of CPUID in the future. It is
> only executed once when pixman is loaded.

Wouldn't it be just as simple as doing the following in the block of code under 
__GNUC__ and __amd64__ or __x86_64__ ifdefs?

int have_ssse3_support ()
{
    uint32_t cpuid_ecx;
    __asm__ (
        "mov $1, %%eax\n"
        "cpuid\n"
        "mov %%ecx, %0\n"
        : "=r" (cpuid_ecx)
        :
        : "%rax", "%rbx", "%rcx", "%rdx"
    );
    return cpuid_ecx & (1 << 9);
}

And we already know that we have MMX and SSE2 on all x86_64 hardware even 
without checking.

MSVC and any compilers other than GNU C are probably not so interesting 
initially. They just should not fail when building pixman.

> For ASM code of 'bk_write' and 'table_48_bytes_bwd', here is the reason of
> "why we use backward copy mode in the patch ", from Ling: All read
> operations after allocation stage can run speculatively, all write
> operation will run in program order, and if addresses are different read
> may run before older write operation, otherwise wait until write commit.
> However CPU don't check each address bit, so read could fail to recognize
> different address even theyare in different page. For example if rsi is
> 0xf004, rdi is 0xe008,in following operation there will generate big
> performance latency. 1. movq (%rsi),	%rax
> 2. movq %rax,		(%rdi)
> 3. movq 8(%rsi),	%rax
> 4. movq %rax,		8(%rdi)
> 
> If %rsi and rdi were in really the same meory page, there are TRUE
> read-after-write dependence because instruction 2 write 0x008 and
> instruction 3 read 0x00c, the two address are overlap partially. Actually
> there are in different page and no any issues, but without checking each
> address bit CPU could think they are in the same page, and instruction 3
> have to wait for instruction 2 to write data into cache from write buffer,
> then load data from cache, the cost time read spent is equal to mfence
> instruction. We may avoid it by tuning operation sequence as follow.
> 1. movq 8(%rsi), %rax
> 2. movq %rax,	8(%rdi)
> 3. movq (%rsi),	%rax
> 4. movq %rax,	(%rdi)
> Instruction 3 read 0x004, instruction 2 write address 0x010, no any
> dependence. In this patch we first handle small size(less 48bytes) by this
> way, and the approach help us improve up to 2X on Atom.

Thanks for the explanation. Some comments describing why it was done this way 
in the code for such non-obvious cases would be generally nice. 

So it is basically a store forwarding aliasing problem which is described in 
"12.3.3.1 Store Forwarding" section from "Intel® 64 and IA-32 Architectures 
Optimization Reference Manual", right?

Wouldn't just the use of MOVD/MOVSS instructions here also solve this problem? 
Store forwarding does not seem to be used for SIMD according to the manual. I 
haven't benchmarked anything yet though.

Moreover, if you have plans to generalize and extend SSSE3 optimizations to 
accelerate more stuff later, then something like 'src_8888_8888' and 
'add_8888_8888' operations would be the really easy ones. But it would be very 
nice to avoid dumb copy/paste and 3x increase of assembly sources size. For 
this to go smoothly, preferably you would need to separate the loop code itself 
from the pixel processing logic. And in order to make pixel processing more 
simple and consistent, it's better to do all the pixel processing using SIMD
instructions (both inner loop and also leading/trailing pixels). Leading and 
trailing unaligned pixels may be probably handled by using MOVD/MOVSS/MOVHPS or 
other similar instructions to do partial SSE register load/store operations. 
It's not like this needs to be done right now, but it is better to plan ahead.

Something similar is implemented for ARM NEON optimizations.

-- 
Best regards,
Siarhei Siamashka
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.freedesktop.org/archives/pixman/attachments/20100827/c2b74a67/attachment.pgp>