Adding 16bpp Pixels

I wrote this article as a response to Rawhed's article in Hugi #16 on the topic. I want to present a better way of adding 16bpp pixels. First I will show two routines, one that does simple adding with saturation and the other which creates an average of pixels. Then I will draw a description of these routines.

Routines

The two routines start in the same way, but end up differently. By right of each instruction I marked a clock cycle in which it is issued. '-' means that an instruction is issued, due to pairing, in the same cycle as the previous one. The cycles correspond to Pentium processor.

The routines process two pairs of pixels in parallel, as described later. Registers esi and edi both contain pointers to pixels. A resulting pair of pixels is put at [edi]. Registers eax, ecx, edx and ebx get destroyed.

; Load pixels and decompose them mov eax, [esi] ; 1 mov ecx, [edi] ; - mov edx, eax ; 2, cache hit assumed mov ebx, ecx ; - and eax, 07E0F81Fh ; 3 and ecx, 07E0F81Fh ; - and edx, 0F81F07E0h ; 4 and ebx, 0F81F07E0h ; - ; Sum up the pixels ; Calculate average add eax, ecx ; 5 add eax, ecx ; 5 add edx, ebx ; - rcr eax, 1 ; 6 add edx, ebx ; - ; Check carries and saturate edx rcr edx, 1 ; 7 sbb ecx, ecx ; 6 xor ebx, ebx ; - ; Compose and store pixels and ecx, 0F8000000h ; 7 or eax, edx ; 8 test edx, 00200000h ; - mov [edi], eax ; 9 setz bl ; 8 or edx, ecx ; 9 xor ecx, ecx ; - dec ebx ; 10 test edx, 00000800h ; - setz cl ; 11 and ebx, 001F0000h ; 12 dec ecx ; - or edx, ebx ; 13 and ecx, 000007E0h ; - or edx, ebx ; 14 ; Check carries and saturate eax xor ecx, ecx ; - and edx, 0F81F07E0h ; 15 test eax, 08000000h ; - setz cl ; 16 xor ebx, ebx ; 17 dec ecx ; - and ecx, 07E00000h ; 18 test eax, 00010000h ; - setz bl ; 19 or eax, ecx ; - xor ecx, ecx ; 20 dec ebx ; - and ebx, 0000F800h ; 21 test eax, 00000020h ; - setz cl ; 22 or eax, ebx ; 23 dec ecx ; - and ecx, 0000001Fh ; 24 and eax, 07E0F81Fh ; - or eax, ecx ; 25 ; Compose and store pixels or eax, edx ; 26 mov [edi], eax ; 27

In the end, the procedure that adds pixels with saturation (left) takes 27 clocks, and as it processes pixels in parallel, it takes 13.5 cycles per pixel. The other procedure that calculates average of two pixels (right) takes 9 clocks, with 4.5 clock per pixel.

How does it work

Generally, pixels are formed in the following way: 5 bits for red color component, 6 for green and 5 for blue. An example layout:

15 0 0010110011000110 Àred-Àgren-Àblu-

As a pixel takes exactly 16 bits, two pixels fit in a 32-bit register. So if we have two pixels from image A in register eax, and two pixels from image B (of corresponding position) in register ecx, we can simply do 'add eax, ecx' to get the two sums of two pixel pairs. The only problem are carries, which are added to adjacent components. So we just smartly split the components of the pixels, do the addition, then either saturate or shift it right, and compose the pixels back. See the example:

31 0 eax from image A: 00101100110001100110001111011111 ecx from image B: 10010011011010000001000001011111 Ãred-Àgren-Àblu´Ãred-Àgren-Àblu´ À----pixel-2----À----pixel-1----

Code that splits the pixels:

; copy the pixels mov edx, eax mov ebx, ecx ; mask off even components and eax, 00000111111000001111100000011111b ; 07E0F81Fh and ecx, 07E0F81Fh ; mask off odd components and edx, 11111000000111110000011111100000b ; 0F81F07E0h and ebx, 0F81F07E0h

And contents of registers resulting from the splitting:

31 0 eax: -----100110-----01100------11111 edx: 00101------00110-----011110----- ecx: -----011011-----00010------11111 ebx: 10010------01000-----000010----- Ãred-Àgren-Àblu´Ãred-Àgren-Àblu´ À----pixel-2----À----pixel-1----

'-' indicates 0 put by the splitting routine with 'and'.

The rest is simple. Maybe the carry checking in the saturation code is the only thing that requires explanation. Watch this:

xor ebx, ebx ; ebx = 00000000h test eax, 00200000h ; zf=1 if bit21 is clear, otherwise zf=0 setz bl ; ebx=1 if bit21=0, otherwise ebx=0 dec ebx ; ebx=0 if bit21=0, otherwise ebx=0FFFFFFFFh and ebx, 001F0000h ; mask off pix2.red (bit21 was the carry) or eax, ebx ; set pix2.red to max value if overflowed

This way of carry checking has the advantage of taking 5 clocks with no branch. Intermixed with other code theoretically takes 3.5 clocks. A branch version takes 2 (no carry, predicted) to 10 (carry, mispredicted) with an average of 6 clocks on Pentium and 1 to 18(!) with an average of 9 clocks on Pentium II, thus this no-branch version is better.

I do not have any idea whether the given routines are the best. I only know that on Pentium the adding with saturation takes 13.5 clocks per pixel and the pixel average takes 4.5 clocks per pixel.