Rules of SIMD Optimisation
Chris Dragan

I had written this article for myself as a quick reference, but finally I have decided to release it. It deals with a somewhat forgotten coding technique called: "code optimisation" *grin*, and mainly concerns MMX and SIMD-FP (3DNow! and SSE) instructions, so the basics you should have known before starting reading this. As a nice startup I recommend Rawhed's tutorial in Hugi19.

All stuff presented herein you can find in Intel and AMD manuals, described in detail. Nevertheless this article, or portions of it, after being saved in HTML (F2 key) can serve as a good quick reference.

Many topics aren't presented here. The best source of all optimisation hints is the Agner Fog's paper, and of course the manuals from the CPU vendors.

If you are not familiar with 3DNow! and SSE, take a look at the appendices. This should give you some basic clue how they work.

Pentium MMX

Using the pure MMX instruction set ensures that the code will run on every MMX-equipped machine. Optimising for Pentium MMX gives good results, and is better than no optimisation at all, even on the latest out-of-order executing monsters.

Pentium MMX has an in-order architecture (instructions are executed sequentially) and contains two pipes, called U and V. Decoding two subsequent instructions, the processor feeds the first of them to the U pipe, and if certain requirements are met, the other one to the V pipe. This means that two instructions are executed at the same time in the two pipes.

The basic coding issues for MMX are the same as for integer instructions. This concerns preventing using the same register or memory location as was written in the previous instruction that went to U pipe. This also concerns such topics as correct data alignment for avoiding stalls (pauses in execution) and so on. The list is quite long, but somewhat clear; again, I can recommend Agner Fog's manual.

Two ALUs, one shifter and one multiplier - these are the MMX execution resources of Pentium MMX. Instructions fed to ALUs and shifter execute in one cycle, while the multiplication instructions execute in three cycles, but are fully pipelined, which means that a multiplication can be started and finished up every cycle. The MMX instructions can go to either pipe, U or V, provided that other conditions are met. There can be only one shifter and one multiplier instruction executed in a cycle (in either pipe), as these resources are single (unlike ALUs - there are two of these).

The instructions and their units:
2 ALUs (movq padd* psub* pand pandn pcmp* por pxor) 1 cycle
1 Shifter (pack* punpck* psll* psrl* psra*) 1 cycle
1 Multiplier (pmullw pmulhw pmaddwd) 3 cycles but fully pipelined

The emms instruction is not pairable. It causes validation of FPU stack registers (which are covered by MMX registers); in other words it "enables" the FPU, which is disabled while MMX code is executing.

Normally all MMX instructions can go down either the U or V pipe. They are also pairable with integer instructions. But there is an exception. All MMX instructions that access memory or integer registers (in case of movd) can be executed only in the U pipe, and they do not pair with integer instructions.

movq and movd instructions which move data from MMX registers to memory or integer registers need one clock more after the data is ready - this comes from the construction of the pipeline.

The following examples do not do anything special, but show how the above rules apply.

	; These two do pair (i.e. are executed in the same cycle)
		paddd	mm0, mm1 ; goes down the U pipe
		paddd	mm2, mm1 ; goes down the V pipe

	; Do not pair - resource constraint
		paddd	mm0, mm1 ; U
		paddd	mm2, mm0 ; U (waits until mm0 result ready)

	; Pair
		pmullw	mm3, mm7 ; result ready after 3 cycles
		psllq	mm4, 3	 ; result ready in next cycle

	; Do not pair - only one shifter
		packssdw mm0, mm2
		psraw	mm1, 6

	; Multiplication chain
		pmaddwd mm0, mm7 ; in first cycle (takes 3 cycles)
		pmaddwd mm1, mm7 ; in second cycle (one multiplier)
		pmaddwd mm2, mm7 ; in third cycle
		paddd	mm0, mm6 ; in fourth cycle, mm0 was now ready

	; Stall due to pmulhw latency
		pmulhw	mm1, mm2 ; 1 (the first cycle, U pipe)
		paddw	mm3, mm0 ; - (the same cycle, V pipe)
		psubw	mm7, mm0 ; 2
		pxor	mm6, mm6 ; -
		punpcklwd mm6, mm1 ; 4 (pmulhw executes for 3 cycles)

	; A piece of code (byte=>word routine, badly optimised)
		movq	mm0, [eax] ; 1
		pxor	mm1, mm1   ; -
		punpckhbw mm1, mm0 ; 2
		pxor	mm2, mm2   ; -
		punpcklbw mm2, mm0 ; 3
		add	eax, 8	   ; -
		movq	[edx], mm2 ; 5 (mm2 modified in previous pair)
		movq	[edx+8], mm1 ; 6 (cannot go to V pipe)
		add	edx, 16    ; 7 (cannot pair with the above)
		dec	ecx	   ; -
		jnz	_again	   ; 8

Pentium II

Optimising pure MMX code especially for Pentium II is a bad idea, unless it is really a key code that takes up 90% of processor time. Pentium II executes code out of the order, which means that a latter instruction can be executed earlier. The decoding of instructions, however, is performed in-order. In fact instruction decoding is the most vulnerable bottleneck.

It is enough to realize that Pentium II decodes MMX instructions that access memory (other than movq and movd) into two internal sub-instructions (called micro-ops). movq and movd that store to memory or integer register also generate two u-ops. Other instructions generate one u-op. There are three decoders on Pentium II, decoding in parallel, first of which can issue up to four u-ops, and the two others can issue up to... one u-op. This is the so-called 4-1-1 scheme, which indicates what types of instructions can be scheduled for parallel decoding. An additional requirement is that the three subsequent instructions occupy no more than 16 bytes. The following piece of code tries to shed some light on this topic. The clocking corresponds to decoding, NOT the execution of instructions, which is unpredictable from the out-of-orderness reason.

		movq	mm0, [edx]    ; 1 (1uop)
		paddd	mm0, mm1      ; - (1uop)
		psubw	mm1, mm2      ; - (1uop)
		paddsw	mm1, [edx+8]  ; 2 (2uops)
		pxor	mm2, mm3      ; - (1uop)
		psubusw mm1, mm3      ; - (1uop)
		psubw	mm1, mm4      ; 3 (1uops)
		movq	mm4, [eax]    ; - (1uop)
		pandn	mm1, [edx+8]  ; 4 (2uops) <= next triplet
		pxor	mm2, mm1      ; - (1uop)
		movq	[eax], mm0    ; 5 (2uops)

In most cases optimising for Pentium MMX gives ideal results for Pentium II. It is also important to keep so-called dependency chains short. Dependency chain is when the data travels through registers with subsequent instructions.

Summing up:
- First of all the code should be optimised for Pentium MMX.
- If additional optimisation is needed for Pentium II, the manner of decoding instructions should be taken into account to provide maximum decode buffers saturation and thus to take advantage of out-of-order execution. Further optimisation is done by the processor itself, but this does not free the programmer from using as fast and short algorithms as possible.
- For Pentium II, dependency chains should be kept short. If a loop is one long dependency chain, it should be cloned and intermixed with the copy, like there were two loops executed in parallel. This approach is the best, though it increases registers pressure.

It is worth to mention that generally loop unrolling is not recommended on Pentium II. Again, the out-of-order architecture along with branch prediction makes the jump instruction latency hidden, thus when the loops are short and fit well in the cache, they are executed fast. Of course one of the principal optimisation rules tells that all unnecessary branches should be replaced with adequate sequences - the branch miss prediction penalty on the latest Intel processors can reach from 10 to even 20 cycles.

K6-2

The biggest advantage of this processor and its successors is the 3DNow! instruction set, which allows issuing floating point operations in parallel. MMX registers keep two 32-bit floating point numbers then.

The K6 architecture decodes instructions into RISC-like micro-ops storing them in some kind of reservation station, but further pipeline stages follow the two-piped Pentium scheme, rather than the out-of-order solution found in P6. The K6 pipes are called X and Y.

The instructions on K6(-2) are usually executed in order, but not always. Some kind of out-of-order execution occurs when one of the pipes is blocked (stalled), for example because of a resource constraint. The other pipe is then still available and accepts and executes new instructions, which can be finished before the stalled instruction.

The MMX execution resources of K6-2 consist of two MMX ALUs, one MMX shifter, one 3DNow! ALU, one generic multiplier (both for MMX and 3DNow!), one load and one store unit. This means that e.g. no two 3DNow! ALU instructions can be executed in one cycle, just like it was in case of Pentium MMX's shared shifter and multiplier units (but one 3DNow! ALU operation and one 3DNow! multiplication can be executed in parallel).

Generally all 3DNow! instructions and MMX multiplication instructions are executed in two cycles and are pipelined (can be issued every cycle). All other MMX instructions execute in one cycle.

Instructions that load data from memory execute for additional two cycles. On Pentium MMX the load latency was hidden by the structure of the pipeline (it was one cycle in fact). On K6(-2) the load latency must be taken into account, because of the decoding instructions into u-ops, and because there is a separate load unit. But it is important to note that the load and store units are not connected to the X and Y pipes, thus they do not cause any stalls. Still a pand instruction that loads source argument from memory executes in 3 cycles. Thus the so-called RISC-like coding, i.e. loading data using movq prior to using it, is often recommended.

	; Deprecated
		pmullw	mm3, [eax]

	; Recommended
		movq	mm7, [eax]
		; perhaps some instructions inbetween
		pmullw	mm3, mm7

A nice feature of the K6 architecture is that pack/unpack instructions are executed in ALU units instead of the shifter. Therefore two of them can go in parallel, or even can be issued along with a shift.

K6-2 execution resource summary:
2 MMX ALUs (padd* psub* pand pandn pcmp* por pxor pack* punpck* pavgusb) 1 cycle
1 Shifter (psll* psrl* psra*) 1 cycle
1 3DNow! ALU (pfacc pfadd pfsub pfsubr pf2id pi2fd pfcmp* pfmax pfmin pfrcp pfrsqrt) 2 cycles, fully pipelined
1 Multiplier (pmullw pmulhw pmulhrw pmaddwd pfmul pfrcpit* pfrsqit1) 2 cycles, fully pipelined
1 Load unit (all with source memory or intreg operands) 2 cycles, fully pipelined
1 Store unit (movq, movd with destination memory or intreg operands) also 2 cycles and fully pipelined

As the 3DNow! instructions have 2-cycle latency, it is sometimes good to do a little trick and instead of using float compare, do integer compare on dwords (pcmp*d). This works if both numbers are known to be greater than 0 or one is greater and the other less than 0 (beware of +0 / -0). In case of both numbers less than 0, the result is opposite.

Operations like abs or neg can be done with pand and pxor instructions, simply modifying the most significant bits of the floats, which are the signs.

As an addition, there are two very useful instructions included with 3DNow!: femms and prefetch. femms allows very fast switching between MMX and FPU. It is recommended that this instruction is used not only on the end of MMX/3DNow! code, but also at the beginning, where it adds a latency of three cycles, but the transition between FPU and MMX is reduced by a half.

The prefetch instruction is very useful when a streaming code is prepared, like digital signal processing, sample mixing, image conversion, etc., where data is processed word by word. As the operand a memory location is given. This address refers to some forward distance relative to the current position of the pointer in the loop. The referenced data is loaded from memory to cache, and thus when its time comes, it is accessed very quickly with minimal latency. The prefetch distance is usually calculated from the formula: 200*DS/C, where DS is the number of bytes processed in a loop (data stride) and C is the approximate number of cycles taken by the loop to execute. (NOTE: This formula was in the Athlon docs, but it can be used for K6-2 as well.) Another very interesting thing about this instruction is that it is not treated as a MMX/3DNow! instruction, and can be used with regular integer or FPU code.

When using instructions that reference memory, it is important to avoid the following addressing modes:
- the use of index register, which forces existence of the SIB opcode byte that increases the size of the instruction (for example [eax*2] or [ebx+ecx]);
- long (32-bit) offsets (e.g. [eax+128] 128 is beyond signed byte);
- using esi register alone: [esi] (strange, but neither K6s nor Athlons like it).

If one of the above addressing modes is encountered, the instruction is decoded alone (does not take advantage of the two parallel decoders) in one or two cycles.

The following example converts a stream of words into 32-bit floats. The instruction latency is marked with parentheses. The prefetch distance was calculated as 200*8/7 = 229.

@@loop: 	movq	mm0, [eax]	; 1     [d c b a]
		add	eax, 8		; - 	     
		movq	mm1, mm0	; 2     [d c b a]
		punpcklwd mm0, mm0	; -     [b b a a]
		punpckhwd mm1, mm1	; 3     [d d c c]
		psrad	mm0, 16 	; -     [ b | a ]
		psrad	mm1, 16 	; 4     [ d | c ]
		pi2fd	mm0, mm0	; - (2) [ b | a ]
		pi2fd	mm1, mm1	; 5 (2) [ d | c ]
		prefetch [eax+229]	; - (3)
		movq	[edx], mm0	; 6 (latency not important)
		movq	[edx+8], mm1	; 7
		add	edx, 16 	; -
		sub	ecx, 4		; 8 (4 words per loop)
		ja	@@loop		; -

The routine takes approximately 8 cycles per loop (2 cycles per converted word), but the cycle marking doesn't have to be the same each loop. Interesting that the integer instruction is paired with a movq from memory. Even if the prefetch instruction executes for three cycles in the load unit, it is almost invisible to the loop code. The following movq could thus be executed in parallel with the preceding pi2fd instruction. The loop count bases on decrementing ecx register. The AMD docs say that the latency of the loop instruction is the same as the latency of dec/jnz pair on K6-2, but on Athlon loop executes slowly like on Intel processors, thus is deprecated.

Pentium III

This was the first time when Intel was after competition, with its own SIMD-FP solution. The difference between SSE (Streaming SIMD Extensions, official name of Intel's SIMD-FP issue) and 3DNow! is significant. 3DNow! performs on the MMX register set, while Intel developed a brand new set of eight registers, which can contain four, not two, 32-bit floating point numbers. This however forced inclusion of a new register save/restore mechanism, what caused a long delay before new software (especially OSes) was available.

Intel's SIMD-FP operates, just like 3DNow!, on four units: ALU, multiplier, load and store unit. MMX multiplier is contained within a separate 3-cycle latency unit. The latencies of MMX instructions are the same on Pentium II and III as on Pentium MMX, and the same numbers and types of execution resources exist; the only difference is that Pentium II and III execute instructions out of order.

SSE includes not only the SIMD-FP logic, but also enhanced set of MMX instructions, which provide more flexibility to the MMX "technology".

As pointed earlier, the new SSE registers (called xmm0..xmm7) are 128-bit wide and contain four independent 32-bit floating point numbers. There are two types of SIMD-FP instructions: ones that do all four operations on the four 32-bit FP numbers, and ones that affect only the first 32-bit number. The first ones' mnemonics end with ps (Packed Single precision FP), the others' with ss (Scalar Single precision FP), for example addps and addss.

The SSE instruction latencies vary, unlike on 3DNow! where all have two cycle latency (SIMD-FP). The throughput of many SSE instructions is 2 cycles (one can be issued every 2 cycles).

Appendix C at the end of this article summarizes and compares 3DNow! and SSE SIMD-FP instructions, their latencies and throughputs, keeping in mind that SSE registers are twice as wide as MMX registers used by 3DNow!.

At first glance, programming with SSE isn't as easy as with 3DNow!. But when I've tried it, it appeared that it isn't more difficult than 3DNow!. The big number of move/unpack instructions can scare off, but they are really useful. The move instructions give much more flexibility with the new SIMD registers than we have with plain MMX. Additionally, the MMX instruction set has also been enhanced with useful move/shuffle instructions.

SSE includes a set of common arithmetic instructions - add*s, sub*s, min*s, max*s, cmp*s, mul*s, div*s, sqrt*s, rcp*s, rcqrt*s. The * in this case stands for p (packed) or s (scalar). rcp*s and rcqrt*s help to quickly calculate reciprocal and reciprocal square root with low precision. To increase precision of the result, a Newton-Raphson method can be used (described in Intel manual). The AMD's approach in this case is simpler, and with the use of special pfrc*it* instructions the result is calculated faster.

More SIMD-FP instructions include comiss and ucomiss - comparison which stores result in regular EFLAGS, andps, xorps, orps used mainly for manipulating signs, a set of move instructions (e.g. shufps, movaps, unpcklps), conversion instructions cvt*2* and ldmxcsr/stmxcsr for managing the precision and rounding control register.

The MMX extensions include pmin*, pmax* and pavg* for min, max and average operations on bytes or words, psadbw - sum of absolute differences, a set of extended move instructions and pmulhuw for unsigned word multiplication.

Additional sfence and prefetch* instructions help to reduce store and load latencies. The prefetch instruction comes in four flavors, which are currently executed in the same manner.

Athlon

Now, this is a real monster. In most fields it is more efficient than Pentium III, though its MMX is theoretically slower (I haven't tested this). One advantage of this processor is that its three decoders are equally limited with micro-op output (which is max two u-ops per decoder, i.e. 2-2-2 scheme). Athlon supports also the new MMX instructions introduced with SSE, along with its own small extension of 3DNow!.

The greatest feature of Athlon are its two fully pipelined FP units, which are responsible for all FPU, MMX and 3DNow! instructions. They are named FADD and FMUL, according to what they actually do. The list of stuff accepted by them:
FADD - MMX ALU, MMX shifter, 3DNow! adder, FPU adder
FMUL - MMX ALU, 3DNow!/MMX multiplier/reciprocal unit, FPU multiplier/divider/square root
The third unit listed in AMD docs is FSTORE. It performs some auxiliary tasks related to FPU, such as loading constants (fldz) and intermediate stuff needed for long and complex FPU instructions, like fsin.

If AMD docs are correct, the reciprocal and square root reciprocal table lookup instructions are executed in the FMUL unit on Athlon, and not in the adder, like it was on K6-2/III.

The only disadvantage of Athlon's SIMD solution is the latency of instructions, which doubled since K6-2, and now equals 4 cycles for MMX multiply and all 3DNow! instructions and 2 for other MMX instructions. Fortunately they are fully pipelined. This is because these instructions execute on the same units as regular FPU instructions.

In fact the SIMD-FP instructions aren't that slow. My small investigation (the clockspd proggie included in bonus pack) proves that though a single 3DNow! FP instruction executes for 4 cycles, the intermediate result can be fetched and processed earlier by other FP instruction. For example, if a pfrcp is followed by a pfmul on the result, the two instructions together take approximately 7 cycles, instead of 8 (4+4). See also the test in appendix D.

As I have mentioned before, RISC-like coding is preferred for 3DNow!, especially on Athlon; i.e. splitting instructions that access memory into movq/exec pairs. This way of handling loads can reduce overall execution time, because the loads can be taken earlier.

There are three additional instructions on Athlon (except those SSE MMX extensions): pswapd (swaps dwords), pfnacc and pfpnacc (accumulation with negation of certain elements), pf2iw and pi2fw (conversion between single-FP and words).

Athlon example

As an example let's take what we find in 3D engines:

	     [ x']   [ a b c d ]     [ X ]
	     [ y'] = [ e f g h ]  x  [ Y ]
	     [ z']   [ i j k l ]     [ Z ]
	     [ 1 ]   [ 0 0 0 1 ]     [ 1 ]

The final result is [ x y ] = [ x'/z' y'/z']. This is a typical geometry transformation algorithm that includes perspective division.

Let's assume that the matrix is stored as [ a b c d e f ... 0 1 ] and the point is stored as [ X Y Z 1 ]. We get the matrix pointer in eax, the point pointer in ecx, and the result [ x y ] we put at edx. All numbers are 32-bit floats.

	; First the z', as we want to do perspective division later
		movq	mm2, [eax+8*4]	; 1 (2) [ j i ]
		movq	mm0, [ecx]	; 2 (2) [ Y X ]
		movq	mm1, [ecx+8]	; 3 (2) [ 1 Z ]

		pfmul	mm2, mm0	; 4 (4) [ j*Y i*X ]

		movq	mm3, [eax+10*4] ; - (2) [ l k ]
		movq	mm4, [eax]	; 5 (2) [ b a ]

		pfmul	mm3, mm1	; 6 (4) [ l k*Z ]
		movq	mm5, [eax+2*4]	; - (2) [ d c ]

	; Now we start calculating x' and load for y'
		pfmul	mm4, mm0	; 7 (4) [ b*Y a*X ]
		movq	mm6, [eax+4*4]	; - (2) [ f e ]
		pfmul	mm5, mm1	; 8 (4) [ d c*Z ]
		movq	mm7, [eax+6*4]	; - (2) [ h g ]

	; Now we start calculating y' while z' is being finished
		pfmul	mm6, mm0	; 9 (4) [ f*Y e*X ]
		pfmul	mm7, mm1	; 10 (4) [ h g*Z ]
		pfacc	mm2, mm3	; -  (4) [ k*Z+l  i*X+j*Y ]

	; Ouch ! Now we loose a lot of cycles...
		pfacc	mm4, mm5	; 12 (4) [ c*Z+d  a*X+b*Y ]
		pfacc	mm2, mm2	; 14 (4) [ z' z']
		pfacc	mm6, mm7	; 15 (4) [ g*Z+h  e*X+f*Y ]
		pfrcp	mm3, mm2	; 18 (4) [ 1/z' 1/z'] (approx.)
		pfacc	mm4, mm6	; 19 (4) [ y' x' ]
		pfmul	mm4, mm3	; 23 (4) [ y'/z' x'/z']
		movq	[edx], mm4	; 27 (2)

The result we have accurate to 15 bits. Full precision would require additional six cycles (with the iteration instructions).

The clock marking in this example is purely theoretical, and in fact wrong: the results are fetched after 3, not 4 cycles, the loads and stores are issued off-path, etc. But the clock marking shows us that the beginning of the routine is quite well balanced, while after 10th cycle we get a lot of spare cycles. Putting this code in a loop, we could theoretically unroll it and put some code from the cloned piece in between the instructions which cause so big cycle losses. But this would be a wrong, needless approach.

The unrolling and "interpolating" the loop has exactly the same effect as simply optimising the loop for K6-2. Athlon, just like Pentium II, executes instructions out of order. This means that when we have a loop and all instructions are decoded pretty well (thus the decode buffers, a.k.a. reservation stations stay saturated), the processor optimises the code for us! If you are an owner of an Athlon, you can check what I am talking about with the test included in the bonus pack.

There is another approach in optimisation of the above code. It bases on executing two loops in parallel. For example:

	; Instead of doing
@@loop: 	movq	mm0, [eax]
		add	eax, 8
		paddusw mm0, mm7
		movq	[edx], mm0
		add	edx, 8
		dec	ecx
		jnz	@@loop

	; We do like follows
@@loop: 	movq	mm0, [eax]
		movq	mm1, [eax+8]
		add	eax, 16
		paddusw mm0, mm7
		paddusw mm1, mm7
		movq	[edx], mm0
		movq	[edx+8], mm1
		add	edx, 16
		sub	ecx, 2
		ja	@@loop

This technique can improve our code scheduled for Athlon processor, but only if the code is complex (the dependency chains are very long, like 20 or more instructions in a row) and we use about half of the registers. In regular, simplier cases, we get a small boost when the loop is executed quickly and the data is in L2 cache, but we do not get anything when we process much more data.

The test in the bonus pack illustrates this very well. It includes optimised transformation routines (point by matrix multiplication and perspective division, the result is int). On Athlon the code optimised for K6-2 gives 19 cycles per point, as well as the first approach described above, and the "parallel" approach gives 16 cycles per point. But this is while all the data is in L1 cache - for 1000 vertices. For 100,000 vertices the results for all routines are the same and give 35 cycles per point. Prefetching works for the large amount of vertices and speeds up the code twice (some 57 cycles per point without prefetching). For low vertex count prefetching neither improves nor reduces performance.

Summing up, 3DNow! code should be optimised for K6-2. On Athlon it is then executed quite efficiently. For complex algorithms with very long dependency chains one could think of additional optimisation for Athlon, but this approach could also be taken when optimising for K6-2.

Conclusion

Pentium III is slightly better performing in MMX than Athlon (theoretically), the tests prove it. The advantage in field of SIMD-FP, however, is questionable. Pentium III provides more instructions that add more flexibility, Athlon, on the other hand, gives obvious simplicity.

It is recommended that MMX code would be optimised for Pentium MMX. This would ensure that it executes fast enough on weak Pentiums, and the newer processors also take advantage of this optimisation.

3DNow! code should be optimised for K6-2, still the coder should carefully consider all dependency chains.

Let's hope that Pentium-XXX and Athlon brands are not the last that do a major optimisation with their out-of-order architecture. IA-64 requires that all optimisations are performed by the coder, who has complete control over instructions execution. Is it really better ?

Appendix A - 3DNow! instructions summary
Appendix B - SSE instructions summary
Appendix C - Comparison of SSE on Pentium III and 3DNow! on Athlon
Appendix D - Comparison test

Chris Dragan