Sunday, February 15, 2009

View frustum culling optimization – Structures and arrays

Last week I've tried my best at optimizing the underlying functions without touching the essence of algorithm (if there was a function initially that filled a 8-vector array with AABB points, optimizations from previous post could be done in math library). It seems the strategy has to be changed.

There are several reasons why the code is still slow. One is branching. We'll cover that in the next issue though. Another one has already been discussed on this blog related to shaders – we have 4-way SIMD instructions, but we are not using them properly. For example, our point transformation function wastes 1 scalar operation per each si_fma, and requires additional .w component fixup after that. Our dot product function is simply horrible. Once again we're going to switch layout for intermediate data from Array of Structures to Structure of Arrays.

We have 8 AABB points, so we'll need 6 vectors – 2 vectors per each component. Do we need all 6? Nah. Since it's an AABB, we can organize stuff so that we need only 4 like this:

x X x X
y y Y Y
z z z z

Note that vectors for x and y components are shared between two 4-point groups. Of course this sharing will go away after we transform our points to world space – but that makes it easier to generate SoA points from min/max vectors.

How do we generate them? Well, we already know the solution for Z – there is some magical si_shufb instruction, that worked for us before. It's time to know what exactly it does, as it can be used to generate x/y vectors too.

What si_shufb(a, b, c) does is it takes a/b registers, and permutes their contents using c as pattern, yielding a new value. Permutation is done at a byte level - each byte of c corresponds to a resulting byte, which is computed one of the following ways:

1.0x0v corresponds to a byte of left operand with index v
2.0x1v corresponds to a byte of right operand with index v
3.0x80 corresponds to a constant 0x00
4.0xC0 corresponds to a constant 0xFF
5.0xE0 corresponds to a constant 0x80
6.other values result in one of the above, the exact treatment is out of the scope

This is a superset of Altivec vec_perm instruction, and can be used to do very powerful things, as we'll realize soon enough. For example, you can implement usual GPU-style swizzling like so:

src_zxxx = si_shufb(src, src, ((qword)(vec_uint4){0x08090a0b, 0x00010203, 0x00010203, 0x00010203}));

First four bytes of my pattern correspond to bytes 8-11 of left argument, all other four-byte groups correspond to bytes 0-3 of left argument. This is equal to applying .zxxx swizzle. As you can probably see, the code can get very obscure if you use shuffles a lot, so I've made some helper macros:

// shuffle helpers
#define L0 0x00010203
#define L1 0x04050607
#define L2 0x08090a0b
#define L3 0x0c0d0e0f

#define R0 0x10111213
#define R1 0x14151617
#define R2 0x18191a1b
#define R3 0x1c1d1e1f

#define SHUFFLE(l, r, x, y, z, w) si_shufb(l, r, ((qword)(vec_uint4){x, y, z, w}))

// splat helper
#define SPLAT(v, idx) si_shufb(v, v, (qword)(vec_uint4)(L ## idx))

SHUFFLE is for general shuffling, SPLAT is for component replication (.yyyy-like swizzles). Note that in previous post SPLAT was used in transform_point to generate .xxxx, .yyyy and .zzzz swizzles from AABB point.

Let's generate AABB points then.

    // get aabb points (SoA)
    qword minmax_x = SHUFFLE(min, max, L0, R0, L0, R0); // x X x X
    qword minmax_y = SHUFFLE(min, max, L1, L1, R1, R1); // y y Y Y
    qword minmax_z_0 = SPLAT(min, 2); // z z z z
    qword minmax_z_1 = SPLAT(max, 2); // Z Z Z Z

That was easy. Now if we want first 4 points, we use minmax_x, minmax_y, minmax_z_0; for the second group, we use minmax_x, minmax_y, minmax_z_1.

Now, we have 2 groups of 4 points in each, SoA style – we have to transform them to world space. It's actually quite easy – remember the first scalar version? If you've glanced at the code, you've seen a macro for computing single resulting component:

#define COMP(c) p->c = op.x * mat->row0.c + op.y * mat->row1.c + op.z * mat->row2.c + mat->row3.c

As it turns out, this can be converted to SoA style multiplication almost literally – you just need to think of op.x, op.y, op.z as of vectors with 4 values of some component; mat->rowi.c has to be splatted over all components. The resulting function becomes:

static inline void transform_points_4(qword* dest, qword x, qword y, qword z, const struct matrix43_t* mat)
#define COMP(c) \
    qword res_ ## c = SPLAT((qword)mat->row3, c); \
    res_ ## c = si_fma(z, SPLAT((qword)mat->row2, c), res_ ## c); \
    res_ ## c = si_fma(y, SPLAT((qword)mat->row1, c), res_ ## c); \
    res_ ## c = si_fma(x, SPLAT((qword)mat->row0, c), res_ ## c); \
    dest[c] = res_ ## c;

#undef COMP

Note that it's not really that much different from the scalar version, only now it transforms 4 points in 9 si_fma and 12 si_shufb instructions. We're going to transform 2 groups of points, so we'll need 18 si_fma instructions, si_shufb can be shared – luckily, the compiler does it for us so we just need to call transform_points_4 twice:

    // transform points to world space
    qword points_ws_0[3];
    qword points_ws_1[3];

    transform_points_4(points_ws_0, minmax_x, minmax_y, minmax_z_0, transform);
    transform_points_4(points_ws_1, minmax_x, minmax_y, minmax_z_1, transform);

Previous vectorized version required 24 si_fma and 24 si_shufb, plus 8 correcting si_selb (to be fair, it could be actually optimized to require 6 si_shufb + 8 si_selb, but it's still not a win over SoA). Note that 18 si_fma + 12 si_shufb does not mean 30 cycles. SPUs are capable of dual-issuing some instructions – there are two groups of instructions, one group runs at even pipeline, another one – at odd. si_fma and si_shufb run on different pipelines, so the net throughput will be closer to 18 cycles (slightly larger than that if si_shufb latency can't be hidden).

Now all that's left is to calculate dot products with a plane. Of course we'll have to calculate them 4 at a time. But wait – in our case execution of inner loop terminated after the first iteration. So previously we were doing only one (albeit ugly) dot product, and now we're doing 4, or even 8! Isn't that a little bit excessive? Well, that's not – but we'll save a more detailed explanation for the later post, for now let the results speak for themselves.

In order to calculate 4 dot products, we'll make a helper function:

static inline qword dot4(qword v, qword x, qword y, qword z)
    qword result = SPLAT(v, 3);

    result = si_fma(SPLAT(v, 2), z, result);
    result = si_fma(SPLAT(v, 1), y, result);
    result = si_fma(SPLAT(v, 0), x, result);

    return result;

And call it twice. Again, we'll be doing four splats twice, but compiler is smart enough to eliminate this. After that we'll have to compare all 8 dot products with zero, and return false if all of those are negative.

    // for each plane...
    for (int i = 0; i < 6; ++i)
        qword plane = (qword)frustum->planes[i];

        // calculate 8 dot products
        qword dp0 = dot4(plane, points_ws_0[0], points_ws_0[1], points_ws_0[2]);
        qword dp1 = dot4(plane, points_ws_1[0], points_ws_1[1], points_ws_1[2]);

        // get signs
        qword dp0neg = si_fcgt((qword)(0), dp0);
        qword dp1neg = si_fcgt((qword)(0), dp1);

        if (si_to_uint(si_gb(si_and(dp0neg, dp1neg))) == 15)
            return false;

si_fcgt is just a floating-point greater comparison; I'm abusing the fact that 0.0f is represented as a vector with all bytes equal to zero here. si_fcgt operates like SSE comparisons and returns 0xffffffff for elements where the comparison result is true, and 0 for others. After that I and the results together, and then use si_gb instruction to gather bits of results. si_gb takes least significant bit from each element and inserts it into corresponding bit of the result; we get a 4-bit value in preferred slot, everything else is zeroed out. If it's equal to 15, then si_and returned a mask where all elements are 0xffffffff, which means that all dot products are less than zero, so the box is outside.

Note that si_gb is like _mm_movemask_ps, only it takes least significant bits instead of most significant – in case of SSE, we don't need to do comparisons. We can avoid comparisons here by anding dot products directly, and then moving the sign bit to least significant bit (it can be done by rotating each element 1 bit to the left, that's achieved by si_roti(v, 1)), but this is slightly slower, so we won't do it.

Now, the results. The code runs at 376 cycles, which is more than 2 times faster than the previous version, and almost 4 times faster than the original. This speedup is partially because we're doing things more efficiently, partially because we got rid of branches; we'll discuss this the next week. A million calls takes 117 msec, which is still worse than x86 results – but it's not the end of the story. Astonishingly, applying exactly the same optimizations to SSE code results in 81 msec for gcc (which is 30% faster than naively vectorized version), and in 104 msec for msvc8 (which is 40% slower!).

The fastest version is still produced by msvc8 from previous version. This should not be very surprising, as we changed inner loop from performing one dot-product to performing 8 at once, so that shows. We can optimize it in this case by adding early out – after we compute first 4 dot products, we'll check if all of them are positive; if some of them are not, we can safely skip additional 4 dot products and continue to the next iteration. It results in 87 ms for msvc8 and 65 ms for gcc, with gcc-compiled SoA finally being faster than all previous approaches. Of course, this is a worst case for SoA – in case inner loops actually did not terminate after first iteration the performance gain would be greater. Adding the same optimization to SPU code makes it slightly (by 3 cycles) slower; the penalty is tens of cycles if the early out does not happen and we have to compute all 8 dot products, so it's definitely not worth it.

The current source can be grabbed here.

That's all for now – stay tuned for the next weekend's post! Read more!

Sunday, February 8, 2009

View frustum culling optimization – Vectorize me

Last week I've posted some teaser code that will be transformed several times, each time yielding a faster one - “faster” in terms of “taking less cycles for the test case on SPU”. A lot of you probably looked at my admittedly lame excuse for, uhm, math library and want to ask – why the hell do you use scalar code? We're going to address the problem in this issue. This is probably a no-brainer for most of my readers, but this is a good opportunity to introduce some important points about SPUs and introduce some actual vector code before diving further.

But first, we need some background information on SPUs. For the todays post, there is a single important thing to know about SPUs – they are vector processors. Unlike most common architectures (PowerPC, x86, etc.), SPUs have only one register set, which consists of 128-bit vectors. The current implementation has 128 of them, and each register is treated differently in different instructions (you have different instructions for adding two registers as if they contained 4 single precision floats or 16 8-bit integers). The important point is that, while you can compile a piece of scalar code for SPU, it's going to use vector registers and vector instructions; the scalar values are assumed to reside in so called preferred slot – for our current needs, we only care about preferred slot for 32-bit scalars, which is the first one (index 0). Register components are numbered from least address in memory onwards, which is really refreshing after SSE little-endian madness.

This actually goes slightly further – not only all registers are 16-byte, but all memory accesses (I'm obviously talking about local storage access here – though the same mostly applies to DMA; I'll be probably discussing something DMA-related after VFC series ends) should be – you can only load/store a full register's worth of data from/to 16b-aligned location. Of course, you can implement a workaround for scalar values – for loading, load the 16 byte chunk the value is in, and then shift it in the register so that it resides in preferred slot; for saving, load the destination 16 byte chunk, insert desired value in it via shifting/masking, and then store the whole chunk back. In fact, this is exactly what compiler does. Moreover, for our struct vector3_t, loading three components in registers will generate such load/shift code for every component, since compiler does not know the alignment (the whole vector could be in one 16 byte chunk, or it could be split in half between any two components).

In order to leverage available power, we have to use available vector instructions. SPUs have a custom instruction set, which is well documented. For now, it's important to know that there is a fused multiply-add instruction, which computes a*b+c, and there is no dot product instruction (or floating-point horizontal sum, for that matter). In fact, on current generation of consoles, XBox360 is pretty unique in that it does have a dot product instruction.

So, our code is bad because we have lots of scalar memory accesses and lots of scalar operations, which are not using available processing power properly. Let's change this!

One option is to code in assembly; this has obvious benefits and obvious pitfalls, and we'll use intrinsics instead. For SPUs, we have three intrinsics sets to choose from – Altivec emulated (vec_*, the same as we use on PPU), generic type-aware (spu_*) and low-level (si_*). GCC compiler provides several vector types as language extensions (some examples are 'vector float' and 'vector unsigned char', which correspond to 4 32-bit floats and 16 8-bit unsigned integers, respectively); a single spu_* instruction translates to different assembly instructions depending on a type, while si_* instructions operate on abstract register (it has type 'qword', which corresponds to 'vector signed char') – i.e. to add two vectors, you can use spu_add(v1, v2) with typed registers, or one of si_a, si_ah, si_fa, si_dfa to add registers as 32-bit integer, 16-bit integer, 32-bit floating point or 64-bit floating point, respectively. We'll be using si_* family for several reasons – one, they map to assembly exactly, so getting used to si_* instructions make it much easier to read (and possibly write) actual assembly, which is very useful when debugging or optimizing code, two, spu_* family is not available in C, as it uses function overloading. I'll explain specific intrinsics as we start using them.

First thing we'll do is dispose of redundant vector3_t/plane_t structures (in a real math library, we won't do this of course, but this is a sample), and replace them with qwords. This way, everything will be properly aligned, and we won't need to write load/store instructions ourselves (as opposed to something like struct vector3_t { float v[4]; }).

Then, we have to generate an array of points. Each resulting point is a combination of aabb->min and aabb->max – for each component we select either minimum or maximum value. As it turns out, there is the instruction that does exactly that – it accepts two registers with actual values and a third one with pattern; for each bit in pattern, it takes left bit for 0 and right bit for 1 – it's equivalent to (a & ~c) | (b & c), only in one instruction.

The code becomes

    // get aabb points
    qword points[] =
        min,                                                   // x y z
        si_selb(min, max, ((qword)(vec_uint4){~0, 0, 0, 0})),  // X y z
        si_selb(min, max, ((qword)(vec_uint4){~0, ~0, 0, 0})), // X Y z
        si_selb(min, max, ((qword)(vec_uint4){0, ~0, 0, 0})),  // x Y z

        si_selb(min, max, ((qword)(vec_uint4){0, 0, ~0, 0})),  // x y Z
        si_selb(min, max, ((qword)(vec_uint4){~0, 0, ~0, 0})), // X y Z
        max,                                                   // X Y Z
        si_selb(min, max, ((qword)(vec_uint4){0, ~0, ~0, 0})), // x Y Z

Note that I'm using another gcc extension to form vector constants. This is very convenient and does not exhibit any unexpected penalties (the expected ones being additional constant storage and additional instructions to load them).

Then we have transform_point; we'll have to transform a given vector by matrix, and additionally to stuff a 1.0f in .w component of the result in order for the following dot product to work (I sort of hacked this in scalar version by using dot(vector3, vector4)). Vector-matrix SIMD multiplication is very well-known – we'll need add/multiply instructions, and ability to replicate a vector element across the whole vector. For this we'll use a si_shufb instruction – I'll leave the detailed explanation for the next issue, for now just assume that it works as desired :)

static inline qword transform_point(qword p, const struct matrix43_t* mat)
    qword px = si_shufb(p, p, (qword)(vec_uint4)(0x00010203));
    qword py = si_shufb(p, p, (qword)(vec_uint4)(0x04050607));
    qword pz = si_shufb(p, p, (qword)(vec_uint4)(0x08090a0b));

    qword result = (qword)mat->row3;

    result = si_fma(pz, (qword)mat->row2, result);
    result = si_fma(py, (qword)mat->row1, result);
    result = si_fma(px, (qword)mat->row0, result);

    result = si_selb(result, ((qword)(vec_float4){0, 0, 0, 1}), ((qword)(vec_uint4){0, 0, 0, ~0}));

    return result;

We replicate point components, yielding three vectors, and then compute transformation result using si_fma (fused multiply-add; returns a * b + c) instruction. After that we combine it via selb to get 1.0f in the last component.

Note that in this case we are fortunate to have our matrix laid out as it is – another layout would force us to transpose it prior to further computations to make vectorization possible. In scalar case, the layout does not make any difference.

Finally, we'll have to compute dot product. As there is no dedicated dot product instruction, we'll have to emulate it, which is not pretty.

static inline float dot(qword lhs, qword rhs)
    qword mul = si_fm(lhs, rhs);

    // two pairs of sums
    qword mul_zwxy = si_rotqbyi(mul, 8);
    qword sum_2 = si_fa(mul, mul_zwxy);

    // single sum
    qword sum_2y = si_rotqbyi(sum_2, 4);
    qword sum_1 = si_fa(sum_2, sum_2y);

    // return result
    return si_to_float(sum_1);

First we get a component-wise multiplication result by using fm; then we'll have to compute horizontal sum. First we sum odd/even components together separately. For that, we rotate our register to the left by 8 bytes (si_rotqbyi) and add with the original. After that, we rotate the result left by 4 bytes (to get the second sum at the preferred slot) and add with the original.

For mul = (1, 2, 3, 4), we get the following values:
mul_zwxy = 3 4 1 2
sum_2 = 4 6 4 6
sum_2y = 6 4 6 4
sum_1 = 10 10 10 10

The result is converted to float via si_to_float cast intrinsic – it just tells the compiler to reinterpret result as if it was a float (actual scalar value is assumed to be in preferred slot), this usually does not generate any additional instructions.

Note that in case of SPU, there is only one register set – thus there is no penalty for such vector/scalar conversion. This code will not perform very well for other architectures – for example, on PowerPC converting vector to float in this way causes a LHS (Load Hit Store; it occurs when you read from the same address you just wrote into) because vector should be stored to stack to load vector element into float register; LHS causes a huge stall (40-50 cycles) and thus performance can be compromised here. For this reason, if your PPU/VMX math library has an optimized dot product function that returns float, don't use it in performance critical code – find another approach. It's interesting that if you think about it, you don't need dot products that much, as I'll show in the next issue.

Anyway, the current code runs at 820 cycles, which is 50% faster than scalar code. This equals to approximately 256 msec per million calls, the corresponding numbers for x86 being 136 msec for gcc and 74 msec for msvc8. Once x86 code is changed so that dot() function returns its result in a vector register, and resulting sign is then analyzed via _mm_movemask_ps instruction, timings change to 126/68, respectively. We've made some progress there, but our SPU implementation is still far from x86 in terms of speed though we're using the same techniques. I promise that the end result will be much more pleasing though :)

The current source can be grabbed here.

That's all for now – stay tuned for the next weekend's post! Read more!