What your mother never told you about graphics development

Sunday, February 15, 2009

View frustum culling optimization – Structures and arrays

Last week I've tried my best at optimizing the underlying functions without touching the essence of algorithm (if there was a function initially that filled a 8-vector array with AABB points, optimizations from previous post could be done in math library). It seems the strategy has to be changed.

There are several reasons why the code is still slow. One is branching. We'll cover that in the next issue though. Another one has already been discussed on this blog related to shaders – we have 4-way SIMD instructions, but we are not using them properly. For example, our point transformation function wastes 1 scalar operation per each si_fma, and requires additional .w component fixup after that. Our dot product function is simply horrible. Once again we're going to switch layout for intermediate data from Array of Structures to Structure of Arrays.

We have 8 AABB points, so we'll need 6 vectors – 2 vectors per each component. Do we need all 6? Nah. Since it's an AABB, we can organize stuff so that we need only 4 like this:

x X x X
y y Y Y
z z z z
Z Z Z Z

Note that vectors for x and y components are shared between two 4-point groups. Of course this sharing will go away after we transform our points to world space – but that makes it easier to generate SoA points from min/max vectors.

How do we generate them? Well, we already know the solution for Z – there is some magical si_shufb instruction, that worked for us before. It's time to know what exactly it does, as it can be used to generate x/y vectors too.

What si_shufb(a, b, c) does is it takes a/b registers, and permutes their contents using c as pattern, yielding a new value. Permutation is done at a byte level - each byte of c corresponds to a resulting byte, which is computed one of the following ways:

1.0x0v corresponds to a byte of left operand with index v
2.0x1v corresponds to a byte of right operand with index v
3.0x80 corresponds to a constant 0x00
4.0xC0 corresponds to a constant 0xFF
5.0xE0 corresponds to a constant 0x80
6.other values result in one of the above, the exact treatment is out of the scope

This is a superset of Altivec vec_perm instruction, and can be used to do very powerful things, as we'll realize soon enough. For example, you can implement usual GPU-style swizzling like so:

src_zxxx = si_shufb(src, src, ((qword)(vec_uint4){0x08090a0b, 0x00010203, 0x00010203, 0x00010203}));

First four bytes of my pattern correspond to bytes 8-11 of left argument, all other four-byte groups correspond to bytes 0-3 of left argument. This is equal to applying .zxxx swizzle. As you can probably see, the code can get very obscure if you use shuffles a lot, so I've made some helper macros:

// shuffle helpers
#define L0 0x00010203
#define L1 0x04050607
#define L2 0x08090a0b
#define L3 0x0c0d0e0f

#define R0 0x10111213
#define R1 0x14151617
#define R2 0x18191a1b
#define R3 0x1c1d1e1f

#define SHUFFLE(l, r, x, y, z, w) si_shufb(l, r, ((qword)(vec_uint4){x, y, z, w}))

// splat helper
#define SPLAT(v, idx) si_shufb(v, v, (qword)(vec_uint4)(L ## idx))

SHUFFLE is for general shuffling, SPLAT is for component replication (.yyyy-like swizzles). Note that in previous post SPLAT was used in transform_point to generate .xxxx, .yyyy and .zzzz swizzles from AABB point.

Let's generate AABB points then.

    // get aabb points (SoA)
    qword minmax_x = SHUFFLE(min, max, L0, R0, L0, R0); // x X x X
    qword minmax_y = SHUFFLE(min, max, L1, L1, R1, R1); // y y Y Y
    qword minmax_z_0 = SPLAT(min, 2); // z z z z
    qword minmax_z_1 = SPLAT(max, 2); // Z Z Z Z

That was easy. Now if we want first 4 points, we use minmax_x, minmax_y, minmax_z_0; for the second group, we use minmax_x, minmax_y, minmax_z_1.

Now, we have 2 groups of 4 points in each, SoA style – we have to transform them to world space. It's actually quite easy – remember the first scalar version? If you've glanced at the code, you've seen a macro for computing single resulting component:

#define COMP(c) p->c = op.x * mat->row0.c + op.y * mat->row1.c + op.z * mat->row2.c + mat->row3.c

As it turns out, this can be converted to SoA style multiplication almost literally – you just need to think of op.x, op.y, op.z as of vectors with 4 values of some component; mat->rowi.c has to be splatted over all components. The resulting function becomes:

static inline void transform_points_4(qword* dest, qword x, qword y, qword z, const struct matrix43_t* mat)
{
#define COMP(c) \
    qword res_ ## c = SPLAT((qword)mat->row3, c); \
    res_ ## c = si_fma(z, SPLAT((qword)mat->row2, c), res_ ## c); \
    res_ ## c = si_fma(y, SPLAT((qword)mat->row1, c), res_ ## c); \
    res_ ## c = si_fma(x, SPLAT((qword)mat->row0, c), res_ ## c); \
    dest[c] = res_ ## c;

    COMP(0);
    COMP(1);
    COMP(2);
    
#undef COMP
}

Note that it's not really that much different from the scalar version, only now it transforms 4 points in 9 si_fma and 12 si_shufb instructions. We're going to transform 2 groups of points, so we'll need 18 si_fma instructions, si_shufb can be shared – luckily, the compiler does it for us so we just need to call transform_points_4 twice:

    // transform points to world space
    qword points_ws_0[3];
    qword points_ws_1[3];

    transform_points_4(points_ws_0, minmax_x, minmax_y, minmax_z_0, transform);
    transform_points_4(points_ws_1, minmax_x, minmax_y, minmax_z_1, transform);

Previous vectorized version required 24 si_fma and 24 si_shufb, plus 8 correcting si_selb (to be fair, it could be actually optimized to require 6 si_shufb + 8 si_selb, but it's still not a win over SoA). Note that 18 si_fma + 12 si_shufb does not mean 30 cycles. SPUs are capable of dual-issuing some instructions – there are two groups of instructions, one group runs at even pipeline, another one – at odd. si_fma and si_shufb run on different pipelines, so the net throughput will be closer to 18 cycles (slightly larger than that if si_shufb latency can't be hidden).

Now all that's left is to calculate dot products with a plane. Of course we'll have to calculate them 4 at a time. But wait – in our case execution of inner loop terminated after the first iteration. So previously we were doing only one (albeit ugly) dot product, and now we're doing 4, or even 8! Isn't that a little bit excessive? Well, that's not – but we'll save a more detailed explanation for the later post, for now let the results speak for themselves.

In order to calculate 4 dot products, we'll make a helper function:

static inline qword dot4(qword v, qword x, qword y, qword z)
{
    qword result = SPLAT(v, 3);

    result = si_fma(SPLAT(v, 2), z, result);
    result = si_fma(SPLAT(v, 1), y, result);
    result = si_fma(SPLAT(v, 0), x, result);

    return result;
}

And call it twice. Again, we'll be doing four splats twice, but compiler is smart enough to eliminate this. After that we'll have to compare all 8 dot products with zero, and return false if all of those are negative.

    // for each plane...
    for (int i = 0; i < 6; ++i)
    {
        qword plane = (qword)frustum->planes[i];

        // calculate 8 dot products
        qword dp0 = dot4(plane, points_ws_0[0], points_ws_0[1], points_ws_0[2]);
        qword dp1 = dot4(plane, points_ws_1[0], points_ws_1[1], points_ws_1[2]);

        // get signs
        qword dp0neg = si_fcgt((qword)(0), dp0);
        qword dp1neg = si_fcgt((qword)(0), dp1);

        if (si_to_uint(si_gb(si_and(dp0neg, dp1neg))) == 15)
        {
            return false;
        }
    }

si_fcgt is just a floating-point greater comparison; I'm abusing the fact that 0.0f is represented as a vector with all bytes equal to zero here. si_fcgt operates like SSE comparisons and returns 0xffffffff for elements where the comparison result is true, and 0 for others. After that I and the results together, and then use si_gb instruction to gather bits of results. si_gb takes least significant bit from each element and inserts it into corresponding bit of the result; we get a 4-bit value in preferred slot, everything else is zeroed out. If it's equal to 15, then si_and returned a mask where all elements are 0xffffffff, which means that all dot products are less than zero, so the box is outside.

Note that si_gb is like _mm_movemask_ps, only it takes least significant bits instead of most significant – in case of SSE, we don't need to do comparisons. We can avoid comparisons here by anding dot products directly, and then moving the sign bit to least significant bit (it can be done by rotating each element 1 bit to the left, that's achieved by si_roti(v, 1)), but this is slightly slower, so we won't do it.

Now, the results. The code runs at 376 cycles, which is more than 2 times faster than the previous version, and almost 4 times faster than the original. This speedup is partially because we're doing things more efficiently, partially because we got rid of branches; we'll discuss this the next week. A million calls takes 117 msec, which is still worse than x86 results – but it's not the end of the story. Astonishingly, applying exactly the same optimizations to SSE code results in 81 msec for gcc (which is 30% faster than naively vectorized version), and in 104 msec for msvc8 (which is 40% slower!).

The fastest version is still produced by msvc8 from previous version. This should not be very surprising, as we changed inner loop from performing one dot-product to performing 8 at once, so that shows. We can optimize it in this case by adding early out – after we compute first 4 dot products, we'll check if all of them are positive; if some of them are not, we can safely skip additional 4 dot products and continue to the next iteration. It results in 87 ms for msvc8 and 65 ms for gcc, with gcc-compiled SoA finally being faster than all previous approaches. Of course, this is a worst case for SoA – in case inner loops actually did not terminate after first iteration the performance gain would be greater. Adding the same optimization to SPU code makes it slightly (by 3 cycles) slower; the penalty is tens of cycles if the early out does not happen and we have to compute all 8 dot products, so it's definitely not worth it.

The current source can be grabbed here.

That's all for now – stay tuned for the next weekend's post! Read more!

Sunday, February 8, 2009

View frustum culling optimization – Vectorize me

Last week I've posted some teaser code that will be transformed several times, each time yielding a faster one - “faster” in terms of “taking less cycles for the test case on SPU”. A lot of you probably looked at my admittedly lame excuse for, uhm, math library and want to ask – why the hell do you use scalar code? We're going to address the problem in this issue. This is probably a no-brainer for most of my readers, but this is a good opportunity to introduce some important points about SPUs and introduce some actual vector code before diving further.

But first, we need some background information on SPUs. For the todays post, there is a single important thing to know about SPUs – they are vector processors. Unlike most common architectures (PowerPC, x86, etc.), SPUs have only one register set, which consists of 128-bit vectors. The current implementation has 128 of them, and each register is treated differently in different instructions (you have different instructions for adding two registers as if they contained 4 single precision floats or 16 8-bit integers). The important point is that, while you can compile a piece of scalar code for SPU, it's going to use vector registers and vector instructions; the scalar values are assumed to reside in so called preferred slot – for our current needs, we only care about preferred slot for 32-bit scalars, which is the first one (index 0). Register components are numbered from least address in memory onwards, which is really refreshing after SSE little-endian madness.

This actually goes slightly further – not only all registers are 16-byte, but all memory accesses (I'm obviously talking about local storage access here – though the same mostly applies to DMA; I'll be probably discussing something DMA-related after VFC series ends) should be – you can only load/store a full register's worth of data from/to 16b-aligned location. Of course, you can implement a workaround for scalar values – for loading, load the 16 byte chunk the value is in, and then shift it in the register so that it resides in preferred slot; for saving, load the destination 16 byte chunk, insert desired value in it via shifting/masking, and then store the whole chunk back. In fact, this is exactly what compiler does. Moreover, for our struct vector3_t, loading three components in registers will generate such load/shift code for every component, since compiler does not know the alignment (the whole vector could be in one 16 byte chunk, or it could be split in half between any two components).

In order to leverage available power, we have to use available vector instructions. SPUs have a custom instruction set, which is well documented. For now, it's important to know that there is a fused multiply-add instruction, which computes a*b+c, and there is no dot product instruction (or floating-point horizontal sum, for that matter). In fact, on current generation of consoles, XBox360 is pretty unique in that it does have a dot product instruction.

So, our code is bad because we have lots of scalar memory accesses and lots of scalar operations, which are not using available processing power properly. Let's change this!

One option is to code in assembly; this has obvious benefits and obvious pitfalls, and we'll use intrinsics instead. For SPUs, we have three intrinsics sets to choose from – Altivec emulated (vec_*, the same as we use on PPU), generic type-aware (spu_*) and low-level (si_*). GCC compiler provides several vector types as language extensions (some examples are 'vector float' and 'vector unsigned char', which correspond to 4 32-bit floats and 16 8-bit unsigned integers, respectively); a single spu_* instruction translates to different assembly instructions depending on a type, while si_* instructions operate on abstract register (it has type 'qword', which corresponds to 'vector signed char') – i.e. to add two vectors, you can use spu_add(v1, v2) with typed registers, or one of si_a, si_ah, si_fa, si_dfa to add registers as 32-bit integer, 16-bit integer, 32-bit floating point or 64-bit floating point, respectively. We'll be using si_* family for several reasons – one, they map to assembly exactly, so getting used to si_* instructions make it much easier to read (and possibly write) actual assembly, which is very useful when debugging or optimizing code, two, spu_* family is not available in C, as it uses function overloading. I'll explain specific intrinsics as we start using them.

First thing we'll do is dispose of redundant vector3_t/plane_t structures (in a real math library, we won't do this of course, but this is a sample), and replace them with qwords. This way, everything will be properly aligned, and we won't need to write load/store instructions ourselves (as opposed to something like struct vector3_t { float v[4]; }).

Then, we have to generate an array of points. Each resulting point is a combination of aabb->min and aabb->max – for each component we select either minimum or maximum value. As it turns out, there is the instruction that does exactly that – it accepts two registers with actual values and a third one with pattern; for each bit in pattern, it takes left bit for 0 and right bit for 1 – it's equivalent to (a & ~c) | (b & c), only in one instruction.

The code becomes

    // get aabb points
    qword points[] =
    {
        min,                                                   // x y z
        si_selb(min, max, ((qword)(vec_uint4){~0, 0, 0, 0})),  // X y z
        si_selb(min, max, ((qword)(vec_uint4){~0, ~0, 0, 0})), // X Y z
        si_selb(min, max, ((qword)(vec_uint4){0, ~0, 0, 0})),  // x Y z

        si_selb(min, max, ((qword)(vec_uint4){0, 0, ~0, 0})),  // x y Z
        si_selb(min, max, ((qword)(vec_uint4){~0, 0, ~0, 0})), // X y Z
        max,                                                   // X Y Z
        si_selb(min, max, ((qword)(vec_uint4){0, ~0, ~0, 0})), // x Y Z
    };

Note that I'm using another gcc extension to form vector constants. This is very convenient and does not exhibit any unexpected penalties (the expected ones being additional constant storage and additional instructions to load them).

Then we have transform_point; we'll have to transform a given vector by matrix, and additionally to stuff a 1.0f in .w component of the result in order for the following dot product to work (I sort of hacked this in scalar version by using dot(vector3, vector4)). Vector-matrix SIMD multiplication is very well-known – we'll need add/multiply instructions, and ability to replicate a vector element across the whole vector. For this we'll use a si_shufb instruction – I'll leave the detailed explanation for the next issue, for now just assume that it works as desired :)

static inline qword transform_point(qword p, const struct matrix43_t* mat)
{
    qword px = si_shufb(p, p, (qword)(vec_uint4)(0x00010203));
    qword py = si_shufb(p, p, (qword)(vec_uint4)(0x04050607));
    qword pz = si_shufb(p, p, (qword)(vec_uint4)(0x08090a0b));

    qword result = (qword)mat->row3;

    result = si_fma(pz, (qword)mat->row2, result);
    result = si_fma(py, (qword)mat->row1, result);
    result = si_fma(px, (qword)mat->row0, result);

    result = si_selb(result, ((qword)(vec_float4){0, 0, 0, 1}), ((qword)(vec_uint4){0, 0, 0, ~0}));

    return result;
}

We replicate point components, yielding three vectors, and then compute transformation result using si_fma (fused multiply-add; returns a * b + c) instruction. After that we combine it via selb to get 1.0f in the last component.

Note that in this case we are fortunate to have our matrix laid out as it is – another layout would force us to transpose it prior to further computations to make vectorization possible. In scalar case, the layout does not make any difference.

Finally, we'll have to compute dot product. As there is no dedicated dot product instruction, we'll have to emulate it, which is not pretty.

static inline float dot(qword lhs, qword rhs)
{
    qword mul = si_fm(lhs, rhs);

    // two pairs of sums
    qword mul_zwxy = si_rotqbyi(mul, 8);
    qword sum_2 = si_fa(mul, mul_zwxy);

    // single sum
    qword sum_2y = si_rotqbyi(sum_2, 4);
    qword sum_1 = si_fa(sum_2, sum_2y);

    // return result
    return si_to_float(sum_1);
}

First we get a component-wise multiplication result by using fm; then we'll have to compute horizontal sum. First we sum odd/even components together separately. For that, we rotate our register to the left by 8 bytes (si_rotqbyi) and add with the original. After that, we rotate the result left by 4 bytes (to get the second sum at the preferred slot) and add with the original.

For mul = (1, 2, 3, 4), we get the following values:
mul_zwxy = 3 4 1 2
sum_2 = 4 6 4 6
sum_2y = 6 4 6 4
sum_1 = 10 10 10 10

The result is converted to float via si_to_float cast intrinsic – it just tells the compiler to reinterpret result as if it was a float (actual scalar value is assumed to be in preferred slot), this usually does not generate any additional instructions.

Note that in case of SPU, there is only one register set – thus there is no penalty for such vector/scalar conversion. This code will not perform very well for other architectures – for example, on PowerPC converting vector to float in this way causes a LHS (Load Hit Store; it occurs when you read from the same address you just wrote into) because vector should be stored to stack to load vector element into float register; LHS causes a huge stall (40-50 cycles) and thus performance can be compromised here. For this reason, if your PPU/VMX math library has an optimized dot product function that returns float, don't use it in performance critical code – find another approach. It's interesting that if you think about it, you don't need dot products that much, as I'll show in the next issue.

Anyway, the current code runs at 820 cycles, which is 50% faster than scalar code. This equals to approximately 256 msec per million calls, the corresponding numbers for x86 being 136 msec for gcc and 74 msec for msvc8. Once x86 code is changed so that dot() function returns its result in a vector register, and resulting sign is then analyzed via _mm_movemask_ps instruction, timings change to 126/68, respectively. We've made some progress there, but our SPU implementation is still far from x86 in terms of speed though we're using the same techniques. I promise that the end result will be much more pleasing though :)

The current source can be grabbed here.

That's all for now – stay tuned for the next weekend's post! Read more!

Saturday, January 31, 2009

View frustum culling optimization - Introduction

Here I come again, back from almost a year long silence – and for some weird reason a visitor counter shows that people are still reading my blog! This was an eventful year for me – I worked on lots of things at work and on some at home, got 3 more shipped titles to put in my CV, started really programming on PS3 (including many RSX-related adventures, optimizations and, recently, SPU coding, which I happen to enjoy a lot), and, as some of you will probably guess from the code below, started using Vim. Some other (good) changes gave me more free time, so this post is a first one in a new one-post-in-a-week series (which will hopefully not be the, uh, last one also).

I have a small piece of code at work, which performs simple frustum culling for a given OBB. Initially it was written in an unoptimized (and cross-platform) way, later it was rewritten for Altivec with interesting optimizations, which yielded 3x performance boost, IIRC, and recently it was rewritten again for SPU. I considered this series of code transformation an interesting one, so I thought I'd expand it slightly (adding more intermediate stages), pretend it always was on SPU, and write several posts about it.

This post is a teaser, featuring testing methodology, source data, algorithm and initial (unoptimized) version of code, with unoptimized underlaying math library. Each code snippet will be short and self-contained, since the problem at hand is simple enough. Later posts in series will each feature some performance-related transformation (which can be obviously applied to lots of algorithms), with the last post giving more or less the current version of my code at work.

So, let's get started!

It starts simple - we have a handful of meshes, with each mesh having some kind of bounding volume and a local-to-world transformation. The task at hand is to determine, for a given frustum, whether the mesh is potentially visible (inside/intersecting with the frustum). The test has to be conservative – i.e. it can answer “visible” for meshes which are actually invisible – but it does not have to be exact. In fact, for many meshes the bounding volume itself is inexact, but additionally the algorithm can sacrifice some accuracy in order to be faster.

For our case, the bounding volume is AABB (axis-aligned bounding box) - “axis-aligned” here means that box axes are aligned to mesh local space axes, so in world space this is OBB. We're testing it against an arbitrary frustum – it can be a usual perspective/orthogonal one, or something more fancy (for example, our reflection/refraction rendering passes use perspective projection with oblique clipping, so near/far planes are not perpendicular to the viewing direction, and in fact they are not parallel at all!). Frustum is defined by a 4x4 matrix, though obviously we're free to convert it to any other representation we like.

There are two common approaches to testing if the box is inside frustum or not. First, equations of all 6 frustum planes are extracted. Then, for each plane it's determined if the box is completely outside (i.e. in the negative half-space, if planes' normals are pointing inside the frustum) or not; if the box is not completely outside for all planes, it's reported visible, otherwise it's reported invisible. This can be extended to differentiate “completely inside” and “partially inside” results, though we don't need it, since we're going to render both groups of meshes anyway.

Two approaches differ in the methodology of box-plane testing – one (bruteforce) is testing all 8 vertices of the box, another one is taking a single point (the p-vertex) and testing it, which is enough to give the correct answer if the correct p-vertex is chosen. The series will concentrate on the bruteforce approach, at least for now.

The naïve version of the algorithm first extracts the plane equations, using frustum combined view projection matrix (this is done once for each frustum, so it is not performance sensitive; as such, the code for this is omitted and frustum is assumed to have 6 computed plane equations initially). Then it applies the described bruteforce algorithm as follows:

bool is_visible(struct matrix43_t* transform, struct aabb_t* aabb, struct frustum_t* frustum)
{
    // get aabb points
    struct vector3_t points[] =
    {
        { aabb->min.x, aabb->min.y, aabb->min.z },
        { aabb->max.x, aabb->min.y, aabb->min.z },
        { aabb->max.x, aabb->max.y, aabb->min.z },
        { aabb->min.x, aabb->max.y, aabb->min.z },

        { aabb->min.x, aabb->min.y, aabb->max.z },
        { aabb->max.x, aabb->min.y, aabb->max.z },
        { aabb->max.x, aabb->max.y, aabb->max.z },
        { aabb->min.x, aabb->max.y, aabb->max.z }
    };

    // transform points to world space
    for (int i = 0; i < 8; ++i)
    {
        transform_point(points + i, transform);
    }

    // for each plane...
    for (int i = 0; i < 6; ++i)
    {
        bool inside = false;

        for (int j = 0; j < 8; ++j)
        {
            if (dot(points + j, frustum->planes + i) > 0)
            {
                inside = true;
                break;
            }
        }

        if (!inside)
        {
            return false;
        }
    }

    return true;
}

It uses five predefined data structures (vector3_t, matrix43_t, aabb_t, frustum_t and plane_t which is the type of frustum->planes array elements); those, and the (again, naïve) code of two used functions is available here (along with the rest of the code).

Note that matrix43_t is laid out so that three translation components are adjacent to each other in memory (I don't use the term “whatever major” here because it's very misleading); in our real code, rows actually consist of four components, with the fourth one being undefined for matrix43_t (of course, all operations should proceed as if the column was filled with 0 0 0 1). Similarly, vector3_t has four components, with the fourth one being undefined (this affects aabb_t). This is something that is assumed to stay this way forever, so all of our discussed code will somehow work around that when needed. From the next post and onwards, data layout of the sample code will be exactly the same as in our engine, I've omitted padding fields in today's version for simplicity.

The testing methodology is simple – I compile the code for SPU using a compiler from Sony's toolchain with -O3 level of optimization (the code is C99, by the way), and then run it via SPU simulator. This is an extremely useful tool that's provided by Sony for PS3 developers, which can run a SPU program and, for example, report run statistics – cycles elapsed, various stalls, etc. As far as I know, IBM has a public simulation suite with comparable capabilities, but since it requires Linux I never bothered to test it out. The number of cycles that the tool reports is then slightly reduce to account for some startup overhead (which is 34 cycles), so the number that I present here is the number of cycles for non-inlined call, with included function call overhead. For the reference, I'll include expected run time on a million OBBs (on one SPU, obviously), and corresponding run times of more or less the same code for PC (on my Core2 Duo 2.13 GHz), compiled with gcc 4.3.0 and MSVC 8.0 (SPU intrinsics will be replaced with SSE1 code). Those are only for reference, don't quote me on them :)

The cycle count for this naïve code is 1204, which (given a 3.2 GHz SPU) translates to 376 msec per million calls. The same code gives me roughly 84 msec when compiled with gcc (switches -O3 -msse) and 117 msec when compiled with cl (switches /O2 /fp:fast /arch:SSE). The test runs on the same data each time (taking care that processing is actually being performed), which excludes cache misses from the picture; the actual data is the same for SPU and PC tests and consists of a small box being completely inside the frustum – so that's a worst case for the outer loop (we have to test the box against all planes only to report that it's actually visible), although that's a best case for the inner loop (the first vertex tested for each plane is inside, so we can bail out early). I don't have any real-world average statistics on iteration counts of those loops; anyway, eventually we're going to eliminate some, and then all of those branches, so that the code will perform at constant speed.

That's all for now – stay tuned for the next weekend's post!
Read more!

Wednesday, March 12, 2008

COLLADA: quick update

Time's running fast. Two weeks has passed since my post about COLLADA, and I've found a killer bug in FCollada TBN generation code.

As 3dsmax native API does not provide support for returning TBN (I do not know about Maya, perhaps it does not too), Feeling Software implemented their own algorithm for TBN calculation, based on source found in Maya 7.0 documentation, "Appendix A: Tangent and binormal vectors". Of course, relying on NVMeshMender would be too easy.

And after three years of Feeling Software's Collada plugins, there is a bug in TBN generation code. You can read the full details here: http://sourceforge.net/forum/forum.php?thread_id=1966038&forum_id=460918 (the poster is me), but to keep it simple - returned tangent/binormal are opposite to the correct ones because of incorrect sign in equations (proof with asset files and comparison between Maya reference code and FCollada is also in the post). Well, perhaps it's just that I misunderstand something, but I definitely think it is a bug - there's just too many things to back it up.

And suddenly I can't post a bug report on Feeling Software forum, and through I get to know that Collada free support is discontinued. Given that other alternatives to DAE export from Max/Maya are just not worth the trouble, this means that suddenly COLLADA starts to feel much less attractive than before.

I'm even considering writing a small (geometry, node hierarchy, skin controller and sampled & baked animation - should not be that hard) plugin for 3dsmax/Maya...

P.S. Don't click the "Read more" link, it's here because Blogger maintainers can't implement the most useful feature about blogs - ability to hide parts of the message for brief version that's displayed on the main page. Read more!

Sunday, February 24, 2008

COLLADA: Best thing since sliced bread or not?

So, uh, the “post in a week” idea did not quite work. Sorry about that.

About half a year ago, our team at work that develops the engine decided to try and switch from the proprietary Maya export plugin (it exported geometry, animation and materials) to COLLADA. The old export pipeline was somewhat obscure, lacked some useful optimizations, and (what’s most important) lacked any convenient way to setup materials. That was not a problem for platforms with more or less fixed functionality, but with next-generation consoles (or should I say current-generation already?) it’s quite different.

So the switch has been made (it did not take half a year, it’s just that I’m writing about it only now), and I’m going to share some experience gained throughout the process.

What is COLLADA exactly? It’s an asset interchange format, based on XML (with complete XML Schema and specification), and a series of tools – exporters from popular DCC software (Maya, 3d Studio Max, XSI, Blender, etc.), viewers, libraries for loading/saving/manipulating COLLADA DOM tree.

This means several important things. First, it’s an asset interchange format, which means that it is not supposed to be used as a format for the retail assets. DCC saves COLLADA file, the custom tool loads it, reads useful information from it, applies optimizations (possibly platform-specific), and saves to some binary format. Second, you don’t have to write the export plugin for any DCC tool you use – in theory, all you do is write the said tool that converts .dae to your format and it magically works with all possible tools. Third, it’s slowly becoming something like an industry standard – every popular DCC has an export plugin, some well-known tools can read DAE files (i.e. FXComposer), it has support of well-known companies like Sony, and more and more engines are adopting it.

But that, of course, does not mean that it is a perfect solution.

So, what exactly are COLLADA advantages (why do you want to use it)?

You get a more or less DCC-independent pipeline. Even if your artists only ever use Maya, it does not mean that you’ll never need 3dsmax support (our engine is now being used by a company which only has 3dsmax-aware artists, so the task “support 3dsmax as geometry/animation export tool” has appeared – and it took a day or two).

It is an additional layer of abstraction between DCC and your builder. This means that tedious work with DCC APIs is now inside the exporter, which is (ideally) the code you shouldn’t even know about. As a result, export pipeline is much simpler.

There is a built-in custom material support (ColladaFX). Basically, it allows you to specify a material created from hardware shader (Cg/CgFX), and supplies the artist with convenient way of tweaking the shader parameters (with viewport preview as an added bonus).

DCC plugins usually support importing DAE files. Why is this important? In old pipeline, we had the proprietary plugin export the .sb file, then a platform-independent tool applied some optimizations (reducing scene graph, removing redundant stuff from scene, merging meshes, etc.), and then the platform-specific exporter read that file and converted it to platform-specific format (stripifying, cache optimization, vertex packing, etc.). Obviously, any kind of visual feedback is lost at the moment you export .sb from Maya, so a special viewing/introspection tool was developed. If you use COLLADA and manage to write some export tools such that they only modify .dae file, you can later import it in your DCC tool. If your pipeline is made of a series of such builders, and (!) you save result of each builder, debugging the pipeline becomes much easier.

Well, so I’ve told all the good things about COLLADA I know of. Unfortunately, there is a number of things that are not so good.

Scheme is complex and redundant in many ways. Writing a complete (able to parse any compliant COLLADA file) parser is hard, so either use an existing library (FCollada?) or parse only the subset of scheme your DCC tool exports. I prefer the latter approach, because it’s simpler for me and also is much faster in terms of performance.

DCC export is sometimes quite slow (in case of Maya, for example, the export usually takes two times as long as the code that parses the file, builds the platform-specific structures and saves them). So cache your .dae files (we’re using SCons as a build system, and a network cache, so it’s not as frustrating as it could be).

It is an additional layer of abstraction between DCC and your builder. This means that every time you encounter a bug, it could be either the export plugin or your builder (or, uh, a series of your builders). And if you use some DAE-parsing library, it could be that it is the source of problem. Fortunately, such cases are rare.

Export plugins sometimes are not very top-quality. For example, lack of pivot animation export in ColladaMaya, ColladaFX support, bugs, etc.

It is just an export plugin, do not expect any miracles. For example, if 3dsmax gives you TBN that does not make sense, COLLADA is not going to fix it.

ColladaFX is very bad from the usability standpoint.
- It’s hard for artists to create a new material and to correctly setup binding to geometry (for example, Maya TBN shader binding is not quite clear because of Cg).
- It’s much harder for them to use it in 3dsmax because of even less convenient interface and some problems with parameter binding – just ask your artist to setup an existing model with ColladaFX materials in 3dsmax and you’ll know why
- Perhaps it’s slightly better with CgFX, but since we don't use it, I can't say for sure.

ColladaFX implementation is quite bad.
- There are frequent crashes in 3dsmax (we fixed some of them and are considering submitting a patch).
- Cg materials did not even export correctly because of exporter bug in 3dsmax! We submitted a patch that should already be in trunk.
- ColladaFX materials export from Maya did not work with batch build.

So, generally, ColladaFX seems great on paper, but requires a lot of work, both in technical implementation and usability areas. We are considering rewriting the Maya interface part from scratch.

Fortunately, COLLADA exporter plugins we're using are open-source, so we debug them if they do not work, fix bugs (isn't it exciting?!) and add functionality as we feel appropriate (though of course this complicates the process of updating plugin versions).

Let’s summarize the above. If you do not have any established and well-working export pipeline and are not planning a custom DCC plugin for material setup or things like that – I’d definitely recommend COLLADA, because it’ll be easier than a custom plugin if you don’t have the relevant experience, and it will make it possible to support several DCC tools, which is a good thing. If you have a well established export pipeline that you’re happy about, there is obviously no need to use COLLADA. In other cases the answer is more complex. I myself am quite happy because of transition to COLLADA, because it made everything better, and the major disappointment of COLLADA was ColladaFX, which we did not have an equivalent for anyway (and export of default materials like Phong/Blinn/Lambert works just fine), but of course your mileage may vary.

If you are using COLLADA and have different experience about any of the enlisted areas, please write a comment! For example, do you use ColladaFX? Do you use FCollada and/or ColladaDOM and does it help you? Perhaps you use Feeling Software proprietary export plugins and have something good (or bad) to say about them?
Read more!

Sunday, October 28, 2007

My own lighting shader with blackjack and hookers

I am sorry for the delay in posts; I’ll try to keep it more regular, one post in a week or so.

So, yesterday we were discussing various lighting approaches at local IRC chat, and someone said that it was impossible to make a ps.2.0 single-pass lighting shader for 8 lights with diffuse, specular and attenuation, of course with normal map. Well, I told him he was actually wrong, and that I can prove it. Proving it turned out to be a very fun and challenging task. I am going to share the resulting shader and the lessons learned with you.

From the start I knew it was not going to be easy – ps.2.0 means that no arbitrary swizzles are available (which can in some cases restrict possible optimizations), and there is a limit of 64 arithmetic instructions. Add to that the fact that there is no instruction pairing as in ps.1.x (which is rather strange, as all hardware I know of is capable of pairing vector and scalar instructions).

I decided to compute lighting in world space, as I thought that passing all lighting data from VS to PS is going to consume too much interpolators. So I passed world-space view vector, world-space position and tangent to world space conversion matrix from VS to PS, and had PS transform normal to world space and compute everything else. This proved to be insufficient to reach target result, but I am going to show you the shader anyway, as it has some interesting ideas, some of which are still in the final shader.

Here is the shader: link. Ignore stuff like bindings.h, TRILINEAR_SAMPLER, etc. – these are to allow FX Composer bindings of parameters.

Interesting things to note:

For world-space lighting, there is no need to do normal expansion (normal * 2 – 1), you can encode it into tangent space -> world space conversion matrix. Saves 1 instruction.

Vectorize like crazy. You can do a lot of stuff cheaper if you vectorize your calculations. For example, here we have 8 lights, that’s 2 groups, 4 lights in each one. You can save instructions on computing attenuation for 4 lights at once (you get 1 mad instruction for 4 lights, instead of 1 mad instruction per light), you can save a lot of instructions for specular computations (read below), etc.

If you want to add a lot of scalar values, dot is your friend. dot(value, 1) will add all components of value.

Be smart about replacing equations with equivalent ones. For example, naïve Phong specular equation is dot(R, V), where R is reflected light vector. In this case, you have to do reflect() for each light. Reflect is not quite cheap, and, as we have 8 lights and only 64 instructions, every instruction that’s executed per light is very expensive. But dot(reflect(L, N), V) is equal to dot(reflect(V, N), L) (V is a view vector), which requires only a single reflect().

Specular is expensive, because pow() call is expensive. Why? Because, sadly, pow() can’t be vectorized. pow() is interpreted as three instructions, log, mul, exp (pow(a, b) is equal to exp(log(a) * b)), and neither log, nor exp have vectorized form. The result is that you’ll waste at least two instructions per light only to compute specular power.

There are several solutions. The first one is to fix specular power to some convenient value. For example, pow(a, 16) can be implemented as 4 muls (and in fact HLSL compiler does it automatically), and muls can be vectorized – so you can compute specular power for all 4 lights for 4 instructions, which is much better.

However, there is a better solution, which is described here: A Non-Integer Power Function on the Pixel Shader. Basically, you can approximate the function pow(x, N) with pow(max(A*x + B, 0), M). A and B can be tuned such that the maximum error is low enough for the approximation to be useable in practice, and M can be made very small, for example I use M=2, and there are no artifacts for N=18 (the results ARE different, and it can be seen by switching between formulas in realtime, but the difference is small and no one will be able to tell that you use an approximation). Alternatively, you can have your artists tune A and B directly.

The net effect is that instead of several instructions per light, we compute specular power in 2 instructions – mad_sat to compute A*x+B, and mul to compute the result (remember, M=2).

Okay, we have a bunch of cool optimizations here. Are we done? Sadly, no, we are not. The presented shader compiles into 75 instructions. The problem is that per-light cost is still too high. We have to compute a unnormalized light vector (1 sub), compute its squared length (1 dp3), normalize it (1 rsq + 1 mul), compute dot(N, L) for diffuse lighting (1 dp3), compute dot(R, L) for specular lighting (1 dp3), and combine diffuse lighting with specified colors (1 mad). This is 7*8=56 instructions, which leaves 8 instructions, and we need much more. What can we do?

Well, we can simplify our calculations and replace light colors with light intensities. This will strip 8 instructions, but add 2 instructions for multiplying NdotL vector by intensity, and 1 instruction for summing all diffuse components together (dp3), which is still not enough – we need to reserve instructions for other things (for example, normal transformation takes 3 instructions, specular computation is 4 instructions for 8 lights, attenuation is 4 instructions for 8 lights, and there are still several other things left).

Yesterday, I gave up and went to sleep. But today, after some thinking, I felt like a barrier in my head collapsed – I knew why it did not work out, and I knew the better solution.

See, it is true that we have a lot of overhead per light source. The problem is not in the fact that we need to do a lot of calculations – the problem is in that we are using computation power inefficiently.

Let’s look at the instruction counts I presented earlier. Yes, it’s true that we do 1 sub per light – but sub is capable of processing 4-float vectors, and we are using it to subtract 3-component vectors, so we lose some power here. The same stays true for everything else – all dot products and mul for example.

And the barrier that collapsed in my head had an inscription “You have the mighty dot product, use it”. As it turned out, there is not a lot of sense in treating GPU assembly differently from for example SSE instructions. We do not have dot product in SSE. Trying to compute 4 dot products at once in a straightforward way in SSE is subject to miserable failure – most likely FPU will be faster. But if you change the way your data is organized, and instead of laying it in AoS order:

v1.x v1.y v1.z (unused)
v2.x v2.y v2.z (unused)
v3.x v3.y v3.z (unused)
v4.x v4.y v4.z (unused)

lay it out in SoA order:

v1.x v2.x v3.x v4.x
v1.y v2.y v3.y v4.y
v1.z v2.z v3.z v4.z

And lo and behold, a lot of slow dot product instructions are now just 3 muls and 2 adds – simple and fast.

This was a triumph. I’m making a note here: HUGE SUCCESS.

I decided to try the same thing with my shader code. It solved all problems like magic. The resulting code (link) while doing exactly the same calculations, now compiles in 54 instructions.

The reason is simple, of course. For example, where in the previous shader we computed squared lengths in 1 instruction per light, here we do it for 3 instructions for 4 lights, effectively using 25% less ALU. The new layout also made it possible to pass lights via interpolators (light data fits into 6 interpolators), which allowed to remove 1 sub instruction per light, and also 3 instructions for transforming normal into tangent space (at the expense of adding 1 expand instruction, of course).

Apart from the SoA data layout, which effectively is the reason why the new shader is so much smaller, there is only one trick - instead of normalizing each vector, we correct dot product results. This saves a couple of instructions for the entire shader.

The old shader did not fit into the instruction limit, the new one does, and it has 10 spare instructions. There is a bunch of things you could do with it. For example, you can implement parallax mapping – 10 instructions should be enough for several parallax steps. Note that one interpolator can be freed (view vector can be stored in COLOR interpolator at the expense of 1 additional instruction (expand from [0,1] to [-1,1])), so you can implement shadow mapping (for example, make first light source a directional one – this is straightforward, you just have to modify vertex shader to supply correct tangent-space direction, and place 0 in light_radius_inv to disable attenuation – and add shadows for it).

There is also space for some small tweaks – i.e. disable specular for triangles with dot(N, L) < 0, make wrap-around lighting, add ambient lighting, have colored specular (use a per-light color instead of the global one), add specular attenuation, etc.

Note, that a couple of instructions can still be saved if you do not want light colors, only intensities (read above).

So, this was a pleasant experience, and I am glad that I decided to write this shader. Sadly, all articles I’ve read about shader optimizations (save for one, “Bump My Shiny Metal” by Andrew Aksyonoff in ShaderX 4) are about common and trivial stuff – vectorize your computations, inspect compiler output... I hope that this post was more interesting. Read more!

Saturday, October 6, 2007

Render state rant

Today we’ll talk about render state management in D3D10 a little.

While designing D3D10, a number of decisions were made to improve runtime efficiency (to reduce batch cost, basically). It’s no secret that D3D9 runtime is not exactly lean & mean – there is a lot of magic going on behind the scenes, a lot of validation, caching, patching…

For example, D3D9 has the ability to set any render or sampler state at any time. However, hardware does not really work that way. The states are separated into groups, and you can only set the whole group at a time (of course, the exact structure of groups is hardware dependent). What this means for runtime/driver is that SetRenderState/SetSamplerState often do not actually set anything. Instead, they modify a so called shadow state – they update the changed states in a local shadow copy, and mark the respective groups as dirty. Then, when you call DrawIndexedPrimitive, the changed state groups are being set, and the dirty flags – cleared.

Also there could be some cross-state checking going on, I don’t know for sure.

So, D3D10 designers decided to replace hundreds of states by 4 simple state objects. Behold, ID3D10BlendState, ID3D10DepthStencilState, ID3D10RasterizerState and ID3D10SamplerState. These are immutable driver state objects – you create them once, you can’t modify them, and driver knows about them, which means it can do smart things (for example, store a chunk of push buffer that sets the respective state inside each state object) and thus optimize state setting. Also all possible validation is made at creation time only.

Sounds cool, right? Yeah, it did first time I’ve read about state objects. Except…

Problem #1. State objects are relatively expensive to construct. Which means 10-100 thousand CPU cycles on my Core 2 Duo. You know, given that we have a user mode driver, given that the smartest thing I can think of here is to 1. hash the state to check if the same state has already been created (it is actually done, you can check it by creating the same state object several times), 2. If the hash lookup failed, validate the state (perform some sanity checks), 3. construct a chunk with push buffer commands, 4. store it in allocated memory. And that should be performed by user mode code. Something is horribly wrong here, I swear.

Note, that even creating the already existing state object (hash state, hash lookup, compare actual state values, remember?) takes 10 thousand cycles. The caching should be performed in D3D10 runtime (the pointer returned is exactly the same - i.e. for input layouts, the caching is performed by the driver, as the pointer to the InputLayout is different, but the underlying driver object is the same; for state objects, pointers to runtime objects are equal), so this means that computing a hash of a 40-byte description object, doing a table lookup, and then comparing two 40-byte objects to test for equality in case of hash collision takes 10 thousand cycles.

Problem #2. You can’t create more than 4096 state objects. Which means that you can’t just forget about it and create a state for each object, even for static ones. Well, you can try, but one day it’ll fail.

Problem #3. The separation of states into groups is outrageous. I did not tell one small thing – not all states are actually immutable. There are two things you can change without constructing a new state object (they act as parameters for state object setting functions). Those are… stencil reference value and blend factor. Everything else is immutable, for example, depth bias and slope scaled depth bias.

How many of you have used blend factor with pixel shader (let’s say ps.2.0)-capable hardware?

How many of you have used a constantly changing stencil reference value?

I’d like to do things like progressively loading textures. What do I mean? Let’s load our texture from smaller mip levels to larger ones. Let’s interpolate MinLOD parameter from N to N-1 once N-1-th mip level is completely loaded – let’s do it over some fixed time. This way there will be no mip level popping, but instead we’ll see gradually improving quality as new levels are loaded – trilinear filtering will do proper interpolation for us. That’s easy, right? No. Not in D3D10.

Yes, I know I could cache render states inside objects, and perform some lifetime management (LRU?..). Though this won’t help in case of constantly changing parameters.

Yes, I know I could separate render states into groups how I like it best, have a 4096-entry hash table, and do lookups in it. And this is actually what I am doing now.

But it does not make me happy.
Read more!

What your mother never told you about graphics development

Sunday, February 15, 2009

View frustum culling optimization – Structures and arrays

Sunday, February 8, 2009

View frustum culling optimization – Vectorize me

Saturday, January 31, 2009

View frustum culling optimization - Introduction

Wednesday, March 12, 2008

COLLADA: quick update

Sunday, February 24, 2008

COLLADA: Best thing since sliced bread or not?

Sunday, October 28, 2007

My own lighting shader with blackjack and hookers

Saturday, October 6, 2007

Render state rant

About Me

Blog Archive

Visitors

Sunday, February 15, 2009

Sunday, February 8, 2009

Saturday, January 31, 2009

Wednesday, March 12, 2008

Sunday, February 24, 2008

Sunday, October 28, 2007

Saturday, October 6, 2007

About Me

Subscribe To

Blog Archive

Visitors