What your mother never told you about graphics development

Sunday, March 8, 2009

Fighting against CRT heap and winning

Memory management is one of (many) cornerstones of tech quality for console games. Proper memory management can decrease amount of bugs, increase product quality (for example, by eliminating desperate pre-release asset shrinking) and generally make life way easier – long term, that is. Improper memory management can wreak havoc. For example, any middleware without means to control/override memory management is, well, often not an option; any subsystem that uncontrollably allocates memory can and will lead to problems and thus needs redesigning/reimplementing. While you can tolerate more reckless memory handling on PC, it often results in negative user experience as well.

In my opinion, there are two steps to proper memory management. First one is global and affects all code – it's memory separation and budgeting. Every subsystem has to live in its own memory area of fixed size (of course, size can be fixed for the whole game or vary per level, this is not essential). This has several benefits:

memory fragmentation is now local – subsystems don't fragment each other's storage, thus fragmentation problems happen less frequently and can be reproduced and fixed faster

fixed sizes mean explicit budgets – because of them out of memory problems are again local and easily tracked to their source. For example, there is no more “game does not fit in video memory, let's resize some textures” - instead, you know that i.e. level textures fit in their budget perfectly, but the UI artists added several screen-size backgrounds, overflowing UI texture budget

because each subsystem lives in its own area, we have detailed memory statistics for no additional work, which again is a good thing for obvious reasons

if memory areas have fixed sizes, they either have fixed addresses or it's easy to trace address range for each of them – this helps somewhat in debugging complex bugs

Second one is local to each subsystem – once you know that your data lives in a fixed area, you have to come up with a way to lay your data in this area. The exact decisions are specific to the nature of data and are up to the programmer; this is out of this post's scope.

Memory is divided into regions, each region is attributed to a single subsystem/usage type – if we accept this, it becomes apparent that any unattributed allocations (i.e. any allocations into global heap) are there either because nobody knows where they should belong or because the person who coded those does not want to think about memory – which is even worse (strict separation and budgeting makes things more complicated in short term by forcing people to think about memory usage – but that's a good thing!). Because of this global heap contains junk by definition and thus should ideally be eliminated altogether, or if this is not possible for some reason, should be of limited and rather small size.

Now that we know the goal, it's necessary to implement it – i.e. we want to have a way to replace allocations in global heap with either fatal errors or allocations in our own small memory area. On different platforms there are different ways to do it – for example, on PS3 there is a documented (and easy) way to override CRT memory management functions (malloc/free/etc.); on other platforms with GNU-based toolchain there is often a --wrap linker switch – however, on some platforms, like Windows (assuming MSVC), there does not seem to be a clean way to do it. In fact, it seems that the only known solution is to modify the CRT code. I work with statically linked CRT, so this would mean less distribution problems, but more development ones – I'd have to either replace prebuilt CRT libraries (which is out of the question because it makes working with other projects impossible) or ignore them and link my own, which is better – but still, the process required building my own (hacked) version of CRT. I did not like the approach, so I came up with my own.

First, some disclaimers. This code is tested for statically linked Win32 CRT only – it requires some modifications to work on Win64 or with dynamically linked CRT – I might do the Win64 part some day, but not DLL CRT. Also I'm not too clear on EULA issues; because of this, I'll post my entire code except for one function that's essentially ripped from CRT and fixed so that it compiles – read further for more details. Finally, there may be some unresolved issues with CRT functions I don't currently use (though I think my solution covers most of them) – basically, this is a demonstration of approach with proof-of-concept code, and if you decide to use it you're expected to fix the problems if they arise :)

Our first priority is to replace CRT allocation functions without modifying libraries. There are basically two ways to do something like it – link time and run time. Link time approach involves telling the linker somehow that instead of existing functions it should use the ones supplied by us. Unfortunately, there does not seem to be a way to do this except /FORCE:MULTIPLE, which results in annoying linker warnings and disables incremental linking. Run time way involves patching code after the executable is started – hooking libraries like Detours do it, but we don't need such a heavyweight solution here. In fact, all that's needed is a simple function:

static inline void patch_with_jump(void* dest, void* address)
{
    // get offset for relative jmp
    unsigned int offset = (unsigned int)((char*)address - (char*)dest - 5);
    
    // unprotect memory
    unsigned long old_protect;
    VirtualProtect(dest, 5, PAGE_READWRITE, &old_protect);
    
    // write jmp
    *(unsigned char*)dest = 0xe9;
    *(unsigned int*)((char*)dest + 1) = offset;
    
    // protect memory
    VirtualProtect(dest, 5, old_protect, &old_protect);
}

This function replaces first 5 bytes of code contained in dest with jump to address (the jump is a relative one so we need to compute relative offset; also, the code area is read-only by default, so we unprotect it for the duration of patching). The primitive for stubbing CRT functions is in place – now we need to figure out where to invoke it. At first I thought that a static initializer (specially tagged so that it's guaranteed to execute before other initializers) would be sufficient, but after looking inside CRT source it became apparent that heap is initialized and (which is more critical) used before static initialization. Thus I had to define my own entry point:

    int entrypoint()
    {
        int mainCRTStartup();
        
        patch_memory_management_functions();
        
        return mainCRTStartup();
    }

Now to patch the functions. We're interested in heap initialization, heap termination and various (de)allocation utilities. There is _heap_init, _heap_term and lots of variants of malloc/free and friends – they are all listed in source code. Note that I stubbed all _aligned_* functions with BREAK() (__asm int 3), because neither CRT code nor my code uses them – of course, you can stub them if you need.

There are several highlights here. First one I stumbled upon is that _heap_term is not getting called! At least not in static CRT. After some CRT source digging I decided to patch __crtCorExitProcess – it's useful only for managed C++, and it's the last thing that gets called before ExitProcess. The second one is in function _recalloc, that's specific to the allocator you're using to replace the default one. The purpose of _recalloc is to reallocate the memory as realloc does, but cleaning any additional memory – so if you do malloc(3) and then _recalloc(4), ((char*)ptr)[3] is guaranteed to be 0. My allocator aligns everything to 4 bytes and has a minimal allocation size limit; the original size that was passed to allocation function is not stored anywhere. It's easy to fix it for CRT because _recalloc is used in CRT only for blocks allocated with calloc, and I hope _recalloc is not used anywhere else. By the way, there is a bug in CRT related to _recalloc – malloc(0) with subsequent _recalloc(1) does not clear first byte (because for malloc(0) block with size 1 is created); moreover, more bugs of such nature are theoretically possible on Win64. Personally I find calloc weird and _recalloc disgusting; luckily it's Windows-only.

Ok, now we're done – are we? Well, everything went well until I turned leak detection on. It turns out that there are lots of allocations left unfreed by CRT – amazingly, there is a __freeCrtMemory function that frees some of those, but it's compiled in only in _DEBUG, and it's called only if CRT debugging facilities are configured to dump memory leaks on exit. Because of this I needed to copy the code, modify it slightly so that it compiles and invoke the function before heap termination. However, this function does not free everything – there were some more allocations left, that I needed to handle myself. You can see the code in cleanup_crt_leaks(). After cleaning up leaks printf(), which was used to output leaks to console, became unusable (oh, horror!), so I came up with the following function:

void debug_printf(const char* format, ...)
{
    char buf[4096];
    
    va_list arglist;
    va_start(arglist, format);
    wvsprintfA(buf, format, arglist);
    va_end(arglist);
    
    // console output
    HANDLE handle = GetStdHandle(STD_OUTPUT_HANDLE);
    WriteFile(handle, buf, (unsigned long)strlen(buf), NULL, NULL);

    // debug output
    OutputDebugStringA(buf);
}

Finally, the last problem is that some CRT code checks global variable _crtheap prior to allocation, so we have to initialize it to something (that affects fopen() and other functions that use dynamically created critical sections).

Well, now it works and I'm quite happy with the results. Of course it's slightly hackish, but CRT code is such a mess that it blends in nicely. The more or less complete source code is here. Note that if you're using C++ new/delete and you have not overridden them globally for some reason, you might want to patch _nh_malloc/_heap_alloc with malloc_stub as well. Read more!

Sunday, March 1, 2009

View frustum culling optimization – Never let me branch

In previous iteration we converted the code to SoA instead of AoS, which enabled us to transform OBB points to world space relatively painlessly, and eliminated ugly and slow dot product, thus making the code faster. Still, the code is slow. Why?

Well, as it appears, the problem is branching.

I wanted to write a long post about branches and why they are often a bad idea for PPU/SPU, but it turns out that Mike Acton beat me to it – be sure to read his articles for detailed explanation: part 1 part 2 part 3 - so I'll make it short. For our case, there are two problems with branching:

First, code performance depends on input data. Visible boxes are worst case (this is the one the cycle count is for); invisible boxes are faster, with the fastest case (where the box is behind the first plane) taking 128 cycles. Because of this, it's hard to estimate the run time of culling, given the number of objects – upper bound is three times bigger than lower bound.

Second, branches divide the code in blocks, and compiler has problems performing optimizations between blocks. We have a constant-length loop, inside we compute 8 dot products for a single plane, then check if all of them are negative, and in this case we early-out. Note that there are a lot of dependencies in computation of dot products – si_fmas in dot4 depend on the result of previous si_fmas, si_fcgt depends on the result of dot4, etc. Here is an example of disassembly for performing a single dot4 operation, assuming that we already have SPLAT(v, i) in registers:

fma res, v2, z, v3
fma res, v1, y, res
fma res, v0, x, res

Pretty reasonable? Well, not exactly. While we have 3 instructions, each one depends on the result of the previous one, so we can use our result in 18 cycles instead of 3 (fma latency is 6 cycles). If we need to compute 6 dot4, and we have some sort of branching after each one, like we had in the code for previous attempt, we'll pay the cost of 18 cycles for each iteration (of course, there'll also be some cost associated with comparison and branching). On the other hand, if we computed all 6 dot4 without any branches, the code could've looked like:

fma res[0], v2[0], z[0], v3[0]
fma res[1], v2[1], z[1], v3[1]
...
fma res[5], v2[5], z[5], v3[5]

fma res[0], v1[0], y[0], res[0]
...
fma res[5], v1[5], y[5], res[5]

fma res[0], v0[0], x[0], res[0]
...
fma res[5], v0[5], x[5], res[5]

This code has 18 instructions, and all results are computed in 24 cycles – but we're computing 6 dot4 instead of 1! Also 24 cycles is the latency for res[5] – we can start working on res[0] immediately after last fma gets issued.

The problem is not only related to instruction latency (in our case, register dependencies), but also to pipeline stalls – SPU has two pipelines (even and odd), and can issue one instruction per pipeline per cycle for, uhm, perfect code – each type of instruction can be issued only on one of the pipes, for example arithmetic instructions belong to even pipe, load/store/shuffle instructions belong to odd one. Because of this shuffles can be free if they dual-issue with arithmetics and do not cause subsequent dependency stalls.

Compiler tries to rearrange instructions in order to minimize all stalls – register dependencies, pipeline stalls and some other types – but it is often not allowed to do it between branches. Because of this it's best to eliminate all branches – compiler will be left with a single block of instructions and will be able to do a pretty good job hiding latencies/dual-issuing instructions. This is often critical – for example, our current version wastes almost half of cycles while waiting for results because of register dependency.

Of course, eliminating branches is often a tradeoff – sometimes it makes worst-case run faster, but best-case now runs slower, as we observed last time with x86 code. The decision depends on your goals and on frequency of various cases – remember that branchless code will give you a guaranteed (and usually acceptable) lower bound on performance.

So, in order to eliminate branches, we'll restructure our code a bit – instead of checking for each plane if all points are outside, we'll check if any point is inside, i.e. if the box is not outside of the plane:

static inline qword is_not_outside(qword plane, const qword* points_ws_0, const qword* points_ws_1)
{
    qword dp0 = dot4(plane, points_ws_0[0], points_ws_0[1], points_ws_0[2]);
    qword dp1 = dot4(plane, points_ws_1[0], points_ws_1[1], points_ws_1[2]);

    qword dp0pos = si_fcgt(dp0, (qword)(0));
    qword dp1pos = si_fcgt(dp1, (qword)(0));

    return si_orx(si_or(dp0pos, dp1pos));
}

si_orx is a horizontal or (or across) instruction, which ors 4 32-bit components of source register together and returns the result in preferred slot, filling the rest of vector with zeroes. Thus is_not_outside will return 0xffffffff in preferred slot if box is not outside of plane, and 0 if it's outside.

Now all we have to do is to call this function for all planes, and combine the results – we can do it with si_and, since the box is not outside of the frustum only if it's not outside of all planes; if any is_not_outside call returns 0, we have to return 0.

    // for each plane...
    qword nout0 = is_not_outside((qword)frustum->planes[0], points_ws_0, points_ws_1);
    qword nout1 = is_not_outside((qword)frustum->planes[1], points_ws_0, points_ws_1);
    qword nout2 = is_not_outside((qword)frustum->planes[2], points_ws_0, points_ws_1);
    qword nout3 = is_not_outside((qword)frustum->planes[3], points_ws_0, points_ws_1);
    qword nout4 = is_not_outside((qword)frustum->planes[4], points_ws_0, points_ws_1);
    qword nout5 = is_not_outside((qword)frustum->planes[5], points_ws_0, points_ws_1);

    // merge "not outside" flags
    qword nout01 = si_and(nout0, nout1); 
    qword nout012 = si_and(nout01, nout2); 

    qword nout34 = si_and(nout3, nout4); 
    qword nout345 = si_and(nout34, nout5); 

    qword nout = si_and(nout012, nout345);

    return si_to_uint(nout);

I changed return type for is_visible to unsigned int, with 0 meaning false and 0xffffffff meaning true; this won't change client code, but slightly improves performance.

Now when we compute everything in a single block, compiler schedules instructions in a way that we waste close to zero cycles because of latency. The new branchless version runs at 119 cycles, which is more than 3 times faster than the previous version, and 10 times faster than initial scalar version. This results in 37 msec for million calls, which is almost 2 times faster than fastest result on x86 (finally!). Moreover, this is slightly faster than the best case of previous version – so there is no tradeoff here, new version is always faster than old one. Note that eliminating branches is not worth it for x86 code (i.e. it does not make worst case faster, which is expected, if you remember that we had to do 2 checks per plane in order to make SoA approach faster than AoS).

The current source can be grabbed here.

That's all for now – stay tuned for the next weekend's post! I plan to post something not VFC-related the next week, then another VFC post the week after that. If you're starting to hate frustums, SPU, me and my blog - sorry about that, but we'll be done with VFC some day, I swear! :) Read more!

Sunday, February 15, 2009

View frustum culling optimization – Structures and arrays

Last week I've tried my best at optimizing the underlying functions without touching the essence of algorithm (if there was a function initially that filled a 8-vector array with AABB points, optimizations from previous post could be done in math library). It seems the strategy has to be changed.

There are several reasons why the code is still slow. One is branching. We'll cover that in the next issue though. Another one has already been discussed on this blog related to shaders – we have 4-way SIMD instructions, but we are not using them properly. For example, our point transformation function wastes 1 scalar operation per each si_fma, and requires additional .w component fixup after that. Our dot product function is simply horrible. Once again we're going to switch layout for intermediate data from Array of Structures to Structure of Arrays.

We have 8 AABB points, so we'll need 6 vectors – 2 vectors per each component. Do we need all 6? Nah. Since it's an AABB, we can organize stuff so that we need only 4 like this:

x X x X
y y Y Y
z z z z
Z Z Z Z

Note that vectors for x and y components are shared between two 4-point groups. Of course this sharing will go away after we transform our points to world space – but that makes it easier to generate SoA points from min/max vectors.

How do we generate them? Well, we already know the solution for Z – there is some magical si_shufb instruction, that worked for us before. It's time to know what exactly it does, as it can be used to generate x/y vectors too.

What si_shufb(a, b, c) does is it takes a/b registers, and permutes their contents using c as pattern, yielding a new value. Permutation is done at a byte level - each byte of c corresponds to a resulting byte, which is computed one of the following ways:

1.0x0v corresponds to a byte of left operand with index v
2.0x1v corresponds to a byte of right operand with index v
3.0x80 corresponds to a constant 0x00
4.0xC0 corresponds to a constant 0xFF
5.0xE0 corresponds to a constant 0x80
6.other values result in one of the above, the exact treatment is out of the scope

This is a superset of Altivec vec_perm instruction, and can be used to do very powerful things, as we'll realize soon enough. For example, you can implement usual GPU-style swizzling like so:

src_zxxx = si_shufb(src, src, ((qword)(vec_uint4){0x08090a0b, 0x00010203, 0x00010203, 0x00010203}));

First four bytes of my pattern correspond to bytes 8-11 of left argument, all other four-byte groups correspond to bytes 0-3 of left argument. This is equal to applying .zxxx swizzle. As you can probably see, the code can get very obscure if you use shuffles a lot, so I've made some helper macros:

// shuffle helpers
#define L0 0x00010203
#define L1 0x04050607
#define L2 0x08090a0b
#define L3 0x0c0d0e0f

#define R0 0x10111213
#define R1 0x14151617
#define R2 0x18191a1b
#define R3 0x1c1d1e1f

#define SHUFFLE(l, r, x, y, z, w) si_shufb(l, r, ((qword)(vec_uint4){x, y, z, w}))

// splat helper
#define SPLAT(v, idx) si_shufb(v, v, (qword)(vec_uint4)(L ## idx))

SHUFFLE is for general shuffling, SPLAT is for component replication (.yyyy-like swizzles). Note that in previous post SPLAT was used in transform_point to generate .xxxx, .yyyy and .zzzz swizzles from AABB point.

Let's generate AABB points then.

    // get aabb points (SoA)
    qword minmax_x = SHUFFLE(min, max, L0, R0, L0, R0); // x X x X
    qword minmax_y = SHUFFLE(min, max, L1, L1, R1, R1); // y y Y Y
    qword minmax_z_0 = SPLAT(min, 2); // z z z z
    qword minmax_z_1 = SPLAT(max, 2); // Z Z Z Z

That was easy. Now if we want first 4 points, we use minmax_x, minmax_y, minmax_z_0; for the second group, we use minmax_x, minmax_y, minmax_z_1.

Now, we have 2 groups of 4 points in each, SoA style – we have to transform them to world space. It's actually quite easy – remember the first scalar version? If you've glanced at the code, you've seen a macro for computing single resulting component:

#define COMP(c) p->c = op.x * mat->row0.c + op.y * mat->row1.c + op.z * mat->row2.c + mat->row3.c

As it turns out, this can be converted to SoA style multiplication almost literally – you just need to think of op.x, op.y, op.z as of vectors with 4 values of some component; mat->rowi.c has to be splatted over all components. The resulting function becomes:

static inline void transform_points_4(qword* dest, qword x, qword y, qword z, const struct matrix43_t* mat)
{
#define COMP(c) \
    qword res_ ## c = SPLAT((qword)mat->row3, c); \
    res_ ## c = si_fma(z, SPLAT((qword)mat->row2, c), res_ ## c); \
    res_ ## c = si_fma(y, SPLAT((qword)mat->row1, c), res_ ## c); \
    res_ ## c = si_fma(x, SPLAT((qword)mat->row0, c), res_ ## c); \
    dest[c] = res_ ## c;

    COMP(0);
    COMP(1);
    COMP(2);
    
#undef COMP
}

Note that it's not really that much different from the scalar version, only now it transforms 4 points in 9 si_fma and 12 si_shufb instructions. We're going to transform 2 groups of points, so we'll need 18 si_fma instructions, si_shufb can be shared – luckily, the compiler does it for us so we just need to call transform_points_4 twice:

    // transform points to world space
    qword points_ws_0[3];
    qword points_ws_1[3];

    transform_points_4(points_ws_0, minmax_x, minmax_y, minmax_z_0, transform);
    transform_points_4(points_ws_1, minmax_x, minmax_y, minmax_z_1, transform);

Previous vectorized version required 24 si_fma and 24 si_shufb, plus 8 correcting si_selb (to be fair, it could be actually optimized to require 6 si_shufb + 8 si_selb, but it's still not a win over SoA). Note that 18 si_fma + 12 si_shufb does not mean 30 cycles. SPUs are capable of dual-issuing some instructions – there are two groups of instructions, one group runs at even pipeline, another one – at odd. si_fma and si_shufb run on different pipelines, so the net throughput will be closer to 18 cycles (slightly larger than that if si_shufb latency can't be hidden).

Now all that's left is to calculate dot products with a plane. Of course we'll have to calculate them 4 at a time. But wait – in our case execution of inner loop terminated after the first iteration. So previously we were doing only one (albeit ugly) dot product, and now we're doing 4, or even 8! Isn't that a little bit excessive? Well, that's not – but we'll save a more detailed explanation for the later post, for now let the results speak for themselves.

In order to calculate 4 dot products, we'll make a helper function:

static inline qword dot4(qword v, qword x, qword y, qword z)
{
    qword result = SPLAT(v, 3);

    result = si_fma(SPLAT(v, 2), z, result);
    result = si_fma(SPLAT(v, 1), y, result);
    result = si_fma(SPLAT(v, 0), x, result);

    return result;
}

And call it twice. Again, we'll be doing four splats twice, but compiler is smart enough to eliminate this. After that we'll have to compare all 8 dot products with zero, and return false if all of those are negative.

    // for each plane...
    for (int i = 0; i < 6; ++i)
    {
        qword plane = (qword)frustum->planes[i];

        // calculate 8 dot products
        qword dp0 = dot4(plane, points_ws_0[0], points_ws_0[1], points_ws_0[2]);
        qword dp1 = dot4(plane, points_ws_1[0], points_ws_1[1], points_ws_1[2]);

        // get signs
        qword dp0neg = si_fcgt((qword)(0), dp0);
        qword dp1neg = si_fcgt((qword)(0), dp1);

        if (si_to_uint(si_gb(si_and(dp0neg, dp1neg))) == 15)
        {
            return false;
        }
    }

si_fcgt is just a floating-point greater comparison; I'm abusing the fact that 0.0f is represented as a vector with all bytes equal to zero here. si_fcgt operates like SSE comparisons and returns 0xffffffff for elements where the comparison result is true, and 0 for others. After that I and the results together, and then use si_gb instruction to gather bits of results. si_gb takes least significant bit from each element and inserts it into corresponding bit of the result; we get a 4-bit value in preferred slot, everything else is zeroed out. If it's equal to 15, then si_and returned a mask where all elements are 0xffffffff, which means that all dot products are less than zero, so the box is outside.

Note that si_gb is like _mm_movemask_ps, only it takes least significant bits instead of most significant – in case of SSE, we don't need to do comparisons. We can avoid comparisons here by anding dot products directly, and then moving the sign bit to least significant bit (it can be done by rotating each element 1 bit to the left, that's achieved by si_roti(v, 1)), but this is slightly slower, so we won't do it.

Now, the results. The code runs at 376 cycles, which is more than 2 times faster than the previous version, and almost 4 times faster than the original. This speedup is partially because we're doing things more efficiently, partially because we got rid of branches; we'll discuss this the next week. A million calls takes 117 msec, which is still worse than x86 results – but it's not the end of the story. Astonishingly, applying exactly the same optimizations to SSE code results in 81 msec for gcc (which is 30% faster than naively vectorized version), and in 104 msec for msvc8 (which is 40% slower!).

The fastest version is still produced by msvc8 from previous version. This should not be very surprising, as we changed inner loop from performing one dot-product to performing 8 at once, so that shows. We can optimize it in this case by adding early out – after we compute first 4 dot products, we'll check if all of them are positive; if some of them are not, we can safely skip additional 4 dot products and continue to the next iteration. It results in 87 ms for msvc8 and 65 ms for gcc, with gcc-compiled SoA finally being faster than all previous approaches. Of course, this is a worst case for SoA – in case inner loops actually did not terminate after first iteration the performance gain would be greater. Adding the same optimization to SPU code makes it slightly (by 3 cycles) slower; the penalty is tens of cycles if the early out does not happen and we have to compute all 8 dot products, so it's definitely not worth it.

The current source can be grabbed here.

That's all for now – stay tuned for the next weekend's post! Read more!

Sunday, February 8, 2009

View frustum culling optimization – Vectorize me

Last week I've posted some teaser code that will be transformed several times, each time yielding a faster one - “faster” in terms of “taking less cycles for the test case on SPU”. A lot of you probably looked at my admittedly lame excuse for, uhm, math library and want to ask – why the hell do you use scalar code? We're going to address the problem in this issue. This is probably a no-brainer for most of my readers, but this is a good opportunity to introduce some important points about SPUs and introduce some actual vector code before diving further.

But first, we need some background information on SPUs. For the todays post, there is a single important thing to know about SPUs – they are vector processors. Unlike most common architectures (PowerPC, x86, etc.), SPUs have only one register set, which consists of 128-bit vectors. The current implementation has 128 of them, and each register is treated differently in different instructions (you have different instructions for adding two registers as if they contained 4 single precision floats or 16 8-bit integers). The important point is that, while you can compile a piece of scalar code for SPU, it's going to use vector registers and vector instructions; the scalar values are assumed to reside in so called preferred slot – for our current needs, we only care about preferred slot for 32-bit scalars, which is the first one (index 0). Register components are numbered from least address in memory onwards, which is really refreshing after SSE little-endian madness.

This actually goes slightly further – not only all registers are 16-byte, but all memory accesses (I'm obviously talking about local storage access here – though the same mostly applies to DMA; I'll be probably discussing something DMA-related after VFC series ends) should be – you can only load/store a full register's worth of data from/to 16b-aligned location. Of course, you can implement a workaround for scalar values – for loading, load the 16 byte chunk the value is in, and then shift it in the register so that it resides in preferred slot; for saving, load the destination 16 byte chunk, insert desired value in it via shifting/masking, and then store the whole chunk back. In fact, this is exactly what compiler does. Moreover, for our struct vector3_t, loading three components in registers will generate such load/shift code for every component, since compiler does not know the alignment (the whole vector could be in one 16 byte chunk, or it could be split in half between any two components).

In order to leverage available power, we have to use available vector instructions. SPUs have a custom instruction set, which is well documented. For now, it's important to know that there is a fused multiply-add instruction, which computes a*b+c, and there is no dot product instruction (or floating-point horizontal sum, for that matter). In fact, on current generation of consoles, XBox360 is pretty unique in that it does have a dot product instruction.

So, our code is bad because we have lots of scalar memory accesses and lots of scalar operations, which are not using available processing power properly. Let's change this!

One option is to code in assembly; this has obvious benefits and obvious pitfalls, and we'll use intrinsics instead. For SPUs, we have three intrinsics sets to choose from – Altivec emulated (vec_*, the same as we use on PPU), generic type-aware (spu_*) and low-level (si_*). GCC compiler provides several vector types as language extensions (some examples are 'vector float' and 'vector unsigned char', which correspond to 4 32-bit floats and 16 8-bit unsigned integers, respectively); a single spu_* instruction translates to different assembly instructions depending on a type, while si_* instructions operate on abstract register (it has type 'qword', which corresponds to 'vector signed char') – i.e. to add two vectors, you can use spu_add(v1, v2) with typed registers, or one of si_a, si_ah, si_fa, si_dfa to add registers as 32-bit integer, 16-bit integer, 32-bit floating point or 64-bit floating point, respectively. We'll be using si_* family for several reasons – one, they map to assembly exactly, so getting used to si_* instructions make it much easier to read (and possibly write) actual assembly, which is very useful when debugging or optimizing code, two, spu_* family is not available in C, as it uses function overloading. I'll explain specific intrinsics as we start using them.

First thing we'll do is dispose of redundant vector3_t/plane_t structures (in a real math library, we won't do this of course, but this is a sample), and replace them with qwords. This way, everything will be properly aligned, and we won't need to write load/store instructions ourselves (as opposed to something like struct vector3_t { float v[4]; }).

Then, we have to generate an array of points. Each resulting point is a combination of aabb->min and aabb->max – for each component we select either minimum or maximum value. As it turns out, there is the instruction that does exactly that – it accepts two registers with actual values and a third one with pattern; for each bit in pattern, it takes left bit for 0 and right bit for 1 – it's equivalent to (a & ~c) | (b & c), only in one instruction.

The code becomes

    // get aabb points
    qword points[] =
    {
        min,                                                   // x y z
        si_selb(min, max, ((qword)(vec_uint4){~0, 0, 0, 0})),  // X y z
        si_selb(min, max, ((qword)(vec_uint4){~0, ~0, 0, 0})), // X Y z
        si_selb(min, max, ((qword)(vec_uint4){0, ~0, 0, 0})),  // x Y z

        si_selb(min, max, ((qword)(vec_uint4){0, 0, ~0, 0})),  // x y Z
        si_selb(min, max, ((qword)(vec_uint4){~0, 0, ~0, 0})), // X y Z
        max,                                                   // X Y Z
        si_selb(min, max, ((qword)(vec_uint4){0, ~0, ~0, 0})), // x Y Z
    };

Note that I'm using another gcc extension to form vector constants. This is very convenient and does not exhibit any unexpected penalties (the expected ones being additional constant storage and additional instructions to load them).

Then we have transform_point; we'll have to transform a given vector by matrix, and additionally to stuff a 1.0f in .w component of the result in order for the following dot product to work (I sort of hacked this in scalar version by using dot(vector3, vector4)). Vector-matrix SIMD multiplication is very well-known – we'll need add/multiply instructions, and ability to replicate a vector element across the whole vector. For this we'll use a si_shufb instruction – I'll leave the detailed explanation for the next issue, for now just assume that it works as desired :)

static inline qword transform_point(qword p, const struct matrix43_t* mat)
{
    qword px = si_shufb(p, p, (qword)(vec_uint4)(0x00010203));
    qword py = si_shufb(p, p, (qword)(vec_uint4)(0x04050607));
    qword pz = si_shufb(p, p, (qword)(vec_uint4)(0x08090a0b));

    qword result = (qword)mat->row3;

    result = si_fma(pz, (qword)mat->row2, result);
    result = si_fma(py, (qword)mat->row1, result);
    result = si_fma(px, (qword)mat->row0, result);

    result = si_selb(result, ((qword)(vec_float4){0, 0, 0, 1}), ((qword)(vec_uint4){0, 0, 0, ~0}));

    return result;
}

We replicate point components, yielding three vectors, and then compute transformation result using si_fma (fused multiply-add; returns a * b + c) instruction. After that we combine it via selb to get 1.0f in the last component.

Note that in this case we are fortunate to have our matrix laid out as it is – another layout would force us to transpose it prior to further computations to make vectorization possible. In scalar case, the layout does not make any difference.

Finally, we'll have to compute dot product. As there is no dedicated dot product instruction, we'll have to emulate it, which is not pretty.

static inline float dot(qword lhs, qword rhs)
{
    qword mul = si_fm(lhs, rhs);

    // two pairs of sums
    qword mul_zwxy = si_rotqbyi(mul, 8);
    qword sum_2 = si_fa(mul, mul_zwxy);

    // single sum
    qword sum_2y = si_rotqbyi(sum_2, 4);
    qword sum_1 = si_fa(sum_2, sum_2y);

    // return result
    return si_to_float(sum_1);
}

First we get a component-wise multiplication result by using fm; then we'll have to compute horizontal sum. First we sum odd/even components together separately. For that, we rotate our register to the left by 8 bytes (si_rotqbyi) and add with the original. After that, we rotate the result left by 4 bytes (to get the second sum at the preferred slot) and add with the original.

For mul = (1, 2, 3, 4), we get the following values:
mul_zwxy = 3 4 1 2
sum_2 = 4 6 4 6
sum_2y = 6 4 6 4
sum_1 = 10 10 10 10

The result is converted to float via si_to_float cast intrinsic – it just tells the compiler to reinterpret result as if it was a float (actual scalar value is assumed to be in preferred slot), this usually does not generate any additional instructions.

Note that in case of SPU, there is only one register set – thus there is no penalty for such vector/scalar conversion. This code will not perform very well for other architectures – for example, on PowerPC converting vector to float in this way causes a LHS (Load Hit Store; it occurs when you read from the same address you just wrote into) because vector should be stored to stack to load vector element into float register; LHS causes a huge stall (40-50 cycles) and thus performance can be compromised here. For this reason, if your PPU/VMX math library has an optimized dot product function that returns float, don't use it in performance critical code – find another approach. It's interesting that if you think about it, you don't need dot products that much, as I'll show in the next issue.

Anyway, the current code runs at 820 cycles, which is 50% faster than scalar code. This equals to approximately 256 msec per million calls, the corresponding numbers for x86 being 136 msec for gcc and 74 msec for msvc8. Once x86 code is changed so that dot() function returns its result in a vector register, and resulting sign is then analyzed via _mm_movemask_ps instruction, timings change to 126/68, respectively. We've made some progress there, but our SPU implementation is still far from x86 in terms of speed though we're using the same techniques. I promise that the end result will be much more pleasing though :)

The current source can be grabbed here.

That's all for now – stay tuned for the next weekend's post! Read more!

Saturday, January 31, 2009

View frustum culling optimization - Introduction

Here I come again, back from almost a year long silence – and for some weird reason a visitor counter shows that people are still reading my blog! This was an eventful year for me – I worked on lots of things at work and on some at home, got 3 more shipped titles to put in my CV, started really programming on PS3 (including many RSX-related adventures, optimizations and, recently, SPU coding, which I happen to enjoy a lot), and, as some of you will probably guess from the code below, started using Vim. Some other (good) changes gave me more free time, so this post is a first one in a new one-post-in-a-week series (which will hopefully not be the, uh, last one also).

I have a small piece of code at work, which performs simple frustum culling for a given OBB. Initially it was written in an unoptimized (and cross-platform) way, later it was rewritten for Altivec with interesting optimizations, which yielded 3x performance boost, IIRC, and recently it was rewritten again for SPU. I considered this series of code transformation an interesting one, so I thought I'd expand it slightly (adding more intermediate stages), pretend it always was on SPU, and write several posts about it.

This post is a teaser, featuring testing methodology, source data, algorithm and initial (unoptimized) version of code, with unoptimized underlaying math library. Each code snippet will be short and self-contained, since the problem at hand is simple enough. Later posts in series will each feature some performance-related transformation (which can be obviously applied to lots of algorithms), with the last post giving more or less the current version of my code at work.

So, let's get started!

It starts simple - we have a handful of meshes, with each mesh having some kind of bounding volume and a local-to-world transformation. The task at hand is to determine, for a given frustum, whether the mesh is potentially visible (inside/intersecting with the frustum). The test has to be conservative – i.e. it can answer “visible” for meshes which are actually invisible – but it does not have to be exact. In fact, for many meshes the bounding volume itself is inexact, but additionally the algorithm can sacrifice some accuracy in order to be faster.

For our case, the bounding volume is AABB (axis-aligned bounding box) - “axis-aligned” here means that box axes are aligned to mesh local space axes, so in world space this is OBB. We're testing it against an arbitrary frustum – it can be a usual perspective/orthogonal one, or something more fancy (for example, our reflection/refraction rendering passes use perspective projection with oblique clipping, so near/far planes are not perpendicular to the viewing direction, and in fact they are not parallel at all!). Frustum is defined by a 4x4 matrix, though obviously we're free to convert it to any other representation we like.

There are two common approaches to testing if the box is inside frustum or not. First, equations of all 6 frustum planes are extracted. Then, for each plane it's determined if the box is completely outside (i.e. in the negative half-space, if planes' normals are pointing inside the frustum) or not; if the box is not completely outside for all planes, it's reported visible, otherwise it's reported invisible. This can be extended to differentiate “completely inside” and “partially inside” results, though we don't need it, since we're going to render both groups of meshes anyway.

Two approaches differ in the methodology of box-plane testing – one (bruteforce) is testing all 8 vertices of the box, another one is taking a single point (the p-vertex) and testing it, which is enough to give the correct answer if the correct p-vertex is chosen. The series will concentrate on the bruteforce approach, at least for now.

The naïve version of the algorithm first extracts the plane equations, using frustum combined view projection matrix (this is done once for each frustum, so it is not performance sensitive; as such, the code for this is omitted and frustum is assumed to have 6 computed plane equations initially). Then it applies the described bruteforce algorithm as follows:

bool is_visible(struct matrix43_t* transform, struct aabb_t* aabb, struct frustum_t* frustum)
{
    // get aabb points
    struct vector3_t points[] =
    {
        { aabb->min.x, aabb->min.y, aabb->min.z },
        { aabb->max.x, aabb->min.y, aabb->min.z },
        { aabb->max.x, aabb->max.y, aabb->min.z },
        { aabb->min.x, aabb->max.y, aabb->min.z },

        { aabb->min.x, aabb->min.y, aabb->max.z },
        { aabb->max.x, aabb->min.y, aabb->max.z },
        { aabb->max.x, aabb->max.y, aabb->max.z },
        { aabb->min.x, aabb->max.y, aabb->max.z }
    };

    // transform points to world space
    for (int i = 0; i < 8; ++i)
    {
        transform_point(points + i, transform);
    }

    // for each plane...
    for (int i = 0; i < 6; ++i)
    {
        bool inside = false;

        for (int j = 0; j < 8; ++j)
        {
            if (dot(points + j, frustum->planes + i) > 0)
            {
                inside = true;
                break;
            }
        }

        if (!inside)
        {
            return false;
        }
    }

    return true;
}

It uses five predefined data structures (vector3_t, matrix43_t, aabb_t, frustum_t and plane_t which is the type of frustum->planes array elements); those, and the (again, naïve) code of two used functions is available here (along with the rest of the code).

Note that matrix43_t is laid out so that three translation components are adjacent to each other in memory (I don't use the term “whatever major” here because it's very misleading); in our real code, rows actually consist of four components, with the fourth one being undefined for matrix43_t (of course, all operations should proceed as if the column was filled with 0 0 0 1). Similarly, vector3_t has four components, with the fourth one being undefined (this affects aabb_t). This is something that is assumed to stay this way forever, so all of our discussed code will somehow work around that when needed. From the next post and onwards, data layout of the sample code will be exactly the same as in our engine, I've omitted padding fields in today's version for simplicity.

The testing methodology is simple – I compile the code for SPU using a compiler from Sony's toolchain with -O3 level of optimization (the code is C99, by the way), and then run it via SPU simulator. This is an extremely useful tool that's provided by Sony for PS3 developers, which can run a SPU program and, for example, report run statistics – cycles elapsed, various stalls, etc. As far as I know, IBM has a public simulation suite with comparable capabilities, but since it requires Linux I never bothered to test it out. The number of cycles that the tool reports is then slightly reduce to account for some startup overhead (which is 34 cycles), so the number that I present here is the number of cycles for non-inlined call, with included function call overhead. For the reference, I'll include expected run time on a million OBBs (on one SPU, obviously), and corresponding run times of more or less the same code for PC (on my Core2 Duo 2.13 GHz), compiled with gcc 4.3.0 and MSVC 8.0 (SPU intrinsics will be replaced with SSE1 code). Those are only for reference, don't quote me on them :)

The cycle count for this naïve code is 1204, which (given a 3.2 GHz SPU) translates to 376 msec per million calls. The same code gives me roughly 84 msec when compiled with gcc (switches -O3 -msse) and 117 msec when compiled with cl (switches /O2 /fp:fast /arch:SSE). The test runs on the same data each time (taking care that processing is actually being performed), which excludes cache misses from the picture; the actual data is the same for SPU and PC tests and consists of a small box being completely inside the frustum – so that's a worst case for the outer loop (we have to test the box against all planes only to report that it's actually visible), although that's a best case for the inner loop (the first vertex tested for each plane is inside, so we can bail out early). I don't have any real-world average statistics on iteration counts of those loops; anyway, eventually we're going to eliminate some, and then all of those branches, so that the code will perform at constant speed.

That's all for now – stay tuned for the next weekend's post!
Read more!

Wednesday, March 12, 2008

COLLADA: quick update

Time's running fast. Two weeks has passed since my post about COLLADA, and I've found a killer bug in FCollada TBN generation code.

As 3dsmax native API does not provide support for returning TBN (I do not know about Maya, perhaps it does not too), Feeling Software implemented their own algorithm for TBN calculation, based on source found in Maya 7.0 documentation, "Appendix A: Tangent and binormal vectors". Of course, relying on NVMeshMender would be too easy.

And after three years of Feeling Software's Collada plugins, there is a bug in TBN generation code. You can read the full details here: http://sourceforge.net/forum/forum.php?thread_id=1966038&forum_id=460918 (the poster is me), but to keep it simple - returned tangent/binormal are opposite to the correct ones because of incorrect sign in equations (proof with asset files and comparison between Maya reference code and FCollada is also in the post). Well, perhaps it's just that I misunderstand something, but I definitely think it is a bug - there's just too many things to back it up.

And suddenly I can't post a bug report on Feeling Software forum, and through I get to know that Collada free support is discontinued. Given that other alternatives to DAE export from Max/Maya are just not worth the trouble, this means that suddenly COLLADA starts to feel much less attractive than before.

I'm even considering writing a small (geometry, node hierarchy, skin controller and sampled & baked animation - should not be that hard) plugin for 3dsmax/Maya...

P.S. Don't click the "Read more" link, it's here because Blogger maintainers can't implement the most useful feature about blogs - ability to hide parts of the message for brief version that's displayed on the main page. Read more!

Sunday, February 24, 2008

COLLADA: Best thing since sliced bread or not?

So, uh, the “post in a week” idea did not quite work. Sorry about that.

About half a year ago, our team at work that develops the engine decided to try and switch from the proprietary Maya export plugin (it exported geometry, animation and materials) to COLLADA. The old export pipeline was somewhat obscure, lacked some useful optimizations, and (what’s most important) lacked any convenient way to setup materials. That was not a problem for platforms with more or less fixed functionality, but with next-generation consoles (or should I say current-generation already?) it’s quite different.

So the switch has been made (it did not take half a year, it’s just that I’m writing about it only now), and I’m going to share some experience gained throughout the process.

What is COLLADA exactly? It’s an asset interchange format, based on XML (with complete XML Schema and specification), and a series of tools – exporters from popular DCC software (Maya, 3d Studio Max, XSI, Blender, etc.), viewers, libraries for loading/saving/manipulating COLLADA DOM tree.

This means several important things. First, it’s an asset interchange format, which means that it is not supposed to be used as a format for the retail assets. DCC saves COLLADA file, the custom tool loads it, reads useful information from it, applies optimizations (possibly platform-specific), and saves to some binary format. Second, you don’t have to write the export plugin for any DCC tool you use – in theory, all you do is write the said tool that converts .dae to your format and it magically works with all possible tools. Third, it’s slowly becoming something like an industry standard – every popular DCC has an export plugin, some well-known tools can read DAE files (i.e. FXComposer), it has support of well-known companies like Sony, and more and more engines are adopting it.

But that, of course, does not mean that it is a perfect solution.

So, what exactly are COLLADA advantages (why do you want to use it)?

You get a more or less DCC-independent pipeline. Even if your artists only ever use Maya, it does not mean that you’ll never need 3dsmax support (our engine is now being used by a company which only has 3dsmax-aware artists, so the task “support 3dsmax as geometry/animation export tool” has appeared – and it took a day or two).

It is an additional layer of abstraction between DCC and your builder. This means that tedious work with DCC APIs is now inside the exporter, which is (ideally) the code you shouldn’t even know about. As a result, export pipeline is much simpler.

There is a built-in custom material support (ColladaFX). Basically, it allows you to specify a material created from hardware shader (Cg/CgFX), and supplies the artist with convenient way of tweaking the shader parameters (with viewport preview as an added bonus).

DCC plugins usually support importing DAE files. Why is this important? In old pipeline, we had the proprietary plugin export the .sb file, then a platform-independent tool applied some optimizations (reducing scene graph, removing redundant stuff from scene, merging meshes, etc.), and then the platform-specific exporter read that file and converted it to platform-specific format (stripifying, cache optimization, vertex packing, etc.). Obviously, any kind of visual feedback is lost at the moment you export .sb from Maya, so a special viewing/introspection tool was developed. If you use COLLADA and manage to write some export tools such that they only modify .dae file, you can later import it in your DCC tool. If your pipeline is made of a series of such builders, and (!) you save result of each builder, debugging the pipeline becomes much easier.

Well, so I’ve told all the good things about COLLADA I know of. Unfortunately, there is a number of things that are not so good.

Scheme is complex and redundant in many ways. Writing a complete (able to parse any compliant COLLADA file) parser is hard, so either use an existing library (FCollada?) or parse only the subset of scheme your DCC tool exports. I prefer the latter approach, because it’s simpler for me and also is much faster in terms of performance.

DCC export is sometimes quite slow (in case of Maya, for example, the export usually takes two times as long as the code that parses the file, builds the platform-specific structures and saves them). So cache your .dae files (we’re using SCons as a build system, and a network cache, so it’s not as frustrating as it could be).

It is an additional layer of abstraction between DCC and your builder. This means that every time you encounter a bug, it could be either the export plugin or your builder (or, uh, a series of your builders). And if you use some DAE-parsing library, it could be that it is the source of problem. Fortunately, such cases are rare.

Export plugins sometimes are not very top-quality. For example, lack of pivot animation export in ColladaMaya, ColladaFX support, bugs, etc.

It is just an export plugin, do not expect any miracles. For example, if 3dsmax gives you TBN that does not make sense, COLLADA is not going to fix it.

ColladaFX is very bad from the usability standpoint.
- It’s hard for artists to create a new material and to correctly setup binding to geometry (for example, Maya TBN shader binding is not quite clear because of Cg).
- It’s much harder for them to use it in 3dsmax because of even less convenient interface and some problems with parameter binding – just ask your artist to setup an existing model with ColladaFX materials in 3dsmax and you’ll know why
- Perhaps it’s slightly better with CgFX, but since we don't use it, I can't say for sure.

ColladaFX implementation is quite bad.
- There are frequent crashes in 3dsmax (we fixed some of them and are considering submitting a patch).
- Cg materials did not even export correctly because of exporter bug in 3dsmax! We submitted a patch that should already be in trunk.
- ColladaFX materials export from Maya did not work with batch build.

So, generally, ColladaFX seems great on paper, but requires a lot of work, both in technical implementation and usability areas. We are considering rewriting the Maya interface part from scratch.

Fortunately, COLLADA exporter plugins we're using are open-source, so we debug them if they do not work, fix bugs (isn't it exciting?!) and add functionality as we feel appropriate (though of course this complicates the process of updating plugin versions).

Let’s summarize the above. If you do not have any established and well-working export pipeline and are not planning a custom DCC plugin for material setup or things like that – I’d definitely recommend COLLADA, because it’ll be easier than a custom plugin if you don’t have the relevant experience, and it will make it possible to support several DCC tools, which is a good thing. If you have a well established export pipeline that you’re happy about, there is obviously no need to use COLLADA. In other cases the answer is more complex. I myself am quite happy because of transition to COLLADA, because it made everything better, and the major disappointment of COLLADA was ColladaFX, which we did not have an equivalent for anyway (and export of default materials like Phong/Blinn/Lambert works just fine), but of course your mileage may vary.

If you are using COLLADA and have different experience about any of the enlisted areas, please write a comment! For example, do you use ColladaFX? Do you use FCollada and/or ColladaDOM and does it help you? Perhaps you use Feeling Software proprietary export plugins and have something good (or bad) to say about them?
Read more!

What your mother never told you about graphics development

Sunday, March 8, 2009

Fighting against CRT heap and winning

Sunday, March 1, 2009

View frustum culling optimization – Never let me branch

Sunday, February 15, 2009

View frustum culling optimization – Structures and arrays

Sunday, February 8, 2009

View frustum culling optimization – Vectorize me

Saturday, January 31, 2009

View frustum culling optimization - Introduction

Wednesday, March 12, 2008

COLLADA: quick update

Sunday, February 24, 2008

COLLADA: Best thing since sliced bread or not?

About Me

Blog Archive

Visitors

Sunday, March 8, 2009

Sunday, March 1, 2009

Sunday, February 15, 2009

Sunday, February 8, 2009

Saturday, January 31, 2009

Wednesday, March 12, 2008

Sunday, February 24, 2008

About Me

Subscribe To

Blog Archive

Visitors