Tuesday, March 31, 2009


I'm sorry for the lack of real post - it was a busy week, and a somewhat busy month lies ahead - I'm attending a local game conference in May and giving a speech about the process of porting our rendering subsystem to SPU (I hope to cover this topic here some day), so some time is spent preparing slides/etc.; my pet projects demand more attention than usual; there's some weird but nevertheless interesting stuff at work... I'll try to keep up, but you should really expect some more weeks without any posts. Don't beat me.

Anyway, a bunch of slides from GDC09 Tutorial sessions are finally uploaded; there is some good stuff in "Advanced Visual Effects with Direct3D", and there's some awesome stuff in "Insomniac Games' Secrets of Console and Playstation 3 Programming". I mean, finally someone told people who compute view-space normal Z as sqrt(1 - x^2 - y^2) that they don't know what they're doing! Not to mention SPU stuff, like KISS SPU scheduler (we have a simple enough custom scheduler at work, but it's still far), SPU debugging stories and other SPU talks. By the way, if you're interested in SPU-related topics and have not read everything here, then you don't take SPU seriously.

There are also Khronos' slides here - don't read them unless you have absolutely nothing to do. Read more!

Sunday, March 22, 2009


There is a bunch of small notes I'd like to share – none of them deserves a post, but I don't want them to disappear forever.

Using hash table as a fixed-size cache

When I worked with Direct3D 10, I found state objects quite cumbersome to work with – they're very slow to create (or at least were back then) and the exact separation of states into objects was sometimes inconvenient from design point of view. Also I've already had a set of classes that divided states into groups and functions like setDepthState with redundancy checking, so I needed to write an implementation for existing interface. The solution I came up with was very simple and elegant, so I'd like to outline it once more (although I sort of mentioned it in the original post).

The natural thing to do here is to cache state object pointer inside state class, and recompute it if necessary (when binding newly created/modified state). There are two issues to solve here – 1. state object creation is expensive (even if you're creating an object with the same data several times in a row – in which case D3D10 runtime returns the same pointer – the call takes 10k cycles), 2. there is a limit on the amount of state objects (4096 for each type). Solving the first one is easy – just make a cache with key being state object description and value being the actual pointer; solving the second one is slightly harder because you'll have to evict entries from your cache based on some policy.

The way I went with was to create a fixed size array (the size should be a power of two less or equal than 4096 and depends on the state usage pattern), make a hash function for state description and use this array as a cache indexed by hash. In case of cache collision the old state object got released.

I often use simple non-resizable hash tables (recent examples include path lookup table in flat file system and vertex data hash to compute index buffer from COLLADA streams), but I always insert collision handling code – but, as it turns out, in this case you can omit it and get some benefit at the same time.

Direct3D 10 Read/Write hazard woes

While in some ways Direct3D 10 is clearly an improvement over Direct3D 9, in lots of areas they screwed it. I surely can't count all deficiencies using my both hands, but some problems annoy me more than the others. One of the things that leads to possible performance/memory compromises is resource Read/Write hazard. There are several inputs for various stages (shader resources (textures, buffers), constant buffers, vertex/index buffers) and several output ones (render targets, depth surfaces, stream out buffers), and there are resources that can be bound to both input and output stage; for example, you can bind a texture to output stage as a render target, then render something to it, and then bind the same texture as a shader resource so that shader can sample rendered data from it. However, Direct3D 10 Runtime does not allow a resource to be bound to both input and output stage at the same time.

One disadvantage is that sometimes you'd like to do an in-place update of render target – for example, to do color correction or some other transformation. In fact, this is a perfectly well-defined operation – at least on NVidia hardware – if you're always reading the same pixel you're writing to; otherwise you'll get old value for some pixels and new one for others. Here there is an actual read/write hazard, but due to the specific hardware knowledge we can exploit it to save memory.

Another disadvantage is that a resource being bound to output pipeline stage does not mean it's being written to! A common example is soft particles – Direct3D 10 introduced cross-platform unified depth textures so that you can apply postprocessing effects that require scene depth without extra pass to output depth in a texture or MRT – you can use the same depth buffer you were using for scene render as a texture input to the shader. While this works perfectly for post processing (except for the fact that you can't read depth from MSAA surfaces – ugh...), it fails miserably for soft particles. You usually disable depth writes for particles so there is no real read/write hazard, but because the runtime thinks there is one you can't bind depth buffer so that HW performs depth test – you can only do depth testing in pixel shader yourself via discard. This disables early coarse/fine Z culling, which results in abysmal performance.

Luckily MSAA depth readback is supported in D3D10.1, and in D3D11 you can bind resources to output pipeline stages as read-only. Too bad there is no D3D11 HW yet, and D3D10.1 is not supported by NVidia...

Knowing the class layout – vfptr

There are two weird points regarding class layout and vfptr (virtual function table pointer) that I'd like to note here – they are related to very simple cases, I'm not going to talk about multiple or god forbid virtual inheritance here.

Why do you need to know class layout? Well, it's useful while writing code so your classes can occupy less space, it's extremely useful while debugging obscure bugs, and you can't even start doing in-place loading/saving without such knowledge (I think I'll make a special post regarding in-place stuff soon). And don't even get me started on debuggers that can't display anything except registers and (if you're lucky) primitive locals – we used to have such debugger on PSP, and CodeWarrior for Wii is only slightly better.

Anyway, the first weird point is related to CodeWarrior – it had been like this on PS2, and it's like this on Wii – I doubt that'll ever change. You see, while on normal compilers there is no way to control vfptr placement – for simple classes without inheritance it always goes in the first word – on CodeWarrior it lies in the place of declaration – except that you can't declare vfptr in C++, so it lies in the place where the first virtual function is declared. Some examples follow:

// layout is vfptr, a, b
struct Foo { virtual void foo1(); unsigned int a; unsigned int b; };

// layout is a, vfptr, b
struct Foo { unsigned int a; virtual void foo1(); unsigned int b; };

// layout is a, vfptr, b
struct Foo { unsigned int a; virtual void foo1(); unsigned int b; virtual void foo2(); };

// layout is a, b, vfptr
struct Foo { unsigned int a; unsigned int b; virtual void foo2(); };

Marvelous, isn't it? Now there is an entry in our coding standard at work which says “first virtual function declaration has to appear before any member declarations”.

The second point was discovered only recently and appears to happen with MSVC. Let's look at the following classes:

struct Foo1 { virtual void foo1(); unsigned int a; float b; };
struct Foo2 { virtual void foo1(); unsigned int a; double b; };

Assuming sizeof(unsigned int) == 4, sizeof(float) == 4, sizeof(double) == 8, what are the layouts of the classes? A couple of days ago I'd say that:

Foo1: 4 bytes for vfptr, 4 bytes for a, 4 bytes for b; alignof(Foo1) == 4
Foo2: 4 bytes for vfptr, 4 bytes for a, 8 bytes for b; alignof(Foo2) == 8

And in fact this is exactly the way these classes are laid out in GCC (PS3/Win32), CodeWarrior (Wii) and other relatively sane compilers; MSVC however chooses the following layout for Foo2:

Foo2: 4 bytes for vfptr, 4 bytes of padding, 4 bytes for a, 4 bytes of padding, 8 bytes for b; alignof(Foo2) == 8

Of course the amount of padding increases if we replace double with i.e. __m128. I don't see any reason for such memory wastage, but that's the way things are implemented, and again I doubt this will ever change.

Optimizing build times with Direct3D 9

Yesterday after making some finishing touches to D3D9 implementation of some functions in my pet project (which is coincidentally a game engine wannabe), I hit rebuild and could not help noticing the difference in compilation speed for different files. The files that did not include any heavy platform-specific headers (such as windows.h or d3d9.h) were compiled almost immediately, files with windows.h included were slightly slower (don't forget to define WIN32_LEAN_AND_MEAN!), and files with d3d9.h were slow as hell compared to them – the compilation delay was clearly visible. Upon examination I understood that including windows.h alone gets you 651 Kb of preprocessed source (all numbers are generated via cl /EP, so the source doesn't include #line directives; also WIN32_LEAN_AND_MEAN is included in compilation flags), and including d3d9.h results in a 1.5 Mb source.

Well, I care about compilation times, so I decided to make things right – after all, d3d9.h can't require EVERYTHING in windows.h and other headers it includes. After half an hour of work, I arrived with minid3d9.h (which can be downloaded here).

Including minid3d9.h gets you 171 Kb of preprocessed source, which is much better. This file defines everything that's necessary for d3d9.h and also a couple of things my D3D9 code used (i.e. SUCCEDED/FAILED macros); you might need to add something else – it's not always a drop-in replacement. Also I've taken some measures that enable safe inclusion of this file after CRT/Platform SDK headers, but don't include it before them – generally, include it after everything else.

This decreased the full rebuild time by 30% for me (even though D3D9 code is less than 15% in terms of code size and less than 10% in terms of translation unit count) – I certainly expected less benefit! You're free to use this at your own risk; remember that I did not test in on 64-bit platform so perhaps it needs more work there. Read more!

Sunday, March 15, 2009

View frustum culling optimization – Representation matters

Before getting into professional game development I've spent a fair amount of time doing it for fun (in fact, I still do it now, although less intensively). The knowledge came from a variety of sources, but the only way that I knew and used to calculate frustum planes equations was as follows – get the equations in clip space (they're really simple – (1, 0, 0, 1), (0, -1, 0, 1), etc.) and then get world space ones by transforming the planes with inverse transpose of view projection camera matrix [correction: in fact, you need to transform with inverse transpose of inverse view projection matrix, which equals to just transpose of view projection matrix]. It's very simple and intuitive – if you know a simple way to express what you need in some space, and a simple way to transform things from that space to your target one, you're good to go.

Imagine my surprise when I started doing game development as a day job and after some time accidentally stumbled upon a piece of our codebase that calculated frustum planes in a completely different way. Given a usual perspective camera setup, it calculated frustum points via some trigonometry (utilizing knowledge about vertical/horizontal FOV angles, near/far distances and the fact that it's a perspective camera without any unusual alterations), and then used them to obtain the equations. I thought it to be very weird – after all, it's more complex and is constrained to specific camera representation, whereas clip space method works for any camera that can be set up for rendering (orthographic projection, oblique-clipping, etc.).

But as it turns out, the same thing can be said about our culling code. It's quite good at culling given box against an arbitrary set of planes (i.e. if you use it for portal/anti-portal culling with arbitrary shape of portals/occluders), but since we have a usual frustum, maybe we can improve it by going to clip space, entirely skipping world space? Let's try it.

We're going to transform AABB points to clip space, and then test them against frustum planes in clip space. Note that we can't divide by w after transforming – that will lead to culling bugs because post-projective space exhibits a discontinuity at the plane with equation z = 0 in view space; however, this is not needed – the frustum plane equations in clip space are as follows:

x >= -w, x <= w: left/right planes
y >= -w, y <= w: top/bottom planes
z >= 0, z <= w: near/far planes

Note that if you're using OpenGL clip space convention, the near plane equation is z >= -w; this is a minor change to the culling procedure.

First, to transform points to clip space, we're going to need a world view projection matrix – I hope the code does not require any additional explanations:

static inline void transform_matrix(struct matrix_t* dest, const struct matrix_t* lhs, const struct matrix_t* rhs)
#define COMP_0(c) \
    qword res_ ## c = si_fm((qword)lhs->row2, SPLAT((qword)rhs->row ## c, 2)); \
    res_ ## c = si_fma((qword)lhs->row1, SPLAT((qword)rhs->row ## c, 1), res_ ## c); \
    res_ ## c = si_fma((qword)lhs->row0, SPLAT((qword)rhs->row ## c, 0), res_ ## c); \
    dest->row ## c = (vec_float4)res_ ## c;

#define COMP_1(c) \
    qword res_ ## c = si_fma((qword)lhs->row2, SPLAT((qword)rhs->row ## c, 2), (qword)lhs->row3); \
    res_ ## c = si_fma((qword)lhs->row1, SPLAT((qword)rhs->row ## c, 1), res_ ## c); \
    res_ ## c = si_fma((qword)lhs->row0, SPLAT((qword)rhs->row ## c, 0), res_ ## c); \
    dest->row ## c = (vec_float4)res_ ## c;


#undef COMP_0
#undef COMP_1

After that we'll transform the points to clip space, yielding 2 groups with 4 vectors (x, y, z, w) in each one; the code is almost the same as in the previous post, only we now have 4 components:

static inline void transform_points_4(qword* dest, qword x, qword y, qword z, const struct matrix_t* mat)
#define COMP(c) \
    qword res_ ## c = SPLAT((qword)mat->row3, c); \
    res_ ## c = si_fma(z, SPLAT((qword)mat->row2, c), res_ ## c); \
    res_ ## c = si_fma(y, SPLAT((qword)mat->row1, c), res_ ## c); \
    res_ ## c = si_fma(x, SPLAT((qword)mat->row0, c), res_ ## c); \
    dest[c] = res_ ## c;

#undef COMP

    // transform points to clip space
    qword points_cs_0[4];
    qword points_cs_1[4];

    transform_points_4(points_cs_0, minmax_x, minmax_y, minmax_z_0, &clip);
    transform_points_4(points_cs_1, minmax_x, minmax_y, minmax_z_1, &clip);

If all 8 clip-space points are outside left plane, i.e. if for all 8 points p.x <= -p.w, then the box is completely outside. Since we're going to use SoA layout, such tests are very easy to perform. We'll need a vector which contains -w for 4 points; SPU do not have a special negation instruction, but you can easily emulate it either by subtracting from zero or by xoring with 0x80000000. Theoretically xor is better (it has 2 cycles of latency, subtract has 6), but in our case there is no difference in speed; I'll use xor nonetheless:

    // calculate -w
    qword points_cs_0_negw = si_xor(points_cs_0[3], (qword)(vec_uint4)(0x80000000));
    qword points_cs_1_negw = si_xor(points_cs_1[3], (qword)(vec_uint4)(0x80000000));

Now we'll calculate “not outside” flags for each plane; the method is exactly the same as in previous post (as is the final result computation), only now we're not doing dot products.

    // for each plane...
    #define NOUT(a, b, c, d) si_orx(si_or(si_fcgt(a, b), si_fcgt(c, d)))

    qword nout0 = NOUT(points_cs_0[0], points_cs_0_negw, points_cs_1[0], points_cs_1_negw);
    qword nout1 = NOUT(points_cs_0[3], points_cs_0[0], points_cs_1[3], points_cs_1[0]);
    qword nout2 = NOUT(points_cs_0[1], points_cs_0_negw, points_cs_1[1], points_cs_1_negw);
    qword nout3 = NOUT(points_cs_0[3], points_cs_0[1], points_cs_1[3], points_cs_1[1]);
    qword nout4 = NOUT(points_cs_0[2], (qword)(0), points_cs_1[2], (qword)(0));
    qword nout5 = NOUT(points_cs_0[3], points_cs_0[2], points_cs_1[3], points_cs_1[2]);

    #undef NOUT

This is the final version. It runs at 104 cycles per test, so it's slightly faster than the last version. But this method is better for another reason – we've calculated clip space positions of box vertices as a by-product (also we've calculated world view projection matrix, but it's likely to be of little further use, because usually shader constant setup happens at a later point in another module). Some things you can do with them:

Feed them as an input to the rasterizer to do rasterization-based occlusion culling (simple depth buffer, HOM, etc.). This is the road I have not taken yet, though I hope I will do it some day.
Use them for screen size culling – if your bounding box when projected to the screen is small enough (i.e. less than 3x3 pixels), you usually can safely throw it away. This is what I do in our production code; it involves dividing positions by w (don't forget to discard the size culling results if any point has w < epsilon!), computing min/max x/y for the results, subtracting min from max and checking if the difference along each axis is less than threshold. The actual implementation is left as an exercise to the reader.

This concludes the computation part of view-frustum culling. There is still something to do here – there is p/n-vertex approach which I did not implement (but I'm certain that it won't be a win over my current methods on SPUs); there are minor potential improvements to the current code that are not worth the trouble for me; all implemented tests return only binary result (outside / not outside), a ternary version can help in hierarchical culling (though this can be achieved with a minor modification to all presented code). There might be something else I can't think of now – post a comment if you'd like to hear about other VFC-related topics!

I'm going to write one final post regarding VFC, which deals with the code that will use is_visible to perform culling of the given batch array – the topics include DMA and double buffering; after that the whole VFC series will be over and I'm going to switch to something different.

The complete source for this post can be grabbed here. Read more!

Sunday, March 8, 2009

Fighting against CRT heap and winning

Memory management is one of (many) cornerstones of tech quality for console games. Proper memory management can decrease amount of bugs, increase product quality (for example, by eliminating desperate pre-release asset shrinking) and generally make life way easier – long term, that is. Improper memory management can wreak havoc. For example, any middleware without means to control/override memory management is, well, often not an option; any subsystem that uncontrollably allocates memory can and will lead to problems and thus needs redesigning/reimplementing. While you can tolerate more reckless memory handling on PC, it often results in negative user experience as well.

In my opinion, there are two steps to proper memory management. First one is global and affects all code – it's memory separation and budgeting. Every subsystem has to live in its own memory area of fixed size (of course, size can be fixed for the whole game or vary per level, this is not essential). This has several benefits:

  • memory fragmentation is now local – subsystems don't fragment each other's storage, thus fragmentation problems happen less frequently and can be reproduced and fixed faster

  • fixed sizes mean explicit budgets – because of them out of memory problems are again local and easily tracked to their source. For example, there is no more “game does not fit in video memory, let's resize some textures” - instead, you know that i.e. level textures fit in their budget perfectly, but the UI artists added several screen-size backgrounds, overflowing UI texture budget

  • because each subsystem lives in its own area, we have detailed memory statistics for no additional work, which again is a good thing for obvious reasons

  • if memory areas have fixed sizes, they either have fixed addresses or it's easy to trace address range for each of them – this helps somewhat in debugging complex bugs

Second one is local to each subsystem – once you know that your data lives in a fixed area, you have to come up with a way to lay your data in this area. The exact decisions are specific to the nature of data and are up to the programmer; this is out of this post's scope.

Memory is divided into regions, each region is attributed to a single subsystem/usage type – if we accept this, it becomes apparent that any unattributed allocations (i.e. any allocations into global heap) are there either because nobody knows where they should belong or because the person who coded those does not want to think about memory – which is even worse (strict separation and budgeting makes things more complicated in short term by forcing people to think about memory usage – but that's a good thing!). Because of this global heap contains junk by definition and thus should ideally be eliminated altogether, or if this is not possible for some reason, should be of limited and rather small size.

Now that we know the goal, it's necessary to implement it – i.e. we want to have a way to replace allocations in global heap with either fatal errors or allocations in our own small memory area. On different platforms there are different ways to do it – for example, on PS3 there is a documented (and easy) way to override CRT memory management functions (malloc/free/etc.); on other platforms with GNU-based toolchain there is often a --wrap linker switch – however, on some platforms, like Windows (assuming MSVC), there does not seem to be a clean way to do it. In fact, it seems that the only known solution is to modify the CRT code. I work with statically linked CRT, so this would mean less distribution problems, but more development ones – I'd have to either replace prebuilt CRT libraries (which is out of the question because it makes working with other projects impossible) or ignore them and link my own, which is better – but still, the process required building my own (hacked) version of CRT. I did not like the approach, so I came up with my own.

First, some disclaimers. This code is tested for statically linked Win32 CRT only – it requires some modifications to work on Win64 or with dynamically linked CRT – I might do the Win64 part some day, but not DLL CRT. Also I'm not too clear on EULA issues; because of this, I'll post my entire code except for one function that's essentially ripped from CRT and fixed so that it compiles – read further for more details. Finally, there may be some unresolved issues with CRT functions I don't currently use (though I think my solution covers most of them) – basically, this is a demonstration of approach with proof-of-concept code, and if you decide to use it you're expected to fix the problems if they arise :)

Our first priority is to replace CRT allocation functions without modifying libraries. There are basically two ways to do something like it – link time and run time. Link time approach involves telling the linker somehow that instead of existing functions it should use the ones supplied by us. Unfortunately, there does not seem to be a way to do this except /FORCE:MULTIPLE, which results in annoying linker warnings and disables incremental linking. Run time way involves patching code after the executable is started – hooking libraries like Detours do it, but we don't need such a heavyweight solution here. In fact, all that's needed is a simple function:

static inline void patch_with_jump(void* dest, void* address)
    // get offset for relative jmp
    unsigned int offset = (unsigned int)((char*)address - (char*)dest - 5);
    // unprotect memory
    unsigned long old_protect;
    VirtualProtect(dest, 5, PAGE_READWRITE, &old_protect);
    // write jmp
    *(unsigned char*)dest = 0xe9;
    *(unsigned int*)((char*)dest + 1) = offset;
    // protect memory
    VirtualProtect(dest, 5, old_protect, &old_protect);

This function replaces first 5 bytes of code contained in dest with jump to address (the jump is a relative one so we need to compute relative offset; also, the code area is read-only by default, so we unprotect it for the duration of patching). The primitive for stubbing CRT functions is in place – now we need to figure out where to invoke it. At first I thought that a static initializer (specially tagged so that it's guaranteed to execute before other initializers) would be sufficient, but after looking inside CRT source it became apparent that heap is initialized and (which is more critical) used before static initialization. Thus I had to define my own entry point:

    int entrypoint()
        int mainCRTStartup();
        return mainCRTStartup();

Now to patch the functions. We're interested in heap initialization, heap termination and various (de)allocation utilities. There is _heap_init, _heap_term and lots of variants of malloc/free and friends – they are all listed in source code. Note that I stubbed all _aligned_* functions with BREAK() (__asm int 3), because neither CRT code nor my code uses them – of course, you can stub them if you need.

There are several highlights here. First one I stumbled upon is that _heap_term is not getting called! At least not in static CRT. After some CRT source digging I decided to patch __crtCorExitProcess – it's useful only for managed C++, and it's the last thing that gets called before ExitProcess. The second one is in function _recalloc, that's specific to the allocator you're using to replace the default one. The purpose of _recalloc is to reallocate the memory as realloc does, but cleaning any additional memory – so if you do malloc(3) and then _recalloc(4), ((char*)ptr)[3] is guaranteed to be 0. My allocator aligns everything to 4 bytes and has a minimal allocation size limit; the original size that was passed to allocation function is not stored anywhere. It's easy to fix it for CRT because _recalloc is used in CRT only for blocks allocated with calloc, and I hope _recalloc is not used anywhere else. By the way, there is a bug in CRT related to _recalloc – malloc(0) with subsequent _recalloc(1) does not clear first byte (because for malloc(0) block with size 1 is created); moreover, more bugs of such nature are theoretically possible on Win64. Personally I find calloc weird and _recalloc disgusting; luckily it's Windows-only.

Ok, now we're done – are we? Well, everything went well until I turned leak detection on. It turns out that there are lots of allocations left unfreed by CRT – amazingly, there is a __freeCrtMemory function that frees some of those, but it's compiled in only in _DEBUG, and it's called only if CRT debugging facilities are configured to dump memory leaks on exit. Because of this I needed to copy the code, modify it slightly so that it compiles and invoke the function before heap termination. However, this function does not free everything – there were some more allocations left, that I needed to handle myself. You can see the code in cleanup_crt_leaks(). After cleaning up leaks printf(), which was used to output leaks to console, became unusable (oh, horror!), so I came up with the following function:

void debug_printf(const char* format, ...)
    char buf[4096];
    va_list arglist;
    va_start(arglist, format);
    wvsprintfA(buf, format, arglist);
    // console output
    HANDLE handle = GetStdHandle(STD_OUTPUT_HANDLE);
    WriteFile(handle, buf, (unsigned long)strlen(buf), NULL, NULL);

    // debug output

Finally, the last problem is that some CRT code checks global variable _crtheap prior to allocation, so we have to initialize it to something (that affects fopen() and other functions that use dynamically created critical sections).

Well, now it works and I'm quite happy with the results. Of course it's slightly hackish, but CRT code is such a mess that it blends in nicely. The more or less complete source code is here. Note that if you're using C++ new/delete and you have not overridden them globally for some reason, you might want to patch _nh_malloc/_heap_alloc with malloc_stub as well. Read more!

Sunday, March 1, 2009

View frustum culling optimization – Never let me branch

In previous iteration we converted the code to SoA instead of AoS, which enabled us to transform OBB points to world space relatively painlessly, and eliminated ugly and slow dot product, thus making the code faster. Still, the code is slow. Why?

Well, as it appears, the problem is branching.

I wanted to write a long post about branches and why they are often a bad idea for PPU/SPU, but it turns out that Mike Acton beat me to it – be sure to read his articles for detailed explanation: part 1 part 2 part 3 - so I'll make it short. For our case, there are two problems with branching:

First, code performance depends on input data. Visible boxes are worst case (this is the one the cycle count is for); invisible boxes are faster, with the fastest case (where the box is behind the first plane) taking 128 cycles. Because of this, it's hard to estimate the run time of culling, given the number of objects – upper bound is three times bigger than lower bound.

Second, branches divide the code in blocks, and compiler has problems performing optimizations between blocks. We have a constant-length loop, inside we compute 8 dot products for a single plane, then check if all of them are negative, and in this case we early-out. Note that there are a lot of dependencies in computation of dot products – si_fmas in dot4 depend on the result of previous si_fmas, si_fcgt depends on the result of dot4, etc. Here is an example of disassembly for performing a single dot4 operation, assuming that we already have SPLAT(v, i) in registers:

fma res, v2, z, v3
fma res, v1, y, res
fma res, v0, x, res

Pretty reasonable? Well, not exactly. While we have 3 instructions, each one depends on the result of the previous one, so we can use our result in 18 cycles instead of 3 (fma latency is 6 cycles). If we need to compute 6 dot4, and we have some sort of branching after each one, like we had in the code for previous attempt, we'll pay the cost of 18 cycles for each iteration (of course, there'll also be some cost associated with comparison and branching). On the other hand, if we computed all 6 dot4 without any branches, the code could've looked like:

fma res[0], v2[0], z[0], v3[0]
fma res[1], v2[1], z[1], v3[1]
fma res[5], v2[5], z[5], v3[5]

fma res[0], v1[0], y[0], res[0]
fma res[5], v1[5], y[5], res[5]

fma res[0], v0[0], x[0], res[0]
fma res[5], v0[5], x[5], res[5]

This code has 18 instructions, and all results are computed in 24 cycles – but we're computing 6 dot4 instead of 1! Also 24 cycles is the latency for res[5] – we can start working on res[0] immediately after last fma gets issued.

The problem is not only related to instruction latency (in our case, register dependencies), but also to pipeline stalls – SPU has two pipelines (even and odd), and can issue one instruction per pipeline per cycle for, uhm, perfect code – each type of instruction can be issued only on one of the pipes, for example arithmetic instructions belong to even pipe, load/store/shuffle instructions belong to odd one. Because of this shuffles can be free if they dual-issue with arithmetics and do not cause subsequent dependency stalls.

Compiler tries to rearrange instructions in order to minimize all stalls – register dependencies, pipeline stalls and some other types – but it is often not allowed to do it between branches. Because of this it's best to eliminate all branches – compiler will be left with a single block of instructions and will be able to do a pretty good job hiding latencies/dual-issuing instructions. This is often critical – for example, our current version wastes almost half of cycles while waiting for results because of register dependency.

Of course, eliminating branches is often a tradeoff – sometimes it makes worst-case run faster, but best-case now runs slower, as we observed last time with x86 code. The decision depends on your goals and on frequency of various cases – remember that branchless code will give you a guaranteed (and usually acceptable) lower bound on performance.

So, in order to eliminate branches, we'll restructure our code a bit – instead of checking for each plane if all points are outside, we'll check if any point is inside, i.e. if the box is not outside of the plane:

static inline qword is_not_outside(qword plane, const qword* points_ws_0, const qword* points_ws_1)
    qword dp0 = dot4(plane, points_ws_0[0], points_ws_0[1], points_ws_0[2]);
    qword dp1 = dot4(plane, points_ws_1[0], points_ws_1[1], points_ws_1[2]);

    qword dp0pos = si_fcgt(dp0, (qword)(0));
    qword dp1pos = si_fcgt(dp1, (qword)(0));

    return si_orx(si_or(dp0pos, dp1pos));

si_orx is a horizontal or (or across) instruction, which ors 4 32-bit components of source register together and returns the result in preferred slot, filling the rest of vector with zeroes. Thus is_not_outside will return 0xffffffff in preferred slot if box is not outside of plane, and 0 if it's outside.

Now all we have to do is to call this function for all planes, and combine the results – we can do it with si_and, since the box is not outside of the frustum only if it's not outside of all planes; if any is_not_outside call returns 0, we have to return 0.

    // for each plane...
    qword nout0 = is_not_outside((qword)frustum->planes[0], points_ws_0, points_ws_1);
    qword nout1 = is_not_outside((qword)frustum->planes[1], points_ws_0, points_ws_1);
    qword nout2 = is_not_outside((qword)frustum->planes[2], points_ws_0, points_ws_1);
    qword nout3 = is_not_outside((qword)frustum->planes[3], points_ws_0, points_ws_1);
    qword nout4 = is_not_outside((qword)frustum->planes[4], points_ws_0, points_ws_1);
    qword nout5 = is_not_outside((qword)frustum->planes[5], points_ws_0, points_ws_1);

    // merge "not outside" flags
    qword nout01 = si_and(nout0, nout1);
    qword nout012 = si_and(nout01, nout2);

    qword nout34 = si_and(nout3, nout4);
    qword nout345 = si_and(nout34, nout5);

    qword nout = si_and(nout012, nout345);

    return si_to_uint(nout);

I changed return type for is_visible to unsigned int, with 0 meaning false and 0xffffffff meaning true; this won't change client code, but slightly improves performance.

Now when we compute everything in a single block, compiler schedules instructions in a way that we waste close to zero cycles because of latency. The new branchless version runs at 119 cycles, which is more than 3 times faster than the previous version, and 10 times faster than initial scalar version. This results in 37 msec for million calls, which is almost 2 times faster than fastest result on x86 (finally!). Moreover, this is slightly faster than the best case of previous version – so there is no tradeoff here, new version is always faster than old one. Note that eliminating branches is not worth it for x86 code (i.e. it does not make worst case faster, which is expected, if you remember that we had to do 2 checks per plane in order to make SoA approach faster than AoS).

The current source can be grabbed here.

That's all for now – stay tuned for the next weekend's post! I plan to post something not VFC-related the next week, then another VFC post the week after that. If you're starting to hate frustums, SPU, me and my blog - sorry about that, but we'll be done with VFC some day, I swear! :) Read more!