Tuesday, September 29, 2009

On joys and sorrows of library development – part 1

This may come as a surprise, but I am not dead. In fact, what you see is a new post! As usual I have a lot of interesting themes to cover, and barely enough time to spare. While I'm at it, let me tell you about NDAs. I hate NDAs with a passion – I've got some things to blog about that are partially covered by NDA (of course, the interesting parts are NOT); also I've been thinking that this is a non-issue and basically that I can blog about things that are not quite critical, but half a year ago or so I was forced to remove a blog post; the reasons are not exactly clear but it seems that it was because of a single sentence that mentioned something that's NOT secret in my point of view and was NOT relevant to post contents. For this reason I'm hesitant to write about some topics so I'll either skip them altogether (which is a shame) or find a way to omit all details that might seem sensitive to people. Also I'm not sure if blogging about post removal due to NDA is an NDA violation?..

Anyway, the topic for today is something different – I'll write a bit about library development. In past few years I've developed and maintained a C++ XML parser PugiXML. This is a tiny library which focuses on performance and ease of use. We've had tremendous speedups of export process after converting from TinyXML, and I know lots of other success stories. PugiXML is portable (lots of platforms and compilers are supported, I've gone through special efforts to support MSVC6 and old CodeWarriors), console-aware (i.e. you can toggle off STL/exception support, override memory management, etc.), small, robust, etc. It even features an XPath evaluator!

PugiXML was born as a project to clean up pugxml – initial idea was to strip pugxml header from sources (thus reducing compilation/linking times), slightly cleanup interface and use it. What followed was an almost complete rewrite of the code, bringing the parser closer to standard compliance, adding useful features for DOM inspection, and greatly improving speed. There are bits of code left from pugxml, and interface is very similar, but it's quite a different project now. As far as I know, the only parser in use that beats PugiXML at parsing speed is RapidXML, and the only major problem with PugiXML is that it's Unicode support is pretty much limited by UTF8. Though both of those may change at some point in the future :)

I'm going to write some stuff here that may be of interest to other people.

1. Interface

The initial API was taken as-is from pugxml; in the hindsight, this was both a good (since it offered a very simple transition for pugxml users) and bad thing. It's a bad thing because the interface is seriously cluttered.

For example, there are at least four methods of traversing children nodes: you can use the next_sibling() function (the DOM is structured as a graph of nodes, with nodes connected via pointers; each node contains a pointer to both right and left siblings, the function gets the right one), you can use the node iterator, you can use xml_tree_walker (which is a Visitor-like interface), and finally you can grab all child elements via an insert iterator with all_elements_by_name(). Oh, and you can use XPath, which makes five methods.

As another example, every method for string-based queries (i.e. xml_node::attribute(const char*), which means “give me the first attribute with the following name) has a corresponding method which uses wildcard matching instead of string comparison (i.e. node.attribute_w(“foo*ba?”) will match foobar or fooatbaz).

Overall, it's not that much (I have a friend who's been working with a codebase that has an interface with 760+ virtual functions, so I'm not easily scared) and it does not stand in the way while you're using the library, but it certainly does not help maintaining and developing it.

But the worst part is that I can't remove any of those functions. For example, I consider tree walker to be a bad abstraction; it's rarely usable, and if it is, it's easy to write it outside the library. If I had a full API usage statistics, I could've made a conscious decision – either nobody uses it and I remove it, or there are very few who do and I extract it into an external helper class in an external header (possibly changing the interface slightly), or it's a feature that is used in every second application that uses my library and I can't do anything. The problem is I have no statistics, so I can't do anything.

Other than that, I feel the interface to be good (I use it relatively often both in my pet projects and at work, so if there was something that annoyed me I would've fixed that); the best decision for me is pointer abstraction – in pugixml you don't work with pointers to node (as with TinyXML), you work with tiny pointer wrapper class (the size is equal to that of a pointer) that's passed by value; the point is that there is no null pointer exception, all operations on “null” nodes/attributes are perfectly defined. Of course, the same could be done with a pointer API by using a dummy object instead of null pointer, what matters is the decision to protect the user. Also I find that this makes parsing code much more concise – you don't have to do error handling for every API call!

2. Performance

The parsing performance is very good, on COLLADA files it's hundreds of megabytes per second (probably closer to gigabyte); the bottleneck is always HDD read speed unless the file is cached. Of course, it's still slightly slower than it could be; also the performance comes for a price of not being fully standard compliant – it manifests in allowing certain XML standard violations, such as disallowed Unicode symbols in attribute/node names, multiple node attributes with the same name, etc. This means that while any correct XML file will be parsed, some malformed ones will not be rejected. Up to some point there even were flags to make parser pass certain standard violations (i.e. there was a mode that could handle HTML-style unclosed tags by scanning for matching open tag and automatically closing all descendants), but I removed them to reduce clutter (that was at the point when parser was used by me and a couple of friends so no harm done).

The memory consumption is also good enough (when we switched from TinyXML at work, we got ~2x improvement in terms of memory required to parse a DOM tree), although it could be better. Surprisingly this was achieved without any tricks that I love (take the pointer, take lower N bits, stuff something useful in there, pretend that everything was that way) and almost without any bit-packing.

All good things come at a price – the parser currently requires that the whole XML file is a large contiguous chunk of memory (i.e. if you have a 200 Mb file to parse, you have to have a 200 Mb chunk of address space); also, this chunk dies with the document so in the worst case PugiXML can lose in peak memory consumption if you modify your tree too much (i.e. load a 200 Mb document from file, remove all nodes, add an equivalent amount of contents by hand – the memory overhead of PugiXML will be i.e. 400 Mb (larger than that because nodes take some space too), the memory overhead of a typical parser will be 200 Mb). Of course this is almost never a problem in practice.

Next time: performance highlights (tricks to make parsing fast, saving performance), user requests, documentation, portability concerns Read more!

Monday, June 8, 2009

Implementing Direct3D for fun and profit

I can't believe I'm writing this, it's been what, 2 months? During that time a lot of things happened – I've been to the conference and gave an hour-long talk about our SPU rendering stuff (which was more or less well received), I've almost completed an occlusion subsystem (rasterization-based), which is giving good results; and the financial crisis has finally hit the company I work at – some projects are freezed due to the lack of funding, and some people are fired. It's kind of sad walking through half-empty offices... Anyway, I know I promised to write often but as I am actively developing my pet engine at home and there is a lot of stuff to work on at my day job, so time is a scarce resource for me. My blog/todo.txt file is already 20 entries long, where some things are too small to deserve a post, and others demand a lengthy series. I'll try to select something interesting from time to time and blog about it. As for todays topic,

Every object in core Direct3D (I'll be talking about 9 today, but the same thing should apply to 10 and 11) is an interface. This means that the details of actual implementation is hidden from us, but this also means that we can implement those interfaces. Why could we want to do that?

Reverse engineering
If you work in game industry/computer graphics, or, well, any other IT-related field, I suppose, then you should be constantly gaining new knowledge; otherwise your qualification as a specialist will decrease very fast. There are lots of ways to learn, and one of the best is to learn from others experience. Unfortunately, while there is a lot of information on the technology of some titles, most are not described at all. Also sometimes the descriptions are inaccurate – after all, devil is in the details. So what you can do is take an existing title and reverse-engineer it – that is, gain information about implementation details from the outside. Disclaimer: Of course, this information is provided only for educational value. Reverse engineering can violate the laws of your country and/or the EULA of the product. Don't use it if it does.

In PC / Direct3D world there are two primary tools than can allow such introspection – NVidia PerfHUD and Microsoft PIX. There is also a beta of Intel GPA (which is, by the way, quite promising, if lacking polish), but it is more or less like PIX. Using PIX does not require modifications of the host program, however PIX does not work for some titles (it might crash), is slow (especially for titles with complex scenes, lots of draw calls, etc.) and is not very convenient to use as a reverse engineering tool for other reasons.

PerfHUD is more useful in some areas, but you need to create Direct3D device with a special adapter and in REF mode in order for PerfHUD to work. While some games already have this kind of support in released version (notable examples include The Elder Scrolls 4: Oblivion and S.T.A.L.K.E.R. - Shadows of Chernobyl), others are more careful (I hope if you're reading this blog you have a build configuration such as Master or Retail, which sets appropriate defines so you can compile development-only stuff, such as asset reloading, profiling or NVPerfHUD support) out of the executable). But still if you manage to intercept the call to Direct3DCreate9 (which can be done for example by creating a DLL, calling it d3d9.dll and putting it near the game executable), you can return a proxy IDirect3D9 object, that forwards all calls to the actual object, except that it modifies the adapter/device type that are passed to CreateDevice. In fact, such proxy objects are used by both PIX and GPA, though the injection technique is more complex.

There are even some programs that simplify the following for you, allowing you to run any title in PerfHUD-compatible mode.

Multithreaded rendering
In fact, this is already described in a Gamefest 2008 presentation “Practical Parallel Rendering with DirectX 9 and 10, Windows PC Command Buffer Recording” (you can get slides and example code here). Basically, since neither Direct3D9 nor Direct3D10 support proper multithreading (creating device as multithreaded means that all device calls will be synchronized with one per-device critical section), you can emulate it via a special proxy device, which records all rendering calls in a buffer, and then uses the buffer to replay the command stream via real device. This saves processing time for other rendering work you do alongside API calls by allowing it to work in multiple threads, and is a good stub for deferred context functionality that's available on other platforms (including Direct3D11 and all console platforms). I use this technique in my pet engine mainly for the purpose of portability – I can render different parts of the scene into different contexts simultaneously, and then “kick” the deferred context via the main one. On PS3 the “kick” part is very lightweight, so the savings are huge; on Windows during the “kick” part deferred context replays the command stream, so it can be quite heavy, but it's faster than doing everything in one thread, and the code works the same way. When I start supporting Direct3D11, the same code will work concurrently, provided a good driver/runtime support of course.

Note that I don't use Emergent library as is – I consider it too heavyweight and obscure for my purposes. They try to support all Direct3D calls, while I use only a handful – I don't use FFP, I don't create resources via this device, etc. My implementation is simple and straightforward, and is only 23 Kb in size (11 of which are reused in another component – see below). If anybody wants to use it I can provide the code to you to save you an hour of work – just drop a comment.

Currently my implementation has a fixed size command buffer, so if you exceed it, you're doomed. There are several more or less obvious ways to fix this, but I hope that by the time I get to it I'll already have D3D11 in place.

Asset pipeline
My asset pipeline is more or less the same for all asset types – there is a source for the asset (Maya/Max scene, texture, sound file, etc.), which is converted via some set of actions to a platform-specific binary that can be loaded by the engine. In this way the complexity of dealing with different resource formats, complex structures, data non suitable for runtime, etc. is moved from engine to tools, which is great since it reduces the amount of runtime code, making it more robust and easier to maintain. The data is saved to a custom format which is optimized for loading time (target endianness, platform-specific data layout/format for graphics resources, compression). I think I'll blog about some interesting aspects/choices in the future as time permits (for example, about my experience of using build systems, such as SCons and Jam, for data builds), but for now I'll focus on a tool that builds textures.

This tool loads the texture file, generates mipmap levels for the texture if necessary (if it was not a DDS with mip chain, and if target texture requires mipmap levels), compresses it to DXTn if necessary (again, that depends on source format and building settings), and makes some other actions, both platform-specific and platform-independent. In order for it to work, I need an image library that can load image formats I care about, including DDS with DXTn contents (so that I don't need to unpack/repack it every time, and so that artists can tweak DXT compression settings in Photoshop plugin – in my experience there is rarely a visible difference, but if they give me a texture and I compress it to DXT and there are some artifacts, I'm to blame – and if they use Photoshop, it's not my scope :)). As it turns out, D3DX is a good enough image loading library, at least it works for me (although in retrospect I probably should've used DevIL, and perhaps I will switch to it in the future).

Anyway, to load a texture via D3DX, you need a Direct3D device. As it turns out, while you can create a working REF device in under 10 lines of code (using desktop window and hardcoded settings), you can't create any device, including NULLREF, if your PC does not have a monitor attached. This problem appeared once I got my pipeline working via IncrediBuild, and sometimes on some machines texture building failed. Since I did not want to modify my code too much, I ended implementing another proxy device, which is suitable for loading a texture with D3DX functions. This time it was slightly harder, because I needed implementations for some functions of IDirect3DDevice9, IDirect3DTexture9 and IDirect3DSurface9, but again the resulting code is quite small and simple – 6 Kb (plus the 11 Kb dummy device I mentioned earlier), and I can load any 2D texture. Of course I'll need to add some code to load cubemaps and even more code to load volume textures, but for now it's fine the way it is.

So these are some examples of situations where implementing Direct3D interfaces might prove useful. The next post is going to either be about multithreading, or about some asset pipeline-related stuff, I guess I'll decide once I get to writing it.
Read more!

Tuesday, March 31, 2009


I'm sorry for the lack of real post - it was a busy week, and a somewhat busy month lies ahead - I'm attending a local game conference in May and giving a speech about the process of porting our rendering subsystem to SPU (I hope to cover this topic here some day), so some time is spent preparing slides/etc.; my pet projects demand more attention than usual; there's some weird but nevertheless interesting stuff at work... I'll try to keep up, but you should really expect some more weeks without any posts. Don't beat me.

Anyway, a bunch of slides from GDC09 Tutorial sessions are finally uploaded; there is some good stuff in "Advanced Visual Effects with Direct3D", and there's some awesome stuff in "Insomniac Games' Secrets of Console and Playstation 3 Programming". I mean, finally someone told people who compute view-space normal Z as sqrt(1 - x^2 - y^2) that they don't know what they're doing! Not to mention SPU stuff, like KISS SPU scheduler (we have a simple enough custom scheduler at work, but it's still far), SPU debugging stories and other SPU talks. By the way, if you're interested in SPU-related topics and have not read everything here, then you don't take SPU seriously.

There are also Khronos' slides here - don't read them unless you have absolutely nothing to do. Read more!

Sunday, March 22, 2009


There is a bunch of small notes I'd like to share – none of them deserves a post, but I don't want them to disappear forever.

Using hash table as a fixed-size cache

When I worked with Direct3D 10, I found state objects quite cumbersome to work with – they're very slow to create (or at least were back then) and the exact separation of states into objects was sometimes inconvenient from design point of view. Also I've already had a set of classes that divided states into groups and functions like setDepthState with redundancy checking, so I needed to write an implementation for existing interface. The solution I came up with was very simple and elegant, so I'd like to outline it once more (although I sort of mentioned it in the original post).

The natural thing to do here is to cache state object pointer inside state class, and recompute it if necessary (when binding newly created/modified state). There are two issues to solve here – 1. state object creation is expensive (even if you're creating an object with the same data several times in a row – in which case D3D10 runtime returns the same pointer – the call takes 10k cycles), 2. there is a limit on the amount of state objects (4096 for each type). Solving the first one is easy – just make a cache with key being state object description and value being the actual pointer; solving the second one is slightly harder because you'll have to evict entries from your cache based on some policy.

The way I went with was to create a fixed size array (the size should be a power of two less or equal than 4096 and depends on the state usage pattern), make a hash function for state description and use this array as a cache indexed by hash. In case of cache collision the old state object got released.

I often use simple non-resizable hash tables (recent examples include path lookup table in flat file system and vertex data hash to compute index buffer from COLLADA streams), but I always insert collision handling code – but, as it turns out, in this case you can omit it and get some benefit at the same time.

Direct3D 10 Read/Write hazard woes

While in some ways Direct3D 10 is clearly an improvement over Direct3D 9, in lots of areas they screwed it. I surely can't count all deficiencies using my both hands, but some problems annoy me more than the others. One of the things that leads to possible performance/memory compromises is resource Read/Write hazard. There are several inputs for various stages (shader resources (textures, buffers), constant buffers, vertex/index buffers) and several output ones (render targets, depth surfaces, stream out buffers), and there are resources that can be bound to both input and output stage; for example, you can bind a texture to output stage as a render target, then render something to it, and then bind the same texture as a shader resource so that shader can sample rendered data from it. However, Direct3D 10 Runtime does not allow a resource to be bound to both input and output stage at the same time.

One disadvantage is that sometimes you'd like to do an in-place update of render target – for example, to do color correction or some other transformation. In fact, this is a perfectly well-defined operation – at least on NVidia hardware – if you're always reading the same pixel you're writing to; otherwise you'll get old value for some pixels and new one for others. Here there is an actual read/write hazard, but due to the specific hardware knowledge we can exploit it to save memory.

Another disadvantage is that a resource being bound to output pipeline stage does not mean it's being written to! A common example is soft particles – Direct3D 10 introduced cross-platform unified depth textures so that you can apply postprocessing effects that require scene depth without extra pass to output depth in a texture or MRT – you can use the same depth buffer you were using for scene render as a texture input to the shader. While this works perfectly for post processing (except for the fact that you can't read depth from MSAA surfaces – ugh...), it fails miserably for soft particles. You usually disable depth writes for particles so there is no real read/write hazard, but because the runtime thinks there is one you can't bind depth buffer so that HW performs depth test – you can only do depth testing in pixel shader yourself via discard. This disables early coarse/fine Z culling, which results in abysmal performance.

Luckily MSAA depth readback is supported in D3D10.1, and in D3D11 you can bind resources to output pipeline stages as read-only. Too bad there is no D3D11 HW yet, and D3D10.1 is not supported by NVidia...

Knowing the class layout – vfptr

There are two weird points regarding class layout and vfptr (virtual function table pointer) that I'd like to note here – they are related to very simple cases, I'm not going to talk about multiple or god forbid virtual inheritance here.

Why do you need to know class layout? Well, it's useful while writing code so your classes can occupy less space, it's extremely useful while debugging obscure bugs, and you can't even start doing in-place loading/saving without such knowledge (I think I'll make a special post regarding in-place stuff soon). And don't even get me started on debuggers that can't display anything except registers and (if you're lucky) primitive locals – we used to have such debugger on PSP, and CodeWarrior for Wii is only slightly better.

Anyway, the first weird point is related to CodeWarrior – it had been like this on PS2, and it's like this on Wii – I doubt that'll ever change. You see, while on normal compilers there is no way to control vfptr placement – for simple classes without inheritance it always goes in the first word – on CodeWarrior it lies in the place of declaration – except that you can't declare vfptr in C++, so it lies in the place where the first virtual function is declared. Some examples follow:

// layout is vfptr, a, b
struct Foo { virtual void foo1(); unsigned int a; unsigned int b; };

// layout is a, vfptr, b
struct Foo { unsigned int a; virtual void foo1(); unsigned int b; };

// layout is a, vfptr, b
struct Foo { unsigned int a; virtual void foo1(); unsigned int b; virtual void foo2(); };

// layout is a, b, vfptr
struct Foo { unsigned int a; unsigned int b; virtual void foo2(); };

Marvelous, isn't it? Now there is an entry in our coding standard at work which says “first virtual function declaration has to appear before any member declarations”.

The second point was discovered only recently and appears to happen with MSVC. Let's look at the following classes:

struct Foo1 { virtual void foo1(); unsigned int a; float b; };
struct Foo2 { virtual void foo1(); unsigned int a; double b; };

Assuming sizeof(unsigned int) == 4, sizeof(float) == 4, sizeof(double) == 8, what are the layouts of the classes? A couple of days ago I'd say that:

Foo1: 4 bytes for vfptr, 4 bytes for a, 4 bytes for b; alignof(Foo1) == 4
Foo2: 4 bytes for vfptr, 4 bytes for a, 8 bytes for b; alignof(Foo2) == 8

And in fact this is exactly the way these classes are laid out in GCC (PS3/Win32), CodeWarrior (Wii) and other relatively sane compilers; MSVC however chooses the following layout for Foo2:

Foo2: 4 bytes for vfptr, 4 bytes of padding, 4 bytes for a, 4 bytes of padding, 8 bytes for b; alignof(Foo2) == 8

Of course the amount of padding increases if we replace double with i.e. __m128. I don't see any reason for such memory wastage, but that's the way things are implemented, and again I doubt this will ever change.

Optimizing build times with Direct3D 9

Yesterday after making some finishing touches to D3D9 implementation of some functions in my pet project (which is coincidentally a game engine wannabe), I hit rebuild and could not help noticing the difference in compilation speed for different files. The files that did not include any heavy platform-specific headers (such as windows.h or d3d9.h) were compiled almost immediately, files with windows.h included were slightly slower (don't forget to define WIN32_LEAN_AND_MEAN!), and files with d3d9.h were slow as hell compared to them – the compilation delay was clearly visible. Upon examination I understood that including windows.h alone gets you 651 Kb of preprocessed source (all numbers are generated via cl /EP, so the source doesn't include #line directives; also WIN32_LEAN_AND_MEAN is included in compilation flags), and including d3d9.h results in a 1.5 Mb source.

Well, I care about compilation times, so I decided to make things right – after all, d3d9.h can't require EVERYTHING in windows.h and other headers it includes. After half an hour of work, I arrived with minid3d9.h (which can be downloaded here).

Including minid3d9.h gets you 171 Kb of preprocessed source, which is much better. This file defines everything that's necessary for d3d9.h and also a couple of things my D3D9 code used (i.e. SUCCEDED/FAILED macros); you might need to add something else – it's not always a drop-in replacement. Also I've taken some measures that enable safe inclusion of this file after CRT/Platform SDK headers, but don't include it before them – generally, include it after everything else.

This decreased the full rebuild time by 30% for me (even though D3D9 code is less than 15% in terms of code size and less than 10% in terms of translation unit count) – I certainly expected less benefit! You're free to use this at your own risk; remember that I did not test in on 64-bit platform so perhaps it needs more work there. Read more!

Sunday, March 15, 2009

View frustum culling optimization – Representation matters

Before getting into professional game development I've spent a fair amount of time doing it for fun (in fact, I still do it now, although less intensively). The knowledge came from a variety of sources, but the only way that I knew and used to calculate frustum planes equations was as follows – get the equations in clip space (they're really simple – (1, 0, 0, 1), (0, -1, 0, 1), etc.) and then get world space ones by transforming the planes with inverse transpose of view projection camera matrix [correction: in fact, you need to transform with inverse transpose of inverse view projection matrix, which equals to just transpose of view projection matrix]. It's very simple and intuitive – if you know a simple way to express what you need in some space, and a simple way to transform things from that space to your target one, you're good to go.

Imagine my surprise when I started doing game development as a day job and after some time accidentally stumbled upon a piece of our codebase that calculated frustum planes in a completely different way. Given a usual perspective camera setup, it calculated frustum points via some trigonometry (utilizing knowledge about vertical/horizontal FOV angles, near/far distances and the fact that it's a perspective camera without any unusual alterations), and then used them to obtain the equations. I thought it to be very weird – after all, it's more complex and is constrained to specific camera representation, whereas clip space method works for any camera that can be set up for rendering (orthographic projection, oblique-clipping, etc.).

But as it turns out, the same thing can be said about our culling code. It's quite good at culling given box against an arbitrary set of planes (i.e. if you use it for portal/anti-portal culling with arbitrary shape of portals/occluders), but since we have a usual frustum, maybe we can improve it by going to clip space, entirely skipping world space? Let's try it.

We're going to transform AABB points to clip space, and then test them against frustum planes in clip space. Note that we can't divide by w after transforming – that will lead to culling bugs because post-projective space exhibits a discontinuity at the plane with equation z = 0 in view space; however, this is not needed – the frustum plane equations in clip space are as follows:

x >= -w, x <= w: left/right planes
y >= -w, y <= w: top/bottom planes
z >= 0, z <= w: near/far planes

Note that if you're using OpenGL clip space convention, the near plane equation is z >= -w; this is a minor change to the culling procedure.

First, to transform points to clip space, we're going to need a world view projection matrix – I hope the code does not require any additional explanations:

static inline void transform_matrix(struct matrix_t* dest, const struct matrix_t* lhs, const struct matrix_t* rhs)
#define COMP_0(c) \
    qword res_ ## c = si_fm((qword)lhs->row2, SPLAT((qword)rhs->row ## c, 2)); \
    res_ ## c = si_fma((qword)lhs->row1, SPLAT((qword)rhs->row ## c, 1), res_ ## c); \
    res_ ## c = si_fma((qword)lhs->row0, SPLAT((qword)rhs->row ## c, 0), res_ ## c); \
    dest->row ## c = (vec_float4)res_ ## c;

#define COMP_1(c) \
    qword res_ ## c = si_fma((qword)lhs->row2, SPLAT((qword)rhs->row ## c, 2), (qword)lhs->row3); \
    res_ ## c = si_fma((qword)lhs->row1, SPLAT((qword)rhs->row ## c, 1), res_ ## c); \
    res_ ## c = si_fma((qword)lhs->row0, SPLAT((qword)rhs->row ## c, 0), res_ ## c); \
    dest->row ## c = (vec_float4)res_ ## c;


#undef COMP_0
#undef COMP_1

After that we'll transform the points to clip space, yielding 2 groups with 4 vectors (x, y, z, w) in each one; the code is almost the same as in the previous post, only we now have 4 components:

static inline void transform_points_4(qword* dest, qword x, qword y, qword z, const struct matrix_t* mat)
#define COMP(c) \
    qword res_ ## c = SPLAT((qword)mat->row3, c); \
    res_ ## c = si_fma(z, SPLAT((qword)mat->row2, c), res_ ## c); \
    res_ ## c = si_fma(y, SPLAT((qword)mat->row1, c), res_ ## c); \
    res_ ## c = si_fma(x, SPLAT((qword)mat->row0, c), res_ ## c); \
    dest[c] = res_ ## c;

#undef COMP

    // transform points to clip space
    qword points_cs_0[4];
    qword points_cs_1[4];

    transform_points_4(points_cs_0, minmax_x, minmax_y, minmax_z_0, &clip);
    transform_points_4(points_cs_1, minmax_x, minmax_y, minmax_z_1, &clip);

If all 8 clip-space points are outside left plane, i.e. if for all 8 points p.x <= -p.w, then the box is completely outside. Since we're going to use SoA layout, such tests are very easy to perform. We'll need a vector which contains -w for 4 points; SPU do not have a special negation instruction, but you can easily emulate it either by subtracting from zero or by xoring with 0x80000000. Theoretically xor is better (it has 2 cycles of latency, subtract has 6), but in our case there is no difference in speed; I'll use xor nonetheless:

    // calculate -w
    qword points_cs_0_negw = si_xor(points_cs_0[3], (qword)(vec_uint4)(0x80000000));
    qword points_cs_1_negw = si_xor(points_cs_1[3], (qword)(vec_uint4)(0x80000000));

Now we'll calculate “not outside” flags for each plane; the method is exactly the same as in previous post (as is the final result computation), only now we're not doing dot products.

    // for each plane...
    #define NOUT(a, b, c, d) si_orx(si_or(si_fcgt(a, b), si_fcgt(c, d)))

    qword nout0 = NOUT(points_cs_0[0], points_cs_0_negw, points_cs_1[0], points_cs_1_negw);
    qword nout1 = NOUT(points_cs_0[3], points_cs_0[0], points_cs_1[3], points_cs_1[0]);
    qword nout2 = NOUT(points_cs_0[1], points_cs_0_negw, points_cs_1[1], points_cs_1_negw);
    qword nout3 = NOUT(points_cs_0[3], points_cs_0[1], points_cs_1[3], points_cs_1[1]);
    qword nout4 = NOUT(points_cs_0[2], (qword)(0), points_cs_1[2], (qword)(0));
    qword nout5 = NOUT(points_cs_0[3], points_cs_0[2], points_cs_1[3], points_cs_1[2]);

    #undef NOUT

This is the final version. It runs at 104 cycles per test, so it's slightly faster than the last version. But this method is better for another reason – we've calculated clip space positions of box vertices as a by-product (also we've calculated world view projection matrix, but it's likely to be of little further use, because usually shader constant setup happens at a later point in another module). Some things you can do with them:

Feed them as an input to the rasterizer to do rasterization-based occlusion culling (simple depth buffer, HOM, etc.). This is the road I have not taken yet, though I hope I will do it some day.
Use them for screen size culling – if your bounding box when projected to the screen is small enough (i.e. less than 3x3 pixels), you usually can safely throw it away. This is what I do in our production code; it involves dividing positions by w (don't forget to discard the size culling results if any point has w < epsilon!), computing min/max x/y for the results, subtracting min from max and checking if the difference along each axis is less than threshold. The actual implementation is left as an exercise to the reader.

This concludes the computation part of view-frustum culling. There is still something to do here – there is p/n-vertex approach which I did not implement (but I'm certain that it won't be a win over my current methods on SPUs); there are minor potential improvements to the current code that are not worth the trouble for me; all implemented tests return only binary result (outside / not outside), a ternary version can help in hierarchical culling (though this can be achieved with a minor modification to all presented code). There might be something else I can't think of now – post a comment if you'd like to hear about other VFC-related topics!

I'm going to write one final post regarding VFC, which deals with the code that will use is_visible to perform culling of the given batch array – the topics include DMA and double buffering; after that the whole VFC series will be over and I'm going to switch to something different.

The complete source for this post can be grabbed here. Read more!

Sunday, March 8, 2009

Fighting against CRT heap and winning

Memory management is one of (many) cornerstones of tech quality for console games. Proper memory management can decrease amount of bugs, increase product quality (for example, by eliminating desperate pre-release asset shrinking) and generally make life way easier – long term, that is. Improper memory management can wreak havoc. For example, any middleware without means to control/override memory management is, well, often not an option; any subsystem that uncontrollably allocates memory can and will lead to problems and thus needs redesigning/reimplementing. While you can tolerate more reckless memory handling on PC, it often results in negative user experience as well.

In my opinion, there are two steps to proper memory management. First one is global and affects all code – it's memory separation and budgeting. Every subsystem has to live in its own memory area of fixed size (of course, size can be fixed for the whole game or vary per level, this is not essential). This has several benefits:

  • memory fragmentation is now local – subsystems don't fragment each other's storage, thus fragmentation problems happen less frequently and can be reproduced and fixed faster

  • fixed sizes mean explicit budgets – because of them out of memory problems are again local and easily tracked to their source. For example, there is no more “game does not fit in video memory, let's resize some textures” - instead, you know that i.e. level textures fit in their budget perfectly, but the UI artists added several screen-size backgrounds, overflowing UI texture budget

  • because each subsystem lives in its own area, we have detailed memory statistics for no additional work, which again is a good thing for obvious reasons

  • if memory areas have fixed sizes, they either have fixed addresses or it's easy to trace address range for each of them – this helps somewhat in debugging complex bugs

Second one is local to each subsystem – once you know that your data lives in a fixed area, you have to come up with a way to lay your data in this area. The exact decisions are specific to the nature of data and are up to the programmer; this is out of this post's scope.

Memory is divided into regions, each region is attributed to a single subsystem/usage type – if we accept this, it becomes apparent that any unattributed allocations (i.e. any allocations into global heap) are there either because nobody knows where they should belong or because the person who coded those does not want to think about memory – which is even worse (strict separation and budgeting makes things more complicated in short term by forcing people to think about memory usage – but that's a good thing!). Because of this global heap contains junk by definition and thus should ideally be eliminated altogether, or if this is not possible for some reason, should be of limited and rather small size.

Now that we know the goal, it's necessary to implement it – i.e. we want to have a way to replace allocations in global heap with either fatal errors or allocations in our own small memory area. On different platforms there are different ways to do it – for example, on PS3 there is a documented (and easy) way to override CRT memory management functions (malloc/free/etc.); on other platforms with GNU-based toolchain there is often a --wrap linker switch – however, on some platforms, like Windows (assuming MSVC), there does not seem to be a clean way to do it. In fact, it seems that the only known solution is to modify the CRT code. I work with statically linked CRT, so this would mean less distribution problems, but more development ones – I'd have to either replace prebuilt CRT libraries (which is out of the question because it makes working with other projects impossible) or ignore them and link my own, which is better – but still, the process required building my own (hacked) version of CRT. I did not like the approach, so I came up with my own.

First, some disclaimers. This code is tested for statically linked Win32 CRT only – it requires some modifications to work on Win64 or with dynamically linked CRT – I might do the Win64 part some day, but not DLL CRT. Also I'm not too clear on EULA issues; because of this, I'll post my entire code except for one function that's essentially ripped from CRT and fixed so that it compiles – read further for more details. Finally, there may be some unresolved issues with CRT functions I don't currently use (though I think my solution covers most of them) – basically, this is a demonstration of approach with proof-of-concept code, and if you decide to use it you're expected to fix the problems if they arise :)

Our first priority is to replace CRT allocation functions without modifying libraries. There are basically two ways to do something like it – link time and run time. Link time approach involves telling the linker somehow that instead of existing functions it should use the ones supplied by us. Unfortunately, there does not seem to be a way to do this except /FORCE:MULTIPLE, which results in annoying linker warnings and disables incremental linking. Run time way involves patching code after the executable is started – hooking libraries like Detours do it, but we don't need such a heavyweight solution here. In fact, all that's needed is a simple function:

static inline void patch_with_jump(void* dest, void* address)
    // get offset for relative jmp
    unsigned int offset = (unsigned int)((char*)address - (char*)dest - 5);
    // unprotect memory
    unsigned long old_protect;
    VirtualProtect(dest, 5, PAGE_READWRITE, &old_protect);
    // write jmp
    *(unsigned char*)dest = 0xe9;
    *(unsigned int*)((char*)dest + 1) = offset;
    // protect memory
    VirtualProtect(dest, 5, old_protect, &old_protect);

This function replaces first 5 bytes of code contained in dest with jump to address (the jump is a relative one so we need to compute relative offset; also, the code area is read-only by default, so we unprotect it for the duration of patching). The primitive for stubbing CRT functions is in place – now we need to figure out where to invoke it. At first I thought that a static initializer (specially tagged so that it's guaranteed to execute before other initializers) would be sufficient, but after looking inside CRT source it became apparent that heap is initialized and (which is more critical) used before static initialization. Thus I had to define my own entry point:

    int entrypoint()
        int mainCRTStartup();
        return mainCRTStartup();

Now to patch the functions. We're interested in heap initialization, heap termination and various (de)allocation utilities. There is _heap_init, _heap_term and lots of variants of malloc/free and friends – they are all listed in source code. Note that I stubbed all _aligned_* functions with BREAK() (__asm int 3), because neither CRT code nor my code uses them – of course, you can stub them if you need.

There are several highlights here. First one I stumbled upon is that _heap_term is not getting called! At least not in static CRT. After some CRT source digging I decided to patch __crtCorExitProcess – it's useful only for managed C++, and it's the last thing that gets called before ExitProcess. The second one is in function _recalloc, that's specific to the allocator you're using to replace the default one. The purpose of _recalloc is to reallocate the memory as realloc does, but cleaning any additional memory – so if you do malloc(3) and then _recalloc(4), ((char*)ptr)[3] is guaranteed to be 0. My allocator aligns everything to 4 bytes and has a minimal allocation size limit; the original size that was passed to allocation function is not stored anywhere. It's easy to fix it for CRT because _recalloc is used in CRT only for blocks allocated with calloc, and I hope _recalloc is not used anywhere else. By the way, there is a bug in CRT related to _recalloc – malloc(0) with subsequent _recalloc(1) does not clear first byte (because for malloc(0) block with size 1 is created); moreover, more bugs of such nature are theoretically possible on Win64. Personally I find calloc weird and _recalloc disgusting; luckily it's Windows-only.

Ok, now we're done – are we? Well, everything went well until I turned leak detection on. It turns out that there are lots of allocations left unfreed by CRT – amazingly, there is a __freeCrtMemory function that frees some of those, but it's compiled in only in _DEBUG, and it's called only if CRT debugging facilities are configured to dump memory leaks on exit. Because of this I needed to copy the code, modify it slightly so that it compiles and invoke the function before heap termination. However, this function does not free everything – there were some more allocations left, that I needed to handle myself. You can see the code in cleanup_crt_leaks(). After cleaning up leaks printf(), which was used to output leaks to console, became unusable (oh, horror!), so I came up with the following function:

void debug_printf(const char* format, ...)
    char buf[4096];
    va_list arglist;
    va_start(arglist, format);
    wvsprintfA(buf, format, arglist);
    // console output
    HANDLE handle = GetStdHandle(STD_OUTPUT_HANDLE);
    WriteFile(handle, buf, (unsigned long)strlen(buf), NULL, NULL);

    // debug output

Finally, the last problem is that some CRT code checks global variable _crtheap prior to allocation, so we have to initialize it to something (that affects fopen() and other functions that use dynamically created critical sections).

Well, now it works and I'm quite happy with the results. Of course it's slightly hackish, but CRT code is such a mess that it blends in nicely. The more or less complete source code is here. Note that if you're using C++ new/delete and you have not overridden them globally for some reason, you might want to patch _nh_malloc/_heap_alloc with malloc_stub as well. Read more!

Sunday, March 1, 2009

View frustum culling optimization – Never let me branch

In previous iteration we converted the code to SoA instead of AoS, which enabled us to transform OBB points to world space relatively painlessly, and eliminated ugly and slow dot product, thus making the code faster. Still, the code is slow. Why?

Well, as it appears, the problem is branching.

I wanted to write a long post about branches and why they are often a bad idea for PPU/SPU, but it turns out that Mike Acton beat me to it – be sure to read his articles for detailed explanation: part 1 part 2 part 3 - so I'll make it short. For our case, there are two problems with branching:

First, code performance depends on input data. Visible boxes are worst case (this is the one the cycle count is for); invisible boxes are faster, with the fastest case (where the box is behind the first plane) taking 128 cycles. Because of this, it's hard to estimate the run time of culling, given the number of objects – upper bound is three times bigger than lower bound.

Second, branches divide the code in blocks, and compiler has problems performing optimizations between blocks. We have a constant-length loop, inside we compute 8 dot products for a single plane, then check if all of them are negative, and in this case we early-out. Note that there are a lot of dependencies in computation of dot products – si_fmas in dot4 depend on the result of previous si_fmas, si_fcgt depends on the result of dot4, etc. Here is an example of disassembly for performing a single dot4 operation, assuming that we already have SPLAT(v, i) in registers:

fma res, v2, z, v3
fma res, v1, y, res
fma res, v0, x, res

Pretty reasonable? Well, not exactly. While we have 3 instructions, each one depends on the result of the previous one, so we can use our result in 18 cycles instead of 3 (fma latency is 6 cycles). If we need to compute 6 dot4, and we have some sort of branching after each one, like we had in the code for previous attempt, we'll pay the cost of 18 cycles for each iteration (of course, there'll also be some cost associated with comparison and branching). On the other hand, if we computed all 6 dot4 without any branches, the code could've looked like:

fma res[0], v2[0], z[0], v3[0]
fma res[1], v2[1], z[1], v3[1]
fma res[5], v2[5], z[5], v3[5]

fma res[0], v1[0], y[0], res[0]
fma res[5], v1[5], y[5], res[5]

fma res[0], v0[0], x[0], res[0]
fma res[5], v0[5], x[5], res[5]

This code has 18 instructions, and all results are computed in 24 cycles – but we're computing 6 dot4 instead of 1! Also 24 cycles is the latency for res[5] – we can start working on res[0] immediately after last fma gets issued.

The problem is not only related to instruction latency (in our case, register dependencies), but also to pipeline stalls – SPU has two pipelines (even and odd), and can issue one instruction per pipeline per cycle for, uhm, perfect code – each type of instruction can be issued only on one of the pipes, for example arithmetic instructions belong to even pipe, load/store/shuffle instructions belong to odd one. Because of this shuffles can be free if they dual-issue with arithmetics and do not cause subsequent dependency stalls.

Compiler tries to rearrange instructions in order to minimize all stalls – register dependencies, pipeline stalls and some other types – but it is often not allowed to do it between branches. Because of this it's best to eliminate all branches – compiler will be left with a single block of instructions and will be able to do a pretty good job hiding latencies/dual-issuing instructions. This is often critical – for example, our current version wastes almost half of cycles while waiting for results because of register dependency.

Of course, eliminating branches is often a tradeoff – sometimes it makes worst-case run faster, but best-case now runs slower, as we observed last time with x86 code. The decision depends on your goals and on frequency of various cases – remember that branchless code will give you a guaranteed (and usually acceptable) lower bound on performance.

So, in order to eliminate branches, we'll restructure our code a bit – instead of checking for each plane if all points are outside, we'll check if any point is inside, i.e. if the box is not outside of the plane:

static inline qword is_not_outside(qword plane, const qword* points_ws_0, const qword* points_ws_1)
    qword dp0 = dot4(plane, points_ws_0[0], points_ws_0[1], points_ws_0[2]);
    qword dp1 = dot4(plane, points_ws_1[0], points_ws_1[1], points_ws_1[2]);

    qword dp0pos = si_fcgt(dp0, (qword)(0));
    qword dp1pos = si_fcgt(dp1, (qword)(0));

    return si_orx(si_or(dp0pos, dp1pos));

si_orx is a horizontal or (or across) instruction, which ors 4 32-bit components of source register together and returns the result in preferred slot, filling the rest of vector with zeroes. Thus is_not_outside will return 0xffffffff in preferred slot if box is not outside of plane, and 0 if it's outside.

Now all we have to do is to call this function for all planes, and combine the results – we can do it with si_and, since the box is not outside of the frustum only if it's not outside of all planes; if any is_not_outside call returns 0, we have to return 0.

    // for each plane...
    qword nout0 = is_not_outside((qword)frustum->planes[0], points_ws_0, points_ws_1);
    qword nout1 = is_not_outside((qword)frustum->planes[1], points_ws_0, points_ws_1);
    qword nout2 = is_not_outside((qword)frustum->planes[2], points_ws_0, points_ws_1);
    qword nout3 = is_not_outside((qword)frustum->planes[3], points_ws_0, points_ws_1);
    qword nout4 = is_not_outside((qword)frustum->planes[4], points_ws_0, points_ws_1);
    qword nout5 = is_not_outside((qword)frustum->planes[5], points_ws_0, points_ws_1);

    // merge "not outside" flags
    qword nout01 = si_and(nout0, nout1);
    qword nout012 = si_and(nout01, nout2);

    qword nout34 = si_and(nout3, nout4);
    qword nout345 = si_and(nout34, nout5);

    qword nout = si_and(nout012, nout345);

    return si_to_uint(nout);

I changed return type for is_visible to unsigned int, with 0 meaning false and 0xffffffff meaning true; this won't change client code, but slightly improves performance.

Now when we compute everything in a single block, compiler schedules instructions in a way that we waste close to zero cycles because of latency. The new branchless version runs at 119 cycles, which is more than 3 times faster than the previous version, and 10 times faster than initial scalar version. This results in 37 msec for million calls, which is almost 2 times faster than fastest result on x86 (finally!). Moreover, this is slightly faster than the best case of previous version – so there is no tradeoff here, new version is always faster than old one. Note that eliminating branches is not worth it for x86 code (i.e. it does not make worst case faster, which is expected, if you remember that we had to do 2 checks per plane in order to make SoA approach faster than AoS).

The current source can be grabbed here.

That's all for now – stay tuned for the next weekend's post! I plan to post something not VFC-related the next week, then another VFC post the week after that. If you're starting to hate frustums, SPU, me and my blog - sorry about that, but we'll be done with VFC some day, I swear! :) Read more!

Sunday, February 15, 2009

View frustum culling optimization – Structures and arrays

Last week I've tried my best at optimizing the underlying functions without touching the essence of algorithm (if there was a function initially that filled a 8-vector array with AABB points, optimizations from previous post could be done in math library). It seems the strategy has to be changed.

There are several reasons why the code is still slow. One is branching. We'll cover that in the next issue though. Another one has already been discussed on this blog related to shaders – we have 4-way SIMD instructions, but we are not using them properly. For example, our point transformation function wastes 1 scalar operation per each si_fma, and requires additional .w component fixup after that. Our dot product function is simply horrible. Once again we're going to switch layout for intermediate data from Array of Structures to Structure of Arrays.

We have 8 AABB points, so we'll need 6 vectors – 2 vectors per each component. Do we need all 6? Nah. Since it's an AABB, we can organize stuff so that we need only 4 like this:

x X x X
y y Y Y
z z z z

Note that vectors for x and y components are shared between two 4-point groups. Of course this sharing will go away after we transform our points to world space – but that makes it easier to generate SoA points from min/max vectors.

How do we generate them? Well, we already know the solution for Z – there is some magical si_shufb instruction, that worked for us before. It's time to know what exactly it does, as it can be used to generate x/y vectors too.

What si_shufb(a, b, c) does is it takes a/b registers, and permutes their contents using c as pattern, yielding a new value. Permutation is done at a byte level - each byte of c corresponds to a resulting byte, which is computed one of the following ways:

1.0x0v corresponds to a byte of left operand with index v
2.0x1v corresponds to a byte of right operand with index v
3.0x80 corresponds to a constant 0x00
4.0xC0 corresponds to a constant 0xFF
5.0xE0 corresponds to a constant 0x80
6.other values result in one of the above, the exact treatment is out of the scope

This is a superset of Altivec vec_perm instruction, and can be used to do very powerful things, as we'll realize soon enough. For example, you can implement usual GPU-style swizzling like so:

src_zxxx = si_shufb(src, src, ((qword)(vec_uint4){0x08090a0b, 0x00010203, 0x00010203, 0x00010203}));

First four bytes of my pattern correspond to bytes 8-11 of left argument, all other four-byte groups correspond to bytes 0-3 of left argument. This is equal to applying .zxxx swizzle. As you can probably see, the code can get very obscure if you use shuffles a lot, so I've made some helper macros:

// shuffle helpers
#define L0 0x00010203
#define L1 0x04050607
#define L2 0x08090a0b
#define L3 0x0c0d0e0f

#define R0 0x10111213
#define R1 0x14151617
#define R2 0x18191a1b
#define R3 0x1c1d1e1f

#define SHUFFLE(l, r, x, y, z, w) si_shufb(l, r, ((qword)(vec_uint4){x, y, z, w}))

// splat helper
#define SPLAT(v, idx) si_shufb(v, v, (qword)(vec_uint4)(L ## idx))

SHUFFLE is for general shuffling, SPLAT is for component replication (.yyyy-like swizzles). Note that in previous post SPLAT was used in transform_point to generate .xxxx, .yyyy and .zzzz swizzles from AABB point.

Let's generate AABB points then.

    // get aabb points (SoA)
    qword minmax_x = SHUFFLE(min, max, L0, R0, L0, R0); // x X x X
    qword minmax_y = SHUFFLE(min, max, L1, L1, R1, R1); // y y Y Y
    qword minmax_z_0 = SPLAT(min, 2); // z z z z
    qword minmax_z_1 = SPLAT(max, 2); // Z Z Z Z

That was easy. Now if we want first 4 points, we use minmax_x, minmax_y, minmax_z_0; for the second group, we use minmax_x, minmax_y, minmax_z_1.

Now, we have 2 groups of 4 points in each, SoA style – we have to transform them to world space. It's actually quite easy – remember the first scalar version? If you've glanced at the code, you've seen a macro for computing single resulting component:

#define COMP(c) p->c = op.x * mat->row0.c + op.y * mat->row1.c + op.z * mat->row2.c + mat->row3.c

As it turns out, this can be converted to SoA style multiplication almost literally – you just need to think of op.x, op.y, op.z as of vectors with 4 values of some component; mat->rowi.c has to be splatted over all components. The resulting function becomes:

static inline void transform_points_4(qword* dest, qword x, qword y, qword z, const struct matrix43_t* mat)
#define COMP(c) \
    qword res_ ## c = SPLAT((qword)mat->row3, c); \
    res_ ## c = si_fma(z, SPLAT((qword)mat->row2, c), res_ ## c); \
    res_ ## c = si_fma(y, SPLAT((qword)mat->row1, c), res_ ## c); \
    res_ ## c = si_fma(x, SPLAT((qword)mat->row0, c), res_ ## c); \
    dest[c] = res_ ## c;

#undef COMP

Note that it's not really that much different from the scalar version, only now it transforms 4 points in 9 si_fma and 12 si_shufb instructions. We're going to transform 2 groups of points, so we'll need 18 si_fma instructions, si_shufb can be shared – luckily, the compiler does it for us so we just need to call transform_points_4 twice:

    // transform points to world space
    qword points_ws_0[3];
    qword points_ws_1[3];

    transform_points_4(points_ws_0, minmax_x, minmax_y, minmax_z_0, transform);
    transform_points_4(points_ws_1, minmax_x, minmax_y, minmax_z_1, transform);

Previous vectorized version required 24 si_fma and 24 si_shufb, plus 8 correcting si_selb (to be fair, it could be actually optimized to require 6 si_shufb + 8 si_selb, but it's still not a win over SoA). Note that 18 si_fma + 12 si_shufb does not mean 30 cycles. SPUs are capable of dual-issuing some instructions – there are two groups of instructions, one group runs at even pipeline, another one – at odd. si_fma and si_shufb run on different pipelines, so the net throughput will be closer to 18 cycles (slightly larger than that if si_shufb latency can't be hidden).

Now all that's left is to calculate dot products with a plane. Of course we'll have to calculate them 4 at a time. But wait – in our case execution of inner loop terminated after the first iteration. So previously we were doing only one (albeit ugly) dot product, and now we're doing 4, or even 8! Isn't that a little bit excessive? Well, that's not – but we'll save a more detailed explanation for the later post, for now let the results speak for themselves.

In order to calculate 4 dot products, we'll make a helper function:

static inline qword dot4(qword v, qword x, qword y, qword z)
    qword result = SPLAT(v, 3);

    result = si_fma(SPLAT(v, 2), z, result);
    result = si_fma(SPLAT(v, 1), y, result);
    result = si_fma(SPLAT(v, 0), x, result);

    return result;

And call it twice. Again, we'll be doing four splats twice, but compiler is smart enough to eliminate this. After that we'll have to compare all 8 dot products with zero, and return false if all of those are negative.

    // for each plane...
    for (int i = 0; i < 6; ++i)
        qword plane = (qword)frustum->planes[i];

        // calculate 8 dot products
        qword dp0 = dot4(plane, points_ws_0[0], points_ws_0[1], points_ws_0[2]);
        qword dp1 = dot4(plane, points_ws_1[0], points_ws_1[1], points_ws_1[2]);

        // get signs
        qword dp0neg = si_fcgt((qword)(0), dp0);
        qword dp1neg = si_fcgt((qword)(0), dp1);

        if (si_to_uint(si_gb(si_and(dp0neg, dp1neg))) == 15)
            return false;

si_fcgt is just a floating-point greater comparison; I'm abusing the fact that 0.0f is represented as a vector with all bytes equal to zero here. si_fcgt operates like SSE comparisons and returns 0xffffffff for elements where the comparison result is true, and 0 for others. After that I and the results together, and then use si_gb instruction to gather bits of results. si_gb takes least significant bit from each element and inserts it into corresponding bit of the result; we get a 4-bit value in preferred slot, everything else is zeroed out. If it's equal to 15, then si_and returned a mask where all elements are 0xffffffff, which means that all dot products are less than zero, so the box is outside.

Note that si_gb is like _mm_movemask_ps, only it takes least significant bits instead of most significant – in case of SSE, we don't need to do comparisons. We can avoid comparisons here by anding dot products directly, and then moving the sign bit to least significant bit (it can be done by rotating each element 1 bit to the left, that's achieved by si_roti(v, 1)), but this is slightly slower, so we won't do it.

Now, the results. The code runs at 376 cycles, which is more than 2 times faster than the previous version, and almost 4 times faster than the original. This speedup is partially because we're doing things more efficiently, partially because we got rid of branches; we'll discuss this the next week. A million calls takes 117 msec, which is still worse than x86 results – but it's not the end of the story. Astonishingly, applying exactly the same optimizations to SSE code results in 81 msec for gcc (which is 30% faster than naively vectorized version), and in 104 msec for msvc8 (which is 40% slower!).

The fastest version is still produced by msvc8 from previous version. This should not be very surprising, as we changed inner loop from performing one dot-product to performing 8 at once, so that shows. We can optimize it in this case by adding early out – after we compute first 4 dot products, we'll check if all of them are positive; if some of them are not, we can safely skip additional 4 dot products and continue to the next iteration. It results in 87 ms for msvc8 and 65 ms for gcc, with gcc-compiled SoA finally being faster than all previous approaches. Of course, this is a worst case for SoA – in case inner loops actually did not terminate after first iteration the performance gain would be greater. Adding the same optimization to SPU code makes it slightly (by 3 cycles) slower; the penalty is tens of cycles if the early out does not happen and we have to compute all 8 dot products, so it's definitely not worth it.

The current source can be grabbed here.

That's all for now – stay tuned for the next weekend's post! Read more!

Sunday, February 8, 2009

View frustum culling optimization – Vectorize me

Last week I've posted some teaser code that will be transformed several times, each time yielding a faster one - “faster” in terms of “taking less cycles for the test case on SPU”. A lot of you probably looked at my admittedly lame excuse for, uhm, math library and want to ask – why the hell do you use scalar code? We're going to address the problem in this issue. This is probably a no-brainer for most of my readers, but this is a good opportunity to introduce some important points about SPUs and introduce some actual vector code before diving further.

But first, we need some background information on SPUs. For the todays post, there is a single important thing to know about SPUs – they are vector processors. Unlike most common architectures (PowerPC, x86, etc.), SPUs have only one register set, which consists of 128-bit vectors. The current implementation has 128 of them, and each register is treated differently in different instructions (you have different instructions for adding two registers as if they contained 4 single precision floats or 16 8-bit integers). The important point is that, while you can compile a piece of scalar code for SPU, it's going to use vector registers and vector instructions; the scalar values are assumed to reside in so called preferred slot – for our current needs, we only care about preferred slot for 32-bit scalars, which is the first one (index 0). Register components are numbered from least address in memory onwards, which is really refreshing after SSE little-endian madness.

This actually goes slightly further – not only all registers are 16-byte, but all memory accesses (I'm obviously talking about local storage access here – though the same mostly applies to DMA; I'll be probably discussing something DMA-related after VFC series ends) should be – you can only load/store a full register's worth of data from/to 16b-aligned location. Of course, you can implement a workaround for scalar values – for loading, load the 16 byte chunk the value is in, and then shift it in the register so that it resides in preferred slot; for saving, load the destination 16 byte chunk, insert desired value in it via shifting/masking, and then store the whole chunk back. In fact, this is exactly what compiler does. Moreover, for our struct vector3_t, loading three components in registers will generate such load/shift code for every component, since compiler does not know the alignment (the whole vector could be in one 16 byte chunk, or it could be split in half between any two components).

In order to leverage available power, we have to use available vector instructions. SPUs have a custom instruction set, which is well documented. For now, it's important to know that there is a fused multiply-add instruction, which computes a*b+c, and there is no dot product instruction (or floating-point horizontal sum, for that matter). In fact, on current generation of consoles, XBox360 is pretty unique in that it does have a dot product instruction.

So, our code is bad because we have lots of scalar memory accesses and lots of scalar operations, which are not using available processing power properly. Let's change this!

One option is to code in assembly; this has obvious benefits and obvious pitfalls, and we'll use intrinsics instead. For SPUs, we have three intrinsics sets to choose from – Altivec emulated (vec_*, the same as we use on PPU), generic type-aware (spu_*) and low-level (si_*). GCC compiler provides several vector types as language extensions (some examples are 'vector float' and 'vector unsigned char', which correspond to 4 32-bit floats and 16 8-bit unsigned integers, respectively); a single spu_* instruction translates to different assembly instructions depending on a type, while si_* instructions operate on abstract register (it has type 'qword', which corresponds to 'vector signed char') – i.e. to add two vectors, you can use spu_add(v1, v2) with typed registers, or one of si_a, si_ah, si_fa, si_dfa to add registers as 32-bit integer, 16-bit integer, 32-bit floating point or 64-bit floating point, respectively. We'll be using si_* family for several reasons – one, they map to assembly exactly, so getting used to si_* instructions make it much easier to read (and possibly write) actual assembly, which is very useful when debugging or optimizing code, two, spu_* family is not available in C, as it uses function overloading. I'll explain specific intrinsics as we start using them.

First thing we'll do is dispose of redundant vector3_t/plane_t structures (in a real math library, we won't do this of course, but this is a sample), and replace them with qwords. This way, everything will be properly aligned, and we won't need to write load/store instructions ourselves (as opposed to something like struct vector3_t { float v[4]; }).

Then, we have to generate an array of points. Each resulting point is a combination of aabb->min and aabb->max – for each component we select either minimum or maximum value. As it turns out, there is the instruction that does exactly that – it accepts two registers with actual values and a third one with pattern; for each bit in pattern, it takes left bit for 0 and right bit for 1 – it's equivalent to (a & ~c) | (b & c), only in one instruction.

The code becomes

    // get aabb points
    qword points[] =
        min,                                                   // x y z
        si_selb(min, max, ((qword)(vec_uint4){~0, 0, 0, 0})),  // X y z
        si_selb(min, max, ((qword)(vec_uint4){~0, ~0, 0, 0})), // X Y z
        si_selb(min, max, ((qword)(vec_uint4){0, ~0, 0, 0})),  // x Y z

        si_selb(min, max, ((qword)(vec_uint4){0, 0, ~0, 0})),  // x y Z
        si_selb(min, max, ((qword)(vec_uint4){~0, 0, ~0, 0})), // X y Z
        max,                                                   // X Y Z
        si_selb(min, max, ((qword)(vec_uint4){0, ~0, ~0, 0})), // x Y Z

Note that I'm using another gcc extension to form vector constants. This is very convenient and does not exhibit any unexpected penalties (the expected ones being additional constant storage and additional instructions to load them).

Then we have transform_point; we'll have to transform a given vector by matrix, and additionally to stuff a 1.0f in .w component of the result in order for the following dot product to work (I sort of hacked this in scalar version by using dot(vector3, vector4)). Vector-matrix SIMD multiplication is very well-known – we'll need add/multiply instructions, and ability to replicate a vector element across the whole vector. For this we'll use a si_shufb instruction – I'll leave the detailed explanation for the next issue, for now just assume that it works as desired :)

static inline qword transform_point(qword p, const struct matrix43_t* mat)
    qword px = si_shufb(p, p, (qword)(vec_uint4)(0x00010203));
    qword py = si_shufb(p, p, (qword)(vec_uint4)(0x04050607));
    qword pz = si_shufb(p, p, (qword)(vec_uint4)(0x08090a0b));

    qword result = (qword)mat->row3;

    result = si_fma(pz, (qword)mat->row2, result);
    result = si_fma(py, (qword)mat->row1, result);
    result = si_fma(px, (qword)mat->row0, result);

    result = si_selb(result, ((qword)(vec_float4){0, 0, 0, 1}), ((qword)(vec_uint4){0, 0, 0, ~0}));

    return result;

We replicate point components, yielding three vectors, and then compute transformation result using si_fma (fused multiply-add; returns a * b + c) instruction. After that we combine it via selb to get 1.0f in the last component.

Note that in this case we are fortunate to have our matrix laid out as it is – another layout would force us to transpose it prior to further computations to make vectorization possible. In scalar case, the layout does not make any difference.

Finally, we'll have to compute dot product. As there is no dedicated dot product instruction, we'll have to emulate it, which is not pretty.

static inline float dot(qword lhs, qword rhs)
    qword mul = si_fm(lhs, rhs);

    // two pairs of sums
    qword mul_zwxy = si_rotqbyi(mul, 8);
    qword sum_2 = si_fa(mul, mul_zwxy);

    // single sum
    qword sum_2y = si_rotqbyi(sum_2, 4);
    qword sum_1 = si_fa(sum_2, sum_2y);

    // return result
    return si_to_float(sum_1);

First we get a component-wise multiplication result by using fm; then we'll have to compute horizontal sum. First we sum odd/even components together separately. For that, we rotate our register to the left by 8 bytes (si_rotqbyi) and add with the original. After that, we rotate the result left by 4 bytes (to get the second sum at the preferred slot) and add with the original.

For mul = (1, 2, 3, 4), we get the following values:
mul_zwxy = 3 4 1 2
sum_2 = 4 6 4 6
sum_2y = 6 4 6 4
sum_1 = 10 10 10 10

The result is converted to float via si_to_float cast intrinsic – it just tells the compiler to reinterpret result as if it was a float (actual scalar value is assumed to be in preferred slot), this usually does not generate any additional instructions.

Note that in case of SPU, there is only one register set – thus there is no penalty for such vector/scalar conversion. This code will not perform very well for other architectures – for example, on PowerPC converting vector to float in this way causes a LHS (Load Hit Store; it occurs when you read from the same address you just wrote into) because vector should be stored to stack to load vector element into float register; LHS causes a huge stall (40-50 cycles) and thus performance can be compromised here. For this reason, if your PPU/VMX math library has an optimized dot product function that returns float, don't use it in performance critical code – find another approach. It's interesting that if you think about it, you don't need dot products that much, as I'll show in the next issue.

Anyway, the current code runs at 820 cycles, which is 50% faster than scalar code. This equals to approximately 256 msec per million calls, the corresponding numbers for x86 being 136 msec for gcc and 74 msec for msvc8. Once x86 code is changed so that dot() function returns its result in a vector register, and resulting sign is then analyzed via _mm_movemask_ps instruction, timings change to 126/68, respectively. We've made some progress there, but our SPU implementation is still far from x86 in terms of speed though we're using the same techniques. I promise that the end result will be much more pleasing though :)

The current source can be grabbed here.

That's all for now – stay tuned for the next weekend's post! Read more!

Saturday, January 31, 2009

View frustum culling optimization - Introduction

Here I come again, back from almost a year long silence – and for some weird reason a visitor counter shows that people are still reading my blog! This was an eventful year for me – I worked on lots of things at work and on some at home, got 3 more shipped titles to put in my CV, started really programming on PS3 (including many RSX-related adventures, optimizations and, recently, SPU coding, which I happen to enjoy a lot), and, as some of you will probably guess from the code below, started using Vim. Some other (good) changes gave me more free time, so this post is a first one in a new one-post-in-a-week series (which will hopefully not be the, uh, last one also).

I have a small piece of code at work, which performs simple frustum culling for a given OBB. Initially it was written in an unoptimized (and cross-platform) way, later it was rewritten for Altivec with interesting optimizations, which yielded 3x performance boost, IIRC, and recently it was rewritten again for SPU. I considered this series of code transformation an interesting one, so I thought I'd expand it slightly (adding more intermediate stages), pretend it always was on SPU, and write several posts about it.

This post is a teaser, featuring testing methodology, source data, algorithm and initial (unoptimized) version of code, with unoptimized underlaying math library. Each code snippet will be short and self-contained, since the problem at hand is simple enough. Later posts in series will each feature some performance-related transformation (which can be obviously applied to lots of algorithms), with the last post giving more or less the current version of my code at work.

So, let's get started!

It starts simple - we have a handful of meshes, with each mesh having some kind of bounding volume and a local-to-world transformation. The task at hand is to determine, for a given frustum, whether the mesh is potentially visible (inside/intersecting with the frustum). The test has to be conservative – i.e. it can answer “visible” for meshes which are actually invisible – but it does not have to be exact. In fact, for many meshes the bounding volume itself is inexact, but additionally the algorithm can sacrifice some accuracy in order to be faster.

For our case, the bounding volume is AABB (axis-aligned bounding box) - “axis-aligned” here means that box axes are aligned to mesh local space axes, so in world space this is OBB. We're testing it against an arbitrary frustum – it can be a usual perspective/orthogonal one, or something more fancy (for example, our reflection/refraction rendering passes use perspective projection with oblique clipping, so near/far planes are not perpendicular to the viewing direction, and in fact they are not parallel at all!). Frustum is defined by a 4x4 matrix, though obviously we're free to convert it to any other representation we like.

There are two common approaches to testing if the box is inside frustum or not. First, equations of all 6 frustum planes are extracted. Then, for each plane it's determined if the box is completely outside (i.e. in the negative half-space, if planes' normals are pointing inside the frustum) or not; if the box is not completely outside for all planes, it's reported visible, otherwise it's reported invisible. This can be extended to differentiate “completely inside” and “partially inside” results, though we don't need it, since we're going to render both groups of meshes anyway.

Two approaches differ in the methodology of box-plane testing – one (bruteforce) is testing all 8 vertices of the box, another one is taking a single point (the p-vertex) and testing it, which is enough to give the correct answer if the correct p-vertex is chosen. The series will concentrate on the bruteforce approach, at least for now.

The naïve version of the algorithm first extracts the plane equations, using frustum combined view projection matrix (this is done once for each frustum, so it is not performance sensitive; as such, the code for this is omitted and frustum is assumed to have 6 computed plane equations initially). Then it applies the described bruteforce algorithm as follows:

bool is_visible(struct matrix43_t* transform, struct aabb_t* aabb, struct frustum_t* frustum)
    // get aabb points
    struct vector3_t points[] =
        { aabb->min.x, aabb->min.y, aabb->min.z },
        { aabb->max.x, aabb->min.y, aabb->min.z },
        { aabb->max.x, aabb->max.y, aabb->min.z },
        { aabb->min.x, aabb->max.y, aabb->min.z },

        { aabb->min.x, aabb->min.y, aabb->max.z },
        { aabb->max.x, aabb->min.y, aabb->max.z },
        { aabb->max.x, aabb->max.y, aabb->max.z },
        { aabb->min.x, aabb->max.y, aabb->max.z }

    // transform points to world space
    for (int i = 0; i < 8; ++i)
        transform_point(points + i, transform);

    // for each plane...
    for (int i = 0; i < 6; ++i)
        bool inside = false;

        for (int j = 0; j < 8; ++j)
            if (dot(points + j, frustum->planes + i) > 0)
                inside = true;

        if (!inside)
            return false;

    return true;

It uses five predefined data structures (vector3_t, matrix43_t, aabb_t, frustum_t and plane_t which is the type of frustum->planes array elements); those, and the (again, naïve) code of two used functions is available here (along with the rest of the code).

Note that matrix43_t is laid out so that three translation components are adjacent to each other in memory (I don't use the term “whatever major” here because it's very misleading); in our real code, rows actually consist of four components, with the fourth one being undefined for matrix43_t (of course, all operations should proceed as if the column was filled with 0 0 0 1). Similarly, vector3_t has four components, with the fourth one being undefined (this affects aabb_t). This is something that is assumed to stay this way forever, so all of our discussed code will somehow work around that when needed. From the next post and onwards, data layout of the sample code will be exactly the same as in our engine, I've omitted padding fields in today's version for simplicity.

The testing methodology is simple – I compile the code for SPU using a compiler from Sony's toolchain with -O3 level of optimization (the code is C99, by the way), and then run it via SPU simulator. This is an extremely useful tool that's provided by Sony for PS3 developers, which can run a SPU program and, for example, report run statistics – cycles elapsed, various stalls, etc. As far as I know, IBM has a public simulation suite with comparable capabilities, but since it requires Linux I never bothered to test it out. The number of cycles that the tool reports is then slightly reduce to account for some startup overhead (which is 34 cycles), so the number that I present here is the number of cycles for non-inlined call, with included function call overhead. For the reference, I'll include expected run time on a million OBBs (on one SPU, obviously), and corresponding run times of more or less the same code for PC (on my Core2 Duo 2.13 GHz), compiled with gcc 4.3.0 and MSVC 8.0 (SPU intrinsics will be replaced with SSE1 code). Those are only for reference, don't quote me on them :)

The cycle count for this naïve code is 1204, which (given a 3.2 GHz SPU) translates to 376 msec per million calls. The same code gives me roughly 84 msec when compiled with gcc (switches -O3 -msse) and 117 msec when compiled with cl (switches /O2 /fp:fast /arch:SSE). The test runs on the same data each time (taking care that processing is actually being performed), which excludes cache misses from the picture; the actual data is the same for SPU and PC tests and consists of a small box being completely inside the frustum – so that's a worst case for the outer loop (we have to test the box against all planes only to report that it's actually visible), although that's a best case for the inner loop (the first vertex tested for each plane is inside, so we can bail out early). I don't have any real-world average statistics on iteration counts of those loops; anyway, eventually we're going to eliminate some, and then all of those branches, so that the code will perform at constant speed.

That's all for now – stay tuned for the next weekend's post!
Read more!