Saturday, September 11, 2010

Reset and reload

Long time no see, everyone.

First of all, the blog has been moved. The new blog address is https://zeux.io, and the new feed address is https://zeux.io/feed. Please, update your bookmarks! This is the last post at this site (zeuxcg.blogspot.com), all new posts will appear at the new address.

In addition to changing the address, I've changed the blogging platform - this blog is now powered by WordPress, which at first impression is superior to Blogger in many ways - built-in code highlighter, built-in 'read more' support, image storage, slightly better html generation (i.e. it does not screw my posts up as often as Blogger did), better themes, non-anonymous comments without Google account, etc. I bet there are some downsides, but anyway I hope it will be a better experience (and will motivate me to write more posts, of course).

All old posts are imported from Blogger along with the comments; their contents is left as is, apart from minor cleanup and link cross-reference.

Previously most of my posts were of considerable length; I've even got as far as stuffing several completely different notes in a single post. The format is going to change slightly - there are going to be small notes as well as normal sized posts. Also probably the amount of non-graphics related posts is going to increase; still, I'll try to keep the content mostly relevant to game development. Read more!

Tuesday, September 29, 2009

On joys and sorrows of library development – part 1

This may come as a surprise, but I am not dead. In fact, what you see is a new post! As usual I have a lot of interesting themes to cover, and barely enough time to spare. While I'm at it, let me tell you about NDAs. I hate NDAs with a passion – I've got some things to blog about that are partially covered by NDA (of course, the interesting parts are NOT); also I've been thinking that this is a non-issue and basically that I can blog about things that are not quite critical, but half a year ago or so I was forced to remove a blog post; the reasons are not exactly clear but it seems that it was because of a single sentence that mentioned something that's NOT secret in my point of view and was NOT relevant to post contents. For this reason I'm hesitant to write about some topics so I'll either skip them altogether (which is a shame) or find a way to omit all details that might seem sensitive to people. Also I'm not sure if blogging about post removal due to NDA is an NDA violation?..

Anyway, the topic for today is something different – I'll write a bit about library development. In past few years I've developed and maintained a C++ XML parser PugiXML. This is a tiny library which focuses on performance and ease of use. We've had tremendous speedups of export process after converting from TinyXML, and I know lots of other success stories. PugiXML is portable (lots of platforms and compilers are supported, I've gone through special efforts to support MSVC6 and old CodeWarriors), console-aware (i.e. you can toggle off STL/exception support, override memory management, etc.), small, robust, etc. It even features an XPath evaluator!

PugiXML was born as a project to clean up pugxml – initial idea was to strip pugxml header from sources (thus reducing compilation/linking times), slightly cleanup interface and use it. What followed was an almost complete rewrite of the code, bringing the parser closer to standard compliance, adding useful features for DOM inspection, and greatly improving speed. There are bits of code left from pugxml, and interface is very similar, but it's quite a different project now. As far as I know, the only parser in use that beats PugiXML at parsing speed is RapidXML, and the only major problem with PugiXML is that it's Unicode support is pretty much limited by UTF8. Though both of those may change at some point in the future :)

I'm going to write some stuff here that may be of interest to other people.

1. Interface


The initial API was taken as-is from pugxml; in the hindsight, this was both a good (since it offered a very simple transition for pugxml users) and bad thing. It's a bad thing because the interface is seriously cluttered.

For example, there are at least four methods of traversing children nodes: you can use the next_sibling() function (the DOM is structured as a graph of nodes, with nodes connected via pointers; each node contains a pointer to both right and left siblings, the function gets the right one), you can use the node iterator, you can use xml_tree_walker (which is a Visitor-like interface), and finally you can grab all child elements via an insert iterator with all_elements_by_name(). Oh, and you can use XPath, which makes five methods.

As another example, every method for string-based queries (i.e. xml_node::attribute(const char*), which means “give me the first attribute with the following name) has a corresponding method which uses wildcard matching instead of string comparison (i.e. node.attribute_w(“foo*ba?”) will match foobar or fooatbaz).

Overall, it's not that much (I have a friend who's been working with a codebase that has an interface with 760+ virtual functions, so I'm not easily scared) and it does not stand in the way while you're using the library, but it certainly does not help maintaining and developing it.

But the worst part is that I can't remove any of those functions. For example, I consider tree walker to be a bad abstraction; it's rarely usable, and if it is, it's easy to write it outside the library. If I had a full API usage statistics, I could've made a conscious decision – either nobody uses it and I remove it, or there are very few who do and I extract it into an external helper class in an external header (possibly changing the interface slightly), or it's a feature that is used in every second application that uses my library and I can't do anything. The problem is I have no statistics, so I can't do anything.

Other than that, I feel the interface to be good (I use it relatively often both in my pet projects and at work, so if there was something that annoyed me I would've fixed that); the best decision for me is pointer abstraction – in pugixml you don't work with pointers to node (as with TinyXML), you work with tiny pointer wrapper class (the size is equal to that of a pointer) that's passed by value; the point is that there is no null pointer exception, all operations on “null” nodes/attributes are perfectly defined. Of course, the same could be done with a pointer API by using a dummy object instead of null pointer, what matters is the decision to protect the user. Also I find that this makes parsing code much more concise – you don't have to do error handling for every API call!

2. Performance


The parsing performance is very good, on COLLADA files it's hundreds of megabytes per second (probably closer to gigabyte); the bottleneck is always HDD read speed unless the file is cached. Of course, it's still slightly slower than it could be; also the performance comes for a price of not being fully standard compliant – it manifests in allowing certain XML standard violations, such as disallowed Unicode symbols in attribute/node names, multiple node attributes with the same name, etc. This means that while any correct XML file will be parsed, some malformed ones will not be rejected. Up to some point there even were flags to make parser pass certain standard violations (i.e. there was a mode that could handle HTML-style unclosed tags by scanning for matching open tag and automatically closing all descendants), but I removed them to reduce clutter (that was at the point when parser was used by me and a couple of friends so no harm done).

The memory consumption is also good enough (when we switched from TinyXML at work, we got ~2x improvement in terms of memory required to parse a DOM tree), although it could be better. Surprisingly this was achieved without any tricks that I love (take the pointer, take lower N bits, stuff something useful in there, pretend that everything was that way) and almost without any bit-packing.

All good things come at a price – the parser currently requires that the whole XML file is a large contiguous chunk of memory (i.e. if you have a 200 Mb file to parse, you have to have a 200 Mb chunk of address space); also, this chunk dies with the document so in the worst case PugiXML can lose in peak memory consumption if you modify your tree too much (i.e. load a 200 Mb document from file, remove all nodes, add an equivalent amount of contents by hand – the memory overhead of PugiXML will be i.e. 400 Mb (larger than that because nodes take some space too), the memory overhead of a typical parser will be 200 Mb). Of course this is almost never a problem in practice.

Next time: performance highlights (tricks to make parsing fast, saving performance), user requests, documentation, portability concerns Read more!

Monday, June 8, 2009

Implementing Direct3D for fun and profit

I can't believe I'm writing this, it's been what, 2 months? During that time a lot of things happened – I've been to the conference and gave an hour-long talk about our SPU rendering stuff (which was more or less well received), I've almost completed an occlusion subsystem (rasterization-based), which is giving good results; and the financial crisis has finally hit the company I work at – some projects are freezed due to the lack of funding, and some people are fired. It's kind of sad walking through half-empty offices... Anyway, I know I promised to write often but as I am actively developing my pet engine at home and there is a lot of stuff to work on at my day job, so time is a scarce resource for me. My blog/todo.txt file is already 20 entries long, where some things are too small to deserve a post, and others demand a lengthy series. I'll try to select something interesting from time to time and blog about it. As for todays topic,

Every object in core Direct3D (I'll be talking about 9 today, but the same thing should apply to 10 and 11) is an interface. This means that the details of actual implementation is hidden from us, but this also means that we can implement those interfaces. Why could we want to do that?

Reverse engineering
If you work in game industry/computer graphics, or, well, any other IT-related field, I suppose, then you should be constantly gaining new knowledge; otherwise your qualification as a specialist will decrease very fast. There are lots of ways to learn, and one of the best is to learn from others experience. Unfortunately, while there is a lot of information on the technology of some titles, most are not described at all. Also sometimes the descriptions are inaccurate – after all, devil is in the details. So what you can do is take an existing title and reverse-engineer it – that is, gain information about implementation details from the outside. Disclaimer: Of course, this information is provided only for educational value. Reverse engineering can violate the laws of your country and/or the EULA of the product. Don't use it if it does.

In PC / Direct3D world there are two primary tools than can allow such introspection – NVidia PerfHUD and Microsoft PIX. There is also a beta of Intel GPA (which is, by the way, quite promising, if lacking polish), but it is more or less like PIX. Using PIX does not require modifications of the host program, however PIX does not work for some titles (it might crash), is slow (especially for titles with complex scenes, lots of draw calls, etc.) and is not very convenient to use as a reverse engineering tool for other reasons.

PerfHUD is more useful in some areas, but you need to create Direct3D device with a special adapter and in REF mode in order for PerfHUD to work. While some games already have this kind of support in released version (notable examples include The Elder Scrolls 4: Oblivion and S.T.A.L.K.E.R. - Shadows of Chernobyl), others are more careful (I hope if you're reading this blog you have a build configuration such as Master or Retail, which sets appropriate defines so you can compile development-only stuff, such as asset reloading, profiling or NVPerfHUD support) out of the executable). But still if you manage to intercept the call to Direct3DCreate9 (which can be done for example by creating a DLL, calling it d3d9.dll and putting it near the game executable), you can return a proxy IDirect3D9 object, that forwards all calls to the actual object, except that it modifies the adapter/device type that are passed to CreateDevice. In fact, such proxy objects are used by both PIX and GPA, though the injection technique is more complex.

There are even some programs that simplify the following for you, allowing you to run any title in PerfHUD-compatible mode.

Multithreaded rendering
In fact, this is already described in a Gamefest 2008 presentation “Practical Parallel Rendering with DirectX 9 and 10, Windows PC Command Buffer Recording” (you can get slides and example code here). Basically, since neither Direct3D9 nor Direct3D10 support proper multithreading (creating device as multithreaded means that all device calls will be synchronized with one per-device critical section), you can emulate it via a special proxy device, which records all rendering calls in a buffer, and then uses the buffer to replay the command stream via real device. This saves processing time for other rendering work you do alongside API calls by allowing it to work in multiple threads, and is a good stub for deferred context functionality that's available on other platforms (including Direct3D11 and all console platforms). I use this technique in my pet engine mainly for the purpose of portability – I can render different parts of the scene into different contexts simultaneously, and then “kick” the deferred context via the main one. On PS3 the “kick” part is very lightweight, so the savings are huge; on Windows during the “kick” part deferred context replays the command stream, so it can be quite heavy, but it's faster than doing everything in one thread, and the code works the same way. When I start supporting Direct3D11, the same code will work concurrently, provided a good driver/runtime support of course.

Note that I don't use Emergent library as is – I consider it too heavyweight and obscure for my purposes. They try to support all Direct3D calls, while I use only a handful – I don't use FFP, I don't create resources via this device, etc. My implementation is simple and straightforward, and is only 23 Kb in size (11 of which are reused in another component – see below). If anybody wants to use it I can provide the code to you to save you an hour of work – just drop a comment.

Currently my implementation has a fixed size command buffer, so if you exceed it, you're doomed. There are several more or less obvious ways to fix this, but I hope that by the time I get to it I'll already have D3D11 in place.

Asset pipeline
My asset pipeline is more or less the same for all asset types – there is a source for the asset (Maya/Max scene, texture, sound file, etc.), which is converted via some set of actions to a platform-specific binary that can be loaded by the engine. In this way the complexity of dealing with different resource formats, complex structures, data non suitable for runtime, etc. is moved from engine to tools, which is great since it reduces the amount of runtime code, making it more robust and easier to maintain. The data is saved to a custom format which is optimized for loading time (target endianness, platform-specific data layout/format for graphics resources, compression). I think I'll blog about some interesting aspects/choices in the future as time permits (for example, about my experience of using build systems, such as SCons and Jam, for data builds), but for now I'll focus on a tool that builds textures.

This tool loads the texture file, generates mipmap levels for the texture if necessary (if it was not a DDS with mip chain, and if target texture requires mipmap levels), compresses it to DXTn if necessary (again, that depends on source format and building settings), and makes some other actions, both platform-specific and platform-independent. In order for it to work, I need an image library that can load image formats I care about, including DDS with DXTn contents (so that I don't need to unpack/repack it every time, and so that artists can tweak DXT compression settings in Photoshop plugin – in my experience there is rarely a visible difference, but if they give me a texture and I compress it to DXT and there are some artifacts, I'm to blame – and if they use Photoshop, it's not my scope :)). As it turns out, D3DX is a good enough image loading library, at least it works for me (although in retrospect I probably should've used DevIL, and perhaps I will switch to it in the future).

Anyway, to load a texture via D3DX, you need a Direct3D device. As it turns out, while you can create a working REF device in under 10 lines of code (using desktop window and hardcoded settings), you can't create any device, including NULLREF, if your PC does not have a monitor attached. This problem appeared once I got my pipeline working via IncrediBuild, and sometimes on some machines texture building failed. Since I did not want to modify my code too much, I ended implementing another proxy device, which is suitable for loading a texture with D3DX functions. This time it was slightly harder, because I needed implementations for some functions of IDirect3DDevice9, IDirect3DTexture9 and IDirect3DSurface9, but again the resulting code is quite small and simple – 6 Kb (plus the 11 Kb dummy device I mentioned earlier), and I can load any 2D texture. Of course I'll need to add some code to load cubemaps and even more code to load volume textures, but for now it's fine the way it is.

So these are some examples of situations where implementing Direct3D interfaces might prove useful. The next post is going to either be about multithreading, or about some asset pipeline-related stuff, I guess I'll decide once I get to writing it.
Read more!

Tuesday, March 31, 2009

Placeholder

I'm sorry for the lack of real post - it was a busy week, and a somewhat busy month lies ahead - I'm attending a local game conference in May and giving a speech about the process of porting our rendering subsystem to SPU (I hope to cover this topic here some day), so some time is spent preparing slides/etc.; my pet projects demand more attention than usual; there's some weird but nevertheless interesting stuff at work... I'll try to keep up, but you should really expect some more weeks without any posts. Don't beat me.

Anyway, a bunch of slides from GDC09 Tutorial sessions are finally uploaded; there is some good stuff in "Advanced Visual Effects with Direct3D", and there's some awesome stuff in "Insomniac Games' Secrets of Console and Playstation 3 Programming". I mean, finally someone told people who compute view-space normal Z as sqrt(1 - x^2 - y^2) that they don't know what they're doing! Not to mention SPU stuff, like KISS SPU scheduler (we have a simple enough custom scheduler at work, but it's still far), SPU debugging stories and other SPU talks. By the way, if you're interested in SPU-related topics and have not read everything here, then you don't take SPU seriously.

There are also Khronos' slides here - don't read them unless you have absolutely nothing to do. Read more!

Sunday, March 22, 2009

Miscellanea

There is a bunch of small notes I'd like to share – none of them deserves a post, but I don't want them to disappear forever.


Using hash table as a fixed-size cache

When I worked with Direct3D 10, I found state objects quite cumbersome to work with – they're very slow to create (or at least were back then) and the exact separation of states into objects was sometimes inconvenient from design point of view. Also I've already had a set of classes that divided states into groups and functions like setDepthState with redundancy checking, so I needed to write an implementation for existing interface. The solution I came up with was very simple and elegant, so I'd like to outline it once more (although I sort of mentioned it in the original post).

The natural thing to do here is to cache state object pointer inside state class, and recompute it if necessary (when binding newly created/modified state). There are two issues to solve here – 1. state object creation is expensive (even if you're creating an object with the same data several times in a row – in which case D3D10 runtime returns the same pointer – the call takes 10k cycles), 2. there is a limit on the amount of state objects (4096 for each type). Solving the first one is easy – just make a cache with key being state object description and value being the actual pointer; solving the second one is slightly harder because you'll have to evict entries from your cache based on some policy.

The way I went with was to create a fixed size array (the size should be a power of two less or equal than 4096 and depends on the state usage pattern), make a hash function for state description and use this array as a cache indexed by hash. In case of cache collision the old state object got released.

I often use simple non-resizable hash tables (recent examples include path lookup table in flat file system and vertex data hash to compute index buffer from COLLADA streams), but I always insert collision handling code – but, as it turns out, in this case you can omit it and get some benefit at the same time.


Direct3D 10 Read/Write hazard woes

While in some ways Direct3D 10 is clearly an improvement over Direct3D 9, in lots of areas they screwed it. I surely can't count all deficiencies using my both hands, but some problems annoy me more than the others. One of the things that leads to possible performance/memory compromises is resource Read/Write hazard. There are several inputs for various stages (shader resources (textures, buffers), constant buffers, vertex/index buffers) and several output ones (render targets, depth surfaces, stream out buffers), and there are resources that can be bound to both input and output stage; for example, you can bind a texture to output stage as a render target, then render something to it, and then bind the same texture as a shader resource so that shader can sample rendered data from it. However, Direct3D 10 Runtime does not allow a resource to be bound to both input and output stage at the same time.

One disadvantage is that sometimes you'd like to do an in-place update of render target – for example, to do color correction or some other transformation. In fact, this is a perfectly well-defined operation – at least on NVidia hardware – if you're always reading the same pixel you're writing to; otherwise you'll get old value for some pixels and new one for others. Here there is an actual read/write hazard, but due to the specific hardware knowledge we can exploit it to save memory.

Another disadvantage is that a resource being bound to output pipeline stage does not mean it's being written to! A common example is soft particles – Direct3D 10 introduced cross-platform unified depth textures so that you can apply postprocessing effects that require scene depth without extra pass to output depth in a texture or MRT – you can use the same depth buffer you were using for scene render as a texture input to the shader. While this works perfectly for post processing (except for the fact that you can't read depth from MSAA surfaces – ugh...), it fails miserably for soft particles. You usually disable depth writes for particles so there is no real read/write hazard, but because the runtime thinks there is one you can't bind depth buffer so that HW performs depth test – you can only do depth testing in pixel shader yourself via discard. This disables early coarse/fine Z culling, which results in abysmal performance.

Luckily MSAA depth readback is supported in D3D10.1, and in D3D11 you can bind resources to output pipeline stages as read-only. Too bad there is no D3D11 HW yet, and D3D10.1 is not supported by NVidia...


Knowing the class layout – vfptr

There are two weird points regarding class layout and vfptr (virtual function table pointer) that I'd like to note here – they are related to very simple cases, I'm not going to talk about multiple or god forbid virtual inheritance here.

Why do you need to know class layout? Well, it's useful while writing code so your classes can occupy less space, it's extremely useful while debugging obscure bugs, and you can't even start doing in-place loading/saving without such knowledge (I think I'll make a special post regarding in-place stuff soon). And don't even get me started on debuggers that can't display anything except registers and (if you're lucky) primitive locals – we used to have such debugger on PSP, and CodeWarrior for Wii is only slightly better.

Anyway, the first weird point is related to CodeWarrior – it had been like this on PS2, and it's like this on Wii – I doubt that'll ever change. You see, while on normal compilers there is no way to control vfptr placement – for simple classes without inheritance it always goes in the first word – on CodeWarrior it lies in the place of declaration – except that you can't declare vfptr in C++, so it lies in the place where the first virtual function is declared. Some examples follow:

// layout is vfptr, a, b
struct Foo { virtual void foo1(); unsigned int a; unsigned int b; };

// layout is a, vfptr, b
struct Foo { unsigned int a; virtual void foo1(); unsigned int b; };

// layout is a, vfptr, b
struct Foo { unsigned int a; virtual void foo1(); unsigned int b; virtual void foo2(); };

// layout is a, b, vfptr
struct Foo { unsigned int a; unsigned int b; virtual void foo2(); };


Marvelous, isn't it? Now there is an entry in our coding standard at work which says “first virtual function declaration has to appear before any member declarations”.

The second point was discovered only recently and appears to happen with MSVC. Let's look at the following classes:

struct Foo1 { virtual void foo1(); unsigned int a; float b; };
struct Foo2 { virtual void foo1(); unsigned int a; double b; };


Assuming sizeof(unsigned int) == 4, sizeof(float) == 4, sizeof(double) == 8, what are the layouts of the classes? A couple of days ago I'd say that:

Foo1: 4 bytes for vfptr, 4 bytes for a, 4 bytes for b; alignof(Foo1) == 4
Foo2: 4 bytes for vfptr, 4 bytes for a, 8 bytes for b; alignof(Foo2) == 8

And in fact this is exactly the way these classes are laid out in GCC (PS3/Win32), CodeWarrior (Wii) and other relatively sane compilers; MSVC however chooses the following layout for Foo2:

Foo2: 4 bytes for vfptr, 4 bytes of padding, 4 bytes for a, 4 bytes of padding, 8 bytes for b; alignof(Foo2) == 8

Of course the amount of padding increases if we replace double with i.e. __m128. I don't see any reason for such memory wastage, but that's the way things are implemented, and again I doubt this will ever change.


Optimizing build times with Direct3D 9

Yesterday after making some finishing touches to D3D9 implementation of some functions in my pet project (which is coincidentally a game engine wannabe), I hit rebuild and could not help noticing the difference in compilation speed for different files. The files that did not include any heavy platform-specific headers (such as windows.h or d3d9.h) were compiled almost immediately, files with windows.h included were slightly slower (don't forget to define WIN32_LEAN_AND_MEAN!), and files with d3d9.h were slow as hell compared to them – the compilation delay was clearly visible. Upon examination I understood that including windows.h alone gets you 651 Kb of preprocessed source (all numbers are generated via cl /EP, so the source doesn't include #line directives; also WIN32_LEAN_AND_MEAN is included in compilation flags), and including d3d9.h results in a 1.5 Mb source.

Well, I care about compilation times, so I decided to make things right – after all, d3d9.h can't require EVERYTHING in windows.h and other headers it includes. After half an hour of work, I arrived with minid3d9.h (which can be downloaded here).

Including minid3d9.h gets you 171 Kb of preprocessed source, which is much better. This file defines everything that's necessary for d3d9.h and also a couple of things my D3D9 code used (i.e. SUCCEDED/FAILED macros); you might need to add something else – it's not always a drop-in replacement. Also I've taken some measures that enable safe inclusion of this file after CRT/Platform SDK headers, but don't include it before them – generally, include it after everything else.

This decreased the full rebuild time by 30% for me (even though D3D9 code is less than 15% in terms of code size and less than 10% in terms of translation unit count) – I certainly expected less benefit! You're free to use this at your own risk; remember that I did not test in on 64-bit platform so perhaps it needs more work there. Read more!

Sunday, March 15, 2009

View frustum culling optimization – Representation matters

Before getting into professional game development I've spent a fair amount of time doing it for fun (in fact, I still do it now, although less intensively). The knowledge came from a variety of sources, but the only way that I knew and used to calculate frustum planes equations was as follows – get the equations in clip space (they're really simple – (1, 0, 0, 1), (0, -1, 0, 1), etc.) and then get world space ones by transforming the planes with inverse transpose of view projection camera matrix [correction: in fact, you need to transform with inverse transpose of inverse view projection matrix, which equals to just transpose of view projection matrix]. It's very simple and intuitive – if you know a simple way to express what you need in some space, and a simple way to transform things from that space to your target one, you're good to go.

Imagine my surprise when I started doing game development as a day job and after some time accidentally stumbled upon a piece of our codebase that calculated frustum planes in a completely different way. Given a usual perspective camera setup, it calculated frustum points via some trigonometry (utilizing knowledge about vertical/horizontal FOV angles, near/far distances and the fact that it's a perspective camera without any unusual alterations), and then used them to obtain the equations. I thought it to be very weird – after all, it's more complex and is constrained to specific camera representation, whereas clip space method works for any camera that can be set up for rendering (orthographic projection, oblique-clipping, etc.).

But as it turns out, the same thing can be said about our culling code. It's quite good at culling given box against an arbitrary set of planes (i.e. if you use it for portal/anti-portal culling with arbitrary shape of portals/occluders), but since we have a usual frustum, maybe we can improve it by going to clip space, entirely skipping world space? Let's try it.

We're going to transform AABB points to clip space, and then test them against frustum planes in clip space. Note that we can't divide by w after transforming – that will lead to culling bugs because post-projective space exhibits a discontinuity at the plane with equation z = 0 in view space; however, this is not needed – the frustum plane equations in clip space are as follows:

x >= -w, x <= w: left/right planes
y >= -w, y <= w: top/bottom planes
z >= 0, z <= w: near/far planes

Note that if you're using OpenGL clip space convention, the near plane equation is z >= -w; this is a minor change to the culling procedure.

First, to transform points to clip space, we're going to need a world view projection matrix – I hope the code does not require any additional explanations:

static inline void transform_matrix(struct matrix_t* dest, const struct matrix_t* lhs, const struct matrix_t* rhs)
{
#define COMP_0(c) \
    qword res_ ## c = si_fm((qword)lhs->row2, SPLAT((qword)rhs->row ## c, 2)); \
    res_ ## c = si_fma((qword)lhs->row1, SPLAT((qword)rhs->row ## c, 1), res_ ## c); \
    res_ ## c = si_fma((qword)lhs->row0, SPLAT((qword)rhs->row ## c, 0), res_ ## c); \
    dest->row ## c = (vec_float4)res_ ## c;

#define COMP_1(c) \
    qword res_ ## c = si_fma((qword)lhs->row2, SPLAT((qword)rhs->row ## c, 2), (qword)lhs->row3); \
    res_ ## c = si_fma((qword)lhs->row1, SPLAT((qword)rhs->row ## c, 1), res_ ## c); \
    res_ ## c = si_fma((qword)lhs->row0, SPLAT((qword)rhs->row ## c, 0), res_ ## c); \
    dest->row ## c = (vec_float4)res_ ## c;

    COMP_0(0);
    COMP_0(1);
    COMP_0(2);
    COMP_1(3);

#undef COMP_0
#undef COMP_1
}


After that we'll transform the points to clip space, yielding 2 groups with 4 vectors (x, y, z, w) in each one; the code is almost the same as in the previous post, only we now have 4 components:

static inline void transform_points_4(qword* dest, qword x, qword y, qword z, const struct matrix_t* mat)
{
#define COMP(c) \
    qword res_ ## c = SPLAT((qword)mat->row3, c); \
    res_ ## c = si_fma(z, SPLAT((qword)mat->row2, c), res_ ## c); \
    res_ ## c = si_fma(y, SPLAT((qword)mat->row1, c), res_ ## c); \
    res_ ## c = si_fma(x, SPLAT((qword)mat->row0, c), res_ ## c); \
    dest[c] = res_ ## c;

    COMP(0);
    COMP(1);
    COMP(2);
    COMP(3);
    
#undef COMP
}


    // transform points to clip space
    qword points_cs_0[4];
    qword points_cs_1[4];

    transform_points_4(points_cs_0, minmax_x, minmax_y, minmax_z_0, &clip);
    transform_points_4(points_cs_1, minmax_x, minmax_y, minmax_z_1, &clip);


If all 8 clip-space points are outside left plane, i.e. if for all 8 points p.x <= -p.w, then the box is completely outside. Since we're going to use SoA layout, such tests are very easy to perform. We'll need a vector which contains -w for 4 points; SPU do not have a special negation instruction, but you can easily emulate it either by subtracting from zero or by xoring with 0x80000000. Theoretically xor is better (it has 2 cycles of latency, subtract has 6), but in our case there is no difference in speed; I'll use xor nonetheless:

    // calculate -w
    qword points_cs_0_negw = si_xor(points_cs_0[3], (qword)(vec_uint4)(0x80000000));
    qword points_cs_1_negw = si_xor(points_cs_1[3], (qword)(vec_uint4)(0x80000000));


Now we'll calculate “not outside” flags for each plane; the method is exactly the same as in previous post (as is the final result computation), only now we're not doing dot products.

    // for each plane...
    #define NOUT(a, b, c, d) si_orx(si_or(si_fcgt(a, b), si_fcgt(c, d)))

    qword nout0 = NOUT(points_cs_0[0], points_cs_0_negw, points_cs_1[0], points_cs_1_negw);
    qword nout1 = NOUT(points_cs_0[3], points_cs_0[0], points_cs_1[3], points_cs_1[0]);
    qword nout2 = NOUT(points_cs_0[1], points_cs_0_negw, points_cs_1[1], points_cs_1_negw);
    qword nout3 = NOUT(points_cs_0[3], points_cs_0[1], points_cs_1[3], points_cs_1[1]);
    qword nout4 = NOUT(points_cs_0[2], (qword)(0), points_cs_1[2], (qword)(0));
    qword nout5 = NOUT(points_cs_0[3], points_cs_0[2], points_cs_1[3], points_cs_1[2]);

    #undef NOUT


This is the final version. It runs at 104 cycles per test, so it's slightly faster than the last version. But this method is better for another reason – we've calculated clip space positions of box vertices as a by-product (also we've calculated world view projection matrix, but it's likely to be of little further use, because usually shader constant setup happens at a later point in another module). Some things you can do with them:

Feed them as an input to the rasterizer to do rasterization-based occlusion culling (simple depth buffer, HOM, etc.). This is the road I have not taken yet, though I hope I will do it some day.
Use them for screen size culling – if your bounding box when projected to the screen is small enough (i.e. less than 3x3 pixels), you usually can safely throw it away. This is what I do in our production code; it involves dividing positions by w (don't forget to discard the size culling results if any point has w < epsilon!), computing min/max x/y for the results, subtracting min from max and checking if the difference along each axis is less than threshold. The actual implementation is left as an exercise to the reader.

This concludes the computation part of view-frustum culling. There is still something to do here – there is p/n-vertex approach which I did not implement (but I'm certain that it won't be a win over my current methods on SPUs); there are minor potential improvements to the current code that are not worth the trouble for me; all implemented tests return only binary result (outside / not outside), a ternary version can help in hierarchical culling (though this can be achieved with a minor modification to all presented code). There might be something else I can't think of now – post a comment if you'd like to hear about other VFC-related topics!

I'm going to write one final post regarding VFC, which deals with the code that will use is_visible to perform culling of the given batch array – the topics include DMA and double buffering; after that the whole VFC series will be over and I'm going to switch to something different.

The complete source for this post can be grabbed here. Read more!

Sunday, March 8, 2009

Fighting against CRT heap and winning

Memory management is one of (many) cornerstones of tech quality for console games. Proper memory management can decrease amount of bugs, increase product quality (for example, by eliminating desperate pre-release asset shrinking) and generally make life way easier – long term, that is. Improper memory management can wreak havoc. For example, any middleware without means to control/override memory management is, well, often not an option; any subsystem that uncontrollably allocates memory can and will lead to problems and thus needs redesigning/reimplementing. While you can tolerate more reckless memory handling on PC, it often results in negative user experience as well.

In my opinion, there are two steps to proper memory management. First one is global and affects all code – it's memory separation and budgeting. Every subsystem has to live in its own memory area of fixed size (of course, size can be fixed for the whole game or vary per level, this is not essential). This has several benefits:

  • memory fragmentation is now local – subsystems don't fragment each other's storage, thus fragmentation problems happen less frequently and can be reproduced and fixed faster

  • fixed sizes mean explicit budgets – because of them out of memory problems are again local and easily tracked to their source. For example, there is no more “game does not fit in video memory, let's resize some textures” - instead, you know that i.e. level textures fit in their budget perfectly, but the UI artists added several screen-size backgrounds, overflowing UI texture budget

  • because each subsystem lives in its own area, we have detailed memory statistics for no additional work, which again is a good thing for obvious reasons

  • if memory areas have fixed sizes, they either have fixed addresses or it's easy to trace address range for each of them – this helps somewhat in debugging complex bugs


Second one is local to each subsystem – once you know that your data lives in a fixed area, you have to come up with a way to lay your data in this area. The exact decisions are specific to the nature of data and are up to the programmer; this is out of this post's scope.

Memory is divided into regions, each region is attributed to a single subsystem/usage type – if we accept this, it becomes apparent that any unattributed allocations (i.e. any allocations into global heap) are there either because nobody knows where they should belong or because the person who coded those does not want to think about memory – which is even worse (strict separation and budgeting makes things more complicated in short term by forcing people to think about memory usage – but that's a good thing!). Because of this global heap contains junk by definition and thus should ideally be eliminated altogether, or if this is not possible for some reason, should be of limited and rather small size.

Now that we know the goal, it's necessary to implement it – i.e. we want to have a way to replace allocations in global heap with either fatal errors or allocations in our own small memory area. On different platforms there are different ways to do it – for example, on PS3 there is a documented (and easy) way to override CRT memory management functions (malloc/free/etc.); on other platforms with GNU-based toolchain there is often a --wrap linker switch – however, on some platforms, like Windows (assuming MSVC), there does not seem to be a clean way to do it. In fact, it seems that the only known solution is to modify the CRT code. I work with statically linked CRT, so this would mean less distribution problems, but more development ones – I'd have to either replace prebuilt CRT libraries (which is out of the question because it makes working with other projects impossible) or ignore them and link my own, which is better – but still, the process required building my own (hacked) version of CRT. I did not like the approach, so I came up with my own.

First, some disclaimers. This code is tested for statically linked Win32 CRT only – it requires some modifications to work on Win64 or with dynamically linked CRT – I might do the Win64 part some day, but not DLL CRT. Also I'm not too clear on EULA issues; because of this, I'll post my entire code except for one function that's essentially ripped from CRT and fixed so that it compiles – read further for more details. Finally, there may be some unresolved issues with CRT functions I don't currently use (though I think my solution covers most of them) – basically, this is a demonstration of approach with proof-of-concept code, and if you decide to use it you're expected to fix the problems if they arise :)

Our first priority is to replace CRT allocation functions without modifying libraries. There are basically two ways to do something like it – link time and run time. Link time approach involves telling the linker somehow that instead of existing functions it should use the ones supplied by us. Unfortunately, there does not seem to be a way to do this except /FORCE:MULTIPLE, which results in annoying linker warnings and disables incremental linking. Run time way involves patching code after the executable is started – hooking libraries like Detours do it, but we don't need such a heavyweight solution here. In fact, all that's needed is a simple function:

static inline void patch_with_jump(void* dest, void* address)
{
    // get offset for relative jmp
    unsigned int offset = (unsigned int)((char*)address - (char*)dest - 5);
    
    // unprotect memory
    unsigned long old_protect;
    VirtualProtect(dest, 5, PAGE_READWRITE, &old_protect);
    
    // write jmp
    *(unsigned char*)dest = 0xe9;
    *(unsigned int*)((char*)dest + 1) = offset;
    
    // protect memory
    VirtualProtect(dest, 5, old_protect, &old_protect);
}


This function replaces first 5 bytes of code contained in dest with jump to address (the jump is a relative one so we need to compute relative offset; also, the code area is read-only by default, so we unprotect it for the duration of patching). The primitive for stubbing CRT functions is in place – now we need to figure out where to invoke it. At first I thought that a static initializer (specially tagged so that it's guaranteed to execute before other initializers) would be sufficient, but after looking inside CRT source it became apparent that heap is initialized and (which is more critical) used before static initialization. Thus I had to define my own entry point:

    int entrypoint()
    {
        int mainCRTStartup();
        
        patch_memory_management_functions();
        
        return mainCRTStartup();
    }


Now to patch the functions. We're interested in heap initialization, heap termination and various (de)allocation utilities. There is _heap_init, _heap_term and lots of variants of malloc/free and friends – they are all listed in source code. Note that I stubbed all _aligned_* functions with BREAK() (__asm int 3), because neither CRT code nor my code uses them – of course, you can stub them if you need.

There are several highlights here. First one I stumbled upon is that _heap_term is not getting called! At least not in static CRT. After some CRT source digging I decided to patch __crtCorExitProcess – it's useful only for managed C++, and it's the last thing that gets called before ExitProcess. The second one is in function _recalloc, that's specific to the allocator you're using to replace the default one. The purpose of _recalloc is to reallocate the memory as realloc does, but cleaning any additional memory – so if you do malloc(3) and then _recalloc(4), ((char*)ptr)[3] is guaranteed to be 0. My allocator aligns everything to 4 bytes and has a minimal allocation size limit; the original size that was passed to allocation function is not stored anywhere. It's easy to fix it for CRT because _recalloc is used in CRT only for blocks allocated with calloc, and I hope _recalloc is not used anywhere else. By the way, there is a bug in CRT related to _recalloc – malloc(0) with subsequent _recalloc(1) does not clear first byte (because for malloc(0) block with size 1 is created); moreover, more bugs of such nature are theoretically possible on Win64. Personally I find calloc weird and _recalloc disgusting; luckily it's Windows-only.

Ok, now we're done – are we? Well, everything went well until I turned leak detection on. It turns out that there are lots of allocations left unfreed by CRT – amazingly, there is a __freeCrtMemory function that frees some of those, but it's compiled in only in _DEBUG, and it's called only if CRT debugging facilities are configured to dump memory leaks on exit. Because of this I needed to copy the code, modify it slightly so that it compiles and invoke the function before heap termination. However, this function does not free everything – there were some more allocations left, that I needed to handle myself. You can see the code in cleanup_crt_leaks(). After cleaning up leaks printf(), which was used to output leaks to console, became unusable (oh, horror!), so I came up with the following function:

void debug_printf(const char* format, ...)
{
    char buf[4096];
    
    va_list arglist;
    va_start(arglist, format);
    wvsprintfA(buf, format, arglist);
    va_end(arglist);
    
    // console output
    HANDLE handle = GetStdHandle(STD_OUTPUT_HANDLE);
    WriteFile(handle, buf, (unsigned long)strlen(buf), NULL, NULL);

    // debug output
    OutputDebugStringA(buf);
}


Finally, the last problem is that some CRT code checks global variable _crtheap prior to allocation, so we have to initialize it to something (that affects fopen() and other functions that use dynamically created critical sections).

Well, now it works and I'm quite happy with the results. Of course it's slightly hackish, but CRT code is such a mess that it blends in nicely. The more or less complete source code is here. Note that if you're using C++ new/delete and you have not overridden them globally for some reason, you might want to patch _nh_malloc/_heap_alloc with malloc_stub as well. Read more!