Sunday, October 28, 2007

My own lighting shader with blackjack and hookers

I am sorry for the delay in posts; I’ll try to keep it more regular, one post in a week or so.

Free Image Hosting at www.ImageShack.usSo, yesterday we were discussing various lighting approaches at local IRC chat, and someone said that it was impossible to make a ps.2.0 single-pass lighting shader for 8 lights with diffuse, specular and attenuation, of course with normal map. Well, I told him he was actually wrong, and that I can prove it. Proving it turned out to be a very fun and challenging task. I am going to share the resulting shader and the lessons learned with you.

From the start I knew it was not going to be easy – ps.2.0 means that no arbitrary swizzles are available (which can in some cases restrict possible optimizations), and there is a limit of 64 arithmetic instructions. Add to that the fact that there is no instruction pairing as in ps.1.x (which is rather strange, as all hardware I know of is capable of pairing vector and scalar instructions).

I decided to compute lighting in world space, as I thought that passing all lighting data from VS to PS is going to consume too much interpolators. So I passed world-space view vector, world-space position and tangent to world space conversion matrix from VS to PS, and had PS transform normal to world space and compute everything else. This proved to be insufficient to reach target result, but I am going to show you the shader anyway, as it has some interesting ideas, some of which are still in the final shader.

Here is the shader: link. Ignore stuff like bindings.h, TRILINEAR_SAMPLER, etc. – these are to allow FX Composer bindings of parameters.

Interesting things to note:

  1. For world-space lighting, there is no need to do normal expansion (normal * 2 – 1), you can encode it into tangent space -> world space conversion matrix. Saves 1 instruction.

  2. Vectorize like crazy. You can do a lot of stuff cheaper if you vectorize your calculations. For example, here we have 8 lights, that’s 2 groups, 4 lights in each one. You can save instructions on computing attenuation for 4 lights at once (you get 1 mad instruction for 4 lights, instead of 1 mad instruction per light), you can save a lot of instructions for specular computations (read below), etc.

  3. If you want to add a lot of scalar values, dot is your friend. dot(value, 1) will add all components of value.

  4. Be smart about replacing equations with equivalent ones. For example, na├»ve Phong specular equation is dot(R, V), where R is reflected light vector. In this case, you have to do reflect() for each light. Reflect is not quite cheap, and, as we have 8 lights and only 64 instructions, every instruction that’s executed per light is very expensive. But dot(reflect(L, N), V) is equal to dot(reflect(V, N), L) (V is a view vector), which requires only a single reflect().

  5. Specular is expensive, because pow() call is expensive. Why? Because, sadly, pow() can’t be vectorized. pow() is interpreted as three instructions, log, mul, exp (pow(a, b) is equal to exp(log(a) * b)), and neither log, nor exp have vectorized form. The result is that you’ll waste at least two instructions per light only to compute specular power.

    There are several solutions. The first one is to fix specular power to some convenient value. For example, pow(a, 16) can be implemented as 4 muls (and in fact HLSL compiler does it automatically), and muls can be vectorized – so you can compute specular power for all 4 lights for 4 instructions, which is much better.

    However, there is a better solution, which is described here: A Non-Integer Power Function on the Pixel Shader. Basically, you can approximate the function pow(x, N) with pow(max(A*x + B, 0), M). A and B can be tuned such that the maximum error is low enough for the approximation to be useable in practice, and M can be made very small, for example I use M=2, and there are no artifacts for N=18 (the results ARE different, and it can be seen by switching between formulas in realtime, but the difference is small and no one will be able to tell that you use an approximation). Alternatively, you can have your artists tune A and B directly.

    The net effect is that instead of several instructions per light, we compute specular power in 2 instructions – mad_sat to compute A*x+B, and mul to compute the result (remember, M=2).

Okay, we have a bunch of cool optimizations here. Are we done? Sadly, no, we are not. The presented shader compiles into 75 instructions. The problem is that per-light cost is still too high. We have to compute a unnormalized light vector (1 sub), compute its squared length (1 dp3), normalize it (1 rsq + 1 mul), compute dot(N, L) for diffuse lighting (1 dp3), compute dot(R, L) for specular lighting (1 dp3), and combine diffuse lighting with specified colors (1 mad). This is 7*8=56 instructions, which leaves 8 instructions, and we need much more. What can we do?

Well, we can simplify our calculations and replace light colors with light intensities. This will strip 8 instructions, but add 2 instructions for multiplying NdotL vector by intensity, and 1 instruction for summing all diffuse components together (dp3), which is still not enough – we need to reserve instructions for other things (for example, normal transformation takes 3 instructions, specular computation is 4 instructions for 8 lights, attenuation is 4 instructions for 8 lights, and there are still several other things left).

Yesterday, I gave up and went to sleep. But today, after some thinking, I felt like a barrier in my head collapsed – I knew why it did not work out, and I knew the better solution.

See, it is true that we have a lot of overhead per light source. The problem is not in the fact that we need to do a lot of calculations – the problem is in that we are using computation power inefficiently.

Let’s look at the instruction counts I presented earlier. Yes, it’s true that we do 1 sub per light – but sub is capable of processing 4-float vectors, and we are using it to subtract 3-component vectors, so we lose some power here. The same stays true for everything else – all dot products and mul for example.

And the barrier that collapsed in my head had an inscription “You have the mighty dot product, use it”. As it turned out, there is not a lot of sense in treating GPU assembly differently from for example SSE instructions. We do not have dot product in SSE. Trying to compute 4 dot products at once in a straightforward way in SSE is subject to miserable failure – most likely FPU will be faster. But if you change the way your data is organized, and instead of laying it in AoS order:

v1.x v1.y v1.z (unused)
v2.x v2.y v2.z (unused)
v3.x v3.y v3.z (unused)
v4.x v4.y v4.z (unused)

lay it out in SoA order:

v1.x v2.x v3.x v4.x
v1.y v2.y v3.y v4.y
v1.z v2.z v3.z v4.z

And lo and behold, a lot of slow dot product instructions are now just 3 muls and 2 adds – simple and fast.

This was a triumph. I’m making a note here: HUGE SUCCESS.

I decided to try the same thing with my shader code. It solved all problems like magic. The resulting code (link) while doing exactly the same calculations, now compiles in 54 instructions.

The reason is simple, of course. For example, where in the previous shader we computed squared lengths in 1 instruction per light, here we do it for 3 instructions for 4 lights, effectively using 25% less ALU. The new layout also made it possible to pass lights via interpolators (light data fits into 6 interpolators), which allowed to remove 1 sub instruction per light, and also 3 instructions for transforming normal into tangent space (at the expense of adding 1 expand instruction, of course).

Apart from the SoA data layout, which effectively is the reason why the new shader is so much smaller, there is only one trick - instead of normalizing each vector, we correct dot product results. This saves a couple of instructions for the entire shader.

The old shader did not fit into the instruction limit, the new one does, and it has 10 spare instructions. There is a bunch of things you could do with it. For example, you can implement parallax mapping – 10 instructions should be enough for several parallax steps. Note that one interpolator can be freed (view vector can be stored in COLOR interpolator at the expense of 1 additional instruction (expand from [0,1] to [-1,1])), so you can implement shadow mapping (for example, make first light source a directional one – this is straightforward, you just have to modify vertex shader to supply correct tangent-space direction, and place 0 in light_radius_inv to disable attenuation – and add shadows for it).

There is also space for some small tweaks – i.e. disable specular for triangles with dot(N, L) < 0, make wrap-around lighting, add ambient lighting, have colored specular (use a per-light color instead of the global one), add specular attenuation, etc.

Note, that a couple of instructions can still be saved if you do not want light colors, only intensities (read above).

So, this was a pleasant experience, and I am glad that I decided to write this shader. Sadly, all articles I’ve read about shader optimizations (save for one, “Bump My Shiny Metal” by Andrew Aksyonoff in ShaderX 4) are about common and trivial stuff – vectorize your computations, inspect compiler output... I hope that this post was more interesting. Read more!

Saturday, October 6, 2007

Render state rant

Today we’ll talk about render state management in D3D10 a little.

While designing D3D10, a number of decisions were made to improve runtime efficiency (to reduce batch cost, basically). It’s no secret that D3D9 runtime is not exactly lean & mean – there is a lot of magic going on behind the scenes, a lot of validation, caching, patching…

For example, D3D9 has the ability to set any render or sampler state at any time. However, hardware does not really work that way. The states are separated into groups, and you can only set the whole group at a time (of course, the exact structure of groups is hardware dependent). What this means for runtime/driver is that SetRenderState/SetSamplerState often do not actually set anything. Instead, they modify a so called shadow state – they update the changed states in a local shadow copy, and mark the respective groups as dirty. Then, when you call DrawIndexedPrimitive, the changed state groups are being set, and the dirty flags – cleared.

Also there could be some cross-state checking going on, I don’t know for sure.

So, D3D10 designers decided to replace hundreds of states by 4 simple state objects. Behold, ID3D10BlendState, ID3D10DepthStencilState, ID3D10RasterizerState and ID3D10SamplerState. These are immutable driver state objects – you create them once, you can’t modify them, and driver knows about them, which means it can do smart things (for example, store a chunk of push buffer that sets the respective state inside each state object) and thus optimize state setting. Also all possible validation is made at creation time only.

Sounds cool, right? Yeah, it did first time I’ve read about state objects. Except…

Problem #1. State objects are relatively expensive to construct. Which means 10-100 thousand CPU cycles on my Core 2 Duo. You know, given that we have a user mode driver, given that the smartest thing I can think of here is to 1. hash the state to check if the same state has already been created (it is actually done, you can check it by creating the same state object several times), 2. If the hash lookup failed, validate the state (perform some sanity checks), 3. construct a chunk with push buffer commands, 4. store it in allocated memory. And that should be performed by user mode code. Something is horribly wrong here, I swear.

Note, that even creating the already existing state object (hash state, hash lookup, compare actual state values, remember?) takes 10 thousand cycles. The caching should be performed in D3D10 runtime (the pointer returned is exactly the same - i.e. for input layouts, the caching is performed by the driver, as the pointer to the InputLayout is different, but the underlying driver object is the same; for state objects, pointers to runtime objects are equal), so this means that computing a hash of a 40-byte description object, doing a table lookup, and then comparing two 40-byte objects to test for equality in case of hash collision takes 10 thousand cycles.

Problem #2. You can’t create more than 4096 state objects. Which means that you can’t just forget about it and create a state for each object, even for static ones. Well, you can try, but one day it’ll fail.

Problem #3. The separation of states into groups is outrageous. I did not tell one small thing – not all states are actually immutable. There are two things you can change without constructing a new state object (they act as parameters for state object setting functions). Those are… stencil reference value and blend factor. Everything else is immutable, for example, depth bias and slope scaled depth bias.

How many of you have used blend factor with pixel shader (let’s say ps.2.0)-capable hardware?

How many of you have used a constantly changing stencil reference value?

I’d like to do things like progressively loading textures. What do I mean? Let’s load our texture from smaller mip levels to larger ones. Let’s interpolate MinLOD parameter from N to N-1 once N-1-th mip level is completely loaded – let’s do it over some fixed time. This way there will be no mip level popping, but instead we’ll see gradually improving quality as new levels are loaded – trilinear filtering will do proper interpolation for us. That’s easy, right? No. Not in D3D10.

Yes, I know I could cache render states inside objects, and perform some lifetime management (LRU?..). Though this won’t help in case of constantly changing parameters.

Yes, I know I could separate render states into groups how I like it best, have a 4096-entry hash table, and do lookups in it. And this is actually what I am doing now.

But it does not make me happy.
Read more!

Tuesday, September 25, 2007

Robust unit cube clipping for shadow mapping

Shadow mapping is my primary area of interest in computer graphics, so expect more posts on this topic. Today I’d like to tell about robust unit cube clipping regarding different projection matrix building techniques.

The task at hand is relatively simple – given a set of points representing shadow receivers and a set of points representing shadow casters, build a matrix that, while used for shadow rendering, maximizes shadow map utilization (both in terms of shadow map area and depth precision) and at the same time eliminates all shadow clipping artifacts. I have not implemented anything except directional light support properly, so expect another post sooner or later.

Note that all points are assumed to be in world space, and for a lot of algorithms it’s preferred to take vertices of convex hull of receivers, clipped by view frustum, instead of actual receivers’ vertices – it’s not always required, but for the sake of simplicity we will assume that all receivers’ points are inside view frustum. Of course casters’ points are arbitrary.

Uniform Shadow Mapping

Uniform shadow mapping is shadow mapping with a simple orthographic projection, without any perspective reparametrization. As unit cube clipping is an operation that’s usually done *after* constructing some approximate matrix, let’s suppose we already have a view and projection matrix for our light configuration – note that in case of uniform light the only thing we care about is that view matrix has the same direction as the light, and the projection matrix represents some orthographic projection – everything else is irrelevant, so we can take arbitrary view position, more or less arbitrary up vector, construct a view matrix, and assume that projection matrix is actually identity matrix.

Now let’s construct two axis-aligned bounding boxes, one for receiver points, transformed to our light post-perspective space, another one for caster points, again in light PPS. Note that projection is orthographic, therefore there is no post-perspective division employed, and there are no singularities during AABB construction.

Now we have to build a new matrix that transforms our light viewprojection matrix to minimize shadow map wastage and Z precision loss, while preserving shadows. The actual choice of values depends on some states that are set while applying shadow map to scene.

At first, let’s deal with XY extents. Many papers propose choosing receivers’ XY extents, because we don’t care about points of casters that are outside receivers’ extents – i.e. can’t cast shadows on receivers. This provides correct results, but we can do slightly better – we can select intersection of casters’ and receivers’ XY extents:

float min_x = max(receivers.min_x, casters.min_x);
float min_y = max(receivers.min_y, casters.min_y);
float max_x = min(receivers.max_x, casters.max_x);
float max_y = min(receivers.max_y, casters.max_y);

What we get here is that if casters’ extents are “wider” than receivers’, we get the same extents as before. However, if you have a big receiver and relatively small casters on it, this will select smaller extents. Note that extents are always correct – they enclose all points that BOTH some caster and some receiver could project to – i.e. all points there you potentially could need shadow. Everywhere else there is no shadow.

Whether this is beneficial depends on scene type – you’ll usually get no benefit from this approach in real-life scenes if there is a single shadow map for the whole view – but if you have some kind of frustum splitting approach, depending on your scene, this could lead to quality improvement.

There is still a problem left – if you have CLAMP filtering set for your shadow map, this approach will cause visual bugs, because now some of receivers’ points are actually out of shadow map – so if there is a caster that fills border pixels with Z value that produces shadow, the shadow will stretch outwards. Solution? Either use BORDER filtering or when rendering to NxN shadowmap set a viewport with X = 1, Y = 1, width = N-2, height = N-2 – this will make sure that border pixels are not affected with casters. This is a small price to pay for potentially tighter frustum.

Now we have tight XY extents, and we’ll have to solve problems with Z.

Let’s suppose we’ve chosen ZN and ZF as our Z extents. This would mean that:

  1. All casters are clipped by ZN and ZF

  2. Shadow map contain depth values in range [0..1] in light PPS

  3. All receivers points that have Z < ZN will have Z < 0 in light PPS

  4. All receivers points that have Z > ZF will have Z > 1 in light PPS

At first, we can’t let our casters be clipped by near plane – this would produce artifacts – so the resulting ZN value has to be less or equal to casters’ minimal Z value. If there are no receivers with Z < casters.min_z, there is no sense to push ZN further (to decrease ZN, that is). If there ARE receivers left with Z < casters.min_z, then there should not be any shadows there. Let’s look at our shadow test.

float is_in_shadow = shadowmap_depth < pixel_depth ? 1 : 0;

For receivers’ points with Z < ZN, pixel_depth is below 0, and the test always returns false, so is_in_shadow = 0 – this is actually what we want. So there is no need to make ZN less than casters.min_z, thus ZN = casters.min_z.

For receivers’ points with Z > ZF, the situation is opposite – pixel_depth is above 1, and the test always returns true, thus making all receivers points with Z > ZF shadowed. This means that there should be no receivers’ points with Z > ZF, so ZF is greater or equal than receivers.max_z.

Is there any reason to push ZF beyond that? No. No, because we don’t care about casters points that have Z > receivers.max_z – they are not going to cast shadows on any receiver anyway. Thus ZF = receivers.max_z.

Now that we have our XY and Z extents, we construct a scaling/biasing matrix that maps [min_x..max_x] x [min_y..max_y] x [ZN .. ZF] to [-1..1] x [-1..1] x [0..1] (or to [-1..1] x [-1..1] x [-1..1] in case you’re using OpenGL), and multiply your light viewprojection matrix by this matrix. By the way, the scaling/biasing matrix is of course equal to the corresponding orthographic projection matrix, so you can use existing functions like D3DXMatrixOrthoOffCenterLH to compute it.

Now that we’ve solved the problem for uniform shadow mapping, let’s move on to various perspective reparametrization algorithms.

Trapezoidal Shadow Mapping

The brief outline of TSM algorithm is as follows – construct light viewprojection matrix, transform receivers’ points to light PPS, approximate them by a trapezoid, construct a matrix that maps trapezoid to unit cube, multiply light viewprojection matrix by the trapezoid mapping matrix.

Thus applying unit cube clipping to TSM is very simple – you first construct a tight frustum for uniform shadow mapping (see above), and then use it for TSM, with no further corrections. This produces correct extents in the resulting matrix.

As we selected XY extents as intersection of casters’ and receivers’ extents, the slightly more correct approach would be to use not the receivers’ points for trapezoidal approximation, but rather points of the receivers’ volume, clipped by planes that correspond to the chosen XY extents. However, my experiments resulted in no significant quality improvement – texel distribution was good enough without this step.

Also note, that as TSM produces a frustum with relatively high FOV, the distortion of post-perspective W coordinate can affect Z coordinate badly. Some solutions for these problems are already presented in the original TSM paper (though expect another post about various methods for fixing Z errors).

Light-space Perspective Shadow Mapping

The brief outline of LiSPSM algorithm is as follows – construct light space with certain restrictions, transform all “interesting” points in the light space, build a perspective frustum that encloses all points, transform it back from light space to world space.

The problem is that if you treat only receivers’ points as “interesting”, you get shadow clipping due to Z extents; if you treat both receivers’ and casters’ points as “interesting”, the shadows are correct, but you get worse texel distribution. Also you can’t really fix Z extents AFTER you’ve computed the frustum – because perspective projection has singularities for points on plane with Z == 0, and occasionally some caster point gets near this plane and you get very high post-perspective Z extents, which screw the Z precision.

In short, I don’t know a good solution. If the light faces the viewer, we can use the approach for normal positional light for PSM (read below). Otherwise, it does not seem we can do anything. Currently I focus my frustum on both casters’ and receivers’ points. If you know a solution, I’d be more than happy to hear it.

Perspective Shadow Mapping

Note that I am not going to describe unit cube clipping for the PSM from the original paper by Stamminger and Drettakis, because there is a much better PSM algorithm, described in GPU Gems 1 by Simon Kozlov (Chapter 14, “Perspective Shadow Maps: Care and Feeding”).

The brief outline of the algorithm is as follows – construct virtual camera which is essentially a real camera slid back a bit to improve quality (to improve zf/zn ratio), transform light to virtual camera PPS. If it’s a directional light in PPS (which means that light’s direction was orthogonal to view direction), construct uniform shadow mapping matrix for the light direction in PPS (of course you should transform all casters and receivers to PPS). The resulting matrix should first transform incoming points to virtual camera PPS, and then transform them by uniform shadow mapping matrix.

If light is a positional one in PPS (which, of course, happens most often), then at first compute a bounding cone with center in light PPS around all receivers’ points (transformed to PPS again, of course). Then we have two cases – either light in PPS becomes an inverted one – that happens if light is shining from behind the viewer, i.e. Z coordinate of light direction in view space is positive – or light is a normal one. For further reference I suggest you read Stamminger and Drettakis paper (STAMMINGER M., DRETTAKIS G.: Perspective shadow maps. In Proceedings of SIGGRAPH(2002), pp. 557.562.).

If light is a normal one, then all casters that can possibly cast shadows on visible receivers’ regions are in front of virtual camera’s near plane – so we can construct a normal shadow mapping matrix from bounding cone parameters. If light is an inverted one, then usual shadow mapping matrix will not contain all casters; therefore Kozlov proposes to use an “inverse projection matrix” which is constructed by specifying –z_near as near plane value and z_near as far plane value. Again, for further reference I suggest you read his paper.

Now, we want to perform unit cube clipping, and there are actually three cases we have to resolve (directional light in PPS, positional light in PPS, inverted positional light in PPS). Why do we have problems? Well, at first, while all receivers points are inside original view frustum (and thus they are inside virtual view frustum), because we clipped receivers’ volumes, caster points are arbitrary, and so there are singularities for caster points with view space Z close to 0.

In case of normal positional light in PPS (directional light that shines in our face), we don’t care about casters that are beyond near plane; so we can clip our casters by near plane, and all caster points will be well-defined in post-projective space, which means that after we’ve computed the projection in PPS from bounding cone, we can find receivers’ and casters’ extents and use the same algorithm we had for uniform shadow mapping – just multiply the light matrix in PPS by unit cube clipping matrix (this will extend light frustum in PPS to hold all needed caster points).

Theoretically, you’ll need precise geometrical clipping of caster volumes by near plane. However, in practice, for my test scenes the simple approach of computing extents only for points in front of near plane worked well. Your mileage may vary.

In case of inverted positional light in PPS, we can no longer clip by near plane, because we’ll clip away potential casters. What we’d like to do instead is to take Z extents as in normal unit cube clipping for all caster points, and modify the shadow mapping matrix as usual. Why can’t we do that (and, as a matter of fact, why could not we do that for normal positional light in PPS, why do we need camera clipping)?

Well, the problem is, that if there are caster points with view-space Z near 0, the PPS Z extents of casters become very huge. As we use casters’ minimal Z value as our Z near value, this can lead to huge extents of unit cube clipping matrix, which makes Z precision very bad. For positional light in PPS, we avoid this by clipping casters by near plane, so that we no longer have casters with very big coordinates in PPS.

Luckily, with inverse projection matrix we can easily solve that – just clip casters’ minimal Z value with some small negative value, i.e. -10. (Note: here and below, “casters’ minimal Z value” refers to minimal Z value of casters transformed first to virtual camera PPS, and then by shadow mapping matrix).

Why does it work? Well, with normal perspective matrix, if we decrease Z near value’s magnitude (i.e. decrease the actual Z near value, as it’s positive for normal matrix), we enlarge our viewing frustum. It turns out that for inverse projection matrix, decreasing Z near value’s magnitude again enlarges viewing frustum – and clipping Z near by some negative value decreases its magnitude, while increasing the actual value.

Finally, if the light is a directional one in PPS, we can clip casters by camera’s near plane too, as we did in case of normal positional light.


  1. If light is a directional one in PPS, construct uniform shadow mapping matrix as in first part of this post, only using casters clipped by camera near plane (either by geometrical clipping or just by throwing away caster points which are beyond near plane).

  2. If light is a positional one in PPS, construct bounding cone with center in light’s position that holds all receiver points. Then, for normal (non-inverted) light, construct shadow mapping matrix (view matrix: simple view matrix with eye position = light’s position, view direction = bounding cone direction, and arbitrary up vector; projection matrix: matrix with FOV wide enough to hold the whole bounding cone, and Z extents that encompass all receiver points), and perform unit cube clipping, only using casters clipped by camera near plane.
    For inverted light, construct an inverted projection matrix instead of a normal one, and instead of clipping casters clip casters’ minimal Z value.

Thanks to Simon Kozlov for suggesting solution for shadow clipping problems, basically all the text in this section consists of his ideas.

eXtended Perspective Shadow Mapping

I am not going to describe algorithm and clipping problems in details, as they are well summarized in the original paper. The only mention I will make is that at the step 11 of the algorithm, all points with w < epsilonW are clipped. This sometimes causes clipping of shadows. The solution is to modify the extents building procedure as follows: instead of throwing away points with w < epsilonW completely, just don’t compute Z bounds for them. This is a hack, and it works only because we’re doing 2D rectangle intersection while doing unit cube clipping.

The author of the algorithm knows about the problem. We discussed several approaches that lead to fixing it, and this was the best one we could come up with for now.
Read more!

Saturday, September 22, 2007

Particle rendering revisited

Recently I was doing particle rendering for different platforms (D3D9, PS3, XBox360), and I wanted to share my experience. The method I came with (which is more or less the same for all 3 platforms) is nothing new or complex - in fact, I know people were and are doing it already - but nevertheless I've never seen it described anywhere, so it might help somebody.

The prerequisites are that we want a pretty flexible simulation (such as, all parameters are controlled by splines, and splines come from some particle editor – perhaps also there is some collision detection, or instead particles are driven by some complex physics only) – which means that (a) we don’t have a simple function position(time) (and likewise for other particle parameters), and (b) we don’t want to implement a fully GPU-based solution, with rendering to vertex buffer/streamout. After all, next-gen GPUs are not *that* powerful, we don’t have that many particles, and we often do not use all available cores (in case of PS3/360 at least) efficiently.

Also let’s suppose for the sake of simplicity that our particles are actually billboards that can only rotate around view Z axis – i.e. they are always camera-facing. This does not really matter so much, but it will make things easier.

What we’d like to do ideally is to upload particles to a buffer, and have GPU render from it. To keep the amount of data low, we’d like to copy exactly one instance of each particle, without duplication. The classical (trivial) approach is to fill VB with particle data, 4 vertices per each particle, while doing all computations on CPU – that is, vertex shader only transforms particle in clip space. This is of course not very wise (after all, we’re trying to save some CPU clocks here), so another classical (slightly less trivial) approach is to fill VB with particle data, 4 vertices per each particle, where those 4 vertices differ only in their UV coordinates. UVs act like corner identifications – you know UV coordinate in vertex shader, and you know the corner of the particle you’re processing ((0, 0) = upper left corner, etc.). Thus you can easily perform actual coordinate position calculation in vertex shader like this:

float3 corner_position = particle_position + camera_axis_x * (uv.x – 0.5) * size + camera_axis_y * (uv.y – 0.5) * size;

Also we have point sprites that *seem* to achieve the same thing we need – you upload exactly 1 vertex per each particle. However, they have lots of disadvantages – point size is limited, you can’t rotate them, etc.

The method I am talking about goes slightly further. Let’s divide our particle data in two parts, actual particle data (position, color, angle of rotation, etc.) and UV coordinates. Now we notice, that what we really want is to have two streams of data, one stream contains particle data without duplication, the other stream contains ONLY UV coordinates – moreover, this buffer consists of the same data, repeated many times – you have 4 vertices (0, 0), (1, 0), (1, 1), (0, 1) for the first particle, 4 vertices (0, 0), (1, 0), (1, 1), (0, 1) for the second one, etc. – so we’d like to be able to specify it once and have GPU “loop” over them.

In effect, we want something like this:

Fortunately, there is a solution that can solve it – it’s hardware instancing. Unfortunately, it’s not available everywhere – you (usually) need SM3.0 support for it. We’re going to accept this disadvantage however.

Thus we have a static stream with 4 “vertices” representing corner data (each “vertex” consists of a single float2), and a dynamic stream with N “instances” representing particles (each “instance” consists of, in our example, a position, color and angle). We render N quads, so the vertex shader gets executed 4*N times – every time we have 1 piece of instance data, and 1 piece of corner data. We compute actual particle corner position as shown above, and output it.

Note that it looks like point sprites. It has a disadvantage in that we have 4 vertex shader runs per each particle, instead of 1 with point sprites – but I have yet to see vertex processing becoming a limiting factor for particles. Also it has a more limited hardware scope. What we get in return is much more flexibility (you are not even limited to screen-facing particles; you can pass orientation (i.e. a quaternion) instead of a single angle). The amount of data that has to be uploaded by the application per frame is the same.

Now let’s go over platform-specific implementation details.

Direct3D 9

Unfortunately, D3D9 does not have “quad” primitive type, so we’ll have to use a dummy index buffer. The setup is as follows:

  1. For all particle systems, create a single index buffer with 6 indices describing a quad (0, 1, 2, 2, 1, 3), and a single corner stream that will contain corner values. I chose to store UV coordinates in D3DCOLOR, though FLOAT2 is ok too.

  2. Create a proper vertex declaration, that says that corner (UV) data goes in stream 0, and particle data goes in stream 1.

  3. For each particle system, create a dynamic vertex buffer (note: it’s usually better to create 2 buffers and to use buffer 0 on first frame, buffer 1 on second frame, buffer 0 on third frame, etc. – thus making synchronization costs lower and lowering chance for buffer renaming), which will hold particle data.

  4. Every frame, lock your buffer, upload particle data in it as is (i.e. 1 copy of data per particle).

  5. Draw as follows:

    device->SetStreamSource(0, shared_buffer_with_corner_data, 0, sizeof(CornerVertex));
    device->SetStreamSourceFreq(0, D3DSTREAMSOURCE_INDEXEDDATA | particle_count);

    device->SetStreamSource(1, buffer_with_particle_data, 0, sizeof(ParticleData));
    device->SetStreamSourceFreq(1, D3DSTREAMSOURCE_INSTANCEDATA | 1);

    device->DrawIndexedPrimitive(D3DPT_TRIANGLELIST, 0, 0, 4, 0, 2);
Note, that:
  1. You have to set corner data as stream 0 due to D3D9 restrictions
  2. You pass parameters to DIP as if you want to render a single quad
In theory, this method uses hardware instancing, and hardware instancing is supported only for SM3.0-compliant cards. However, in practice, all SM2.0-capable ATi cards support hardware instancing – it’s just that Direct3D9 does not let you use it. ATi engineers made a hack that lets you enable instancing for their cards – just do this once at application startup:

// Enable instancing
device->SetRenderState(D3DRS_POINTSIZE, MAKEFOURCC('I','N','S','T'));

I love ATi D3D9 hacks, they are ingenious.

XBox 360

You can perform rendering with the proposed method as is – the only difference is that you’ll have to fetch vertex data manually in vertex shader via vfetch command because there is no explicit instancing support. For further reference, look at CustomVFetch sample.


You can perform rendering with the proposed method as is – you’ll have to set frequency divider operation to MODULO with frequency = 4 for corner stream, and to DIVIDE with frequency = 4 for particle data stream.

Direct3D 10

I have not actually implemented this for Direct3D 10, but it should be pretty straightforward – you’ll have to create proper input layout (with D3D10_INPUT_PER_INSTANCE_DATA set for all elements except corner data), create index buffer with 6 indices as for D3D9, and then render via DrawIndexedInstanced(6, particle_count, 0, 0, 0);

Note that for Direct3D 10 you can also render from your particle data stream with D3D10_PRIMITIVE_TOPOLOGY_POINTLIST, and perform quad expansion with geometry shader. This in theory should somewhat speed up vertex processing part, but in practice I have very bad experience with geometry shaders on NVidia cards performance-wise. If you have an ATi R600 or (perhaps) next-generation NVidia card, I’d be happy to hear that things are okay there.
Read more!