Tuesday, September 25, 2007

Robust unit cube clipping for shadow mapping

Shadow mapping is my primary area of interest in computer graphics, so expect more posts on this topic. Today I’d like to tell about robust unit cube clipping regarding different projection matrix building techniques.

The task at hand is relatively simple – given a set of points representing shadow receivers and a set of points representing shadow casters, build a matrix that, while used for shadow rendering, maximizes shadow map utilization (both in terms of shadow map area and depth precision) and at the same time eliminates all shadow clipping artifacts. I have not implemented anything except directional light support properly, so expect another post sooner or later.

Note that all points are assumed to be in world space, and for a lot of algorithms it’s preferred to take vertices of convex hull of receivers, clipped by view frustum, instead of actual receivers’ vertices – it’s not always required, but for the sake of simplicity we will assume that all receivers’ points are inside view frustum. Of course casters’ points are arbitrary.

Uniform Shadow Mapping

Uniform shadow mapping is shadow mapping with a simple orthographic projection, without any perspective reparametrization. As unit cube clipping is an operation that’s usually done *after* constructing some approximate matrix, let’s suppose we already have a view and projection matrix for our light configuration – note that in case of uniform light the only thing we care about is that view matrix has the same direction as the light, and the projection matrix represents some orthographic projection – everything else is irrelevant, so we can take arbitrary view position, more or less arbitrary up vector, construct a view matrix, and assume that projection matrix is actually identity matrix.

Now let’s construct two axis-aligned bounding boxes, one for receiver points, transformed to our light post-perspective space, another one for caster points, again in light PPS. Note that projection is orthographic, therefore there is no post-perspective division employed, and there are no singularities during AABB construction.

Now we have to build a new matrix that transforms our light viewprojection matrix to minimize shadow map wastage and Z precision loss, while preserving shadows. The actual choice of values depends on some states that are set while applying shadow map to scene.

At first, let’s deal with XY extents. Many papers propose choosing receivers’ XY extents, because we don’t care about points of casters that are outside receivers’ extents – i.e. can’t cast shadows on receivers. This provides correct results, but we can do slightly better – we can select intersection of casters’ and receivers’ XY extents:

float min_x = max(receivers.min_x, casters.min_x);
float min_y = max(receivers.min_y, casters.min_y);
float max_x = min(receivers.max_x, casters.max_x);
float max_y = min(receivers.max_y, casters.max_y);

What we get here is that if casters’ extents are “wider” than receivers’, we get the same extents as before. However, if you have a big receiver and relatively small casters on it, this will select smaller extents. Note that extents are always correct – they enclose all points that BOTH some caster and some receiver could project to – i.e. all points there you potentially could need shadow. Everywhere else there is no shadow.

Whether this is beneficial depends on scene type – you’ll usually get no benefit from this approach in real-life scenes if there is a single shadow map for the whole view – but if you have some kind of frustum splitting approach, depending on your scene, this could lead to quality improvement.

There is still a problem left – if you have CLAMP filtering set for your shadow map, this approach will cause visual bugs, because now some of receivers’ points are actually out of shadow map – so if there is a caster that fills border pixels with Z value that produces shadow, the shadow will stretch outwards. Solution? Either use BORDER filtering or when rendering to NxN shadowmap set a viewport with X = 1, Y = 1, width = N-2, height = N-2 – this will make sure that border pixels are not affected with casters. This is a small price to pay for potentially tighter frustum.

Now we have tight XY extents, and we’ll have to solve problems with Z.

Let’s suppose we’ve chosen ZN and ZF as our Z extents. This would mean that:

  1. All casters are clipped by ZN and ZF

  2. Shadow map contain depth values in range [0..1] in light PPS

  3. All receivers points that have Z < ZN will have Z < 0 in light PPS

  4. All receivers points that have Z > ZF will have Z > 1 in light PPS

At first, we can’t let our casters be clipped by near plane – this would produce artifacts – so the resulting ZN value has to be less or equal to casters’ minimal Z value. If there are no receivers with Z < casters.min_z, there is no sense to push ZN further (to decrease ZN, that is). If there ARE receivers left with Z < casters.min_z, then there should not be any shadows there. Let’s look at our shadow test.

float is_in_shadow = shadowmap_depth < pixel_depth ? 1 : 0;

For receivers’ points with Z < ZN, pixel_depth is below 0, and the test always returns false, so is_in_shadow = 0 – this is actually what we want. So there is no need to make ZN less than casters.min_z, thus ZN = casters.min_z.

For receivers’ points with Z > ZF, the situation is opposite – pixel_depth is above 1, and the test always returns true, thus making all receivers points with Z > ZF shadowed. This means that there should be no receivers’ points with Z > ZF, so ZF is greater or equal than receivers.max_z.

Is there any reason to push ZF beyond that? No. No, because we don’t care about casters points that have Z > receivers.max_z – they are not going to cast shadows on any receiver anyway. Thus ZF = receivers.max_z.

Now that we have our XY and Z extents, we construct a scaling/biasing matrix that maps [min_x..max_x] x [min_y..max_y] x [ZN .. ZF] to [-1..1] x [-1..1] x [0..1] (or to [-1..1] x [-1..1] x [-1..1] in case you’re using OpenGL), and multiply your light viewprojection matrix by this matrix. By the way, the scaling/biasing matrix is of course equal to the corresponding orthographic projection matrix, so you can use existing functions like D3DXMatrixOrthoOffCenterLH to compute it.

Now that we’ve solved the problem for uniform shadow mapping, let’s move on to various perspective reparametrization algorithms.

Trapezoidal Shadow Mapping

The brief outline of TSM algorithm is as follows – construct light viewprojection matrix, transform receivers’ points to light PPS, approximate them by a trapezoid, construct a matrix that maps trapezoid to unit cube, multiply light viewprojection matrix by the trapezoid mapping matrix.

Thus applying unit cube clipping to TSM is very simple – you first construct a tight frustum for uniform shadow mapping (see above), and then use it for TSM, with no further corrections. This produces correct extents in the resulting matrix.

As we selected XY extents as intersection of casters’ and receivers’ extents, the slightly more correct approach would be to use not the receivers’ points for trapezoidal approximation, but rather points of the receivers’ volume, clipped by planes that correspond to the chosen XY extents. However, my experiments resulted in no significant quality improvement – texel distribution was good enough without this step.

Also note, that as TSM produces a frustum with relatively high FOV, the distortion of post-perspective W coordinate can affect Z coordinate badly. Some solutions for these problems are already presented in the original TSM paper (though expect another post about various methods for fixing Z errors).

Light-space Perspective Shadow Mapping

The brief outline of LiSPSM algorithm is as follows – construct light space with certain restrictions, transform all “interesting” points in the light space, build a perspective frustum that encloses all points, transform it back from light space to world space.

The problem is that if you treat only receivers’ points as “interesting”, you get shadow clipping due to Z extents; if you treat both receivers’ and casters’ points as “interesting”, the shadows are correct, but you get worse texel distribution. Also you can’t really fix Z extents AFTER you’ve computed the frustum – because perspective projection has singularities for points on plane with Z == 0, and occasionally some caster point gets near this plane and you get very high post-perspective Z extents, which screw the Z precision.

In short, I don’t know a good solution. If the light faces the viewer, we can use the approach for normal positional light for PSM (read below). Otherwise, it does not seem we can do anything. Currently I focus my frustum on both casters’ and receivers’ points. If you know a solution, I’d be more than happy to hear it.

Perspective Shadow Mapping

Note that I am not going to describe unit cube clipping for the PSM from the original paper by Stamminger and Drettakis, because there is a much better PSM algorithm, described in GPU Gems 1 by Simon Kozlov (Chapter 14, “Perspective Shadow Maps: Care and Feeding”).

The brief outline of the algorithm is as follows – construct virtual camera which is essentially a real camera slid back a bit to improve quality (to improve zf/zn ratio), transform light to virtual camera PPS. If it’s a directional light in PPS (which means that light’s direction was orthogonal to view direction), construct uniform shadow mapping matrix for the light direction in PPS (of course you should transform all casters and receivers to PPS). The resulting matrix should first transform incoming points to virtual camera PPS, and then transform them by uniform shadow mapping matrix.

If light is a positional one in PPS (which, of course, happens most often), then at first compute a bounding cone with center in light PPS around all receivers’ points (transformed to PPS again, of course). Then we have two cases – either light in PPS becomes an inverted one – that happens if light is shining from behind the viewer, i.e. Z coordinate of light direction in view space is positive – or light is a normal one. For further reference I suggest you read Stamminger and Drettakis paper (STAMMINGER M., DRETTAKIS G.: Perspective shadow maps. In Proceedings of SIGGRAPH(2002), pp. 557.562.).

If light is a normal one, then all casters that can possibly cast shadows on visible receivers’ regions are in front of virtual camera’s near plane – so we can construct a normal shadow mapping matrix from bounding cone parameters. If light is an inverted one, then usual shadow mapping matrix will not contain all casters; therefore Kozlov proposes to use an “inverse projection matrix” which is constructed by specifying –z_near as near plane value and z_near as far plane value. Again, for further reference I suggest you read his paper.

Now, we want to perform unit cube clipping, and there are actually three cases we have to resolve (directional light in PPS, positional light in PPS, inverted positional light in PPS). Why do we have problems? Well, at first, while all receivers points are inside original view frustum (and thus they are inside virtual view frustum), because we clipped receivers’ volumes, caster points are arbitrary, and so there are singularities for caster points with view space Z close to 0.

In case of normal positional light in PPS (directional light that shines in our face), we don’t care about casters that are beyond near plane; so we can clip our casters by near plane, and all caster points will be well-defined in post-projective space, which means that after we’ve computed the projection in PPS from bounding cone, we can find receivers’ and casters’ extents and use the same algorithm we had for uniform shadow mapping – just multiply the light matrix in PPS by unit cube clipping matrix (this will extend light frustum in PPS to hold all needed caster points).

Theoretically, you’ll need precise geometrical clipping of caster volumes by near plane. However, in practice, for my test scenes the simple approach of computing extents only for points in front of near plane worked well. Your mileage may vary.

In case of inverted positional light in PPS, we can no longer clip by near plane, because we’ll clip away potential casters. What we’d like to do instead is to take Z extents as in normal unit cube clipping for all caster points, and modify the shadow mapping matrix as usual. Why can’t we do that (and, as a matter of fact, why could not we do that for normal positional light in PPS, why do we need camera clipping)?

Well, the problem is, that if there are caster points with view-space Z near 0, the PPS Z extents of casters become very huge. As we use casters’ minimal Z value as our Z near value, this can lead to huge extents of unit cube clipping matrix, which makes Z precision very bad. For positional light in PPS, we avoid this by clipping casters by near plane, so that we no longer have casters with very big coordinates in PPS.

Luckily, with inverse projection matrix we can easily solve that – just clip casters’ minimal Z value with some small negative value, i.e. -10. (Note: here and below, “casters’ minimal Z value” refers to minimal Z value of casters transformed first to virtual camera PPS, and then by shadow mapping matrix).

Why does it work? Well, with normal perspective matrix, if we decrease Z near value’s magnitude (i.e. decrease the actual Z near value, as it’s positive for normal matrix), we enlarge our viewing frustum. It turns out that for inverse projection matrix, decreasing Z near value’s magnitude again enlarges viewing frustum – and clipping Z near by some negative value decreases its magnitude, while increasing the actual value.

Finally, if the light is a directional one in PPS, we can clip casters by camera’s near plane too, as we did in case of normal positional light.


  1. If light is a directional one in PPS, construct uniform shadow mapping matrix as in first part of this post, only using casters clipped by camera near plane (either by geometrical clipping or just by throwing away caster points which are beyond near plane).

  2. If light is a positional one in PPS, construct bounding cone with center in light’s position that holds all receiver points. Then, for normal (non-inverted) light, construct shadow mapping matrix (view matrix: simple view matrix with eye position = light’s position, view direction = bounding cone direction, and arbitrary up vector; projection matrix: matrix with FOV wide enough to hold the whole bounding cone, and Z extents that encompass all receiver points), and perform unit cube clipping, only using casters clipped by camera near plane.
    For inverted light, construct an inverted projection matrix instead of a normal one, and instead of clipping casters clip casters’ minimal Z value.

Thanks to Simon Kozlov for suggesting solution for shadow clipping problems, basically all the text in this section consists of his ideas.

eXtended Perspective Shadow Mapping

I am not going to describe algorithm and clipping problems in details, as they are well summarized in the original paper. The only mention I will make is that at the step 11 of the algorithm, all points with w < epsilonW are clipped. This sometimes causes clipping of shadows. The solution is to modify the extents building procedure as follows: instead of throwing away points with w < epsilonW completely, just don’t compute Z bounds for them. This is a hack, and it works only because we’re doing 2D rectangle intersection while doing unit cube clipping.

The author of the algorithm knows about the problem. We discussed several approaches that lead to fixing it, and this was the best one we could come up with for now.
Read more!

Saturday, September 22, 2007

Particle rendering revisited

Recently I was doing particle rendering for different platforms (D3D9, PS3, XBox360), and I wanted to share my experience. The method I came with (which is more or less the same for all 3 platforms) is nothing new or complex - in fact, I know people were and are doing it already - but nevertheless I've never seen it described anywhere, so it might help somebody.

The prerequisites are that we want a pretty flexible simulation (such as, all parameters are controlled by splines, and splines come from some particle editor – perhaps also there is some collision detection, or instead particles are driven by some complex physics only) – which means that (a) we don’t have a simple function position(time) (and likewise for other particle parameters), and (b) we don’t want to implement a fully GPU-based solution, with rendering to vertex buffer/streamout. After all, next-gen GPUs are not *that* powerful, we don’t have that many particles, and we often do not use all available cores (in case of PS3/360 at least) efficiently.

Also let’s suppose for the sake of simplicity that our particles are actually billboards that can only rotate around view Z axis – i.e. they are always camera-facing. This does not really matter so much, but it will make things easier.

What we’d like to do ideally is to upload particles to a buffer, and have GPU render from it. To keep the amount of data low, we’d like to copy exactly one instance of each particle, without duplication. The classical (trivial) approach is to fill VB with particle data, 4 vertices per each particle, while doing all computations on CPU – that is, vertex shader only transforms particle in clip space. This is of course not very wise (after all, we’re trying to save some CPU clocks here), so another classical (slightly less trivial) approach is to fill VB with particle data, 4 vertices per each particle, where those 4 vertices differ only in their UV coordinates. UVs act like corner identifications – you know UV coordinate in vertex shader, and you know the corner of the particle you’re processing ((0, 0) = upper left corner, etc.). Thus you can easily perform actual coordinate position calculation in vertex shader like this:

float3 corner_position = particle_position + camera_axis_x * (uv.x – 0.5) * size + camera_axis_y * (uv.y – 0.5) * size;

Also we have point sprites that *seem* to achieve the same thing we need – you upload exactly 1 vertex per each particle. However, they have lots of disadvantages – point size is limited, you can’t rotate them, etc.

The method I am talking about goes slightly further. Let’s divide our particle data in two parts, actual particle data (position, color, angle of rotation, etc.) and UV coordinates. Now we notice, that what we really want is to have two streams of data, one stream contains particle data without duplication, the other stream contains ONLY UV coordinates – moreover, this buffer consists of the same data, repeated many times – you have 4 vertices (0, 0), (1, 0), (1, 1), (0, 1) for the first particle, 4 vertices (0, 0), (1, 0), (1, 1), (0, 1) for the second one, etc. – so we’d like to be able to specify it once and have GPU “loop” over them.

In effect, we want something like this:

Fortunately, there is a solution that can solve it – it’s hardware instancing. Unfortunately, it’s not available everywhere – you (usually) need SM3.0 support for it. We’re going to accept this disadvantage however.

Thus we have a static stream with 4 “vertices” representing corner data (each “vertex” consists of a single float2), and a dynamic stream with N “instances” representing particles (each “instance” consists of, in our example, a position, color and angle). We render N quads, so the vertex shader gets executed 4*N times – every time we have 1 piece of instance data, and 1 piece of corner data. We compute actual particle corner position as shown above, and output it.

Note that it looks like point sprites. It has a disadvantage in that we have 4 vertex shader runs per each particle, instead of 1 with point sprites – but I have yet to see vertex processing becoming a limiting factor for particles. Also it has a more limited hardware scope. What we get in return is much more flexibility (you are not even limited to screen-facing particles; you can pass orientation (i.e. a quaternion) instead of a single angle). The amount of data that has to be uploaded by the application per frame is the same.

Now let’s go over platform-specific implementation details.

Direct3D 9

Unfortunately, D3D9 does not have “quad” primitive type, so we’ll have to use a dummy index buffer. The setup is as follows:

  1. For all particle systems, create a single index buffer with 6 indices describing a quad (0, 1, 2, 2, 1, 3), and a single corner stream that will contain corner values. I chose to store UV coordinates in D3DCOLOR, though FLOAT2 is ok too.

  2. Create a proper vertex declaration, that says that corner (UV) data goes in stream 0, and particle data goes in stream 1.

  3. For each particle system, create a dynamic vertex buffer (note: it’s usually better to create 2 buffers and to use buffer 0 on first frame, buffer 1 on second frame, buffer 0 on third frame, etc. – thus making synchronization costs lower and lowering chance for buffer renaming), which will hold particle data.

  4. Every frame, lock your buffer, upload particle data in it as is (i.e. 1 copy of data per particle).

  5. Draw as follows:

    device->SetStreamSource(0, shared_buffer_with_corner_data, 0, sizeof(CornerVertex));
    device->SetStreamSourceFreq(0, D3DSTREAMSOURCE_INDEXEDDATA | particle_count);

    device->SetStreamSource(1, buffer_with_particle_data, 0, sizeof(ParticleData));
    device->SetStreamSourceFreq(1, D3DSTREAMSOURCE_INSTANCEDATA | 1);

    device->DrawIndexedPrimitive(D3DPT_TRIANGLELIST, 0, 0, 4, 0, 2);
Note, that:
  1. You have to set corner data as stream 0 due to D3D9 restrictions
  2. You pass parameters to DIP as if you want to render a single quad
In theory, this method uses hardware instancing, and hardware instancing is supported only for SM3.0-compliant cards. However, in practice, all SM2.0-capable ATi cards support hardware instancing – it’s just that Direct3D9 does not let you use it. ATi engineers made a hack that lets you enable instancing for their cards – just do this once at application startup:

// Enable instancing
device->SetRenderState(D3DRS_POINTSIZE, MAKEFOURCC('I','N','S','T'));

I love ATi D3D9 hacks, they are ingenious.

XBox 360

You can perform rendering with the proposed method as is – the only difference is that you’ll have to fetch vertex data manually in vertex shader via vfetch command because there is no explicit instancing support. For further reference, look at CustomVFetch sample.


You can perform rendering with the proposed method as is – you’ll have to set frequency divider operation to MODULO with frequency = 4 for corner stream, and to DIVIDE with frequency = 4 for particle data stream.

Direct3D 10

I have not actually implemented this for Direct3D 10, but it should be pretty straightforward – you’ll have to create proper input layout (with D3D10_INPUT_PER_INSTANCE_DATA set for all elements except corner data), create index buffer with 6 indices as for D3D9, and then render via DrawIndexedInstanced(6, particle_count, 0, 0, 0);

Note that for Direct3D 10 you can also render from your particle data stream with D3D10_PRIMITIVE_TOPOLOGY_POINTLIST, and perform quad expansion with geometry shader. This in theory should somewhat speed up vertex processing part, but in practice I have very bad experience with geometry shaders on NVidia cards performance-wise. If you have an ATi R600 or (perhaps) next-generation NVidia card, I’d be happy to hear that things are okay there.
Read more!