## Sunday, March 15, 2009

### View frustum culling optimization – Representation matters

Before getting into professional game development I've spent a fair amount of time doing it for fun (in fact, I still do it now, although less intensively). The knowledge came from a variety of sources, but the only way that I knew and used to calculate frustum planes equations was as follows – get the equations in clip space (they're really simple – (1, 0, 0, 1), (0, -1, 0, 1), etc.) and then get world space ones by transforming the planes with inverse transpose of view projection camera matrix [correction: in fact, you need to transform with inverse transpose of inverse view projection matrix, which equals to just transpose of view projection matrix]. It's very simple and intuitive – if you know a simple way to express what you need in some space, and a simple way to transform things from that space to your target one, you're good to go.

Imagine my surprise when I started doing game development as a day job and after some time accidentally stumbled upon a piece of our codebase that calculated frustum planes in a completely different way. Given a usual perspective camera setup, it calculated frustum points via some trigonometry (utilizing knowledge about vertical/horizontal FOV angles, near/far distances and the fact that it's a perspective camera without any unusual alterations), and then used them to obtain the equations. I thought it to be very weird – after all, it's more complex and is constrained to specific camera representation, whereas clip space method works for any camera that can be set up for rendering (orthographic projection, oblique-clipping, etc.).

But as it turns out, the same thing can be said about our culling code. It's quite good at culling given box against an arbitrary set of planes (i.e. if you use it for portal/anti-portal culling with arbitrary shape of portals/occluders), but since we have a usual frustum, maybe we can improve it by going to clip space, entirely skipping world space? Let's try it.

We're going to transform AABB points to clip space, and then test them against frustum planes in clip space. Note that we can't divide by w after transforming – that will lead to culling bugs because post-projective space exhibits a discontinuity at the plane with equation z = 0 in view space; however, this is not needed – the frustum plane equations in clip space are as follows:

x >= -w, x <= w: left/right planes
y >= -w, y <= w: top/bottom planes
z >= 0, z <= w: near/far planes

Note that if you're using OpenGL clip space convention, the near plane equation is z >= -w; this is a minor change to the culling procedure.

First, to transform points to clip space, we're going to need a world view projection matrix – I hope the code does not require any additional explanations:

static inline void transform_matrix(struct matrix_t* dest, const struct matrix_t* lhs, const struct matrix_t* rhs)
{
#define COMP_0(c) \
qword res_ ## c = si_fm((qword)lhs->row2, SPLAT((qword)rhs->row ## c, 2)); \
res_ ## c = si_fma((qword)lhs->row1, SPLAT((qword)rhs->row ## c, 1), res_ ## c); \
res_ ## c = si_fma((qword)lhs->row0, SPLAT((qword)rhs->row ## c, 0), res_ ## c); \
dest->row ## c = (vec_float4)res_ ## c;

#define COMP_1(c) \
qword res_ ## c = si_fma((qword)lhs->row2, SPLAT((qword)rhs->row ## c, 2), (qword)lhs->row3); \
res_ ## c = si_fma((qword)lhs->row1, SPLAT((qword)rhs->row ## c, 1), res_ ## c); \
res_ ## c = si_fma((qword)lhs->row0, SPLAT((qword)rhs->row ## c, 0), res_ ## c); \
dest->row ## c = (vec_float4)res_ ## c;

COMP_0(0);
COMP_0(1);
COMP_0(2);
COMP_1(3);

#undef COMP_0
#undef COMP_1
}

After that we'll transform the points to clip space, yielding 2 groups with 4 vectors (x, y, z, w) in each one; the code is almost the same as in the previous post, only we now have 4 components:

static inline void transform_points_4(qword* dest, qword x, qword y, qword z, const struct matrix_t* mat)
{
#define COMP(c) \
qword res_ ## c = SPLAT((qword)mat->row3, c); \
res_ ## c = si_fma(z, SPLAT((qword)mat->row2, c), res_ ## c); \
res_ ## c = si_fma(y, SPLAT((qword)mat->row1, c), res_ ## c); \
res_ ## c = si_fma(x, SPLAT((qword)mat->row0, c), res_ ## c); \
dest[c] = res_ ## c;

COMP(0);
COMP(1);
COMP(2);
COMP(3);

#undef COMP
}

// transform points to clip space
qword points_cs_0;
qword points_cs_1;

transform_points_4(points_cs_0, minmax_x, minmax_y, minmax_z_0, &clip);
transform_points_4(points_cs_1, minmax_x, minmax_y, minmax_z_1, &clip);

If all 8 clip-space points are outside left plane, i.e. if for all 8 points p.x <= -p.w, then the box is completely outside. Since we're going to use SoA layout, such tests are very easy to perform. We'll need a vector which contains -w for 4 points; SPU do not have a special negation instruction, but you can easily emulate it either by subtracting from zero or by xoring with 0x80000000. Theoretically xor is better (it has 2 cycles of latency, subtract has 6), but in our case there is no difference in speed; I'll use xor nonetheless:

// calculate -w
qword points_cs_0_negw = si_xor(points_cs_0, (qword)(vec_uint4)(0x80000000));
qword points_cs_1_negw = si_xor(points_cs_1, (qword)(vec_uint4)(0x80000000));

Now we'll calculate “not outside” flags for each plane; the method is exactly the same as in previous post (as is the final result computation), only now we're not doing dot products.

// for each plane...
#define NOUT(a, b, c, d) si_orx(si_or(si_fcgt(a, b), si_fcgt(c, d)))

qword nout0 = NOUT(points_cs_0, points_cs_0_negw, points_cs_1, points_cs_1_negw);
qword nout1 = NOUT(points_cs_0, points_cs_0, points_cs_1, points_cs_1);
qword nout2 = NOUT(points_cs_0, points_cs_0_negw, points_cs_1, points_cs_1_negw);
qword nout3 = NOUT(points_cs_0, points_cs_0, points_cs_1, points_cs_1);
qword nout4 = NOUT(points_cs_0, (qword)(0), points_cs_1, (qword)(0));
qword nout5 = NOUT(points_cs_0, points_cs_0, points_cs_1, points_cs_1);

#undef NOUT

This is the final version. It runs at 104 cycles per test, so it's slightly faster than the last version. But this method is better for another reason – we've calculated clip space positions of box vertices as a by-product (also we've calculated world view projection matrix, but it's likely to be of little further use, because usually shader constant setup happens at a later point in another module). Some things you can do with them:

Feed them as an input to the rasterizer to do rasterization-based occlusion culling (simple depth buffer, HOM, etc.). This is the road I have not taken yet, though I hope I will do it some day.
Use them for screen size culling – if your bounding box when projected to the screen is small enough (i.e. less than 3x3 pixels), you usually can safely throw it away. This is what I do in our production code; it involves dividing positions by w (don't forget to discard the size culling results if any point has w < epsilon!), computing min/max x/y for the results, subtracting min from max and checking if the difference along each axis is less than threshold. The actual implementation is left as an exercise to the reader.

This concludes the computation part of view-frustum culling. There is still something to do here – there is p/n-vertex approach which I did not implement (but I'm certain that it won't be a win over my current methods on SPUs); there are minor potential improvements to the current code that are not worth the trouble for me; all implemented tests return only binary result (outside / not outside), a ternary version can help in hierarchical culling (though this can be achieved with a minor modification to all presented code). There might be something else I can't think of now – post a comment if you'd like to hear about other VFC-related topics!

I'm going to write one final post regarding VFC, which deals with the code that will use is_visible to perform culling of the given batch array – the topics include DMA and double buffering; after that the whole VFC series will be over and I'm going to switch to something different.

The complete source for this post can be grabbed here.

1. Great series of posts - please keep them coming!

I am quite enjoying them.

2. You can extract the planes directly from the projection matrix, there is no need for trig and there is no need to transform the clip space planes by the inverse projection.

3. As for trig, I wrote that in the post.
As for "no need to transform", no, you are wrong. Adding matrix columns is just another way to do transformation of (1, 0, 0, 1)-like vectors.
Although there is a typo in the post - you need inversed transposed inverse view projection matrix (which is of course equal to just transposed). Thanks for leading me to it.

4. I use the p/n-vertex approach on the wii and perform the test in ~145 cycles (with paired-single instructions)

This is great if you use occluding planes as well since you can use the almost exact algorithm.

For an AABB, getting the p/n vertex is 2 fsel instructions (not counting the load instructions) on the wii, maybe 1 on spu.

5. Since you're likely to have many many objects with bounding boxes in world space, and only 6 frustum planes in projection space, wouldn't it be faster in the long run to just bring the planes to world space and to the test there?

6. Well since most of the objects are going to need their world view projection matrices calculated anyway, then there's little loss, otherwise you have to solve all the planes for the frustum in world space for each object. I think a matrix calculation is probably cheaper than a world space frustum cull test.