Part 4: Using generic matrix inverse when you don’t need to

This is a small tangent I’m going to go on: It’s about the amount of times I have seen a game engine using a generic matrix inverse function where it can be inlined to be way WAY faster!

Here is a generic matrix inverse using Cramer’s Rule:

_UNKNOWN **__fastcall sub_65C3E20(__m128 *a1) { v2 = *a1; v3 = a1[1]; v4 = _mm_shuffle_ps(v2, v2, 0xFF).m128_f32[0]; v5 = _mm_shuffle_ps(v3, v3, 0xFF).m128_f32[0]; v40 = *a1; v42 = v3.m128_f32[0]; v6 = a1[2]; v7 = a1[3]; v8 = _mm_shuffle_ps(v7, v7, 0xFF).m128_f32[0]; v9 = _mm_shuffle_ps(v7, v7, 0xAA).m128_f32[0]; v10 = _mm_shuffle_ps(v6, v6, 0xFF).m128_f32[0]; v11 = _mm_shuffle_ps(v2, v2, 0xAA).m128_f32[0]; v38 = v7; v7.m128_f32[0] = _mm_shuffle_ps(v3, v3, 0xAA).m128_f32[0]; v39 = _mm_shuffle_ps(v6, v6, 0xAA).m128_f32[0]; v12 = v7.m128_f32[0] * v8; v13 = v39 * v8; v14 = v5 * v9; v37 = v6.m128_f32[0]; v41 = _mm_shuffle_ps(v3, v3, 0x55).m128_f32[0]; v46 = v8; v15 = v11 * v8; v44 = v11; v16 = v4 * v9; v48 = v9; v17 = v10 * v9; v18 = v11 * v5; v49 = v7.m128_f32[0]; v7.m128_f32[0] = v7.m128_f32[0] * v10; v43 = v10; v19 = v11 * v10; v20 = v4 * v39; v47 = v4; v21 = v4 * v49; v31 = v12; v45 = _mm_shuffle_ps(v6, v6, 0x55).m128_f32[0]; v50 = _mm_shuffle_ps(v38, v38, 0x55).m128_f32[0]; v30 = (float)((float)((float)(v45 * v14) + (float)(v41 * v13)) + (float)(v50 * v7.m128_f32[0])) - (float)((float)((float)(v45 * v12) + (float)(v41 * v17)) + (float)(v50 * (float)(v5 * v39))); v32 = (float)((float)((float)(v45 * v15) + (float)(v40.m128_f32[1] * v17)) + (float)(v50 * v20)) - (float)((float)((float)(v45 * v16) + (float)(v40.m128_f32[1] * v13)) + (float)(v50 * v19)); v33 = (float)((float)((float)(v41 * v16) + (float)(v40.m128_f32[1] * v12)) + (float)(v50 * v18)) - (float)((float)((float)(v41 * v15) + (float)(v40.m128_f32[1] * v14)) + (float)(v50 * v21)); v3.m128_f32[0] = (float)((float)((float)(v41 * v19) + (float)(v40.m128_f32[1] * (float)(v5 * v39))) + (float)(v45 * v21)) - (float)((float)((float)(v41 * v20) + (float)(v40.m128_f32[1] * v7.m128_f32[0])) + (float)(v45 * v18)); v22 = (float)((float)((float)(v6.m128_f32[0] * v12) + (float)(v42 * v17)) + (float)(v38.m128_f32[0] * (float)(v5 * v39))) - (float)((float)((float)(v6.m128_f32[0] * v14) + (float)(v42 * v13)) + (float)(v38.m128_f32[0] * v7.m128_f32[0])); v23 = (float)((float)((float)(v6.m128_f32[0] * v16) + (float)(v40.m128_f32[0] * v13)) + (float)(v38.m128_f32[0] * v19)) - (float)((float)((float)(v6.m128_f32[0] * v15) + (float)(v40.m128_f32[0] * v17)) + (float)(v38.m128_f32[0] * v20)); v24 = (float)((float)((float)(v42 * v15) + (float)(v40.m128_f32[0] * v14)) + (float)(v38.m128_f32[0] * v21)) - (float)((float)((float)(v42 * v16) + (float)(v40.m128_f32[0] * v31)) + (float)(v38.m128_f32[0] * v18)); v6.m128_f32[0] = v42 * v19; v25 = v38.m128_f32[0] * COERCE_FLOAT(HIDWORD(a1->m128_u64[0])); v26 = (float)((float)((float)(v42 * v20) + (float)(v40.m128_f32[0] * v7.m128_f32[0])) + (float)(v37 * v18)) - (float)((float)(v6.m128_f32[0] + (float)(v40.m128_f32[0] * (float)(v5 * v39))) + (float)(v37 * v21)); v27 = COERCE_FLOAT(*a1) * v50; v29 = COERCE_FLOAT(*a1) * v45; v34 = v37 * COERCE_FLOAT(HIDWORD(a1->m128_u64[0])); v35 = COERCE_FLOAT(*a1) * v41; v36 = v42 * COERCE_FLOAT(HIDWORD(a1->m128_u64[0])); v28 = 1.0 / (float)((float)((float)((float)(v30 * COERCE_FLOAT(*a1)) + (float)(v32 * v42)) + (float)(v33 * v37)) + (float)(v3.m128_f32[0] * v38.m128_f32[0])); a1->m128_f32[0] = v30 * v28; a1[1].m128_f32[1] = v28 * v23; a1[1].m128_f32[0] = v28 * v22; a1[1].m128_f32[3] = v28 * v26; a1[1].m128_f32[2] = v28 * v24; a1->m128_f32[3] = v3.m128_f32[0] * v28; a1->m128_f32[1] = v32 * v28; a1->m128_f32[2] = v33 * v28; a1[2].m128_f32[0] = (float)((float)((float)((float)((float)(v38.m128_f32[0] * v41) * v43) + (float)(v5 * (float)(v37 * v50))) + (float)((float)(v42 * v45) * v46)) - (float)((float)((float)(v5 * (float)(v38.m128_f32[0] * v45)) + (float)((float)(v42 * v50) * v43)) + (float)((float)(v37 * v41) * v46))) * v28; a1[2].m128_f32[1] = (float)((float)((float)((float)(v47 * (float)(v38.m128_f32[0] * v45)) + (float)(v27 * v43)) + (float)(v34 * v46)) - (float)((float)((float)(v25 * v43) + (float)(v47 * (float)(v37 * v50))) + (float)(v29 * v46))) * v28; a1[2].m128_f32[2] = (float)((float)((float)((float)(v25 * v5) + (float)(v47 * (float)(v42 * v50))) + (float)(v35 * v46)) - (float)((float)((float)(v47 * (float)(v38.m128_f32[0] * v41)) + (float)(v27 * v5)) + (float)(v36 * v46))) * v28; a1[2].m128_f32[3] = (float)((float)((float)((float)(v47 * (float)(v37 * v41)) + (float)(v29 * v5)) + (float)(v36 * v43)) - (float)((float)((float)(v34 * v5) + (float)(v47 * (float)(v42 * v45))) + (float)(v35 * v43))) * v28; a1[3].m128_f32[0] = (float)((float)((float)((float)((float)(v37 * v41) * v48) + (float)((float)(v42 * v50) * v39)) + (float)(v49 * (float)(v38.m128_f32[0] * v45))) - (float)((float)((float)((float)(v42 * v45) * v48) + (float)(v49 * (float)(v37 * v50))) + (float)((float)(v38.m128_f32[0] * v41) * v39))) * v28; a1[3].m128_f32[1] = (float)((float)((float)((float)(v29 * v48) + (float)(v44 * (float)(v37 * v50))) + (float)(v25 * v39)) - (float)((float)((float)(v34 * v48) + (float)(v27 * v39)) + (float)(v44 * (float)(v38.m128_f32[0] * v45)))) * v28; a1[3].m128_f32[2] = (float)((float)((float)((float)(v36 * v48) + (float)(v27 * v49)) + (float)(v44 * (float)(v38.m128_f32[0] * v41))) - (float)((float)((float)(v35 * v48) + (float)(v44 * (float)(v42 * v50))) + (float)(v25 * v49))) * v28; a1[3].m128_f32[3] = (float)((float)((float)((float)(v44 * (float)(v42 * v45)) + (float)(v35 * v39)) + (float)(v34 * v49)) - (float)((float)((float)(v29 * v49) + (float)(v36 * v39)) + (float)(v44 * (float)(v37 * v41)))) * v28; return &retaddr; }

For matrices like the Projection Matrix, a large number of values are simply 0. If it’s a well-known matrix, it can be inlined without using generic inversions. It would be a huge drop in CPU cycles!

In the dunia engine they have inlined the inverse Projection Matrix calculation by saving important variables while constructing the Projection Matrix and rearranging it.

The instruction count for this generic Cramer’s Rule inverse is roughly 470, not to mention instruction count alone isn’t enough to showcase the inefficiency

1. SIMD to Scalar

The matrix rows are loaded into 128-bit wide registers, but get immediately unpacked to do scalar calculations 32 bits at a time.

2. Register Spilling

float v29; // [rsp+4h] [rbp-1C4h] float v30; // [rsp+8h] [rbp-1C0h] float v31; // [rsp+Ch] [rbp-1BCh] float v32; // [rsp+10h] [rbp-1B8h] float v33; // [rsp+14h] [rbp-1B4h] float v34; // [rsp+1Ch] [rbp-1ACh] float v35; // [rsp+20h] [rbp-1A8h] float v36; // [rsp+24h] [rbp-1A4h] etc...

16 XMM registers are available for floating-point math. This function defines over 50 individual float variables.

The variables written with annotations like [rsp+Ch] indicates that the CPU ran out of hardware registers and was forced to “spill” intermediate calculations to the stack.

3. Dependency Chains

look at:

v28 = 1.0 / (float)((float)((float)((float)(v30 * COERCE_FLOAT(*a1)) + (float)(v32 * v42)) + (float)(v33 * v37)) + (float)(v3.m128_f32[0] * v38.m128_f32[0]));

which represents 1.0/determinant. Its calculation requires v30, v32, v33, and others to be completely finished.

Subsequently, every single element written back to the final matrix depends on v28.

a1[2].m128_f32[1] = (float)((float)((float)((float)(v47 * (float)(v38.m128_f32[0] * v45)) + (float)(v27 * v43)) + (float)(v34 * v46)) - (float)((float)((float)(v25 * v43) + (float)(v47 * (float)(v37 * v50))) + (float)(v29 * v46))) * v28; a1[2].m128_f32[2] = (float)((float)((float)((float)(v25 * v5) + (float)(v47 * (float)(v42 * v50))) + (float)(v35 * v46)) - (float)((float)((float)(v47 * (float)(v38.m128_f32[0] * v41)) + (float)(v27 * v5)) + (float)(v36 * v46))) * v28; etc...

I would be very surprised if it isn’t stalled at least somewhere in the pipeline

The dunia engine inlined this to about 15-20 instructions by saving required values into registers or the stack and constructing it at the end of projection matrix construction. More info on my previous write-up! Part 6: Reversing Construction of the Inverse Projection Matrix

This also ties into the Camera Matrix and honestly various other matrices!

Reminder!
Camera Matrix^-1 = View Matrix

Let’s take the view matrix for example and how to inline it from my previous blog Reversing The ViewProjection Matrix (Part 4.2: Reversing SIMD Instructions for Matrix Math - Fast inverse for orthonormal Matrices)

Fast inverse for orthonormal Matrices

If R is a pure rotation matrix meaning:

No scaling,
No shear,
It’s orthonormal (columns are perpendicular and unit-length)

then \(R^{-1} = R^T\)

Suppose a 4x4 matrix with homogenous coordinates:

\[C_{world} = \begin{bmatrix} R_{00} & R_{01} & R_{02} & 0 \\ R_{10} & R_{11} & R_{12} & 0 \\ R_{20} & R_{21} & R_{22} & 0 \\ T_x & T_y & T_z & 1.0 \end{bmatrix}\]

Here:

R (upper 3×3) is the orientation of the camera in world space.
T (bottom row, first 3 values) is the position of the camera in world space.

To get \(C_{world}^{-1}\) we can separate the matrix like so:

\[C_{world} = \begin{bmatrix} R & 0 \\ T & 1 \end{bmatrix}\]

and we want its inverse.

The block matrix inverse formula for this special form is:

\[\begin{bmatrix} A & 0 \\ B & 1 \end{bmatrix}^{-1} = \begin{bmatrix} A^{-1} & 0 \\ -BA^{-1} & 1 \end{bmatrix}\]

(See Wikipedia: Blockwise inversion for the general derivation)

Applying The Formula we get:

A = R
B = T

So:

\[C_{world}^{-1} = \begin{bmatrix} R^{-1} & 0 \\ -TR^{-1} & 1 \end{bmatrix}\]

Since R is orthonormal (\(R^{-1} = R^T\)):

\[C_{world}^{-1} = \begin{bmatrix} R^T & 0 \\ -TR^T & 1 \end{bmatrix}\]

Exponent “T” represents the Transpose and Regular “T” represents the Translation

Now Expand \(−TR^T\) into its dot products:

if:

\[R = \begin{bmatrix} R_{0x} & R_{0y} & R_{0z} \\ R_{1x} & R_{1y} & R_{1z} \\ R_{2x} & R_{2y} & R_{2z} \\ \end{bmatrix}\]

and \(T = [T_x, T_y, T_z],\)

So:

\[-TR^T= \begin{bmatrix} -T_x & -T_y & -T_z \\ \end{bmatrix} \times \begin{bmatrix} R_{0x} & R_{1x} & R_{2x} \\ R_{0y} & R_{1y} & R_{2y} \\ R_{0z} & R_{1z} & R_{2z} \\ \end{bmatrix}\]

then:

\[-TR^T = [-dot(T,R_0), -dot(T,R_1), -dot(T,R_2)]\]

So the last row becomes:

\[[-dot(T,R_0), -dot(T,R_1), -dot(T,R_2)]\]

And Expanding \(R^T\) is just the Transpose of the Rotation, thus completing the inverse:

\[C_{world}^{-1} = \begin{bmatrix} R^T & 0 \\ -TR^T & 1 \end{bmatrix}\]

This is also seen in the dunia engine and honestly in most game engines.

Dunia Engine Example for View Matrix Fast Inverse:

// Dot Product of: Right • CamPos rightTrans = (float)((float)(rightY * CamPos_XYZ[1]) + (float)(rightX * *CamPos_XYZ)) + (float)(rightZ * CamPos_XYZ[2]); forwardZ = *(float *)(a1 + 0x88); upX = *(float *)(a1 + 0x90); upY = *(float *)(a1 + 0x94); // Dot Product of: Forward • CamPos forwardTrans = (float)((float)(forwardY * CamPos_XYZ[1]) + (float)(forwardX * *CamPos_XYZ)) + (float)(forwardZ * CamPos_XYZ[2]); upZ = *(float *)(a1 + 0x98); // Dot product of: up • CamPos (upTrans + v18) *(float *)&v18 = upZ * CamPos_XYZ[2]; upTrans = (float)(upY * CamPos_XYZ[1]) + (float)(upX * *CamPos_XYZ);

Standard Camera Matrix layout being (Memory Layout):

\[C_{world} = \begin{bmatrix} r_x & r_y & r_z & 0 \\ u_x & u_y & u_z & 0 \\ f_x & f_y & f_z & 0 \\ p_x & p_y & p_z & 1.0 \end{bmatrix}\]

here the right vector would be stored like so:

*(float *)(a1 + 0x30) = rightX; *(float *)(a1 + 0x34) = rightY; *(float *)(a1 + 0x38) = rightZ;

The dunia engine transposes it like so and adds the dot products.

// Fast inverse for orthonormal matrices (View Matrix Construction) *(float *)(a1 + 0x30) = rightX; *(float *)(a1 + 0x40) = rightY; *(float *)(a1 + 0x50) = rightZ; *(float *)(a1 + 0x34) = forwardX; *(float *)(a1 + 0x44) = forwardY; *(float *)(a1 + 0x54) = forwardZ; *(float *)(a1 + 0x58) = upZ; *(float *)(a1 + 0x38) = upX; *(float *)(a1 + 0x48) = upY; *(float *)(a1 + 0x60) = -rightTrans; *(float *)(a1 + 0x64) = -forwardTrans; *(float *)(a1 + 0x68) = -(float)(upTrans + *(float *)&v18);

Closing thoughts:

I’m all out of rants and tangents to go on about, here are some main takeaways:

1. Clean C++ Code ≠ Clean Compiled Code

Abstraction is a luxury where the cost is performance. Generalized math wrappers and trusting the compiler to “figure it out” is how you end up with 133 instructions instead of 3.

2. Profilers won’t save you

If the entire foundation is bloated, the baseline execution cost of every function is artificially raised. Profiler-Invisible Waste / Death by a Thousand Cuts

3. Following the textbook perfectly

Those matrix inversions, matrix multiplications, identity matrices all look great on the whiteboard but in a low-level CPU pipeline it is going about it in a really roundabout way.

4. The “Main Path” Contagion

This is not even a niche, unimportant function. This is the main rendering function preparing many different matrices bound for the GPU for calculations.
You can guarantee this exact same philosophy infests every other system in the engine.

5. Who was the common antagonist anyway?

You might have already guessed! It’s the MatrixMultiply4x4() but really that’s just the narrative for this write-up.

The true antagonist is the codebase culture itself. It is the “Clean C++” philosophy that prioritizes developer convenience and generic abstractions over CPU execution realities. This exact same over-engineering bleeds into entirely different mathematical primitives across the entire engine.

And it probably doesn’t even stop at just math libraries, probably every other library is also abused like this.

In my next write-up, we are going to look at the exact opposite problem. We are going to explore the Compatibility Tax. The ghost of a 12-year-old CPU that keeps modern games from utilizing instructions that could theoretically yield 5x speedups.

Until then, worship the IDA goddess!