Redundancy and Bloat seen in AAA Game Engines

Part 1: Dunia Engine (Redundancy And Over-Engineering)

2026-05-25T18:30:00+00:00

Intro:

Before I delve in this I need to get a few things out of the way ⚠️:

To be completely clear: I did not go looking for this. I was simply reversing engines to study how they constructed their fundamental Transformation matrices and handled temporal jitter logic. But reading a core rendering function and seeing such avoidable overhead practically hits you in the face. You are not looking at unoptimized code but codebase culture.

Nowww, the functions we are going to dissect are only run a few times per frame, so it really doesn’t matter if we optimize it or not. But it raises an uncomfortable question: If such optimizations are not even considered on these important rendering pathways then what other functions are overlooked? does such culture infest every other system in the engine?

This is a symptom of “Profiler-Invisible-Waste”; it is a classic case of missing the forest for the trees. You will never find this issue through a profiler, the culture is ingrained in every function. This leads to “Death By a Thousand Cuts”.

Look at this VTune capture. The top hotspots barely break 5% of the CPU time each. But look at the red box: 80.9% of the execution time is buried in [Others]. Profilers are designed to find massive, isolated bottlenecks.

But if your entire codebase is built on over-engineered abstractions and bloated generic math wrappers, the baseline execution cost of every function is raised. You don’t get a few obvious performance spikes; you get a uniformly elevated floor.

That 80% block is exactly where those “thousand cuts” are hiding.

We will go over 3 separate game engines:

Dunia Engine (Far Cry Series)
Sucker Punch Studios Proprietary Engine (Ghost of Tsushima)
Avalanche Engine (Just Cause Series)

And all 3 have one common “Antagonist”.

Basic Outline:

Redundancy and Over-Engineering:

First we will go over all of the Redundant and Over-Engineered code which exists in the engine. Then explain why it is Redundant/Over-Engineered and how it could have been written. I will show Difference in instruction count side-by-side (Original vs Hand Written Assembly).

We will also come up with theories on why this happens, Specifically on the concept “Clean C++ Code ≠ Clean Compiled Code”

Compatibility, Legacy and Readable Code

The Compatibility Tax: Not Wrong, Just Capped. In my next write-up we will look at the realities of AAA game development. Not all of this is the result of developers blindly trusting compilers. I will show sections of the engine where the code could theoretically run way faster (with some functions reaching 5× speedups) due to new advancements in processor architecture, but couldn’t be fully utilized because the engine must maintain compatibility with a wide range of hardware.

This whole write-up is just food for thought, a question really: “What is the performace tax for such abstractions?”

Case 1: Y-up To Z-up Over-Engineering (Dunia Engine)

This is doing a Coordinate Space Conversion in a very “Textbooky” way. They first construct a Matrix:

\[M = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & -4.37 \times 10^{-8} & 1 & 0 \\ 0 & -1 & -4.37 \times 10^{-8} & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}\]

on the stack, probably using some function like Matrix::CreateRotationX(-PI / 2)

So it actually constructed:

\[\begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & \cos(-90^\circ) & -\sin(-90^\circ) & 0 \\ 0 & \sin(-90^\circ) & \cos(-90^\circ) & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}\]

Then pass the arguments to MatrixMultiply4x4(&CameraMatrix, (__int64)&matrixPointer);

Where the logic for the function is a2 × a1 = a1

Let’s put a breakpoint in the multiply call to see exactly what happens after this multiplication:

I am looking exactly “north” in-game when breakpointing

Before:

After:

So, take the up vector, negate it and put it on row 2, then take the forward vector and put it on row 1.

Let me put into perspective how many assembly instructions were executed just to do this:

Pseudo code of MatrixMultiply4x4:

First load Camera Matrix:

movaps xmm0, xmmword ptr [rdi] movaps xmm1, xmmword ptr [rbx+0C0h] movaps [rsp+1A0h+CameraMatrix], xmm0 movaps xmm0, xmmword ptr [rbx+0D0h] movaps [rsp+1A0h+CamMatRow1], xmm1 movaps xmm1, xmmword ptr [rbx+0E0h] movaps [rsp+1A0h+CamMatRow2], xmm0 movaps [rsp+1A0h+CamMatRow3], xmm1

Construct the Swizzle Matrix:

mov qword ptr [rsp+1A0h+matrixPointer], 3F800000h mov dword ptr [rbp+0A0h+var_120], 0 mov dword ptr [rbp+0A0h+var_110], 0 mov qword ptr [rbp+0A0h+var_100], 0 mov dword ptr [rbp+0A0h+var_120+4], 0B33BBD2Eh mov dword ptr [rbp+0A0h+var_110+4], 0BF800000h mov qword ptr [rsp+1A0h+matrixPointer+8], 0 mov qword ptr [rbp+0A0h+var_120+8], 3F800000h mov dword ptr [rbp+0A0h+var_110+8], 0B33BBD2Eh mov dword ptr [rbp+0A0h+var_100+8], 0 mov dword ptr [rbp+0A0h+var_110+0Ch], 0 mov dword ptr [rbp+0A0h+var_100+0Ch], 3F800000h

Then load arguments in registers:

lea rdx, [rsp+1A0h+matrixPointer] lea rcx, [rsp+1A0h+CameraMatrix]

Call the function:

call MatrixMultiply4x4

Inside function entry (stack allocation, set up security cookies, load registers etc):

sub rsp, 78h movaps [rsp+78h+var_18], xmm6 movaps [rsp+78h+var_28], xmm7 mov rax, cs:__security_cookie xor rax, rsp mov [rsp+78h+var_38], rax movaps xmm4, xmmword ptr [rcx] lea r8, [rsp+78h+var_78] movaps xmm5, xmmword ptr [rcx+10h] lea rax, [rsp+78h+var_78] movaps xmm6, xmmword ptr [rcx+20h] sub rdx, r8 movaps xmm7, xmmword ptr [rcx+30h] mov r8d, 4 nop dword ptr [rax]

Multiply all rows to columns (repeated 4 times):

; --- iteration 1 --- movaps xmm2, xmmword ptr [rdx+rax] movaps xmm3, xmm2 movaps xmm0, xmm2 shufps xmm3, xmm2, 55h ; 'U' movaps xmm1, xmm2 shufps xmm0, xmm2, 0 mulps xmm3, xmm5 shufps xmm1, xmm2, 0AAh mulps xmm0, xmm4 mulps xmm1, xmm6 shufps xmm2, xmm2, 0FFh addps xmm3, xmm0 mulps xmm2, xmm7 addps xmm3, xmm1 addps xmm3, xmm2 movaps xmmword ptr [rax], xmm3 add rax, 10h sub r8, 1 jnz short loc_712D0F0 ; --- iteration 2 --- movaps xmm2, xmmword ptr [rdx+rax] movaps xmm3, xmm2 movaps xmm0, xmm2 shufps xmm3, xmm2, 55h ; 'U' movaps xmm1, xmm2 shufps xmm0, xmm2, 0 mulps xmm3, xmm5 shufps xmm1, xmm2, 0AAh mulps xmm0, xmm4 mulps xmm1, xmm6 shufps xmm2, xmm2, 0FFh addps xmm3, xmm0 mulps xmm2, xmm7 addps xmm3, xmm1 addps xmm3, xmm2 movaps xmmword ptr [rax], xmm3 add rax, 10h sub r8, 1 jnz short loc_712D0F0 ; --- iteration 3 --- movaps xmm2, xmmword ptr [rdx+rax] movaps xmm3, xmm2 movaps xmm0, xmm2 shufps xmm3, xmm2, 55h ; 'U' movaps xmm1, xmm2 shufps xmm0, xmm2, 0 mulps xmm3, xmm5 shufps xmm1, xmm2, 0AAh mulps xmm0, xmm4 mulps xmm1, xmm6 shufps xmm2, xmm2, 0FFh addps xmm3, xmm0 mulps xmm2, xmm7 addps xmm3, xmm1 addps xmm3, xmm2 movaps xmmword ptr [rax], xmm3 add rax, 10h sub r8, 1 jnz short loc_712D0F0 ; --- iteration 4 --- movaps xmm2, xmmword ptr [rdx+rax] movaps xmm3, xmm2 movaps xmm0, xmm2 shufps xmm3, xmm2, 55h ; 'U' movaps xmm1, xmm2 shufps xmm0, xmm2, 0 mulps xmm3, xmm5 shufps xmm1, xmm2, 0AAh mulps xmm0, xmm4 mulps xmm1, xmm6 shufps xmm2, xmm2, 0FFh addps xmm3, xmm0 mulps xmm2, xmm7 addps xmm3, xmm1 addps xmm3, xmm2 movaps xmmword ptr [rax], xmm3 add rax, 10h sub r8, 1 jnz short loc_712D0F0

Function end (dealloc stack, verify the security cookie, load result into memory and registers):

add rax, 10h sub r8, 1 jnz short loc_712D0F0 movaps xmm0, [rsp+78h+var_78] mov rax, rcx movaps xmm1, [rsp+78h+var_68] movaps xmmword ptr [rcx], xmm0 movaps xmm0, [rsp+78h+var_58] movaps xmmword ptr [rcx+10h], xmm1 movaps xmm1, [rsp+78h+var_48] movaps xmmword ptr [rcx+20h], xmm0 movaps xmmword ptr [rcx+30h], xmm1 mov rcx, [rsp+78h+var_38] xor rcx, rsp ; StackCookie call j___security_check_cookie movaps xmm6, [rsp+78h+var_18] movaps xmm7, [rsp+78h+var_28] add rsp, 78h retn

All this for a very simple coordinate space conversion?

We could simply shuffle it ourself using movaps to swizzle the vectors then simply use xorps with mask 0x80000000 for negation! Let’s try:

It seems to be storing the result of the multiply into the camera structure, we could simply change how they store the rows in the camera structure. The multiplied result will not be used by the current function and will be simply deallocated as it’s constructed on the stack so we only care about what’s in the camera structure.

So now take the up vector negate it and put it on row2, then take forward vector and put it on row1.

Counting rows from 0!

For this to work we simply change v16 = CamMatRow1; to v16 = CamMatRow2;

and

v17 = CamMatRow2; to v17 = CamMatRow1;

This simple change would not add more assembly instructions, we are simply modifying existing instructions.

then simply xorps v17, 0x80000000 to negate the bits.

And we are done!

So the total difference in instruction count is 133 to 1! where we only added one new instruction:

xorps v17, 0x80000000

while completely nuking all other instructions i have shown above.

To be clear: the instructions moving the vectors into the camera structure aren’t redundant. They are a sunk cost that has to execute regardless, thus i did not include those instructions in the assembly showcase. The only net-new instruction required to achieve the coordinate conversion is a single xorps, thus i am calling it a 133 to 1 instruction count decrease.

Instruction Count Visualized:

Before 133 Instructions

; 1. Load Camera Matrix
movaps  xmm0, xmmword ptr [rdi]
movaps  xmm1, xmmword ptr [rbx+0C0h]
movaps  [rsp+1A0h+CameraMatrix], xmm0
movaps  xmm0, xmmword ptr [rbx+0D0h]
movaps  [rsp+1A0h+CamMatRow1], xmm1
movaps  xmm1, xmmword ptr [rbx+0E0h]
movaps  [rsp+1A0h+CamMatRow2], xmm0
movaps  [rsp+1A0h+CamMatRow3], xmm1

; 2. Construct Identity/Swizzle Matrix
mov     qword ptr [rsp+1A0h+matrixPointer], 3F800000h
mov     dword ptr [rbp+0A0h+var_120], 0
mov     dword ptr [rbp+0A0h+var_110], 0
mov     qword ptr [rbp+0A0h+var_100], 0
mov     dword ptr [rbp+0A0h+var_120+4], 0B33BBD2Eh
mov     dword ptr [rbp+0A0h+var_110+4], 0BF800000h
mov     qword ptr [rsp+1A0h+matrixPointer+8], 0
mov     qword ptr [rbp+0A0h+var_120+8], 3F800000h
mov     dword ptr [rbp+0A0h+var_110+8], 0B33BBD2Eh
mov     dword ptr [rbp+0A0h+var_100+8], 0
mov     dword ptr [rbp+0A0h+var_110+0Ch], 0
mov     dword ptr [rbp+0A0h+var_100+0Ch], 3F800000h

; 3. Setup Arguments & Call MatrixMultiply4x4
lea     rdx, [rsp+1A0h+matrixPointer]
lea     rcx, [rsp+1A0h+CameraMatrix]
call    MatrixMultiply4x4

; 4. Inside Function: ABI Overhead
sub     rsp, 78h
movaps  [rsp+78h+var_18], xmm6
movaps  [rsp+78h+var_28], xmm7
mov     rax, cs:__security_cookie
xor     rax, rsp
mov     [rsp+78h+var_38], rax
movaps  xmm4, xmmword ptr [rcx]
lea     r8, [rsp+78h+var_78]
movaps  xmm5, xmmword ptr [rcx+10h]
lea     rax, [rsp+78h+var_78]
movaps  xmm6, xmmword ptr [rcx+20h]
sub     rdx, r8
movaps  xmm7, xmmword ptr [rcx+30h]
mov     r8d, 4

; 5. SIMD Unrolled Loop (x4 Iterations)
; --- iteration 1 ---
movaps  xmm2, xmmword ptr [rdx+rax]
movaps  xmm3, xmm2
movaps  xmm0, xmm2
shufps  xmm3, xmm2, 55h
movaps  xmm1, xmm2
shufps  xmm0, xmm2, 0
mulps   xmm3, xmm5
shufps  xmm1, xmm2, 0AAh
mulps   xmm0, xmm4
mulps   xmm1, xmm6
shufps  xmm2, xmm2, 0FFh
addps   xmm3, xmm0
mulps   xmm2, xmm7
addps   xmm3, xmm1
addps   xmm3, xmm2
movaps  xmmword ptr [rax], xmm3
add     rax, 10h
sub     r8, 1
jnz     short loc_712D0F0
; --- iteration 2 ---
movaps  xmm2, xmmword ptr [rdx+rax]
movaps  xmm3, xmm2
movaps  xmm0, xmm2
shufps  xmm3, xmm2, 55h
movaps  xmm1, xmm2
shufps  xmm0, xmm2, 0
mulps   xmm3, xmm5
shufps  xmm1, xmm2, 0AAh
mulps   xmm0, xmm4
mulps   xmm1, xmm6
shufps  xmm2, xmm2, 0FFh
addps   xmm3, xmm0
mulps   xmm2, xmm7
addps   xmm3, xmm1
addps   xmm3, xmm2
movaps  xmmword ptr [rax], xmm3
add     rax, 10h
sub     r8, 1
jnz     short loc_712D0F0
; --- iteration 3 ---
movaps  xmm2, xmmword ptr [rdx+rax]
movaps  xmm3, xmm2
movaps  xmm0, xmm2
shufps  xmm3, xmm2, 55h
movaps  xmm1, xmm2
shufps  xmm0, xmm2, 0
mulps   xmm3, xmm5
shufps  xmm1, xmm2, 0AAh
mulps   xmm0, xmm4
mulps   xmm1, xmm6
shufps  xmm2, xmm2, 0FFh
addps   xmm3, xmm0
mulps   xmm2, xmm7
addps   xmm3, xmm1
addps   xmm3, xmm2
movaps  xmmword ptr [rax], xmm3
add     rax, 10h
sub     r8, 1
jnz     short loc_712D0F0
; --- iteration 4 ---
movaps  xmm2, xmmword ptr [rdx+rax]
movaps  xmm3, xmm2
movaps  xmm0, xmm2
shufps  xmm3, xmm2, 55h
movaps  xmm1, xmm2
shufps  xmm0, xmm2, 0
mulps   xmm3, xmm5
shufps  xmm1, xmm2, 0AAh
mulps   xmm0, xmm4
mulps   xmm1, xmm6
shufps  xmm2, xmm2, 0FFh
addps   xmm3, xmm0
mulps   xmm2, xmm7
addps   xmm3, xmm1
addps   xmm3, xmm2
movaps  xmmword ptr [rax], xmm3
add     rax, 10h
sub     r8, 1
jnz     short loc_712D0F0

; 6. Deallocate & Return
movaps  xmm0, [rsp+78h+var_78]
mov     rax, rcx
movaps  xmm1, [rsp+78h+var_68]
movaps  xmmword ptr [rcx], xmm0
movaps  xmm0, [rsp+78h+var_58]
movaps  xmmword ptr [rcx+10h], xmm1
movaps  xmm1, [rsp+78h+var_48]
movaps  xmmword ptr [rcx+20h], xmm0
movaps  xmmword ptr [rcx+30h], xmm1
mov     rcx, [rsp+78h+var_38]
xor     rcx, rsp
call    j___security_check_cookie
movaps  xmm6, [rsp+78h+var_18]
movaps  xmm7, [rsp+78h+var_28]
add     rsp, 78h
retn

After 3 Instructions

; Just swap the pointers and flip the sign bit
movaps  v16, CamMatRow2
movaps  v17, CamMatRow1
xorps   v17, 0x80000000

You might say it’s for readability or that it makes it easier to modify the coordinate conversion later. But for a programmer who grasps the underlying math, the intent behind swapping rows and flipping a sign is perfectly clear especially since we can express the exact same logic cleanly using SSE intrinsics in C++. Memorizing textbook formulas is fine, but if you don’t understand the actual spatial intent behind them you’re just pattern-matching

Case 2: Atanf just to get back FovX:

I’ve already talked about this here Reversing The Prespective Projection Matrix (Part 5.1) but want to shine light on this in more detail.

The arguments for tanf and atanf are not given but i have read the assembly (which is loaded into xmm0 just before call) and written the arguments on the right.

Let’s start with this block

fovX_calc = (__m128)*(unsigned int *)(a1 + 0x234); fovX_calc.m128_f32[0] = fovX_calc.m128_f32[0] * 0.5; ucrtBase_Tanf(); // Arg: fovX_calc

Using CE for dynamic analysis we can see that (a1 + 0x234) is FovX in radians

So it basically loads FovX in radians into “fovX_calc” then immediately divide it by 2 so now “fovX_calc” holds the value fovX/2.

Next Tanf() is called with arg = fovX_calc so fovX_calc currently holds the value tan(fovX/2)

Next the engine does a lot of calculations using tan(fovX/2), then the engine decides it actually needs the value of fovX back so it does it in an ingenious way!

ucrtBase_aTanf(); // Arg: fovX_calc fovX_val.m128_f32[0] = fovX_calc.m128_f32[0] * 2.0;

The fovX_calc variable is never used again in the function

we all know the identity arctan(tan(x)) = x holds if and only if x lies strictly inside (−90°, 90°), the identity holds without exception. FovX can only have a value from 60 to 120 degrees in-game. Since we divided it by 2, the angle is between 30 and 60 degrees (well within the -π/2 to π/2 principal bounds of arctan)

The codebase culture accepts:
\(x = 2 \cdot \arctan\!\left(\tan\!\left(\frac{x}{2}\right)\right)\)
as a valid way to move data from point A to point B.

So just to get back the raw fovX they call atanf then multiply it by 2. It could have been easily avoided by simply loading it from the camera structure again without it taking anywhere near a full IEEE 754 atanf calculation cycles since it’s certainly loaded in the L1 cache or alternatively just storing it to another unused xmm register to save it

Keep in mind this is not using the formula to find FovY which is:
\(FOV_Y = 2 \cdot \tan^{-1}\!\left(\frac{\tan\!\left(\frac{FOV_X}{2}\right)}{A}\right)\)
There is no Aspect Ratio used here.

We cant exactly tell how many cycles atanf() has used. It can range from 30 to 150 depending on your CPU architecture.

One meaningless atanf() on a function that only runs a few times per frame? That’s negligible.

Multiple redundant atanf() calls scattered across the codebase? That’s adding up.

Multiple redundant atanf, tanf, sinf, cosf, matrix multiply, matrix inversions, dot products, cross products etc etc..? That’s huge.

The Point:

The point here isn’t, “OmG tHeY uSeD a FuLl iEeE 754 bIt PeRfEct aTanf() CaLL jUsT tO gEt BaCk fovX!”

Let’s be real: doing that in a function that only runs a few times per frame doesn’t actually cost much, whether it’s an atanf call or a MatrixMultiply4x4.

The real point is: “Does it stop here?”, “Does this practice not carry over to all other systems?”

It does. This same over-engineering and “clean C++” bleeds into entirely different generic functions across the codebase. We’re talking atanf, tanf, sinf, cosf, 1/sqrtf, normalization, matrix multiply, matrix inversions, dot products, cross products, and honestly a whole, whole lot more.

And it’s probably not just math libraries being used like this.

This is the exact definition of “Death by a Thousand Cuts”. The Codebase ensures you’re bleeding from everywhere.

Now that you know the point of this blog i will continue with Ghost of Tsushima and the Avalanche Engine.

Part 2: Ghost Of Tsushima (Vector Extraction Through Multiplication)

2026-05-25T18:30:00+00:00

Case 1: Vector Extraction Through Multiplication?

In Ghost of Tsushima while i was looking at how the View-Projection Matrix was being constructed i came across a common pattern where they do a full Row to column multiplication that could be replaced by a simple movaps instruction.

Take this Example (IDA pseudo code):

v16[3] = _mm_add_ps( _mm_add_ps( _mm_add_ps( _mm_mul_ps(_mm_shuffle_ps((__m128)xmmword_1138D10, (__m128)xmmword_1138D10, 0), ProjMat_0), _mm_mul_ps(_mm_shuffle_ps((__m128)xmmword_1138D10, (__m128)xmmword_1138D10, 0x55), ProjMat_1)), _mm_mul_ps(_mm_shuffle_ps((__m128)xmmword_1138D10, (__m128)xmmword_1138D10, 0xAA), ProjMat_2)), _mm_mul_ps(_mm_shuffle_ps((__m128)xmmword_1138D10, (__m128)xmmword_1138D10, 0xFF), ProjMat_3));

This is part of the construction of the View-Projection Matrix where the translation row of the matrix needed to be zeroed out (likely for skybox rendering). Multiply View with Projection only using directional vectors while zeroing out Translation.

The logic here is simply:

Step 1: Shuffle

_mm_shuffle_ps(someRow1, someRow1, imm) selects one component of someRow1 and replicates it across all 4 slots of a new __m128.
The different imm values pick different elements: 0x00-> picks element 0 (X) 0x55-> picks element 1 (Y) 0xAA-> picks element 2 (Z) 0xFF-> picks element 3 (W)

After shuffling, each __m128 looks like [X,X,X,X], [Y,Y,Y,Y], etc.

Step 2: Multiply with Projection Matrix

Each shuffled vector is multiplied component-wise with a column of the projection matrix:

_mm_mul_ps(shuffledRow, ProjMat_n)

This performs 4 parallel multiplications of the same row component with each element in the projection matrix column.

Step 3: Sum the results

The _mm_add_ps calls sum all four products together:

(X * ProjMat_0) + (Y * ProjMat_1) + (Z * ProjMat_2) + (W * ProjMat_3)

The result is a single row of the final View-Projection matrix.

The Problem:

The problem here is that xmmword_1138D10 has a value of: (0.0, 0.0, 0.0, 1.0).

Since the first three components are zero, those multiplications with ProjMat_0, ProjMat_1, and ProjMat_2 drop out. The only one left is the last one, where w = 1.0. Which means you’re just selecting the last row of the projection matrix (ProjMat_3).

So this whole instruction chain simplifies to basically 1 movaps instruction:

v16[3] = ProjMat_3

Case 2: Even More Vector Extraction Through Multiplication??

This is the later stage where translation is added back into the View-Projection Matrix where it was previously zeroed out, and it is done in a very confusing way.

v17 = _mm_add_ps( _mm_add_ps( _mm_add_ps( _mm_mul_ps(_mm_shuffle_ps(CamPos_Negated_w1_dup, CamPos_Negated_w1_dup, 0x55), VP_NoTrans_Row1), _mm_mul_ps(_mm_shuffle_ps(CamPos_Negated_w1_dup, CamPos_Negated_w1_dup, 0), *VP_NoTrans)), _mm_mul_ps(_mm_shuffle_ps(CamPos_Negated_w1_dup, CamPos_Negated_w1_dup, 0xAA), VP_NoTrans_Row2)), _mm_mul_ps(_mm_shuffle_ps(CamPos_Negated_w1_dup, CamPos_Negated_w1_dup, 0xFF), VP_NoTrans_Row3));

This is the only Vector Multiplication that matters, the one where it’s adding back the translation into the VP matrix.

Here is the biggest reduction:

v18 = _mm_add_ps( _mm_mul_ps(_mm_shuffle_ps(Mask_0100, Mask_0100, 0x55), VP_NoTrans_Row1), _mm_mul_ps(_mm_shuffle_ps(Mask_0100, Mask_0100, 0), *VP_NoTrans)); *(__m128 *)(a1 + 0x260) = _mm_add_ps( _mm_add_ps( _mm_add_ps( _mm_mul_ps(_mm_shuffle_ps(Mask_0010, Mask_0010, 0x55), VP_NoTrans_Row1), _mm_mul_ps(_mm_shuffle_ps(Mask_0010, Mask_0010, 0), *VP_NoTrans)), _mm_mul_ps(_mm_shuffle_ps(Mask_0010, Mask_0010, 0xAA), VP_NoTrans_Row2)), _mm_mul_ps(_mm_shuffle_ps(Mask_0010, Mask_0010, 0xFF), VP_NoTrans_Row3)); *(__m128 *)(a1 + 0x270) = v17; *(__m128 *)(a1 + 0x250) = _mm_add_ps( _mm_add_ps(v18, _mm_mul_ps(_mm_shuffle_ps(Mask_0100, Mask_0100, 0xAA), VP_NoTrans_Row2)), _mm_mul_ps(_mm_shuffle_ps(Mask_0100, Mask_0100, 0xFF), VP_NoTrans_Row3)); *(__m128 *)(a1 + 0x240) = _mm_add_ps( _mm_add_ps( _mm_add_ps( _mm_mul_ps(_mm_shuffle_ps(Mask_1000, Mask_1000, 0x55), VP_NoTrans_Row1), _mm_mul_ps(_mm_shuffle_ps(Mask_1000, Mask_1000, 0), VP_NoTrans_Row0)), _mm_mul_ps(_mm_shuffle_ps(Mask_1000, Mask_1000, 0xAA), VP_NoTrans_Row2)), _mm_mul_ps(_mm_shuffle_ps(Mask_1000, Mask_1000, 0xFF), VP_NoTrans_Row3));

This is just extracting the values stored in the VP rows using Masks.

Mask_1000 is (1, 0, 0, 0) Mask_0100 is (0, 1, 0, 0) Mask_0010 is (0, 0, 1, 0)

Multiplying a unit vector by a matrix simply extracts the corresponding row. The original code was laboriously performing this extraction manually for each axis:

The calculation for 0x240 used Mask_1000 to extract Row0.
The calculation for 0x250 used Mask_0100 to extract Row1.
The calculation for 0x260 used Mask_0010 to extract Row2.

So it just collapses into 3 movaps instructions and 1 vector multiplication.

*(__m128 *)(a1 + 0x240) = VP_NoTrans_Row0 *(__m128 *)(a1 + 0x250) = VP_NoTrans_Row1 *(__m128 *)(a1 + 0x260) = VP_NoTrans_Row2 *(__m128 *)(a1 + 0x270) = v17

Again: I was not looking specifically for over-engineered code, this just stood out a lot.

I have also optimized this on my previous blog in assembly (for fun):
Reversing The ViewProjection Matrix - Part 4.5: Detour Hooking to Optimize SIMD Operations

Part 3: Avalanche Engine (Matrix Multiplication Replaces Vector Addition)

2026-05-25T18:30:00+00:00

Temporal Anti-Aliasing (TAA): Matrix Multiplication Replaces Vector Addition

Here we see a very “Textbook” way of adding jitters to the projection matrix for SMAA_T2X in the Avalanche Engine.

The classic textbook way being:

\[\begin{bmatrix} x_{scale} & 0 & 0 & 0 \\ 0 & y_{scale} & 0 & 0 \\ 0 & 0 & \dfrac{z_{far}}{z_{far}-z_{near}} & 1 \\ 0 & 0 & -\dfrac{z_{near}z_{far}}{z_{far}-z_{near}} & 0 \end{bmatrix} \times \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ j_x & j_y & 0 & 1 \end{bmatrix} = \begin{bmatrix} x_{scale} & 0 & 0 & 0 \\ 0 & y_{scale} & 0 & 0 \\ j_x & j_y & \dfrac{z_{far}}{z_{far}-z_{near}} & 1 \\ 0 & 0 & -\dfrac{z_{near}z_{far}}{z_{far}-z_{near}} & 0 \end{bmatrix}\]

This looks Clean when looking at the source code, but in a low-level CPU render loop, it is inefficient.

Probably would look something like this in C++:

Matrix4x4 jitterMatrix = Matrix4x4::Identity(); jitterMatrix.m[3][0] = jX; jitterMatrix.m[3][1] = jY; projMatrix = projMatrix * jitterMatrix;

Again: Clean C++ Code ≠ Clean Compiled Code

First we need to construct an entire 4x4 Identity Matrix on the stack just to hold two float values.
Then load the matrices as arguments into the function.
Inside the function do stack allocation, set up security cookies, load registers etc.
Multiply all rows to columns (repeated 4 times)
Finally end the function by deallocating stack, verify the security cookie, loading result into memory and registers.

Note: Calculating curFrame & 1, selecting jitters, scaling them down to sub-pixel space are all mathematically necessary steps and are not over-engineered.

The easier way to do it would be:

Take the scaled down jitters.
Take the 2nd Row (counting from 0) of the Projection Matrix and do a very simple addps.

Example (where the jitters were already scaled down):

movaps xmm0, projMat_2  ; Load 2nd Row [0, 0, Z_scale, 1]
movaps xmm1, jitter_Row ; [jitX, jitY, 0, 0]

addps xmm0, xmm1 ; result: [jX, jY, Z_scale, 1]

That’s about a 120 instruction count drop to 3.

Part 4: Using generic matrix inverse when you don’t need to

2026-05-25T18:30:00+00:00

This is a small tangent I’m going to go on: It’s about the amount of times I have seen a game engine using a generic matrix inverse function where it can be inlined to be way WAY faster!

Here is a generic matrix inverse using Cramer’s Rule:

_UNKNOWN **__fastcall sub_65C3E20(__m128 *a1) { v2 = *a1; v3 = a1[1]; v4 = _mm_shuffle_ps(v2, v2, 0xFF).m128_f32[0]; v5 = _mm_shuffle_ps(v3, v3, 0xFF).m128_f32[0]; v40 = *a1; v42 = v3.m128_f32[0]; v6 = a1[2]; v7 = a1[3]; v8 = _mm_shuffle_ps(v7, v7, 0xFF).m128_f32[0]; v9 = _mm_shuffle_ps(v7, v7, 0xAA).m128_f32[0]; v10 = _mm_shuffle_ps(v6, v6, 0xFF).m128_f32[0]; v11 = _mm_shuffle_ps(v2, v2, 0xAA).m128_f32[0]; v38 = v7; v7.m128_f32[0] = _mm_shuffle_ps(v3, v3, 0xAA).m128_f32[0]; v39 = _mm_shuffle_ps(v6, v6, 0xAA).m128_f32[0]; v12 = v7.m128_f32[0] * v8; v13 = v39 * v8; v14 = v5 * v9; v37 = v6.m128_f32[0]; v41 = _mm_shuffle_ps(v3, v3, 0x55).m128_f32[0]; v46 = v8; v15 = v11 * v8; v44 = v11; v16 = v4 * v9; v48 = v9; v17 = v10 * v9; v18 = v11 * v5; v49 = v7.m128_f32[0]; v7.m128_f32[0] = v7.m128_f32[0] * v10; v43 = v10; v19 = v11 * v10; v20 = v4 * v39; v47 = v4; v21 = v4 * v49; v31 = v12; v45 = _mm_shuffle_ps(v6, v6, 0x55).m128_f32[0]; v50 = _mm_shuffle_ps(v38, v38, 0x55).m128_f32[0]; v30 = (float)((float)((float)(v45 * v14) + (float)(v41 * v13)) + (float)(v50 * v7.m128_f32[0])) - (float)((float)((float)(v45 * v12) + (float)(v41 * v17)) + (float)(v50 * (float)(v5 * v39))); v32 = (float)((float)((float)(v45 * v15) + (float)(v40.m128_f32[1] * v17)) + (float)(v50 * v20)) - (float)((float)((float)(v45 * v16) + (float)(v40.m128_f32[1] * v13)) + (float)(v50 * v19)); v33 = (float)((float)((float)(v41 * v16) + (float)(v40.m128_f32[1] * v12)) + (float)(v50 * v18)) - (float)((float)((float)(v41 * v15) + (float)(v40.m128_f32[1] * v14)) + (float)(v50 * v21)); v3.m128_f32[0] = (float)((float)((float)(v41 * v19) + (float)(v40.m128_f32[1] * (float)(v5 * v39))) + (float)(v45 * v21)) - (float)((float)((float)(v41 * v20) + (float)(v40.m128_f32[1] * v7.m128_f32[0])) + (float)(v45 * v18)); v22 = (float)((float)((float)(v6.m128_f32[0] * v12) + (float)(v42 * v17)) + (float)(v38.m128_f32[0] * (float)(v5 * v39))) - (float)((float)((float)(v6.m128_f32[0] * v14) + (float)(v42 * v13)) + (float)(v38.m128_f32[0] * v7.m128_f32[0])); v23 = (float)((float)((float)(v6.m128_f32[0] * v16) + (float)(v40.m128_f32[0] * v13)) + (float)(v38.m128_f32[0] * v19)) - (float)((float)((float)(v6.m128_f32[0] * v15) + (float)(v40.m128_f32[0] * v17)) + (float)(v38.m128_f32[0] * v20)); v24 = (float)((float)((float)(v42 * v15) + (float)(v40.m128_f32[0] * v14)) + (float)(v38.m128_f32[0] * v21)) - (float)((float)((float)(v42 * v16) + (float)(v40.m128_f32[0] * v31)) + (float)(v38.m128_f32[0] * v18)); v6.m128_f32[0] = v42 * v19; v25 = v38.m128_f32[0] * COERCE_FLOAT(HIDWORD(a1->m128_u64[0])); v26 = (float)((float)((float)(v42 * v20) + (float)(v40.m128_f32[0] * v7.m128_f32[0])) + (float)(v37 * v18)) - (float)((float)(v6.m128_f32[0] + (float)(v40.m128_f32[0] * (float)(v5 * v39))) + (float)(v37 * v21)); v27 = COERCE_FLOAT(*a1) * v50; v29 = COERCE_FLOAT(*a1) * v45; v34 = v37 * COERCE_FLOAT(HIDWORD(a1->m128_u64[0])); v35 = COERCE_FLOAT(*a1) * v41; v36 = v42 * COERCE_FLOAT(HIDWORD(a1->m128_u64[0])); v28 = 1.0 / (float)((float)((float)((float)(v30 * COERCE_FLOAT(*a1)) + (float)(v32 * v42)) + (float)(v33 * v37)) + (float)(v3.m128_f32[0] * v38.m128_f32[0])); a1->m128_f32[0] = v30 * v28; a1[1].m128_f32[1] = v28 * v23; a1[1].m128_f32[0] = v28 * v22; a1[1].m128_f32[3] = v28 * v26; a1[1].m128_f32[2] = v28 * v24; a1->m128_f32[3] = v3.m128_f32[0] * v28; a1->m128_f32[1] = v32 * v28; a1->m128_f32[2] = v33 * v28; a1[2].m128_f32[0] = (float)((float)((float)((float)((float)(v38.m128_f32[0] * v41) * v43) + (float)(v5 * (float)(v37 * v50))) + (float)((float)(v42 * v45) * v46)) - (float)((float)((float)(v5 * (float)(v38.m128_f32[0] * v45)) + (float)((float)(v42 * v50) * v43)) + (float)((float)(v37 * v41) * v46))) * v28; a1[2].m128_f32[1] = (float)((float)((float)((float)(v47 * (float)(v38.m128_f32[0] * v45)) + (float)(v27 * v43)) + (float)(v34 * v46)) - (float)((float)((float)(v25 * v43) + (float)(v47 * (float)(v37 * v50))) + (float)(v29 * v46))) * v28; a1[2].m128_f32[2] = (float)((float)((float)((float)(v25 * v5) + (float)(v47 * (float)(v42 * v50))) + (float)(v35 * v46)) - (float)((float)((float)(v47 * (float)(v38.m128_f32[0] * v41)) + (float)(v27 * v5)) + (float)(v36 * v46))) * v28; a1[2].m128_f32[3] = (float)((float)((float)((float)(v47 * (float)(v37 * v41)) + (float)(v29 * v5)) + (float)(v36 * v43)) - (float)((float)((float)(v34 * v5) + (float)(v47 * (float)(v42 * v45))) + (float)(v35 * v43))) * v28; a1[3].m128_f32[0] = (float)((float)((float)((float)((float)(v37 * v41) * v48) + (float)((float)(v42 * v50) * v39)) + (float)(v49 * (float)(v38.m128_f32[0] * v45))) - (float)((float)((float)((float)(v42 * v45) * v48) + (float)(v49 * (float)(v37 * v50))) + (float)((float)(v38.m128_f32[0] * v41) * v39))) * v28; a1[3].m128_f32[1] = (float)((float)((float)((float)(v29 * v48) + (float)(v44 * (float)(v37 * v50))) + (float)(v25 * v39)) - (float)((float)((float)(v34 * v48) + (float)(v27 * v39)) + (float)(v44 * (float)(v38.m128_f32[0] * v45)))) * v28; a1[3].m128_f32[2] = (float)((float)((float)((float)(v36 * v48) + (float)(v27 * v49)) + (float)(v44 * (float)(v38.m128_f32[0] * v41))) - (float)((float)((float)(v35 * v48) + (float)(v44 * (float)(v42 * v50))) + (float)(v25 * v49))) * v28; a1[3].m128_f32[3] = (float)((float)((float)((float)(v44 * (float)(v42 * v45)) + (float)(v35 * v39)) + (float)(v34 * v49)) - (float)((float)((float)(v29 * v49) + (float)(v36 * v39)) + (float)(v44 * (float)(v37 * v41)))) * v28; return &retaddr; }

For matrices like the Projection Matrix, a large number of values are simply 0. If it’s a well-known matrix, it can be inlined without using generic inversions. It would be a huge drop in CPU cycles!

In the dunia engine they have inlined the inverse Projection Matrix calculation by saving important variables while constructing the Projection Matrix and rearranging it.

The instruction count for this generic Cramer’s Rule inverse is roughly 470, not to mention instruction count alone isn’t enough to showcase the inefficiency

1. SIMD to Scalar

The matrix rows are loaded into 128-bit wide registers, but get immediately unpacked to do scalar calculations 32 bits at a time.

2. Register Spilling

float v29; // [rsp+4h] [rbp-1C4h] float v30; // [rsp+8h] [rbp-1C0h] float v31; // [rsp+Ch] [rbp-1BCh] float v32; // [rsp+10h] [rbp-1B8h] float v33; // [rsp+14h] [rbp-1B4h] float v34; // [rsp+1Ch] [rbp-1ACh] float v35; // [rsp+20h] [rbp-1A8h] float v36; // [rsp+24h] [rbp-1A4h] etc...

16 XMM registers are available for floating-point math. This function defines over 50 individual float variables.

The variables written with annotations like [rsp+Ch] indicates that the CPU ran out of hardware registers and was forced to “spill” intermediate calculations to the stack.

3. Dependency Chains

look at:

v28 = 1.0 / (float)((float)((float)((float)(v30 * COERCE_FLOAT(*a1)) + (float)(v32 * v42)) + (float)(v33 * v37)) + (float)(v3.m128_f32[0] * v38.m128_f32[0]));

which represents 1.0/determinant. Its calculation requires v30, v32, v33, and others to be completely finished.

Subsequently, every single element written back to the final matrix depends on v28.

a1[2].m128_f32[1] = (float)((float)((float)((float)(v47 * (float)(v38.m128_f32[0] * v45)) + (float)(v27 * v43)) + (float)(v34 * v46)) - (float)((float)((float)(v25 * v43) + (float)(v47 * (float)(v37 * v50))) + (float)(v29 * v46))) * v28; a1[2].m128_f32[2] = (float)((float)((float)((float)(v25 * v5) + (float)(v47 * (float)(v42 * v50))) + (float)(v35 * v46)) - (float)((float)((float)(v47 * (float)(v38.m128_f32[0] * v41)) + (float)(v27 * v5)) + (float)(v36 * v46))) * v28; etc...

I would be very surprised if it isn’t stalled at least somewhere in the pipeline

The dunia engine inlined this to about 15-20 instructions by saving required values into registers or the stack and constructing it at the end of projection matrix construction. More info on my previous write-up! Part 6: Reversing Construction of the Inverse Projection Matrix

This also ties into the Camera Matrix and honestly various other matrices!

Reminder!
Camera Matrix^-1 = View Matrix

Let’s take the view matrix for example and how to inline it from my previous blog Reversing The ViewProjection Matrix (Part 4.2: Reversing SIMD Instructions for Matrix Math - Fast inverse for orthonormal Matrices)

Fast inverse for orthonormal Matrices

If R is a pure rotation matrix meaning:

No scaling,
No shear,
It’s orthonormal (columns are perpendicular and unit-length)

then \(R^{-1} = R^T\)

Suppose a 4x4 matrix with homogenous coordinates:

\[C_{world} = \begin{bmatrix} R_{00} & R_{01} & R_{02} & 0 \\ R_{10} & R_{11} & R_{12} & 0 \\ R_{20} & R_{21} & R_{22} & 0 \\ T_x & T_y & T_z & 1.0 \end{bmatrix}\]

Here:

R (upper 3×3) is the orientation of the camera in world space.
T (bottom row, first 3 values) is the position of the camera in world space.

To get \(C_{world}^{-1}\) we can separate the matrix like so:

\[C_{world} = \begin{bmatrix} R & 0 \\ T & 1 \end{bmatrix}\]

and we want its inverse.

The block matrix inverse formula for this special form is:

\[\begin{bmatrix} A & 0 \\ B & 1 \end{bmatrix}^{-1} = \begin{bmatrix} A^{-1} & 0 \\ -BA^{-1} & 1 \end{bmatrix}\]

(See Wikipedia: Blockwise inversion for the general derivation)

Applying The Formula we get:

A = R
B = T

So:

\[C_{world}^{-1} = \begin{bmatrix} R^{-1} & 0 \\ -TR^{-1} & 1 \end{bmatrix}\]

Since R is orthonormal (\(R^{-1} = R^T\)):

\[C_{world}^{-1} = \begin{bmatrix} R^T & 0 \\ -TR^T & 1 \end{bmatrix}\]

Exponent “T” represents the Transpose and Regular “T” represents the Translation

Now Expand \(−TR^T\) into its dot products:

if:

\[R = \begin{bmatrix} R_{0x} & R_{0y} & R_{0z} \\ R_{1x} & R_{1y} & R_{1z} \\ R_{2x} & R_{2y} & R_{2z} \\ \end{bmatrix}\]

and \(T = [T_x, T_y, T_z],\)

So:

\[-TR^T= \begin{bmatrix} -T_x & -T_y & -T_z \\ \end{bmatrix} \times \begin{bmatrix} R_{0x} & R_{1x} & R_{2x} \\ R_{0y} & R_{1y} & R_{2y} \\ R_{0z} & R_{1z} & R_{2z} \\ \end{bmatrix}\]

then:

\[-TR^T = [-dot(T,R_0), -dot(T,R_1), -dot(T,R_2)]\]

So the last row becomes:

\[[-dot(T,R_0), -dot(T,R_1), -dot(T,R_2)]\]

And Expanding \(R^T\) is just the Transpose of the Rotation, thus completing the inverse:

\[C_{world}^{-1} = \begin{bmatrix} R^T & 0 \\ -TR^T & 1 \end{bmatrix}\]

This is also seen in the dunia engine and honestly in most game engines.

Dunia Engine Example for View Matrix Fast Inverse:

// Dot Product of: Right • CamPos rightTrans = (float)((float)(rightY * CamPos_XYZ[1]) + (float)(rightX * *CamPos_XYZ)) + (float)(rightZ * CamPos_XYZ[2]); forwardZ = *(float *)(a1 + 0x88); upX = *(float *)(a1 + 0x90); upY = *(float *)(a1 + 0x94); // Dot Product of: Forward • CamPos forwardTrans = (float)((float)(forwardY * CamPos_XYZ[1]) + (float)(forwardX * *CamPos_XYZ)) + (float)(forwardZ * CamPos_XYZ[2]); upZ = *(float *)(a1 + 0x98); // Dot product of: up • CamPos (upTrans + v18) *(float *)&v18 = upZ * CamPos_XYZ[2]; upTrans = (float)(upY * CamPos_XYZ[1]) + (float)(upX * *CamPos_XYZ);

Standard Camera Matrix layout being (Memory Layout):

\[C_{world} = \begin{bmatrix} r_x & r_y & r_z & 0 \\ u_x & u_y & u_z & 0 \\ f_x & f_y & f_z & 0 \\ p_x & p_y & p_z & 1.0 \end{bmatrix}\]

here the right vector would be stored like so:

*(float *)(a1 + 0x30) = rightX; *(float *)(a1 + 0x34) = rightY; *(float *)(a1 + 0x38) = rightZ;

The dunia engine transposes it like so and adds the dot products.

// Fast inverse for orthonormal matrices (View Matrix Construction) *(float *)(a1 + 0x30) = rightX; *(float *)(a1 + 0x40) = rightY; *(float *)(a1 + 0x50) = rightZ; *(float *)(a1 + 0x34) = forwardX; *(float *)(a1 + 0x44) = forwardY; *(float *)(a1 + 0x54) = forwardZ; *(float *)(a1 + 0x58) = upZ; *(float *)(a1 + 0x38) = upX; *(float *)(a1 + 0x48) = upY; *(float *)(a1 + 0x60) = -rightTrans; *(float *)(a1 + 0x64) = -forwardTrans; *(float *)(a1 + 0x68) = -(float)(upTrans + *(float *)&v18);

Closing thoughts:

I’m all out of rants and tangents to go on about, here are some main takeaways:

1. Clean C++ Code ≠ Clean Compiled Code

Abstraction is a luxury where the cost is performance. Generalized math wrappers and trusting the compiler to “figure it out” is how you end up with 133 instructions instead of 3.

2. Profilers won’t save you

If the entire foundation is bloated, the baseline execution cost of every function is artificially raised. Profiler-Invisible Waste / Death by a Thousand Cuts

3. Following the textbook perfectly

Those matrix inversions, matrix multiplications, identity matrices all look great on the whiteboard but in a low-level CPU pipeline it is going about it in a really roundabout way.

4. The “Main Path” Contagion

This is not even a niche, unimportant function. This is the main rendering function preparing many different matrices bound for the GPU for calculations.
You can guarantee this exact same philosophy infests every other system in the engine.

5. Who was the common antagonist anyway?

You might have already guessed! It’s the MatrixMultiply4x4() but really that’s just the narrative for this write-up.

The true antagonist is the codebase culture itself. It is the “Clean C++” philosophy that prioritizes developer convenience and generic abstractions over CPU execution realities. This exact same over-engineering bleeds into entirely different mathematical primitives across the entire engine.

And it probably doesn’t even stop at just math libraries, probably every other library is also abused like this.

In my next write-up, we are going to look at the exact opposite problem. We are going to explore the Compatibility Tax. The ghost of a 12-year-old CPU that keeps modern games from utilizing instructions that could theoretically yield 5x speedups.

Until then, worship the IDA goddess!

Part 1: Introduction

2026-04-03T18:30:00+00:00