Intro:

Before I delve in this I need to get a few things out of the way ⚠️:

To be completely clear: I did not go looking for this. I was simply reversing engines to study how they constructed their fundamental Transformation matrices and handled temporal jitter logic. But reading a core rendering function and seeing such avoidable overhead practically hits you in the face. You are not looking at unoptimized code but codebase culture.

Nowww, the functions we are going to dissect are only run a few times per frame, so it really doesn’t matter if we optimize it or not. But it raises an uncomfortable question: If such optimizations are not even considered on these important rendering pathways then what other functions are overlooked? does such culture infest every other system in the engine?

This is a symptom of “Profiler-Invisible-Waste”; it is a classic case of missing the forest for the trees. You will never find this issue through a profiler, the culture is ingrained in every function. This leads to “Death By a Thousand Cuts”.

ESP-Image1

Look at this VTune capture. The top hotspots barely break 5% of the CPU time each. But look at the red box: 80.9% of the execution time is buried in [Others]. Profilers are designed to find massive, isolated bottlenecks.

But if your entire codebase is built on over-engineered abstractions and bloated generic math wrappers, the baseline execution cost of every function is raised. You don’t get a few obvious performance spikes; you get a uniformly elevated floor.

That 80% block is exactly where those “thousand cuts” are hiding.

We will go over 3 separate game engines:

  1. Dunia Engine (Far Cry Series)
  2. Sucker Punch Studios Proprietary Engine (Ghost of Tsushima)
  3. Avalanche Engine (Just Cause Series)

And all 3 have one common “Antagonist”.

Basic Outline:

Redundancy and Over-Engineering:

First we will go over all of the Redundant and Over-Engineered code which exists in the engine. Then explain why it is Redundant/Over-Engineered and how it could have been written. I will show Difference in instruction count side-by-side (Original vs Hand Written Assembly).

We will also come up with theories on why this happens, Specifically on the concept “Clean C++ Code ≠ Clean Compiled Code”

Compatibility, Legacy and Readable Code

The Compatibility Tax: Not Wrong, Just Capped. In my next write-up we will look at the realities of AAA game development. Not all of this is the result of developers blindly trusting compilers. I will show sections of the engine where the code could theoretically run way faster (with some functions reaching 5× speedups) due to new advancements in processor architecture, but couldn’t be fully utilized because the engine must maintain compatibility with a wide range of hardware.

This whole write-up is just food for thought, a question really: “What is the performace tax for such abstractions?”

Case 1: Y-up To Z-up Over-Engineering (Dunia Engine)

ESP-Image1

This is doing a Coordinate Space Conversion in a very “Textbooky” way. They first construct a Matrix:

\[M = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & -4.37 \times 10^{-8} & 1 & 0 \\ 0 & -1 & -4.37 \times 10^{-8} & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}\]

on the stack, probably using some function like Matrix::CreateRotationX(-PI / 2)

So it actually constructed:

\[\begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & \cos(-90^\circ) & -\sin(-90^\circ) & 0 \\ 0 & \sin(-90^\circ) & \cos(-90^\circ) & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}\]

Then pass the arguments to MatrixMultiply4x4(&CameraMatrix, (__int64)&matrixPointer);

Where the logic for the function is a2 × a1 = a1

Let’s put a breakpoint in the multiply call to see exactly what happens after this multiplication:

I am looking exactly “north” in-game when breakpointing

Before:

ESP-Image1

After:

ESP-Image1

So, take the up vector, negate it and put it on row 2, then take the forward vector and put it on row 1.

Let me put into perspective how many assembly instructions were executed just to do this:

Pseudo code of MatrixMultiply4x4: ESP-Image1

First load Camera Matrix:
movaps xmm0, xmmword ptr [rdi] movaps xmm1, xmmword ptr [rbx+0C0h] movaps [rsp+1A0h+CameraMatrix], xmm0 movaps xmm0, xmmword ptr [rbx+0D0h] movaps [rsp+1A0h+CamMatRow1], xmm1 movaps xmm1, xmmword ptr [rbx+0E0h] movaps [rsp+1A0h+CamMatRow2], xmm0 movaps [rsp+1A0h+CamMatRow3], xmm1
Construct the Swizzle Matrix:
mov qword ptr [rsp+1A0h+matrixPointer], 3F800000h mov dword ptr [rbp+0A0h+var_120], 0 mov dword ptr [rbp+0A0h+var_110], 0 mov qword ptr [rbp+0A0h+var_100], 0 mov dword ptr [rbp+0A0h+var_120+4], 0B33BBD2Eh mov dword ptr [rbp+0A0h+var_110+4], 0BF800000h mov qword ptr [rsp+1A0h+matrixPointer+8], 0 mov qword ptr [rbp+0A0h+var_120+8], 3F800000h mov dword ptr [rbp+0A0h+var_110+8], 0B33BBD2Eh mov dword ptr [rbp+0A0h+var_100+8], 0 mov dword ptr [rbp+0A0h+var_110+0Ch], 0 mov dword ptr [rbp+0A0h+var_100+0Ch], 3F800000h
Then load arguments in registers:
lea rdx, [rsp+1A0h+matrixPointer] lea rcx, [rsp+1A0h+CameraMatrix]
Call the function:
call MatrixMultiply4x4
Inside function entry (stack allocation, set up security cookies, load registers etc):
sub rsp, 78h movaps [rsp+78h+var_18], xmm6 movaps [rsp+78h+var_28], xmm7 mov rax, cs:__security_cookie xor rax, rsp mov [rsp+78h+var_38], rax movaps xmm4, xmmword ptr [rcx] lea r8, [rsp+78h+var_78] movaps xmm5, xmmword ptr [rcx+10h] lea rax, [rsp+78h+var_78] movaps xmm6, xmmword ptr [rcx+20h] sub rdx, r8 movaps xmm7, xmmword ptr [rcx+30h] mov r8d, 4 nop dword ptr [rax]
Multiply all rows to columns (repeated 4 times):
; --- iteration 1 --- movaps xmm2, xmmword ptr [rdx+rax] movaps xmm3, xmm2 movaps xmm0, xmm2 shufps xmm3, xmm2, 55h ; 'U' movaps xmm1, xmm2 shufps xmm0, xmm2, 0 mulps xmm3, xmm5 shufps xmm1, xmm2, 0AAh mulps xmm0, xmm4 mulps xmm1, xmm6 shufps xmm2, xmm2, 0FFh addps xmm3, xmm0 mulps xmm2, xmm7 addps xmm3, xmm1 addps xmm3, xmm2 movaps xmmword ptr [rax], xmm3 add rax, 10h sub r8, 1 jnz short loc_712D0F0 ; --- iteration 2 --- movaps xmm2, xmmword ptr [rdx+rax] movaps xmm3, xmm2 movaps xmm0, xmm2 shufps xmm3, xmm2, 55h ; 'U' movaps xmm1, xmm2 shufps xmm0, xmm2, 0 mulps xmm3, xmm5 shufps xmm1, xmm2, 0AAh mulps xmm0, xmm4 mulps xmm1, xmm6 shufps xmm2, xmm2, 0FFh addps xmm3, xmm0 mulps xmm2, xmm7 addps xmm3, xmm1 addps xmm3, xmm2 movaps xmmword ptr [rax], xmm3 add rax, 10h sub r8, 1 jnz short loc_712D0F0 ; --- iteration 3 --- movaps xmm2, xmmword ptr [rdx+rax] movaps xmm3, xmm2 movaps xmm0, xmm2 shufps xmm3, xmm2, 55h ; 'U' movaps xmm1, xmm2 shufps xmm0, xmm2, 0 mulps xmm3, xmm5 shufps xmm1, xmm2, 0AAh mulps xmm0, xmm4 mulps xmm1, xmm6 shufps xmm2, xmm2, 0FFh addps xmm3, xmm0 mulps xmm2, xmm7 addps xmm3, xmm1 addps xmm3, xmm2 movaps xmmword ptr [rax], xmm3 add rax, 10h sub r8, 1 jnz short loc_712D0F0 ; --- iteration 4 --- movaps xmm2, xmmword ptr [rdx+rax] movaps xmm3, xmm2 movaps xmm0, xmm2 shufps xmm3, xmm2, 55h ; 'U' movaps xmm1, xmm2 shufps xmm0, xmm2, 0 mulps xmm3, xmm5 shufps xmm1, xmm2, 0AAh mulps xmm0, xmm4 mulps xmm1, xmm6 shufps xmm2, xmm2, 0FFh addps xmm3, xmm0 mulps xmm2, xmm7 addps xmm3, xmm1 addps xmm3, xmm2 movaps xmmword ptr [rax], xmm3 add rax, 10h sub r8, 1 jnz short loc_712D0F0
Function end (dealloc stack, verify the security cookie, load result into memory and registers):
add rax, 10h sub r8, 1 jnz short loc_712D0F0 movaps xmm0, [rsp+78h+var_78] mov rax, rcx movaps xmm1, [rsp+78h+var_68] movaps xmmword ptr [rcx], xmm0 movaps xmm0, [rsp+78h+var_58] movaps xmmword ptr [rcx+10h], xmm1 movaps xmm1, [rsp+78h+var_48] movaps xmmword ptr [rcx+20h], xmm0 movaps xmmword ptr [rcx+30h], xmm1 mov rcx, [rsp+78h+var_38] xor rcx, rsp ; StackCookie call j___security_check_cookie movaps xmm6, [rsp+78h+var_18] movaps xmm7, [rsp+78h+var_28] add rsp, 78h retn

All this for a very simple coordinate space conversion?

We could simply shuffle it ourself using movaps to swizzle the vectors then simply use xorps with mask 0x80000000 for negation! Let’s try:

ESP-Image1

It seems to be storing the result of the multiply into the camera structure, we could simply change how they store the rows in the camera structure. The multiplied result will not be used by the current function and will be simply deallocated as it’s constructed on the stack so we only care about what’s in the camera structure.

So now take the up vector negate it and put it on row2, then take forward vector and put it on row1.

Counting rows from 0!

For this to work we simply change v16 = CamMatRow1; to v16 = CamMatRow2;

and

v17 = CamMatRow2; to v17 = CamMatRow1;

This simple change would not add more assembly instructions, we are simply modifying existing instructions.

then simply xorps v17, 0x80000000 to negate the bits.

And we are done!

So the total difference in instruction count is 133 to 1! where we only added one new instruction:

xorps v17, 0x80000000

while completely nuking all other instructions i have shown above.

To be clear: the instructions moving the vectors into the camera structure aren’t redundant. They are a sunk cost that has to execute regardless, thus i did not include those instructions in the assembly showcase. The only net-new instruction required to achieve the coordinate conversion is a single xorps, thus i am calling it a 133 to 1 instruction count decrease.

Instruction Count Visualized:

Before 133 Instructions
; 1. Load Camera Matrix
movaps  xmm0, xmmword ptr [rdi]
movaps  xmm1, xmmword ptr [rbx+0C0h]
movaps  [rsp+1A0h+CameraMatrix], xmm0
movaps  xmm0, xmmword ptr [rbx+0D0h]
movaps  [rsp+1A0h+CamMatRow1], xmm1
movaps  xmm1, xmmword ptr [rbx+0E0h]
movaps  [rsp+1A0h+CamMatRow2], xmm0
movaps  [rsp+1A0h+CamMatRow3], xmm1

; 2. Construct Identity/Swizzle Matrix
mov     qword ptr [rsp+1A0h+matrixPointer], 3F800000h
mov     dword ptr [rbp+0A0h+var_120], 0
mov     dword ptr [rbp+0A0h+var_110], 0
mov     qword ptr [rbp+0A0h+var_100], 0
mov     dword ptr [rbp+0A0h+var_120+4], 0B33BBD2Eh
mov     dword ptr [rbp+0A0h+var_110+4], 0BF800000h
mov     qword ptr [rsp+1A0h+matrixPointer+8], 0
mov     qword ptr [rbp+0A0h+var_120+8], 3F800000h
mov     dword ptr [rbp+0A0h+var_110+8], 0B33BBD2Eh
mov     dword ptr [rbp+0A0h+var_100+8], 0
mov     dword ptr [rbp+0A0h+var_110+0Ch], 0
mov     dword ptr [rbp+0A0h+var_100+0Ch], 3F800000h

; 3. Setup Arguments & Call MatrixMultiply4x4
lea     rdx, [rsp+1A0h+matrixPointer]
lea     rcx, [rsp+1A0h+CameraMatrix]
call    MatrixMultiply4x4

; 4. Inside Function: ABI Overhead
sub     rsp, 78h
movaps  [rsp+78h+var_18], xmm6
movaps  [rsp+78h+var_28], xmm7
mov     rax, cs:__security_cookie
xor     rax, rsp
mov     [rsp+78h+var_38], rax
movaps  xmm4, xmmword ptr [rcx]
lea     r8, [rsp+78h+var_78]
movaps  xmm5, xmmword ptr [rcx+10h]
lea     rax, [rsp+78h+var_78]
movaps  xmm6, xmmword ptr [rcx+20h]
sub     rdx, r8
movaps  xmm7, xmmword ptr [rcx+30h]
mov     r8d, 4

; 5. SIMD Unrolled Loop (x4 Iterations)
; --- iteration 1 ---
movaps  xmm2, xmmword ptr [rdx+rax]
movaps  xmm3, xmm2
movaps  xmm0, xmm2
shufps  xmm3, xmm2, 55h
movaps  xmm1, xmm2
shufps  xmm0, xmm2, 0
mulps   xmm3, xmm5
shufps  xmm1, xmm2, 0AAh
mulps   xmm0, xmm4
mulps   xmm1, xmm6
shufps  xmm2, xmm2, 0FFh
addps   xmm3, xmm0
mulps   xmm2, xmm7
addps   xmm3, xmm1
addps   xmm3, xmm2
movaps  xmmword ptr [rax], xmm3
add     rax, 10h
sub     r8, 1
jnz     short loc_712D0F0
; --- iteration 2 ---
movaps  xmm2, xmmword ptr [rdx+rax]
movaps  xmm3, xmm2
movaps  xmm0, xmm2
shufps  xmm3, xmm2, 55h
movaps  xmm1, xmm2
shufps  xmm0, xmm2, 0
mulps   xmm3, xmm5
shufps  xmm1, xmm2, 0AAh
mulps   xmm0, xmm4
mulps   xmm1, xmm6
shufps  xmm2, xmm2, 0FFh
addps   xmm3, xmm0
mulps   xmm2, xmm7
addps   xmm3, xmm1
addps   xmm3, xmm2
movaps  xmmword ptr [rax], xmm3
add     rax, 10h
sub     r8, 1
jnz     short loc_712D0F0
; --- iteration 3 ---
movaps  xmm2, xmmword ptr [rdx+rax]
movaps  xmm3, xmm2
movaps  xmm0, xmm2
shufps  xmm3, xmm2, 55h
movaps  xmm1, xmm2
shufps  xmm0, xmm2, 0
mulps   xmm3, xmm5
shufps  xmm1, xmm2, 0AAh
mulps   xmm0, xmm4
mulps   xmm1, xmm6
shufps  xmm2, xmm2, 0FFh
addps   xmm3, xmm0
mulps   xmm2, xmm7
addps   xmm3, xmm1
addps   xmm3, xmm2
movaps  xmmword ptr [rax], xmm3
add     rax, 10h
sub     r8, 1
jnz     short loc_712D0F0
; --- iteration 4 ---
movaps  xmm2, xmmword ptr [rdx+rax]
movaps  xmm3, xmm2
movaps  xmm0, xmm2
shufps  xmm3, xmm2, 55h
movaps  xmm1, xmm2
shufps  xmm0, xmm2, 0
mulps   xmm3, xmm5
shufps  xmm1, xmm2, 0AAh
mulps   xmm0, xmm4
mulps   xmm1, xmm6
shufps  xmm2, xmm2, 0FFh
addps   xmm3, xmm0
mulps   xmm2, xmm7
addps   xmm3, xmm1
addps   xmm3, xmm2
movaps  xmmword ptr [rax], xmm3
add     rax, 10h
sub     r8, 1
jnz     short loc_712D0F0

; 6. Deallocate & Return
movaps  xmm0, [rsp+78h+var_78]
mov     rax, rcx
movaps  xmm1, [rsp+78h+var_68]
movaps  xmmword ptr [rcx], xmm0
movaps  xmm0, [rsp+78h+var_58]
movaps  xmmword ptr [rcx+10h], xmm1
movaps  xmm1, [rsp+78h+var_48]
movaps  xmmword ptr [rcx+20h], xmm0
movaps  xmmword ptr [rcx+30h], xmm1
mov     rcx, [rsp+78h+var_38]
xor     rcx, rsp
call    j___security_check_cookie
movaps  xmm6, [rsp+78h+var_18]
movaps  xmm7, [rsp+78h+var_28]
add     rsp, 78h
retn
After 3 Instructions
; Just swap the pointers and flip the sign bit
movaps  v16, CamMatRow2
movaps  v17, CamMatRow1
xorps   v17, 0x80000000

You might say it’s for readability or that it makes it easier to modify the coordinate conversion later. But for a programmer who grasps the underlying math, the intent behind swapping rows and flipping a sign is perfectly clear especially since we can express the exact same logic cleanly using SSE intrinsics in C++. Memorizing textbook formulas is fine, but if you don’t understand the actual spatial intent behind them you’re just pattern-matching

Case 2: Atanf just to get back FovX:

I’ve already talked about this here Reversing The Prespective Projection Matrix (Part 5.1) but want to shine light on this in more detail.

ESP-Image1

The arguments for tanf and atanf are not given but i have read the assembly (which is loaded into xmm0 just before call) and written the arguments on the right.

Let’s start with this block

fovX_calc = (__m128)*(unsigned int *)(a1 + 0x234); fovX_calc.m128_f32[0] = fovX_calc.m128_f32[0] * 0.5; ucrtBase_Tanf(); // Arg: fovX_calc

Using CE for dynamic analysis we can see that (a1 + 0x234) is FovX in radians

So it basically loads FovX in radians into “fovX_calc” then immediately divide it by 2 so now “fovX_calc” holds the value fovX/2.

Next Tanf() is called with arg = fovX_calc so fovX_calc currently holds the value tan(fovX/2)

Next the engine does a lot of calculations using tan(fovX/2), then the engine decides it actually needs the value of fovX back so it does it in an ingenious way!

ucrtBase_aTanf(); // Arg: fovX_calc fovX_val.m128_f32[0] = fovX_calc.m128_f32[0] * 2.0;

The fovX_calc variable is never used again in the function

we all know the identity arctan(tan(x)) = x holds if and only if x lies strictly inside (−90°, 90°), the identity holds without exception. FovX can only have a value from 60 to 120 degrees in-game. Since we divided it by 2, the angle is between 30 and 60 degrees (well within the -π/2 to π/2 principal bounds of arctan)

The codebase culture accepts:
\(x = 2 \cdot \arctan\!\left(\tan\!\left(\frac{x}{2}\right)\right)\)
as a valid way to move data from point A to point B.

So just to get back the raw fovX they call atanf then multiply it by 2. It could have been easily avoided by simply loading it from the camera structure again without it taking anywhere near a full IEEE 754 atanf calculation cycles since it’s certainly loaded in the L1 cache or alternatively just storing it to another unused xmm register to save it

Keep in mind this is not using the formula to find FovY which is:
\(FOV_Y = 2 \cdot \tan^{-1}\!\left(\frac{\tan\!\left(\frac{FOV_X}{2}\right)}{A}\right)\)
There is no Aspect Ratio used here.

We cant exactly tell how many cycles atanf() has used. It can range from 30 to 150 depending on your CPU architecture.

One meaningless atanf() on a function that only runs a few times per frame? That’s negligible.

Multiple redundant atanf() calls scattered across the codebase? That’s adding up.

Multiple redundant atanf, tanf, sinf, cosf, matrix multiply, matrix inversions, dot products, cross products etc etc..? That’s huge.

The Point:

The point here isn’t, “OmG tHeY uSeD a FuLl iEeE 754 bIt PeRfEct aTanf() CaLL jUsT tO gEt BaCk fovX!”

Let’s be real: doing that in a function that only runs a few times per frame doesn’t actually cost much, whether it’s an atanf call or a MatrixMultiply4x4.

The real point is: “Does it stop here?”, “Does this practice not carry over to all other systems?”

It does. This same over-engineering and “clean C++” bleeds into entirely different generic functions across the codebase. We’re talking atanf, tanf, sinf, cosf, 1/sqrtf, normalization, matrix multiply, matrix inversions, dot products, cross products, and honestly a whole, whole lot more.

And it’s probably not just math libraries being used like this.

This is the exact definition of “Death by a Thousand Cuts”. The Codebase ensures you’re bleeding from everywhere.

Now that you know the point of this blog i will continue with Ghost of Tsushima and the Avalanche Engine.

Part 2: Ghost Of Tsushima »