Makes me miss debugging systems software...
In my day we didn't have floating point co-processors, so all 3D games used fixed point values. sqrt() was - maybe still is - unforgivably slow: we used a (rather short, actually) lookup table to do that. Same with sine, cosine, etc. 100's of times faster than doing the math and because of pixel density the effect was indistinguishable. Just sayin'... maybe at today's resolutions and with numerical co-processors that's all pointless bit-twiddling...
Good lord it's been a while since I've even heard fixed-point mentioned. Things have indeed changed quite a lot since then.
Out of order architectures (like what runs in most PCs) are hugely complex and amazing at reordering. The bulk of floating point operations can be done in a few cycles, and SIMD units are ubiquitous leaving you with the ability to do floating point operations 4 at a time with 0 penalty. PC instruction pipelines allow you to execute something like 3 x86 instructions or 2 FP (x87 or SSE) instructions per cycle.
It's not hard to roll your own math functions when needed. The stdlib functions aren't bad but easy to replace for the extra edge if you absolutely need it. Lookup tables are probably not work the effort, approximations are probably going to be (way way) faster. PC hardware comes with some a nice fast inverse square root (takes as much time as an add, 12(?) bits of precision) function that you can combine with a Newton-Raphson iteration to produce a usable inverse square root.
Memory access penalties are huge, big sprawling object-oriented code with hundreds of virtual calls (alternatively named the late 90s in C++) blows instruction and data cache, leaving the processor to spin for millions of cycles while it waits for data from main memory. If writing asm you seriously want to try as hard as possible to minimize dependency stalls by executing 2 threads of computation at once (you should probably only be writing asm if you're in a compute heavy function). All this combined with multicore computers leave computation cheap and memory access slow.
For efficiency it's best to concentrate on batching operations on data with linear (or near-linear) layouts in memory to execute concurrently. Using task-based concurrency is a huge help. Break up computation into little bits that can be executed on different pools of data with minimal locking. You can think of your code like a directed acyclic graph of code blocks. This clearly shows you where you can do things in parallel.
Preallocate object pools and then allocate objects in runtime out of those, use handles as multi-frame references and pointers as single-frame references. Perhaps defragment your object pools (this is why you need handles!) at the end of the frame for a set amount of time and then continue defragmenting next frame. Hand off physics calculations to your physics engine which will implement some insane optimizations and batching internally (seriously those guys are magicians).
Of course our tech does none of this.