also curious fact:
fastest scenario for me runs in 19 secs
and thats half the memory ops
so im probably bumping my head against the latencies of simd
i already have a suspect in my code
the fukken additional blocks
but it think if i remove everything i can be using more than 16 vector registers at certain points and that means offloading shit to the stack
which is gonna tank performance i think
its 60 cycles. 4 cycles for a read and 4 cycles for a write and thats 13% of the runtime of added latency which may or may not be optimized away by the cpu
but then the compiler might/should do a better job at linking everything together

im not gonna code it today, but i think my intern tier use of blocks makes my stuff way slower than it should be
possibly fences which prevent out of order, which in turn influences latencies
certainly the fact that i pass everything through eax, which happens because i trusted a very bad source- everything is passed by value
although that may be because i force the behaviour with my blocks