are cuda malloc and cuda stream worth using? can't even tell if they're helping, left is default, right is with both enabled. 3080 12gb. simple 4 gens 832x1216