Search Results
6/25/2025, 2:53:21 AM
>>105695182
i tried, didnt work i tried balancing the blk in a fair manner accross the devices like
"\.blk\.[0-4]\.(ffn_down_exps|ffn_gate_exps|ffn_up_exps|attn_output)\..*=CUDA0"
"\.blk\.[5-9]\.(ffn_down_exps|ffn_gate_exps|ffn_up_exps|attn_output)\..*=CUDA1"
"\.blk\.1[0-4]\.(ffn_down_exps|ffn_gate_exps|ffn_up_exps|attn_output)\..*=CUDA2"
"\.blk\.1[5-9]\.(ffn_down_exps|ffn_gate_exps|ffn_up_exps|attn_output)\..*=CUDA3"
"\.blk\.2[0-4]\.(ffn_down_exps|ffn_gate_exps|ffn_up_exps|attn_output)\..*=RPC[10.0.0.28:50052]"
"\.blk\.2[5-9]\.(ffn_down_exps|ffn_gate_exps|ffn_up_exps|attn_output)\..*=RPC[10.0.0.28:50053]"
"\.blk\.3[0-4]\.(ffn_down_exps|ffn_gate_exps|ffn_up_exps|attn_output)\..*=RPC[10.0.0.28:50054]"
"\.blk\.3[5-9]\.(ffn_down_exps|ffn_gate_exps|ffn_up_exps|attn_output)\..*=RPC[10.0.0.28:50055]"
"\.blk\.4[0-4]\.(ffn_down_exps|ffn_gate_exps|ffn_up_exps|attn_output)\..*=RPC[10.0.0.40:50052]"
"\.blk\.4[5-9]\.(ffn_down_exps|ffn_gate_exps|ffn_up_exps|attn_output)\..*=RPC[10.0.0.40:50053]"
"\.blk\.5[0-4]\.(ffn_down_exps|ffn_gate_exps|ffn_up_exps|attn_output)\..*=RPC[10.0.0.40:50054]"
"(^output\.|^token_embd\.|\.blk\.(5[5-9]|60)\.(ffn_down_exps|ffn_gate_exps|ffn_up_exps|attn_output)\.).*=RPC[10.0.0.40:50055]"
"(\.blk\..*\.(ffn_.*shexp|attn_k_b|attn_kv_a|attn_q_|attn_v_b|.*norm)\.|.*norm\.).*=CPU"
this, then i want even simpler with exactly 3 blocks per device and the rest on cpu
-ot ".*=CPU"
which then didnt use Cuda at all????
i mean looking at this i could fit atleast 30 GB more in the vram.
i tried, didnt work i tried balancing the blk in a fair manner accross the devices like
"\.blk\.[0-4]\.(ffn_down_exps|ffn_gate_exps|ffn_up_exps|attn_output)\..*=CUDA0"
"\.blk\.[5-9]\.(ffn_down_exps|ffn_gate_exps|ffn_up_exps|attn_output)\..*=CUDA1"
"\.blk\.1[0-4]\.(ffn_down_exps|ffn_gate_exps|ffn_up_exps|attn_output)\..*=CUDA2"
"\.blk\.1[5-9]\.(ffn_down_exps|ffn_gate_exps|ffn_up_exps|attn_output)\..*=CUDA3"
"\.blk\.2[0-4]\.(ffn_down_exps|ffn_gate_exps|ffn_up_exps|attn_output)\..*=RPC[10.0.0.28:50052]"
"\.blk\.2[5-9]\.(ffn_down_exps|ffn_gate_exps|ffn_up_exps|attn_output)\..*=RPC[10.0.0.28:50053]"
"\.blk\.3[0-4]\.(ffn_down_exps|ffn_gate_exps|ffn_up_exps|attn_output)\..*=RPC[10.0.0.28:50054]"
"\.blk\.3[5-9]\.(ffn_down_exps|ffn_gate_exps|ffn_up_exps|attn_output)\..*=RPC[10.0.0.28:50055]"
"\.blk\.4[0-4]\.(ffn_down_exps|ffn_gate_exps|ffn_up_exps|attn_output)\..*=RPC[10.0.0.40:50052]"
"\.blk\.4[5-9]\.(ffn_down_exps|ffn_gate_exps|ffn_up_exps|attn_output)\..*=RPC[10.0.0.40:50053]"
"\.blk\.5[0-4]\.(ffn_down_exps|ffn_gate_exps|ffn_up_exps|attn_output)\..*=RPC[10.0.0.40:50054]"
"(^output\.|^token_embd\.|\.blk\.(5[5-9]|60)\.(ffn_down_exps|ffn_gate_exps|ffn_up_exps|attn_output)\.).*=RPC[10.0.0.40:50055]"
"(\.blk\..*\.(ffn_.*shexp|attn_k_b|attn_kv_a|attn_q_|attn_v_b|.*norm)\.|.*norm\.).*=CPU"
this, then i want even simpler with exactly 3 blocks per device and the rest on cpu
-ot ".*=CPU"
which then didnt use Cuda at all????
i mean looking at this i could fit atleast 30 GB more in the vram.
Page 1