Search Results

Found 1 results for "4901fa30804d1c2124b91048b3f1cb49" across all boards searching md5.

llama.cpp CUDA dev !!yhbFjk57TDr/g/105750356#105753011
6/30/2025, 1:30:51 PM
>>105750356
Follow-up to >>105736983 :

|Model |File size [GiB]|Correct answers| |
|--------------------------------------------|---------------|---------------|--------|
|mistral_small_3.1_instruct_2503-24b-f16.gguf| 43.92|1051/4962 |= 21.18%|
|phi_4-15b-f16.gguf | 27.31|1105/4664 |= 23.69%|
|gemma_3_it-27b-q8_0.gguf | 26.74|1149/4856 |= 23.66%|
|mistral_nemo_instruct_2407-12b-f16.gguf | 22.82|1053/4860 |= 21.67%|
|gemma_3_it-12b-f16.gguf | 21.92|1147/4926 |= 23.28%|
|glm_4_chat-9b-f16.gguf | 17.52|1083/4990 |= 21.70%|
|gemma_2_it-9b-f16.gguf | 17.22|1151/5000 |= 23.02%|
|llama_3.1_instruct-8b-f16.gguf | 14.97|1015/5000 |= 20.30%|
|ministral_instruct_2410-8b-f16.gguf | 14.95|1044/4958 |= 21.06%|
|qwen_2.5_instruct_1m-7b-f16.gguf | 14.19|1052/5000 |= 21.04%|
|gemma_3_it-4b-f16.gguf | 7.23|1064/5000 |= 21.28%|
|phi_4_mini_instruct-4b-f16.gguf | 7.15|1082/4982 |= 21.72%|
|llama_3.2_instruct-3b-f16.gguf | 5.99|900/5000 |= 18.00%|
|stablelm_2_chat-2b-f16.gguf | 3.07|996/4976 |= 20.02%|
|llama_3.2_instruct-1b-f16.gguf | 2.31|1000/4998 |= 20.01%|
|gemma_3_it-1b-f16.gguf | 1.87|955/4938 |= 19.34%|


It seems my initial impression was too pessimistic, with a sufficiently large sample size even weaker models seem to be able to do better than RNG.
With a sample size of 5000 RNG would result in 20.0+-0.5% so even 4b models can be statistically significantly better.