Search Results
8/9/2025, 10:24:59 PM
I'm lost, idk if it's llamacpp or something else but fuck I need some help.
I'm trying to use GLM 4.5 AIR with llamacpp backend and ST frontend, but I've ran into multiple issues:
Settings are 32k context (on both BE and FE) and 250 tokens per message. llamacpp is running in server-mode (openai compatibility mode I guess)
1 - The response are cut off while the model is still thinking, I think they get cutoff at the 250 tokens limit
2 - Is there a way to set token limit for thinking (through ST or lcpp?) When I worked to build an agentic assistant for some enterprise I remember bedrock providing the amount of thinking tokens other than the amount of max tokens per message for sonnet 3.7+, but I was unable to find a similar setting for llamacpp or ST
3 - <think> tag likely is not appearing during output in ST. I've set ST to parse the <think></think> tags using the deepseek config, but they're not coming out of the responses (at least checking in ST). Could this be due to use server mode for llamacpp?
4 - I actually forgot but there was a 4th issue. What are the suggested params I guess for p_value/temp/repetition?
Help a bro out!!!!
I'm trying to use GLM 4.5 AIR with llamacpp backend and ST frontend, but I've ran into multiple issues:
Settings are 32k context (on both BE and FE) and 250 tokens per message. llamacpp is running in server-mode (openai compatibility mode I guess)
1 - The response are cut off while the model is still thinking, I think they get cutoff at the 250 tokens limit
2 - Is there a way to set token limit for thinking (through ST or lcpp?) When I worked to build an agentic assistant for some enterprise I remember bedrock providing the amount of thinking tokens other than the amount of max tokens per message for sonnet 3.7+, but I was unable to find a similar setting for llamacpp or ST
3 - <think> tag likely is not appearing during output in ST. I've set ST to parse the <think></think> tags using the deepseek config, but they're not coming out of the responses (at least checking in ST). Could this be due to use server mode for llamacpp?
4 - I actually forgot but there was a 4th issue. What are the suggested params I guess for p_value/temp/repetition?
Help a bro out!!!!
Page 1