Anonymous
8/5/2025, 7:11:38 PM No.106151934
Hello r/e/g/edit
I have been working on and off with large action models. I call them LLAMs. Because so far, the way I do it is with two separate processes. Like a left and right brain. One model is a traditional LLM, and the other is a yolo based unit. Yolo details information it sees, the llm then responds to what it sees. Right now, darknet or another trainer actually performs the actions. The llm simply communicates what's happening but isn't driving except in specific situations. If i feed it the visual data. And say to identify the location in coordinates of the x y z button. It can spit that back. It is then extracted and used to move the mouse to said location. Cool, but it would be horribly slow. I would love to have an lam that actually controls the computer to do tasks. But everything i see is kind of really bad, and geared for stores. I see a bit about xlam, but the amount of discussion is so low that I wonder if it's worth my time to get into it too deeply.
I have been working on and off with large action models. I call them LLAMs. Because so far, the way I do it is with two separate processes. Like a left and right brain. One model is a traditional LLM, and the other is a yolo based unit. Yolo details information it sees, the llm then responds to what it sees. Right now, darknet or another trainer actually performs the actions. The llm simply communicates what's happening but isn't driving except in specific situations. If i feed it the visual data. And say to identify the location in coordinates of the x y z button. It can spit that back. It is then extracted and used to move the mouse to said location. Cool, but it would be horribly slow. I would love to have an lam that actually controls the computer to do tasks. But everything i see is kind of really bad, and geared for stores. I see a bit about xlam, but the amount of discussion is so low that I wonder if it's worth my time to get into it too deeply.