>>717799303
I thought about this when all this LLM stuff was new.
The answer is "never" with this type of technology. It has to read thousands of frames of footage, compile it with the context of having done so thousands of times before, and generating an answer.
It has to do this in realtime.
It also has to generate a voice in realtime.
Sorry, but it's not happening.