>>528024652
>My thought so far is to tokenize each word
Have you tried this? Did it also give you timestamps? It seems like a good idea.

https://github.com/linto-ai/whisper-timestamped

You'd only need to do it once then save the timestamps into a file.