Anonymous
10/22/2025, 11:01:01 PM
No.106976603
I missed a couple of days but I was looking through and someone mentioned movie scripts as a source of training and now I'm wonder if there is any source of "organic" human only good narrative text/writing data like that we are overlooking over synthetically generated data.
Visual novels scripts were another although probably best trained in original Japanese over a translation and I don't know how many of them exist that would be any good given you would probably have to find the fan translated ones to get any tools for training an LLM. Anything else that might have been in a different medium but we have transcripts for? Another one I was thinking was radio dramas but I don't think on the English side of things, they are in any way popular anymore, right? Only Japan still does it for like anime stuff? And I don't think podcasts are great because a lot of it is just conversational rather than narrative, and the storytelling is overdramatic in the ones that try and make it something worth watching like true crime podcasts, it feels like it would be slop and maybe some of the podcasting stuff has already been tainted by LLM stuff so probably would have to go back in time to pre-2022.
Visual novels scripts were another although probably best trained in original Japanese over a translation and I don't know how many of them exist that would be any good given you would probably have to find the fan translated ones to get any tools for training an LLM. Anything else that might have been in a different medium but we have transcripts for? Another one I was thinking was radio dramas but I don't think on the English side of things, they are in any way popular anymore, right? Only Japan still does it for like anime stuff? And I don't think podcasts are great because a lot of it is just conversational rather than narrative, and the storytelling is overdramatic in the ones that try and make it something worth watching like true crime podcasts, it feels like it would be slop and maybe some of the podcasting stuff has already been tainted by LLM stuff so probably would have to go back in time to pre-2022.