Search - 4rchive

>>42651316
I wrote a script that combines a lot of resources to do it, but the actual reading part is this:
The story text is pulled from a database, Elasticsearch, and split into sentences based on punctuation.
Each sentence is embedded to capture its semantic meaning, Almost all of the process revolves around this embedding process, which uses an LLM. The model isn't a chat model, though, the embeddings are a long string of numbers, the purpose of which is to compare against each other to see if the meaning of the text is similar. This is the first time the GPU reads the story, and the step that takes the most time.
After each sentence is embedded, it's compared against its neighboring sentences to combine strings of them on a similar topic together. In stories, dialogue in particular can form a semantic boundary, so this is done on a sliding window to "skip across" the boundary, if it is small enough to imply that the topic still hasn't changed.
https://docs.chonkie.ai/oss/chunkers/semantic-chunker

After that similarity comparison is done, the split sentences are put back together into a chunk of text that is still small enough for my GPU to read, and that chunk is embedded, with the embeddings sent to be stored in Elasticsearch. This is the second time the GPU reads the story, and takes about 1/4 the time of the first reading.

Next, and this is the most impressive part to me, a question about the story's contents is embedded:
Instruct: Given a web search query, retrieve relevant passages that answer the query.
Query: Which scenes depict Luna and another character having a romantic or intimate moment?"
This embedding is sent to Elasticsearch to find the chunks that answer the question. It's frankly unbelievable to me how well this works. If it is a Tunafic, it finds the Tuna scenes and puts them almost at the top of the search results. If it's not a Tunafic, but does have Luna shipping, it will find Luna and that other character's scenes - anything from a heartfelt conversation to banging. In a fic, https://www.fimfiction.net/story/562291/paladin-shimmer-goes-on-a-bender that has no Luna shipping (it's Twilight/Sunset, it even captured a scene where Luna warmly takes over the night from Celestia.

With that, I have the answer for whether or not the story is a Tunafic, but reading the out-of-context chunks is unnecessarily annoying. So, finally, those retrieved chunks are combined into a prompt to examine the retrieved story text and answer whether or not it is a Tunafic. This prompt goes to a typical chat model, and produces a chat response. It's right most of the time, but it's pretty clearly constrained by my VRAM and the general stupidity of chat models. You could say it's the third time my GPU reads the fic, though.

It's still going, done 83 fics so far.