>>107147210
I decided to finally take the plunge and just start making my own AI.
Gonna try and start at a surface level and work down. For now I'm just tinkering with nanoGPT and seeing what I can do.
Right now I'm working on a hybrid word/char-level tokenizer. Not sure where I want to get training data. Goal is english-only with maybe a move to Japanese or Mandarin/chinese later on once I'm more familiar with how this all works.
Are there any good text datasets on Huggingface you guys recommend?