Search Results

Found 1 results for "6c1b3ba74b543d1c69ec0ed6c20f914c" across all boards searching md5.

8/4/2025, 1:32:48 PM

>>106136206
I guess it could count as distilling.
The trained model is just a bunch of linear layers and activations, T5-small is 512 dim, T5-XXL is 4096, the first layer is 512->4096 and the rest are 4096->4096.
The dataset is embeds from T5-small and T5-XXL, training is T5-small embed -> adapter -> target is T5-XXL embed
Dataset is precomputed, saved to webdataset, with some custom tensor serialization with compression because T5-XXL embeds are huge
I was using A40 on runpod but it's too slow, now I'm using gpu_1x_gh200 on Lambda, that's ARM64 + H100, 64 vCPUs, 432 GiB RAM, 4 TiB SSD for only $1.49 / hr
If it ends up working good enough then yeah I'll release it

Go to Thread

Page 1