Search Results
8/4/2025, 1:32:48 PM
>>106136206
I guess it could count as distilling.
The trained model is just a bunch of linear layers and activations, T5-small is 512 dim, T5-XXL is 4096, the first layer is 512->4096 and the rest are 4096->4096.
The dataset is embeds from T5-small and T5-XXL, training is T5-small embed -> adapter -> target is T5-XXL embed
Dataset is precomputed, saved to webdataset, with some custom tensor serialization with compression because T5-XXL embeds are huge
I was using A40 on runpod but it's too slow, now I'm using gpu_1x_gh200 on Lambda, that's ARM64 + H100, 64 vCPUs, 432 GiB RAM, 4 TiB SSD for only $1.49 / hr
If it ends up working good enough then yeah I'll release it
I guess it could count as distilling.
The trained model is just a bunch of linear layers and activations, T5-small is 512 dim, T5-XXL is 4096, the first layer is 512->4096 and the rest are 4096->4096.
The dataset is embeds from T5-small and T5-XXL, training is T5-small embed -> adapter -> target is T5-XXL embed
Dataset is precomputed, saved to webdataset, with some custom tensor serialization with compression because T5-XXL embeds are huge
I was using A40 on runpod but it's too slow, now I'm using gpu_1x_gh200 on Lambda, that's ARM64 + H100, 64 vCPUs, 432 GiB RAM, 4 TiB SSD for only $1.49 / hr
If it ends up working good enough then yeah I'll release it
Page 1