> On the infra side, training a 1.5B model in ~4 hours on 8×H100 is impressive.
It's hard to compare without more details about the training process and the dataset, but, is it? Genuine question, because I had the opposite impression. Like, for example, recently I did a full finetuning run on a 3B model chewing through a 146k entry dataset (with 116k entries having reasoning traces, so they're not short) in 7 hours on a single RTX 6000.
Honestly I think we can improve our training throughput drastically via a few more optimizations but we've been spending most of our time on model quality improvements instead.
It's hard to compare without more details about the training process and the dataset, but, is it? Genuine question, because I had the opposite impression. Like, for example, recently I did a full finetuning run on a 3B model chewing through a 146k entry dataset (with 116k entries having reasoning traces, so they're not short) in 7 hours on a single RTX 6000.