Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> On the infra side, training a 1.5B model in ~4 hours on 8×H100 is impressive.

It's hard to compare without more details about the training process and the dataset, but, is it? Genuine question, because I had the opposite impression. Like, for example, recently I did a full finetuning run on a 3B model chewing through a 146k entry dataset (with 116k entries having reasoning traces, so they're not short) in 7 hours on a single RTX 6000.





Honestly I think we can improve our training throughput drastically via a few more optimizations but we've been spending most of our time on model quality improvements instead.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: