What would you define as 'distillation' versus 'learning'? How do you know that what a LLM is doing is 'distillation' vs a process closer to a human reading a book?
From my perspective, pretraining is pretty clearly not 'distilling', as the goal is not to replicate the pretraining data but to generalize. But what these companies are doing is clearly 'distilling' in that they want their models to exactly emulate Claude's behavior.
That's a soft distinction (distilling vs learning). If I read a chapter in a text book I am distilling the knowledge from that chapter into my own latent space - one would hope I learn something. Flipping it the other way, you could say that model from Lab Y is ALSO learning the model from Lab X. Not just "distilling". Hence my original comment - how deep does this go?
> And yet nearly every machine learning engineer would disagree with you, which is a given away that your argument is rooted in ideology.
That's a bold statement! Of course I know the difference, in one case you are learning from correct/wrong answers, and in the other from a probability distribution. But in both cases you are using some X to move the weights. We can get down and gritty on KL divergence vs cross-entropy, but the whole topic is about "theft", which is perhaps in the eye of the beholder.
I see it in the "2. Model Summary" section (for [2]). In the next section, I see links to Hugging Face to download the DeepSeek-R1 Distill Models (for [3]).
Hi, the Zephyr link may be what I'm looking for. yeah I'm quite familiar with RL already so it was specifically RLHF that I was asking about, I'll check out that resource, thanks!
You could definitely use this for upsampling negative prompts, though I haven't tested that much. In theory, future T2I models shouldn't need to be negatively prompted as much; I find it's better to focus on really high quality positive prompts, as that is closer to the captions the model was trained on.
Yup, the model will still forget details sometimes. This is a common issue with prompt upsampling methods, but I'm hoping to improve this with the next version.
Thanks for the kind words! I started with the 780M param flan-t5-large model, and kept trying smaller and smaller base models - I was shocked at how good the output was at 77M. As you go smaller, though, it's much easier to accidentally overfit or collapse the model and produce gibberish. Had to be very careful with hyperparams and sanitizing / filtering the dataset.
I haven't tested extensively with non SDXL based checkpoints but there's nothing really SDXL specific about the model; if you're using a fine-tune that's trained on booru-style tags, it will probably not work as well - but otherwise it should work just fine. And in that case, just fork the project and tune it on however your fine-tune prompts best :)
From my perspective, pretraining is pretty clearly not 'distilling', as the goal is not to replicate the pretraining data but to generalize. But what these companies are doing is clearly 'distilling' in that they want their models to exactly emulate Claude's behavior.
reply