> Today, we are excited to release Mistral NeMo, a 12B model built in collaboration with NVIDIA. Mistral NeMo offers a large context window of up to 128k tokens. Its reasoning, world knowledge, and coding accuracy are state-of-the-art in its size category. As it relies on standard architecture, Mistral NeMo is easy to use and a drop-in replacement in any system using Mistral 7B.
> We have released pre-trained base and instruction-tuned checkpoints checkpoints under the Apache 2.0 license to promote adoption for researchers and enterprises. Mistral NeMo was trained with quantisation awareness, enabling FP8 inference without any performance loss.
So that's... uniformly an improvement at just about everything, right? Large context, permissive license, should have good perf. The one thing I can't tell is how big 12B is going to be (read: how much VRAM/RAM is this thing going to need). Annoyingly and rather confusingly for a model under Apache 2.0, https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407 refuses to show me files unless I login and "You need to agree to share your contact information to access this model"... though if it's actually as good as it looks, I give it hours before it's reposted without that restriction, which Apache 2.0 allows.
You could consider the improvement in model performance a bit of a cheat - they beat other models "in the same size category" that have 30% fewer parameters.
I still welcome this approach. 7B seems like a dead end in terms of reasoning and generalization. They are annoyingly close to statistical parrots, a world away from the moderate reasoning you get in 70B models. Any use case where that's useful can increasingly be filled by even smaller models, so chasing slightly larger models to get a bit more "intelligence" might be the right move
Aren't small models useful for providing a language-based interface - spoken or in writing - to any app? Tuned specifically for that app or more likely enriched via RAG and possibly also by using function calling?
It doesn't have to be intelligent like we expect it from the top-tier, huge models, just capable of understanding some words in sentences, mostly commands, and how to react to them.
I wonder if a "mixture of models" is going to become more common for real-world use cases (i.e. where latency & dollar budgets are real constraints). Chain together a huge model for reasoning, a small model for function calling/RAG, a medium model for decoding language generation. I'm definitely not dismissing 7B models as irrelevant just yet.
Except Llama 3 8b is a significant improvement over llama 2, which was basically so terrible that there was a whole community building fine tunes that are better than what the multi billion dollar company can do using a much smaller budget. With llama 3 8b things have shifted towards there being much less community fine-tunes that actually beat it. The fact that Mistral AI can still build models that beat it, means the company isn't falling too far behind a significantly better equipped competitor.
What's more irritating is that they decided to do quantization aware training for fp8. int8 quantization results in an imperceptible loss of quality that is difficult to pick up in benchmarks. They should have gone for something more aggressive like 4-bit, where quantization leads to a significant loss in quality.
Not that you aren't correct overall in terms of difficulty, but llama3 definitely still has a handful of fine tunes that I'd say outperform the base model by quite a bit, like the hermes model from Nous research, and we're only going to see more as time goes on.
I usually tell the model that I will be testing its reasoning capabilities by describing a scenario and then asking questions about the evolving scenario.
I typically give it a description of a limited environment with objects in it, and say that “we “ are in this environment. I then describe actions that I take within the environment and ask questions about the updated world-state that must be inferred from the actions. This tests a lot of “common sense” reasoning skills, which I find to be more important for real world tasks than logic puzzle type reasoning.
Easy head math: parameter count times parameter size plus 20-40% for inference slop space. Anywhere from 8-40GB of vram required depending on quantization levels being used.
They did quantization aware training for fp8 so you won't get any benefits from using more than 12GB of RAM for the parameters. What you might be using more RAM is the much bigger context window.
if you want to be lazy, 7b = 7gb of vRAM, 12b = 12gb of vRAM, but quantizing you might be able to do with with ~6-8. So any 16gb Macbook could run it (but not much else).
Welp, my data point of one shows you need more than 8 GB of vRam.
When I run mistral-chat with Nemo-Instruct it crashes in 5 seconds with the error: "torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU"
This is on Ubuntu 22.04.4 with an NVIDIA GeForce RTX 3060 Ti with 8192MiB. I ran "nvidia-smi -lms 10" to see what it maxed out with, and it last recorded max usage of 7966MiB before the crash.
When I run mistral-chat on Ubuntu 22.04 after cleaning up some smaller processes from the GPU (like gnome-remote-desktop-daemon) I am able to start Mistral-Nemo 2407 and get a Prompt on RTX 4090, but after entering the prompt it still fails with OOM, so, as someone noted, it narrowly fits 4090.
Agreed, it narrowly fits on RTX 4090. Yesterday I rented an RTX 4090 on vast.ai and setup Mistral-Nemo-2407. I got it to work, but just barely. I can run mistral-chat, get the prompt, and it will start generating a response to the prompt after 10 to 15 seconds. The second prompt always causes it to crash immediately from OOM error. At first I almost bought an RTX 4090 from Best Buy, but it was going to cost $2,000 after tax, so I'm glad that instead I only spent 40 cents.
What about for fine-tuning? Are the memory requirements comparable to inference? If not, is there a rule of thumb for the difference? Would it be realistic to do it on a macbook with 96G of unified memory?
Yes, but it's not common for the original model to be 8 bit int. The community can downgrade any model to 8 bit int, but it's always linked to quality loss.
> We have released pre-trained base and instruction-tuned checkpoints checkpoints under the Apache 2.0 license to promote adoption for researchers and enterprises. Mistral NeMo was trained with quantisation awareness, enabling FP8 inference without any performance loss.
So that's... uniformly an improvement at just about everything, right? Large context, permissive license, should have good perf. The one thing I can't tell is how big 12B is going to be (read: how much VRAM/RAM is this thing going to need). Annoyingly and rather confusingly for a model under Apache 2.0, https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407 refuses to show me files unless I login and "You need to agree to share your contact information to access this model"... though if it's actually as good as it looks, I give it hours before it's reposted without that restriction, which Apache 2.0 allows.