> Today, we are excited to release Mistral NeMo, a 12B model built in collaborat...

wongarsu · on July 18, 2024

You could consider the improvement in model performance a bit of a cheat - they beat other models "in the same size category" that have 30% fewer parameters.

I still welcome this approach. 7B seems like a dead end in terms of reasoning and generalization. They are annoyingly close to statistical parrots, a world away from the moderate reasoning you get in 70B models. Any use case where that's useful can increasingly be filled by even smaller models, so chasing slightly larger models to get a bit more "intelligence" might be the right move

qwertox · on July 18, 2024

Aren't small models useful for providing a language-based interface - spoken or in writing - to any app? Tuned specifically for that app or more likely enriched via RAG and possibly also by using function calling?

It doesn't have to be intelligent like we expect it from the top-tier, huge models, just capable of understanding some words in sentences, mostly commands, and how to react to them.

nmfisher · on July 19, 2024

I wonder if a "mixture of models" is going to become more common for real-world use cases (i.e. where latency & dollar budgets are real constraints). Chain together a huge model for reasoning, a small model for function calling/RAG, a medium model for decoding language generation. I'm definitely not dismissing 7B models as irrelevant just yet.

mistercheph · on July 18, 2024

I strongly disagree, have you used fp16 or q8 llama3 8b?

imtringued · on July 18, 2024

Except Llama 3 8b is a significant improvement over llama 2, which was basically so terrible that there was a whole community building fine tunes that are better than what the multi billion dollar company can do using a much smaller budget. With llama 3 8b things have shifted towards there being much less community fine-tunes that actually beat it. The fact that Mistral AI can still build models that beat it, means the company isn't falling too far behind a significantly better equipped competitor.

What's more irritating is that they decided to do quantization aware training for fp8. int8 quantization results in an imperceptible loss of quality that is difficult to pick up in benchmarks. They should have gone for something more aggressive like 4-bit, where quantization leads to a significant loss in quality.

viridian · on July 19, 2024

Not that you aren't correct overall in terms of difficulty, but llama3 definitely still has a handful of fine tunes that I'd say outperform the base model by quite a bit, like the hermes model from Nous research, and we're only going to see more as time goes on.

amrrs · on July 18, 2024

>reasoning and generalization

any example use-cases or prompts? how do you define those?

K0balt · on July 19, 2024

I usually tell the model that I will be testing its reasoning capabilities by describing a scenario and then asking questions about the evolving scenario.

I typically give it a description of a limited environment with objects in it, and say that “we “ are in this environment. I then describe actions that I take within the environment and ask questions about the updated world-state that must be inferred from the actions. This tests a lot of “common sense” reasoning skills, which I find to be more important for real world tasks than logic puzzle type reasoning.

yjftsjthsd-h · on July 18, 2024

I actually meant execution speed from quantisation awareness - agreed that comparing against smaller models is a bit cheating.

xena · on July 18, 2024

Easy head math: parameter count times parameter size plus 20-40% for inference slop space. Anywhere from 8-40GB of vram required depending on quantization levels being used.

imtringued · on July 18, 2024

They did quantization aware training for fp8 so you won't get any benefits from using more than 12GB of RAM for the parameters. What you might be using more RAM is the much bigger context window.

renewiltord · on July 18, 2024

According to nvidia https://blogs.nvidia.com/blog/mistral-nvidia-ai-model/ it was made to fit on a 4090 so it should work with 24 GB.

bernaferrari · on July 18, 2024

if you want to be lazy, 7b = 7gb of vRAM, 12b = 12gb of vRAM, but quantizing you might be able to do with with ~6-8. So any 16gb Macbook could run it (but not much else).

peterleiser · on July 19, 2024

Welp, my data point of one shows you need more than 8 GB of vRam.

When I run mistral-chat with Nemo-Instruct it crashes in 5 seconds with the error: "torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU"

This is on Ubuntu 22.04.4 with an NVIDIA GeForce RTX 3060 Ti with 8192MiB. I ran "nvidia-smi -lms 10" to see what it maxed out with, and it last recorded max usage of 7966MiB before the crash.

neonbrain · on July 19, 2024

When I run mistral-chat on Ubuntu 22.04 after cleaning up some smaller processes from the GPU (like gnome-remote-desktop-daemon) I am able to start Mistral-Nemo 2407 and get a Prompt on RTX 4090, but after entering the prompt it still fails with OOM, so, as someone noted, it narrowly fits 4090.

peterleiser · on July 20, 2024

Agreed, it narrowly fits on RTX 4090. Yesterday I rented an RTX 4090 on vast.ai and setup Mistral-Nemo-2407. I got it to work, but just barely. I can run mistral-chat, get the prompt, and it will start generating a response to the prompt after 10 to 15 seconds. The second prompt always causes it to crash immediately from OOM error. At first I almost bought an RTX 4090 from Best Buy, but it was going to cost $2,000 after tax, so I'm glad that instead I only spent 40 cents.

BaculumMeumEst · on July 19, 2024

What about for fine-tuning? Are the memory requirements comparable to inference? If not, is there a rule of thumb for the difference? Would it be realistic to do it on a macbook with 96G of unified memory?

hislaziness · on July 18, 2024

isn't it 2 bytes (fp16) per param. so 7b = 14 GB+some for inference?

ancientworldnow · on July 18, 2024

This was trained to be run at FP8 with no quality loss.

hislaziness · on July 19, 2024

The model description on huggingface says - Model size - 12.2B params, Tensor type - BF16. Is the Tensor type different from the training param size?

fzzzy · on July 18, 2024

it's very common to run local models in 8 bit int.

qwertox · on July 18, 2024

Yes, but it's not common for the original model to be 8 bit int. The community can downgrade any model to 8 bit int, but it's always linked to quality loss.

Bumblonono · on July 18, 2024

It fits a 4090. Nvidia lists the models and therefore i assume 24gig is min

michaelt · on July 18, 2024

A 4090 will just narrowly fit a 34B parameter model at 4-bit quantisation.

A 12B model will run on a 4090 with plenty room to spare, even with 8-bit quantisation.

exe34 · on July 18, 2024

tensors look about 20gb. not sure what that's like in vram.

kelsey98765431 · on July 18, 2024

same size