Keep in mind that gguf/llama.cpp, although highly performant and portable, is no...

ComputerGuru · on Dec 1, 2023

How much more work is it to get those up and running?

idonotknowwhy · on Dec 1, 2023

Almost none of you already have python. Download exl2, exui from github and run a few terminal commands. This let's me run the 120b param models, which won't fit in vram if I use llamacpp

idonotknowwhy · on Dec 2, 2023

Typo: I meant "almost none if you already have python installed"

Rastonbury · on Dec 2, 2023

Wait, so using large models isn't limited by VRAM anymore?

idonotknowwhy · on Dec 2, 2023

It is. I have 48GB of VRAM. But exl2 is more efficient, and can be quantized to partial bits. So you can run things like 4.75bpw, etc.

I can run 120b models at 3bpw.

The larger models like this are less affected (increased perplexity) by the quantization.

ComputerGuru · on Dec 2, 2023

Did you have to quantize it yourself to 4.75bpw and 3bpw or are they readily available for download?

idonotknowwhy · on Dec 3, 2023

Most of the time it's readily available eg:

Panchovix/goliath-120b-exl2 (there's a different branch for each size)

Some of them I've had to do myself eg. I wanted a Q2 GGUF of Falcon 180b

There's a guy on huggingface called "TheBloke" who does GGUF, AWQ and GPTQ for most models. For exl2, you can usually just search for exl2 and find them.

ComputerGuru · on Dec 4, 2023

Thanks, friend!