More

tMcGrath · on Dec 23, 2024

Yes - we'd never normally turn features up this much as it breaks the model quite badly, but we put this in the post to show what that looked like in practice.

tMcGrath · on Dec 23, 2024

Thank you! I think some of the features we have like conditional steering make SAEs a lot more convenient to use. It also makes using models a lot more like conventional programming. For example, when the model is 'thinking' x, or the text is about y, then invoke steering. We have an example of this for jailbreak detection: https://x.com/GoodfireAI/status/1871241905712828711

We also have an 'autosteer' feature that makes coming up with new variants easy: https://x.com/GoodfireAI/status/1871241902684831977 (this feels kind of like no-code finetuning).

Being able to read features out and train classifiers on them seems pretty useful - for instance we can read out features like 'the user is unhappy with the conversation', which you could then use for A/B testing your model rollouts (kind of like Google Analytics for your LLM). The big improvements here are (a) cost - the marginal cost of an SAE is low compared to frontier model annotations, (b) a consistent ontology across conversations, and (c) not having to specify that ontology in advance, but rather discover it from data.

These are just my guesses though - a large part of why we're excited about putting this out is that we don't have all the answers for how it can be most useful, but we're excited to support people finding out.

swyx · on Dec 24, 2024

sure but as you well know classifying sentiment analysis is a BERT-scale problem, not really an SAE problem. burden of proof is on you that "read features out and train classifiers on them" is superior to "GOFAI".

anyway i dont need you to have the answers right now. congrats on launching!

tMcGrath · on Dec 23, 2024

We'll be open-sourcing these SAEs so you're not required to do this if you'd rather self-host.

tMcGrath · on Dec 23, 2024

I'm one of the authors of this paper - happy to answer any questions you might have.

goldemerald · on Dec 23, 2024

Why not actually release the weights on huggingface? The popular SAE_lens repo has a direct way to upload the weights and there are already hundreds publicly available. The lack of training details/dataset used makes me hesitant to run any study on this API.

Are images included in the training?

What kind of SAE is being used? There have been some nice improvements in SAE architecture this last year, and it would be nice to know which one (if any) is provided.

tMcGrath · on Dec 23, 2024

We're planning to release the weights once we do a moderation pass. Our SAE was trained on LMSys (you can see this in our accompanying post: https://www.goodfire.ai/papers/mapping-latent-spaces-llama/).

No images in training - 3.3 70B is a text-only model so it wouldn't have made sense. We're exploring other modalities currently though.

SAE is a basic ReLU one. This might seem a little backwards, but I've been concerned by some of the high-frequency features in TopK and JumpReLU SAEs and the recent SAE (https://arxiv.org/abs/2407.14435, Figure 14), and the recent SAEBench results (https://www.neuronpedia.org/sae-bench/info) show quite a lot of feature absorption in more recent variants (though this could be confounded by a number of things). This isn't to say they're definitely bad - I think it's quite likely that TopK/JumpReLU are an improvement, but rather that we need to evaluate them in more detail before pushing them live. Overall I'm very optimistic about the potential for improvements in SAE variants, which we talk a bit about at the bottom of the post. We're going to be pushing SAE quality a ton now we have a stable platform to deploy them to.

wg0 · on Dec 23, 2024

Noob question - how do we know that these autoencoders aren't hallucinating and really are mapping/clustering what they should be?

trq_ · on Dec 24, 2024

Hmm the hallucination would happen in the auto labelling, but we review and test our labels and they seem correct!

tMcGrath · on Aug 12, 2021

The paper is measuring I/O behaviour, rather than the complexity of the mechanisms generating that behaviour. Transistors might have quite complex physics, but are designed to have relatively simple I/O behaviour.

tMcGrath · on Feb 23, 2021

I'm not sure that's all there is to it - at least some scientists are in it for the explanation: they want to know more about something. In this case the prediction is just a useful check that their proposed explanation is compatible with reality.

Of course, this isn't typically why science gets funded - we want the engineering applications that are enabled by our ability to calculate - but a version of science that's all prediction, no explanation seems very unappealing (not to mention sterile for further investigation).

pdonis · on Feb 23, 2021

> at least some scientists are in it for the explanation

What the scientists are in it for is one thing. But what confidence non-scientist members of the public should have in claims made by scientists is another. The latter is what prediction is for: the better the predictive track record of the scientific claims, the higher the confidence they deserve.

tMcGrath · on March 27, 2019

Modern Classical Physics is a great book - I've recently started working through it (just about to move on to chapter 2). I'd be interested in chatting about it with others/cross-checking solutions. Anyone who's interested, drop me an email (address in my profile).

If you're thinking of getting it but want to check it out, there's a 2012 draft version that the authors have previously taught from here: http://www.pmaweb.caltech.edu/Courses/ph136/yr2012/. It's not the same as the book, of course, but from a skim it seems quite similar.

vijayshankarv · on March 27, 2019

Thanks! This looks like what I will start with. Can't find your email in your profile, though.

tMcGrath · on March 27, 2019

As far as I know, a good differential-geometric understanding of nonequilibrium thermodynamics still hasn't been achieved.

The central issue is understanding how changes in control parameters (for instance concentrations of catalysts in a chemical system, or local fields in a spin system) affect the evolution of the probability distribution over states. Some work has been done in close to steady state (for instance [1,2,3]) but it's far from resolved.

This has some nice applications - designing efficient protocols for microscale devices, for instance.

[1] https://arxiv.org/abs/1603.07758 [2] https://arxiv.org/abs/1507.06269v1 [3] https://arxiv.org/abs/1201.4166

tMcGrath · on Jan 3, 2019

As I understand it her point it that what we take to be fundamental particles have certain properties - charge, spin, etc - and no others. If they had other properties, this would lead to measurably different outcomes in collision experiments for quantum-mechanical reasons (not because of any 'free will' on the part of the particles, as I believe some other commenters have interpreted it).

If these few numbers really are all the information that an electron (for instance) contains, then where is the informational content of consciousness located, assuming that panpsychism claims that electrons possess a certain small amount of consciousness? This is how I interpreted her sentence on consciousness implying the ability to change; not as meaning the ability to decide, but as meaning the ability to carry extra information by being in different states. I think this seems like a reasonable objection and I'm interested in how panpsychists might respond.

tMcGrath · on Aug 2, 2018

I think the standard reference is probably Spivak's 'Calculus on Manifolds' but this never really did it for me.

If you have a background in physics then some combination of Nakahara's 'Geometry, Topology and Physics' and Baez and Muniain's 'Gauge Fields, Knots and Gravity' might be good (I haven't included relativity textbooks as I assume it you have a background in GR then you have enough differential geometry).

An unusual recommendation that I think is really nice is 'Stochastic Models, Information Theory and Lie Groups' by Chirikjian. It covers a few other topics mentioned in this thread and is really nice. It's _extremely_ concrete and spells out a lot of calculations in great detail. Plus, the connection to engineering applications is much more obvious.

imh · on Aug 3, 2018

Chirikjian's book looks really cool! Its website says that in volume 1 "The author reviews stochastic processes and basic differential geometry in an accessible way for applied mathematicians, scientists, and engineers." And I can't tell if that means 'brief review because this is a prereq to the book' or if this is a good first take on it. Do you know which it is?