Yep they (would) basically have 8-16 "experts" that are each about the size of GPT-3. Since they each see different batches of the dataset, they learn to model those distributions independently rather than the distribution of the whole dataset. Some of the attention is shared between them however.
Then another "routing model" decides which model is most suitable for the given user prompt.
Given they use relatively few experts, each one is likely similarly capable to the others on many tasks. I assume this make deployment easier and is a "more conservative" less risky approach. Even if the wrong model is chosen by the router, answers should still tend to be somewhat acceptable, for instance.
This is not how mixture of experts works at all. The experts are chosen on each layer, not for the whole network, and attention is shared between all of them.
This is common behavior for inference based learners who don’t hail from strong academic backgrounds. Many developers who are self taught utilize a similar method of learning, essentially using pattern recognition to make “educated guesses” that are then internalized as potential facts and tested at the earliest opportunity. In this instance the test was to project the incorrect information out onto a public forum containing experts and using the presence of contradiction as weak evidence to the validity of their newfound knowledge.
Yes, this is done in lieu of actually looking up extended details on what something means. It has it’s advantages though.
Sorry but that is just ludicrous. You do not answer a factual question with a mostly made-up answer without saying clearly "this is pure speculation" at some point, preferably early on.
When I was younger (hah, I'm only 26 now) I sure did do exactly what you say. If your statement is intended to say that "People should not" then you're absolutely correct, however it is a learned behavior that some people must adopt after being corrected by their peers.
Seems the advantage is somewhat localized to the individual inference-based learner; it doesn't seem like a pro-social strategy which would optimize benefit to the group. Overall this seems like it would generalize to widespread misinformation if the majority of uses adopted this behavior.
I'm guessing it's in the best interests of the wider group to try to minimize the occurrence of this type of participation.
The advantages in a social setting lie in the introduction of entropy, that is _creativity_, to a community. In a rigorous academic setting and with proper training these individuals are more likely identify links between ideas or information that may not seem obvious at first, and tend to be your more 'eccentric' academics.
For the interests of the wider group, the best outcome is to help these individuals refine their communication to make it clear when information they present is unsubstantiated inference as opposed to verified knowledge.
Once the proper 'rules of engagement' are outlined the contributions of these individuals is an oftentimes useful 'ingredient' to the success of many enterprises.
Given that this is an online forum, another advantage is that a conversational trail is left for others to discover. The inferences these types of individuals make are often based on a structure of knowledge and reality that others share, so the most common preconceived and incorrect notions tend to have the most documentation on how to ameliorate the incorrectness (given that these individuals are allowed to state their inferences out loud).
Its a good thread, particularly Exuma's comment, but a memento mori from the root node:
My one word "Source?" meant: "You are adding a new episode of fictional info to a discussion about fictional info."
Even in that fictional world the post was wrong. Even if it was right in some allegorical sense, the simplistic allegory adds nothing. "Mixture of experts is like a group of experts where you pick the right expert to ask a question" isn't some hard-won cross-domain self-taught knowledge. It's something a bright 6th grader would pull off.
The self-taught feeling-stuff-out stuff matters when you're making useful connections that get practical results.
When you're just wiring stuff together online, and the wiring together is meaningless, you're doing nothing and taking the consequences of the negative signals
You have clarified something I have always thought about very intensely and deeply but haven’t really ever read anyone else who understands that so well or rather put it into words so clearly.
I’m an inferenced based learner to an extreme and it definitely has many upsides and also downsides. The upsides are being able to learn extremely rapidly by making connections between pieces of information where there’s gaps and then using a sort of heuristic detection like a compass to feel out which gaps need filled in most. Then, I follow that trail down, regardless of how hard or complex it is to the bottom just to the point where it accomplishes what I need (whether it is statistics, machine learning, transaction isolation that I've learned for the 50th time...). Another upside is significant abstract thinking ability, and sometimes it feels like looking at a maze from overhead.
I’ve built over 100+ projects over close to 30,000 hours of programming over like 15 years
The downside is always when I’m around strong people of the other type I get the sense they don’t respect this style of learning sometimes. It comes through in their words, tone, subtle body language cues.
My friend, who is very very much the other type with a PhD in something very hard I don’t remember … algorithms and data structures or something, said it’s because I don’t value domain knowledge. He said if you spent an entire life building say, a database, you would not consider that a life worth lived. I laughed cause that makes me sound like an asshole but the more I thought about it the more it’s clear that I actually agree somehow. As if information on its own as a means to an end is not fulfilling to me. To me it feels like efficiency, creativity and and moving from A to B very rapidly while hierarchically organizing a massive amount of chaotic information is engrained in my DNA but just simply getting the correct, deepest domain knowledge possible is not appealing to me at all. I sort of will go to the depths that’s needed then go elsewhere.
I’m VERY thankful for those people though as that’s where a lot if not most of progress is made.
It has been an internal paradox for most of my life where I can’t figure out if I’m smart or stupid. I have built companies completely on my own on the tech side where one made over 10m and another made over 200m revenue. I’ve been told I built some things entire teams were not capable of doing on another project.., this gives me signals that I’m smart. Then other things like getting an F on this hard programming interview from my first employer who is genius level Harvard graduate academic style domain knowledge style person. It made me feel completely idiotic. There are many other times and situations where I often think I’m wired weird where it “feels” like I’m stupid.
This over the years I’ve accepted this paradox but it wasn’t until this domain knowledge piece or the creativity aspect that made me finally just accept it as ok and not something wrong with me.
The hardest part is VERY often being misunderstood. So much of it that I often have to expend an exhausting amount of time when working with new teams to say “how I think” because like a fortune teller I can always predict what will be misperceived, and even when I say it up front it usually happens anyway. This is why trust is paramount with my business partners. They know I’m extremely eccentric so to speak but they “trust the process” when I lock myself in a room for 30 days and come out with an amazing piece of tech that was built purely on raw intuition.
The other part that often made me feel stupid is despite its upsides this way of thinking often is exhausting because I don’t usually rely on past experiences ti make decisions. Each situation is different. So even if I’ve done something new 30 times I will feel this “stepping into unknown” feeling which takes great willpower and courage to repeatedly, especially when other people are relying on it.
Using this method tho I’ve also built cool things, one recently is the platform I built is the best converting one in the entire industry. On that project there is a massive team on the other side but the platform itself was also just built by me alone without much starting info to go off of other than a few multi hour brain dump calls.
One thing my PhD friend I mentioned pointed out is that Feynman was a creative learner I think. It helped me feel better that it’s not a “wrong” way of thinking or stupid way if other people out there that high up might share similar ways of thinking more creatively. Of course it’s not exactly everything I describe to a T, I’m not saying that, but threads of it.
Hopefully none of what I wrote sounds insulting or arrogant to anyone. I fully acknowledge that domain knowledge is what moves world forward in many ways.
I have established multiple companies, some of which have grown significantly with over 600 employees. For quite some time, I've transitioned from development and mainly held executive roles such as CEO, Chairman, etc. Simultaneously, it's intriguing to note that I've mostly been unsuccessful in securing 'normal' jobs through interviews in the past (Google, McKinsey, Bain, Accenture etc).
I believe this poses a fascinating topic on the way people assess creativity and intelligence in general.
From my perspective, the crux of the issue lies in the inherent difficulty of accurately measuring creativity in comparison to quick problem-solving skills during job interviews. Consequently, it seems that corporations tend to favor the latter.
@Exuma, this comment is ridiculously resonant with me, the part about 'learning transaction isolation for the 50th time' is very on point too.
Everything you said I pretty much feel the same way. I've accepted it as part of how I work, and the advantages are many (and valued by many) - but yes, interacting with deep experts usually ends with feeling a bit like a fraud. I feel like I maybe was an expert at whatever the thing is at some point in time, momentarily, but then I just shed the information as soon as the next thing needs to be done, and it just ends up as part of the background inference pattern matcher.
Certain things where I'm really forced to learn something deeply do stick, but I find my ways of thinking about that domain to be very different to most 'true' experts, and rely heavily on visual models and analogies with other concepts.
haha yes! What mentioned about analogies... I must use like 50 analogies a day. I also noticed I can use phrases like "always" and "never" and I can say them without a second of hesitation, because they are merely indications of magnitude in a predictive sense, not a literal interpretation. But to someone who must understand information deeply, they never use phrases like that because they operate based on observed knowledge and sort of "hypothesis testing" like a scientist.
It's fun to realize other people are out there who can relate. Thanks for your comment
This was very well said and I don't know if you could have said it any better. FWIW, I'm a person with multiple degrees in CS, but the best programmers I've worked with and who get stuff done have zero degrees. I have eight years of hardcore programming experience to include professional and side project stuff - I've learned more actually doing than in any classroom. Yeah it's cool to know what a bubble sort is and how it compares to a merge sort, but knowing all the fine details isn't really needed for actually building things, especially now that we're at the point where an AI can give you the code along with complete instruction.
It sounds like you've done completely fine for yourself and built things that people want, so I would try not to be too hard on yourself.
We are generalists, as opposed to specialists. Generalists use information from experts across multiple domains. Specialists are the experts who build a particular domain.
As a culture, we look down on generalists - "A jack of all trades is master of none." However, a world full of specialists creates information silos, where experts solve the same problems over and over in isolation. This is where society is at the moment. We need more generalists to navigate these silos.
I really appreciate your response, thank you for sharing your perspective and experiences.
When I read message I can't help but picture you as the storybook 'inventor' who is locked away inside his house with strange colored smoke coming out the chimney & weird noises heard from the street, yet when the doors open the whole town would gather to see what you made.
Haha, that would be me! In some ways you are right about strangeness. I actually work lying down... in my bed. It allows my mind to completely dissolve into the code or problem as if I'm weightless. My business partners all joke that "uhhh yes, the next stop on our tour, well.. this is our CTO's office but its actually just a bed in there so we wont go in that room...." This gives me a good laugh every time
I don’t appreciate the many bad-faith assumptions made. in particular the assertion that I’m not from a “strong academic background”. For what it’s worth, I made a mistake in a public forum, admitting to this twice. I’ve sought accurate versions of my response which no one provided. Nevertheless I have continued to educate myself on the subject and only feel more confident that I wasn’t misinformed to the degree you all indicated.
You may not realize it but not everyone on the internet is nefarious and if you were to speak in this analytical way about say a classmate while they were in ear shot - that person would likely be quite upset.
I'm not the biggest fan of LessWrong, I do however refer new developers who seem interested to the "Rationality: From AI to Zombies" sequences to help refine conscious development of rational thinking.
Also, I absolutely loved the fanfic "Harry Potter and the Methods of Rationality" written by the forum's creator.
** Edit **
And for the record, my comment wasn't intended to be an accurate depiction of you specifically, which honestly wasn't very effectively conveyed.
It was to highlight a common 'type of person' who make authoritative statements on areas they're not 'experts' in, through no malice on their part, rather as a function of their default mode of behavior.
As others who replied to me highlighted, there are at least two of these people in the world and I wanted others to at least be aware of their existence and point of view; the end goal of this being they might offer others the benefit of the doubt and perhaps some constructive feedback instead of unproductive criticism in similar exchanges.
In essence, I thought other comments were being too hard on you and wanted to point out a potential scenario in which their critiques were at at best unproductive.
> Yes, this is done in lieu of actually looking up extended details on what something means.
If they were capable of understanding the extended details, they would already have an academic background in the subject. Laymen aren't going to have a clue what MoE means even if they went to the trouble of digging up the paper.
> Many developers who are self taught utilize a similar method of learning, essentially using pattern recognition to make “educated guesses” that are then internalized as potential facts and tested at the earliest opportunity.
Using pattern recognition skills to make an educated guess that is internalized as a potential fact sounds an awful lot like what LLMs do. At least when humans do it, we bother with the verification step instead of just acting like we know what we're talking about.
> If they were capable of understanding the extended details, they would already have an academic background in the subject. Laymen aren't going to have a clue what MoE means even if they went to the trouble of digging up the paper.
This is, hopefully, an accidental thought experiment gone awry. "IF THEY WERE CAPABLE of understanding the extended details, they would already have an academic background in the subject" can and should == "I spent a ton of time in the library", and an follow-up apology for putting "capable" and "academic background" in the same sentence.
The whole friggin' point of this glorified LAN is that we can break down those dumb walled gardens and let kids learn from random BBS textfiles, MIT YT videos and the gathered wisdom of HN.
If you are going to just dismiss auto-didacts, you're going to have to re-write the complete higher education History in Western society. I won't even begin to try and validate how wrong this is for Eastern History as well.
Then another "routing model" decides which model is most suitable for the given user prompt.
Given they use relatively few experts, each one is likely similarly capable to the others on many tasks. I assume this make deployment easier and is a "more conservative" less risky approach. Even if the wrong model is chosen by the router, answers should still tend to be somewhat acceptable, for instance.