More

dnhkng · 2026-03-24T17:21:25 1774372885

Thanks!

I have pushed basic code to GitHub (https://github.com/dnhkng/RYS)

Some interesting areas to explore might be a combination of deleting some layers and duplicating others. i.e. reduce VRAM by dropping some layer (this works, well documented), and recovering performance by duplicating others (saves VRAM). I am not pursuing this, but it seems interesting!

vessenes · 2026-03-24T18:37:40 1774377460

Thanks -- interesting. I like the idea of ablating layers. I guess you could get a differentiable stack that has a layer skip and layer copy/loop and a total memory use loss function; that would let someone ship either a big (usually ablate) or little (usually copy) model. The expert routing for longer sequences interests me a lot because the edge inference issue is always memory bandwidth.

dnhkng · 2026-03-24T14:56:24 1774364184

Author here: The code is up on GitHub.

The probes I used seem to help identify good configurations, but are quite noisey. A small probe set was initially used to make the scan tractable, and then the higher ranked models were retested on a set ~10x larger.

dnhkng · 2026-03-24T14:54:23 1774364063

Author here: That was done in this blog post, in the beam search. I started with the best re-layer configs, and iteratively added more blocks, including the same multiple times, during a long beam search.

It turns out this does not help (somewhat surprisingly).

skyde · 2026-03-24T15:37:18 1774366638

Actually not surprised. I guess this is for the same reason “say it twice” [1] is working. Because LLm are trained as causal language model, past token cannot attend to future token. One copy of the layer set solve this. [1]https://arxiv.org/html/2512.14982v1

coppsilgold · 2026-03-24T19:49:07 1774381747

It's possible that the gains are despite the noise the coarse process introduces. After two repetitions the noise may overwhelm the advantage.

The residual connections resemble the Euler method (this observation led to Neural ODE's IIRC) which isn't known to be exactly clean. If the model has been trained to be a particular number of layers, adding more layers will also add a lot of noise.

Ultimately, the LLM will need to be fine tuned with the loops or a looped architecture trained from scratch, such as: <https://ouro-llm.github.io> unfortunately they made the mistake of looping the entire LLM rather than just the center portion.

dnhkng · 2026-03-24T13:59:04 1774360744

There was some work done on this a while back, during the FrankenMerge craze of 23'

I am working with TurboDerp to integrate this into the Exllama v3 format.

sigbottle · 2026-03-24T17:05:57 1774371957

Wow, super interesting keywords. Are you a ML researcher? What kind of experiments do you do?

dnhkng · 2026-03-24T13:56:19 1774360579

Author here. Another thing I want to highlight: the language-agnostic "thinking space" finding came from Evan Maunder, who read Part 1 and ran an elegant experiment — same sentence in English, Mandarin, and Base64, cosine similarity at every layer. The representations converge by the early layers, stay nearly identical through the mid-stack, then diverge again at the end as the model commits to an output format.

I extended this to a 2×2 design (two languages × two content types) and the result is even starker: by layer 10, cross-language same-content pairs are more similar than same-language different-content pairs. The model cares about what you're saying, not what language you're saying it in.

This is also what makes layer duplication work — those mid-stack layers operate in a space where input and output distributions match, so you can loop through them without breaking anything. The encoding and decoding boundaries are where the blue walls show up in the heatmaps.

thesz · 2026-03-24T18:48:35 1774378115

  > The model cares about what you're saying, not what language you're saying it in.

What is the number of languages model is trained upon? And what is the number of training set sentences? I believe that these numbers are vastly different and cosine similarity is overwhelmingly biased by number of sentences.

What if we equalize number of languages and number of sentences in the training set? A galaxy-wise LLM, so to say.

Also, model can't help but care about language because your work shows divergence of cosine similarity at the decoding (output) stage(s).

hmokiguess · 2026-03-24T18:37:40 1774377460

I'm trying to understand what you said, can you please correct me if I'm wrong here.

Would this be sort of like saying the way embeddings of different primitives across languages end up distributed in a vector space all follow the same principles and "laws"?

For example, if I train a large corpus of english and, separately, a large corpus of spanish, in both cases the way language constructs that are equivalent across both will end up represented using the same vector space patterns?

canjobear · 2026-03-24T21:10:29 1774386629

This does seem to happen, at least close enough that it's possible to align embedding spaces across languages and do some translation without training on parallel texts.

1bpp · 2026-03-24T17:22:30 1774372950

A fun thing to do is convince a model to fluidly switch between character sets to express ideas as 'efficiently' as possible. It likes to use Chinese hanzi a lot for abstract concepts. I've also seen Gemini use them unprompted in the middle of an English sentence.

mikkupikku · 2026-03-24T18:09:27 1774375767

AIs code switching between human languages is cyberpunk AF.

theredsix · 2026-03-24T16:50:30 1774371030

Extrapolating the benchmarks, this would imply the best RYS 27B is capable of out performing the 397B MoE?

dnhkng · 2026-03-24T13:30:27 1774359027

Author here. The result that surprised me most: after evaluating 3,024 beam search candidates, training a surrogate model on ~4,600 measurements, and scoring 2 million configurations — the Pareto-optimal configs were all simple contiguous blocks. No exotic multi-block compositions, no sparse repeats. Just "repeat layers 31–33" and you're on the efficiency frontier.

I think this says something interesting about how transformers organise computation internally. The mid-stack reasoning circuits are coherent enough that you can loop through them twice without distribution mismatch. The encoding/decoding boundaries are not.

dnhkng · 2026-03-13T14:59:00 1773413940

Glad to see someone replicate the results already :)

hashmap · 2026-03-13T15:43:37 1773416617

im kind of wondering like what the ceiling would be on reasoning for something like the 1.5T models with the repeating technique, but they would take a long time to download. i think if you have them already it would take maybe an hour or so to check against a swath of prompts. whats the reasoningest open model at the moment?

my guess is that large models trained on large corpuses there is just some ceiling of "reasoning you can do" given the internal geometry implied by the training data, cause text is lossy and low-bandwidth anyway, and theres only really so much of it. past some point you just have to have models learning from real-world interactions and my guess is we're already kind of there.

dnhkng · 2026-03-13T16:41:12 1773420072

I stick with models I can run on VRAM, but DeepSeek Speciale have the best reasoning capabilities of the models I can actually run (https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Speciale). What hardware can you access?

I have Deepseek etc, but inferencing on DDR5 would take about 2-3 weeks for a simple scan. I think this works best with dense models, but it also seems ok with MoE.

@everyone: Can someone hook me up with Nvidia sponsorship?

hashmap · 2026-03-14T00:40:46 1773448846

oh neat ill check that one out. i dont get that much speedup from ssd/128gb unified vs vram if im doing like a predefined set of prompts, since i have it load it from disk anyway and im just doing one forward pass per prompt, and just like load part of it at a time. its a bit slower if im doing cpu inferencing but i only had to do that with one model so far.

but yeah on demand would be a lot of ssd churn so id just do it for testing or getting some hidden state vectors.

dnhkng · 2026-03-11T06:44:10 1773211450

But blogging is fun!

I do wish one of the big labs would sponsor with a rack of HGX Rubin NVL8's. I have lots of ideas to test, and I have probably hit the spending limit with the boss on hardware (she hasn't seen the new power bill yet...)

dnhkng · 2026-03-11T06:34:51 1773210891

Hi, thanks for the praise!

On the other papers, models like SOLAR or training a model that uses a single layers are probably going to hit a wall, based on the heatmaps I found. The transformer stack starts with randomised weights, (analogous to undifferentiated stem cells), and it seems they later form 'organs' during the trillions of pre-training tokens they undergo. My hypothesis is that you probably only want one copy of the 'token-to-thought', and 'thought-to-token' organs. It seems that you can make one layer do all three things (transforms in and out, and do the 'thinking'), but I think specialisation will always win.

dnhkng · 2026-03-11T06:26:23 1773210383

Cheers. I will go back though my other old projects (optogenetics, hacking Crispr/CAS9 etc), and put them on my blog.

On your questions: 1) A few other papers have been mentioned in the thread, like Solar10.7B. They duplicated the whole transformer stack, and it kinda helped. But as I found experimentally, that probably not a great idea. You are duplicating 'organs' (i.e. input processing stuff), that should only have one copy. Also, that paper didn't see immediate improvements; they had to do continued pre-training to see benefits. At that point, I'm guessing the big labs stopped bothering. Limited by hardware, I had to find unusual angles to approach this topic.

2) Nah, no more wetware for me. I did a half decade of research at a big neurobiology institute, and while it was very enjoyable, I can truly say that grant writing and paper review are 'not my thing'. This reason this info was delayed so long is that I wanted a paper in the AI field to go along with my papers in other fields. But as a Hobbyist with no official affiliation, and the attention span of a gnat, I gave up and started a blog instead. Maybe someone will cite it?