My experience is that all LLMs that I have tested so far did a very good job producing D code.
I actually think that the average D code produced has been superior to the code produced for the C++ problems I tested. This may be an outlier (the problems are quite different), but the quality issues I saw on the C++ side came partially from the ease in which the language enables incompatible use of different features to achieve similar goals (e.g. smart_ptr s new/delete).
The result is hardly a clean room implementation. It was rather a brute force attempt to decompress fuzzily stored knowledge contained within the network and it required close steering (using a big suite of tests) to get a reasonable approximation to the desired output. The compression and storage happened during the LLM training.
Nobody disputes that the LLM was drawing on knowledge in its training data. Obviously it was! But you'll need to be a bit more specific with your critique, because there is a whole spectrum of interpretations, from "it just decompressed fuzzily-stored code verbatim from the internet" (obviously wrong, since the Rust-based C compiler it wrote doesn't exist on the internet) all the way to "it used general knowledge from its training about compiler architecture and x86 and the C language."
Your post is phrased like it's a two sentence slam-dunk refutation of Anthropic's claims. I don't think it is, and I'm not even clear on what you're claiming precisely except that LLMs use knowledge acquired during training, which we all agree on here.
"clean room" usually means "without looking at the source code" of other similar projects. But presumably the AIs training data would have included GCC, Clang, and probably a dozen other C compilers.
Suppose you the human are working on a clean room implementation of C compiler, how do you go about doing it? Will you need to know about: a) the C language, and b) the inner working of a compiler? How did you acquire that knowledge?
Doesn’t matter how you gain general knowledge of compiler techniques as long as you don’t have specific knowledge of the implementation of the compiler you are reverse engineering.
If you have ever read the source code of the compiler you are reverse engineering, you are by definition not doing a clean room implementation.
Claude was not reverse engineering here. By your definition no one can do a clean room implementation if they've taken a recent compilers course at university.
Claude was reverse engineering gcc. It was using it as an oracle and attempting to exactly march its output. That is the definition of reverse engineering. Since Claude was trained on the gcc source code, that’s not a clean room implementation.
> By your definition no one can do a clean room implementation if they've taken a recent compilers course at university.
Clean room implementation has a very specific definition. It’s not my definition. If your compiler course walked through the source code of a specific compiler then no you couldn’t build a clean room implementation of that specific compiler.
There is no specific definition of clean room implementation. Please provide source for your claim otherwise.
There are many well known examples of clean room implementation. One example that survived lawsuits is Sony v. Connectix:
During production, Connectix unsuccessfully attempted a Chinese wall approach to reverse engineer the BIOS, so its engineers disassembled the object code directly. Connectix's successful appeal maintained that the direct disassembly and observation of proprietary code was necessary because there was no other way to determine its behavior - [0]
That practice is similar to GCC being used here to verify the output of the generated compiler, arguably even more intrusive.
“clean room implementation” is a term of art with a specific meaning. It has no statutory definition though so you’re technically right. But it is a defense against copyright infringement because you can’t infringe on copyright without knowledge of the material.
>During production, Connectix unsuccessfully attempted a Chinese wall approach to reverse engineer the BIOS, so its engineers disassembled the object code directly.
This doesn’t mean what you think it means. They unsuccessfully attempted a clean room implementation. What they did do was later ruled to be fair use, but it wasn’t a clean room implementation.
Using gcc as an oracle isn’t what makes it not a clean room implementation. Prior knowledge of the source code is what makes it not a clean room implementation. Using gcc as an oracle makes it an attempt to reverse engineer gcc, it says nothing about whether it is a clean room implementation or not.
There is no definition of “clean room implementation” that allows knowledge of source code. Otherwise it’s not a clean room implementation. It’s just reverse engineering/copying.
Again, reverse engineering is a valid use case of clean room implementation as I posted above, so you don't have a point there.
> “clean room implementation” is a term of art with a specific meaning.
What is the specific meaning you are talking about? If I set out to do a clean room implementation of some software, what do I need to do specifically so that I will prevail any copyright infringement claims? The answer is that there is no such a surefire guarantee.
Re: Sony v. Connectix, clean room is to protect against copyright infringement, and since Connectix was ruled not infringing on Sony's copyrights, their implementation is practically clean room under the law, despite all the pushbacks. If Connectix prevailed, I'm sure the C compiler in question would have prevailed as well if they got sued.
Finally, take Phoenix vs. IBM re: the former's BIOS implementation of the latter's PC:
Whenever Phoenix found parts of this new BIOS that didn't work like IBM's, the isolated programmer would be given written descriptions of the problems, but not any coded solutions that might have hinted at IBM's original version of the software - [0]
That very much sounds like using GCC as an online known-good compiler oracle to compare against in this case.
You’re getting confused because you are substituting the goal of a clean room implementation for its definition. And you are not understanding that “clean room implementation” is one specific type of reverse engineering.
The goal is to avoid copyright infringement claims. A specific clean room implementation may or may not be successful at that.
This does not mean that any reverse engineering attempt that successfully avoids copyright infringement was a clean room implementation.
A clean room implementation is a specific method of reverse engineering where one team writes a spec by reviewing the original software and the other team attempts to implement that spec. The entire point is so that the 2nd team has no knowledge of proprietary implementation details.
If the 2nd team has previously read the entire source code that defeats the entire purpose.
> That very much sounds like using GCC as an online known-good compiler oracle to compare against in this case.
Yes and that is absolutely fine to do in a clean room implementation. That’s not the part that makes this not a clean room implementation. That’s the part that makes it an attempt at reverse engineering.
> you are by definition not doing a clean room implementation.
This makes no sense. Reverse engineering IS an application of clean room implementation. Citing Wikipedia:
“Clean-room design (also known as the Chinese wall technique) is the method of copying a design by reverse engineering and then recreating it without infringing any of the copyrights associated with the original design”
The result is a fuzzy reproduction of the training input, specifically of the compilers contained within. The reproduction in a different, yet still similar enough programming language does not refute that. The implementation was strongly guided by a compiler and a suite of tests as an explicit filter on those outputs and limiting the acceptable solution space, which excluded unwanted interpolations of the training set that also result from the lossy input compression.
The fact that the implementation language for the compiler is rust doesn't factor into this. ML based natural language translation has proven that model training produces an abstract space of concepts internally that maps from and to different languages on the input and output side. All this points to is that there are different implicitly formed decoders for the same compressed data embedded in the LLM and the keyword rust in the input activates one specific to that programming language.
Checking for similarity with compilers that consist of orders of magnitudes more code probably doesn't reveal much. There many more smaller compilers for C-adjacent languages out there pkus cod3 fragments from text books.
Thanks for elaborating. So what is the empirically-testable assertion behind this… that an LLM cannot create a (sufficiently complex) system without examples of the source code of similar systems in its training set? That seems empirically testable, although not for compilers without training a whole new model that excludes compiler source code from training. But what other kind of system would count for you?
I personally work on simulation software and create novel simulation methods as part of the job. I find that LLMs can only help if I reduce that task to a translation of detailed algorithms descriptions from English to code. And even then, the output is often riddled with errors.
If all it takes is "trained on the Internet" and "decompress stored knowledge", then surely gpt3, 3.5, 4, 4.1, 4o, o1, o3, o4, 5, 5.1, 5.x should have been able to do it, right? Claude 2, 3, 4, 4.1, 4.5? Surely.
Well, "Reimplement the c4 compiler - C in four functions" is absolutely something older models can do. Because most are trained, on that quite small product - its 20kb.
But reimplementing that isn't impressive, because its not a clean room implementation if you trained on that data, to make the model that regurgitates the effort.
This comparison is only meaningful with comparable numbers of parameters and context window tokens. And then it would mainly test the efficiency and accuracy of the information encoding. I would argue that this is the main improvement over all model generations.
Perhaps 4.5 could also do it? We don’t know really until we try. I don’t trust the marketing material as much. The fact that the previous version (smaller versions) couldn’t or could do it does not really disprove that claim.
Even with 1 TB of weights (probable size of the largest state of the art models), the network is far too small to contain any significant part of the internet as compressed data, unless you really stretch the definition of data compression.
Take the C4 training dataset for example. The uncompressed, uncleaned, size of the dataset is ~6TB, and contains an exhaustive English language scrape of the public internet from 2019. The cleaned (still uncompressed) dataset is significantly less than 1TB.
I could go on, but, I think it's already pretty obvious that 1TB is more than enough storage to represent a significant portion of the internet.
A lot of the internet is duplicate data, low quality content, SEO spam etc. I wouldn't be surprised if 1 TB is a significant portion of the high-quality, information-dense part of the internet.
I was curious about the scale of 1TiB of text. According to WolframAlpha, it's roughly 1.1 trillion characters, which breaks down to 180.2 billion words, 360.5 million pages, or 16.2 billion lines. In terms of professional typing speed, that's about 3800 years of continuous work.
So post-deduplication, I think it's a fair assessment that a significant portion of high-quality text could fit within 1TiB. Tho 'high-quality' is a pretty squishy and subjective term.
This is obviously wrong. There is a bunch of knowledge embedded in those weights, and some of it can be recalled verbatim. So, by virtue of this recall alone, training is a form of lossy data compression.
I challenge anyone to try building a C compiler without a big suite of tests. Zig is the most recent attempt and they had an extensive test suite. I don't see how that is disqualifying.
If you're testing a model I think it's reasonable that "clean room" have an exception for the model itself. They kept it offline and gave it a sandbox to avoid letting it find the answers for itself.
Yes the compression and storage happened during the training. Before it still didn't work; now it does much better.
The point is - for a NEW project, no one has an extensive test suite. And if an extensive test suite exists, it's probably because the product that uses it also exists, already.
If it could translate the C++ standard INTO an extensive test suite that actually captures most corner cases, and doesn't generate false positives - again, without internet access and without using gcc as an oracle, etc?
How? The printer only ever retrieves G code for individual parts without any knowledge of what they are going to be assembled into. There is no viable way to solve this classification problem on this kind of incomplete data, is there?
That's broadly how it works today, yes: The printer itself has no concept of what it is printing. It's just running some heaters and spinning some motors in response to gcode.
Since such a printer is incapable of determining whether or not this gcode represents a legislatively-restricted item and then blocking its production, then that machine becomes illegal to sell in New York. Easy-peasy. It just takes a quick vote or two and the stroke of a pen, and it is done.
You're probably thinking something like "But that doesn't work at all," and I agree. But sometimes legislators just don't care that they've thrown out the baby along with the bathwater.
It depends how you define the problem. Certainly a human can look at a part and say "that's a lower reciever" but you probably can make something that functions as a firearm exclusively from inconspicuous parts. For the more limited case, an AI can definitely be trained, the broader case is likely unsolvable.
It’s not nearly that hard of a problem. There are n gun files on internet, so validate the hash of those n files (g code whatever). These people aren’t cadding their own designs.
One big part of this is that gcode isnt really a 3d model
its a set of instructions on how to move the printhead around.
You don't download the gcode directly, because that varies by printer. You download a model, and then a slicing program turns that into a set of printer-specific gcode. Any subtle settings changes would change the hash of this gcode.
And the printer doesn't really know what the model is. It would have to reverse the gcode instructions back into a model somehow. The printer isn't really the place to detect and prevent this sort of thing imo. Especially with how cheap some 3d printers are getting, they often don't really have much compute power in them. They just move things around as instructed by the g-code. If the g-code is malformed it can even break the printer in some instances, or at least really screw up your print.
There are even scripts that modify the gcode to do weird things the printer really isn't designed for, like print something and then have the printer move in such a way to crash into and push the printed object off the plate, and then start over and print another print. The printer will just follow these instructions blindly.
Given that quite simple G-code, say a pair of nested circles with code for tool changes/accessory activation, can make two wildly different parts depending on which machine it is run on:
- a washer if run on a small machine in metric w/ flood coolant
- a lamp base if run on a larger router in Imperial w/ a tool changer
and that deriving what will be made by a given G-code file in 3D is a problem which the industry hasn't solved in decades, the solution of which would be worthy of a Turing Award _and_ a Fields Medal, I don't see this happening.
A further question, just attempting it will require collecting a set of 3D models for making firearms --- who will persuade every firearms manufacturer to submit said parts, where/how will they be stored, and how will they be secured so that they are not used/available as a resource for making firearms?
A more reasonable bit of legislation would be that persons legally barred from owning firearms are barred from owning 3D printers and CNC equipment unless there is a mechanism to submit parts to their parole officer for approval before manufacturing, since that's the only class of folks which the 2nd Amendment doesn't apply to, and a reasonable argument is:
1st Amendment + 2nd Amendment == The Right to 3D Print and Bear Arms
Guns can be made out of simple geometric shapes like tubes, blocks, and simple machines like levers and springs. There is mathematically no way to distinguish a gun part from a part used in home plumbing - in fact you can go to the plumbing section of your local hardware store and buy everything you need to build a fully functional shotgun.
In 3D modeling, there are parametric files where the end user is expected to modify the input parameters to fit their needs. So for example, if you have multiple parts that need to fit together, you may need to adjust the tolerances for that fit, because the physical shape will vary depending on your printer settings and material.
Making tiny modifications isn't just a method of circumvention, it's like part of the main workflow of using a 3d model.
This is an extremely bold claim and I think that it completely overlooks how Photoshop is used by professionals in practice. Professional users want extremely fine grained and precise control over their tools to achieve the specific results that they want. AI "image editing" is incapable of providing anything remotely similar.
There is one big argument against these "good enough" solutions: commercial business software providers need to put a lot of R&D into finding generalized workflows that apply to as many clients as possible. Effectively, they find and encode current standard practices into their products. This is valuable from a business operations perspective in two ways: it's a good bet that transitioning the customer's operations to match the software is cleaning up internal processes, and it makes onboarding new employees easier because the tools and workflows should be much more familiar right from the start.
I don't find this surprising. Code and data models encode the results of accumulated business decisions, but nothing about the decision making process or rationale. Most of the time, this information is stored only in people's heads, so any automated tool is necessary blind.
This captures succinctly the one of the key issues with (current) AI actually solving real problems outside of small "sandboxes" where it has all the information.
When an AI can email/message all the key people that have the institutional knowledge, ask them the right discovery questions (probably in a few rounds and working out which bits are human "hallucinations" that don't make sense). Collect that information and use it to create a solution. Then human jobs are in real trouble.
Until that AI is just a productivity boost for us.
The AI will also have to be trained to be diplomatic and maybe even cunning, because, as I can personally attest, answering questions from an AI is an extremely grating and disillusioning experience.
There are plenty of workers who refuse to answer questions from a human until it’s escalated far enough up the chain to affect their paycheck / reputation. I’m sure that the intelligence is artificial will only multiply the disdain / noncompliance.
But then maybe there will be strategies for masking from where requests are coming, like a system that anonymizes all requests for information. Even so, I feel like there would still be a way that people would ping / walk up to their colleague in meatspace and say “hey that request came from me, thanks!”
I still expect this feature to roll out worldwide with some legalese fine print that the customer is responsible for configuring and operating the product "in accordance with local laws". I'd be really surprised if MS handles this differently.
I actually think that the average D code produced has been superior to the code produced for the C++ problems I tested. This may be an outlier (the problems are quite different), but the quality issues I saw on the C++ side came partially from the ease in which the language enables incompatible use of different features to achieve similar goals (e.g. smart_ptr s new/delete).
reply