Hardly seems worth the effort, perhaps things have improved since 2019. It would be interesting to see an updated benchmark, but if your going to end up with code that looks like C++ to get proper performance, you might as well write it in C++. My biggest problem with Julia is that they decided to use column-major indexing for multi-dimensional arrays (i.e. FORTRAN/MATLAB style). This makes interoperability with C/C++ and python numpy a real pain, since you can't do zero-copy array sharing between the two without one side being forced into strided-access. For that reason alone I haven't adopted it in any of my work-flows.
Actually the column-major order of Fortran is more efficient for some linear algebra operations than the order of C, which has been inherited by many modern languages that do not care about high performance in scientific computations.
So I would say that the culprit for interoperability is C and its descendants, not Fortran or Julia. The designers of C and of the languages that have imitated C have not given any thought about which order for multi-dimensional arrays is better, so the users of such languages do not have any right to blame for interoperability other languages that have done the right thing. Even if the Fortran order had not been better, it had already been used for 20 years before C, so there was no reason to choose a different order.
C has chosen to store arrays in the order in which they are typically read by humans when written on paper, but this is a choice like the choice between big-endian and little-endian, where big-endian was how Europeans wrote numbers, but little-endian is more efficient on computers.
An example of why column-major order is preferable, is the matrix-vector product, i.e. the evaluation of a function that maps linear spaces.
The matrix-vector product should not be done as it is typically taught in schools, by scalar products of rows of the matrix with the vector, because this is less efficient, by making more memory accesses. The right way to compute a matrix-vector product is by doing AXPY operations between columns of the matrix and the vector operand (segments of the output of the AXPY operations are held in registers until all partial AXPY operations are accumulated, avoiding memory accesses). In this case, you need to read columns of the input matrix for each AXPY operation, which is much more efficient when the elements of a column are stored compactly in memory, avoiding the need of strided accesses.
The same thing happens for matrix-matrix products, which must not be done in the naive way taught in schools, by scalar products of rows of the first matrix with columns of the second matrix, but it must be done by tensor products of columns of the first matrix with rows of the second matrix.
> Actually the column-major order of Fortran is more efficient for some linear algebra operations than the order of C, which has been inherited by many modern languages that do not care about high performance in scientific computations.
This is a plausible assumption to make but unfortunately it is not true at large. Especially when the traditional sizes are exceeded say n >= 2000 certain operations such as LU can be improved in terms of performance with C-major arrays. However the correct statement is you lose at some place you win at other. There are certainly linalg operations that F-major can give you more performance. However it is also true for C-major layout.
In your example matrix vector product or any BLAS2 or BLAS3 level operations you can also swap out the for loop order to convert things around (row*col buffer multiplication vs sum of weighted column sum interpretation). In particular matrix norm operations are the only exceptions (abs column sum, row abs sum etc.) that certain norms prefer certain orders. In fact if you go into the Goto method deep enough you'll see that internal order is a bit like Morton ordering to fit things into L1 Cache.
The reason why column-major is preferred is historical and requires more surgery to get it running with C-major ordering. Trust me I tried but it's too much work to gain not so much. Maybe someday when I retire I can attempt it. Hence I kept it column major in my retranslation of LAPACK https://github.com/ilayn/semicolon-lapack
Instead I implemented a "high"-performance AVX2 matrix transpose operation so that swapping the memory layout is trivial compared to the linalg cost.
Just reverse the axis on one side, typically the Julia side. This is the convention used in Lux.jl/Flux.jl. I share memory between the two with zero additional copying for my workflows on a daily basis. If you are really allergic to doing this, I’m sure it’s possible to use metaprogramming / the type system to write it the same way in both places with zero performance overhead.
I have a old supermicro X10DRU-i server with two Tesla V100's (48 GB VRAM) and 128GB RAM and have been running qwen3.6-27B with a lot of success. I would say it's performance on my use case (modifying and extending a ~70kloc C++ code base) has been excellent. I have no benchmarks, but it seems comparable to claude sonnet 4.6 in capabilities. I run it with llama.cpp:
I regularly get ~22tok/s when context utilization is below <65k, but it does slow done to ~13tok/s when the context is nearly full (lots of swapping to RAM). I have been using the qwen-code harness though, since it is far more token efficient than claude-code which injects massive prompts that chew up the context window. I plan on trying it with pi next.
I'm keeping my ~$20/mo claude subscripts for the planning prompts, and then hand it off to qwen for implementation. It's been working well so far.
I know. I'm struggling to understand how this is a github repo/HN article. I've been using claude-code with a llama.cpp server and a dummy API key, and all that is required is to define 2 environmental variables to point claude at the local endpoint. Am I missing something?
We might already be there. I've been running Qwen-3.6-27B with 8-bit quantization locally with llama.cpp (~100k context window), and to be honest for my use case, 40-50% of the time it is more usable than claude-code. I only have the $20/mo plan, so I often hit rate limits after 2-3 prompts. And while the local model is slower, it just keeps chugging, is practically free, and more often than not produces code similar to claude. I wouldn't be surprised if in 6-12 months we have local models which are comparable to opus 4.6...which I personally consider as a tipping point where agentic coding became practical.
> At the same time, new versions of FFmpeg brought support for new codecs and file formats, and reliability improvements, all of which allowed us to ingest more diverse video content from users without disruptions.
While it is good they worked to get their internal improvements into upstream, and this is certainly better behavior than some other unmentioned tech giants. It makes one wonder (since they are presumably running it tens of billions of times per day), if they were involved in supporting these improvements all along. If not, why not?
Neat, but also hilarious! Searching for "mug" gives results where the first item listed (ABC-00008297) is a mug model with a hole not only in the top to pour in your drink, but also in the side and bottom (just in case you wanted more access to your liquid).
Not the first one... but the "pair of interconnected mugs" that is described as "emphasizes connectivity and collaboration, suitable for serving beverages in a shared or communal setting" is pretty amazing too. I never knew I was missing out on communal mug holding.
The first item returned by a search on "couch" is ABC-00927226, which "features a rectangular form resembling a couch, characterized by a solid, box-like structure." In other words: it's a box.
HOWEVER: We KNOW it's a COUCH, because: "One side prominently displays the word "COUCH" vertically, rendered in a bold, modern font."
Nice write up! For this sort of thing, I have leaned towards AMD Epyc, Intel e810, and DPDK for the software stack. Unfortunately, lately the supermicro H13SSL line of mobo's appear to have become near-unobtainable with ridiculous 6+ month lead times.
No idea, you can still get one-off boards here and there, but buying anything in quantity has been tricky. I can only surmise supermicro's resources are largely tied up with AI data center build out, with everything else relegated to short runs.
My comment was directed at your statement "that at no point in history was the US dollar 100% backed by gold (or silver, originally). Never." which is entirely false. The US dollar (the currency unit), was at one point quite literally an exact weight of silver. This is no longer the case, but it was true in the past. This has nothing to do with the nation debt, or what is backing it. Obviously, the national debt isn't collateralized by precious metals or anything else except the military and power to raise taxes.