More

mattalex · 2026-03-11T18:29:03 1773253743

There were plenty of models the size of gpt3 in industry.

The core insight necessary for chatgpt was not scaling (that was already widely accepted): the insight was that instead of finetuning for each individual task, you can finetune once for the meta-task of instruction following, which brings a problem specification directly into the data stream.

mattalex · 2026-02-28T08:02:45 1772265765

Assuming this is real: Why do you think anthropic was put on what is essentially an "enemy of the state" list and openai didn't?

The two things anthropic refused to do is mass surveillance and autonomous weapons, so why do _you_ think openai refused and still did not get placed on the exact same list.

It's fine to say "I'm not going to resign. I didn't even sign that letter", but thinking that openai can get away with not developing autonomous weapons or mass surveillance is naive at the very best.

mattalex · 2025-09-16T18:05:25 1758045925

It might be that they pay less for anthropic depending how many tokens are generated by each model: total cost is token cost times number of tokens. I haven't checked gpt5, but it is not impossible that price wise they might be very comparable if you account for reasoning tokens used.

poslathian · 2025-09-16T19:32:22 1758051142

Is it possible that regardless of what they pay they think Anthropic is negative margin on it?

mattalex · on Dec 26, 2024

This is essentially the principle behind algebraic effects (which, in practice, do get implemented as delimited continuations):

When you have an impure effect (e.g. check a database, generate a random number, write to a file, nondeterministic choices,...), instead of directly implementing the impure action, you instead have a symbol e.g "read", "generate number", ...

When executing the function, you also provide a context of "interpreters" that map the symbol to whatever action you want. This is very useful, since the actual business logic can be analyzed in an isolated way. For instance, if you want to test your application you can use a dummy interpreter for "check database" that returns whatever values you need for testing, but without needing to go to an actual SQL database. It also allows you to switch backends rather easily: If your database uses the symbols "read", "write", "delete" then you just need to implement those calls in your backend. If you want to formally prove properties of your code, you can also do that by noting the properties of your symbols, e.g. `∀ key. read (delete key) = None`.

Since you always capture the symbol using an interpreter, you can also do fancy things like dynamically overriding the interpreter: To implement a seeded random number generator, you can have an interpreter that always overrides itself using the new seed. The interpreter would look something like this

```

Pseudorandom_interpreter(seed)(argument, continuation):

  rnd, new_seed <- generate_pseudorandom(seed, argument)
  with Pseudorandom_interpreter(new_seed):
       continuation(rnd)

```

You can clearly see the continuation passing style and the power of self-overriding your own interpreter. In fact, this is a nice way of handeling state in a pure way: Just put something other than new_seed into the new interpreter.

If you want to debug a state machine, you can use an interpreter like this

``` replace_state_interpreter(state)(new_state, continuation):

  with replace_state_interpreter(new_state ++ state):
       continuation(head state)

```

To trace the state. This way the "state" always holds the entire history of state changes, which can be very nice for debugging. During deployment, you can then replace use a different interpreter

```

replace_state_interpreter(state)(new_state, continuation):

  with replace_state_interpreter(new_state):
       continuation(state)

```

which just holds the current state.

upghost · on Dec 26, 2024

That's really interesting. This seems like it would be a really good approach to combine something like an otherwise pure finite state machine, but with state transitions that rely on communicating with external systems.

Normally I emit tokens to a stack which are consumed by an interpreter but then it's a bit awkward to feed the results back into the FSM, it feels like decoupling just for the sake of decoupling even though the systems need to be maintained in parallel.

I'll have to explore this approach, thank you!

mattalex · on Oct 24, 2024

Once you have strong normalization you can just check local confluence and use Newman's lemma to get strong confluence. That should be pretty easy: just build all n^2 pairs and run them to termination (which you have proven before). If those pairs are confluent, so is the full rewriting scheme.

JonChesterfield · on Oct 25, 2024

That is a new one to me. Tracked the reference back to https://www.jstor.org/stable/1968867 which looks excellent. Thank you!

mattalex · on Oct 20, 2024

That entirely depends on what AMD device you look at: gaming GPUs are not well supported, but their instinct line of accelerators works just as well as cuda. keep in mind that, in contrast to Nvidia, AMD uses different architectures for compute and gaming (though they are changing that in the next generation)

mattalex · on Sept 4, 2024

To expand on that: there's also the issue that these games have to be (somewhat) competitive multiplayer games: multiplayer because otherwise there's no way to create enough content, and competitive since otherwise there's less of a reason to play the game for long periods of time.

If you've ever played a dead/dying competitive game as a newcomer you will know the problem this creates: since the people that stay around are either new or very dedicated players, the skill gap becomes gigantic, which turns of most new players.

if your game wins the Life-Service race, you draw other players in. If your game dies the very same structure that keep players around will prevent new players from joining.

mattalex · on Sept 1, 2024

There are alternatives to iron that have higher efficiency and lower prices. For instance https://hydrogenious.net/ does exactly that but with benzene like structures. The advantage of this is that you can reuse existing infrastructure for transport and you have higher transport efficiency: while the square cube law exist, the same thing holds for the forces on the chamber walls which have to increase in thickness. Hydrogen tanks are also very expensive as they have to be manufactured to tight tolerances (and they need to be replaced rate often due to hydrogen creep weakening chamber walls)

mattalex · on Aug 17, 2024

2008 is ancient for optimization!

People have tested old year 2000 lp and milp solvers against recent ones while correcting for hardware. Hardware improvements made up ~20x improvement, while lp solvers in general sped up 180x. MILP solvers speed up a full 1000x (Progress in mathematical programming solvers from 2001 to 2020).

Solvers from 2008 are entirely different levels of performance: there are many problems that are unsolvable by those that are solved to zero duality gap in less than a second by more modern solvers.

In MINLPs the difference is even more standing. This doesn't mean that those books are useless (they are quite good), but do not expect a solver based on those techniques to even play in the same league as modern solvers.

wakawaka28 · on Aug 18, 2024

Can you send me some of these results? I am pretty skeptical of such dramatic algorithmic improvements.

I don't think the point of an encyclopedia is to cover every single topic, as nice as that would be. If you're in the market for an encyclopedia, you are probably looking for a starting point, survey, or summary of stuff that's good to know. The algorithms you're thinking of are probably in very dry papers and monographs, accessible only to experts. If you were writing a commercial-grade generic MINLP solver, you would surely be looking at the latest papers for ideas, or you simply won't be competitive with existing solvers.

mattalex · on Aug 18, 2024

The paper I have mentioned can be found here https://arxiv.org/pdf/2206.09787

There are so many things that have only been invented in the last couple of years like RINS, MCF cuts, conflict analysis, symmetry detection, dynamic search,... (see e.g. Tobias Achterberg's line of work).

On the other hand, hardware improvements were not as relevant for LP and MILP solvers as one would expect: For instance, as of now there is still no solver that really uses GPU compute (though people are working on that). The reason is that parallelization of simplex solvers is quite though since the algorithm is inherently sequential (it's a descend over simplex vertices) and the actual linear algebra is very sparse (if not entirely matrix free). You can do some things like lookahead for better pricing or row/column generation approaches, but you have to be very careful in that (interior point methods are arguably nicer to parallelize but in many cases have a penalty in performance compared to simplex).

MILP/MINLP solvers are much nicer to parallelize at first glance since you can parallelize across branches in the branch-and-bound, but in practice that is also pretty hard: Moderns solvers are so efficient that it can easily happen that you spend a lot of compute exploring a branch that is quickly proven to be unncessary to explore by a different branch (e.g. SCIP, the fastest open-source MINLP solver is completely single threaded and still _somewhat_ competetive). This means that a lot of the algorithmic improvements are hidden inside the parallelization improvements. I.e. a lot of time has been spent on the question of "What do we have to do to parallelize the solver without just wasting the additional threads".

wakawaka28 · on Aug 18, 2024

Thanks!

mattalex · on July 4, 2024

You can solve L1 regression using linear programming at fantastically large scales. In fact in many applications you do the opposite: go from squared to absolute because the latter fits into in lp