More

nirinor · 2025-11-02T14:40:53 1762094453

Its a nit pick, but backpropagation is getting a bad rep here. These examples are about gradients+gradient descent variants being a leaky abstraction for optimization [1].

Backpropagation is a specific algorithm for computing gradients of composite functions, but even the failures that do come from composition (multiple sequential sigmoids cause exponential gradient decay) are not backpropagation specific: that's just how the gradients behave for that function, whatever algorithm you use. The remedy, of having people calculate their own backwards pass, is useful because people are _calculating their own derivatives_ for the functions, and get a chance to notice the exponents creeping in. Ask me how I know ;)

[1] Gradients being zero would not be a problem with a global optimization algorithm (which we don't use because they are impractical in high dimensions). Gradients getting very small might be dealt with by with tools like line search (if they are small in all directions) or approximate newton methods (if small in some directions but not others). Not saying those are better solutions in this context, just that optimization(+modeling) are the actually hard parts, not the way gradients are calculated.

xpe · 2025-11-02T15:44:42 1762098282

Yes. No need to be apologetic or timid about it — it’s not a nit to push back against a flawed conceptual framing.

I respect Karpathy’s contributions to the field, but often I find his writing and speaking to be more than imprecise — it is sloppy in the sense that it overreaches and butchers key distinctions. This may sound harsh, but at his level, one is held to a higher standard.

embedding-shape · 2025-11-02T16:03:51 1762099431

> often I find his writing and speaking to be more than imprecise

I think that's more because he's trying to write to an audience who isn't hardcore deep into ML already, so he simplifies a lot, sometimes to the detriment of accuracy.

At this point I see him more as a "ML educator" than "ML practitioner" or "ML researcher", and as far as I know, he's moving in that direction on purpose, and I have no qualms with it overall, he seems good at educating.

But I think shifting the mindset of what the purpose of his writings are maybe help understand why sometimes it feels imprecise.

HarHarVeryFunny · 2025-11-02T16:48:48 1762102128

Whoever chose this topic title perhaps did him a disservice in suggesting he said the problem was backprop itself, since in his blog post he immediately clarifies what he meant by it. It's a nice pithy way of stating the issue though.

nirinor · 2025-11-02T17:25:52 1762104352

Nah, Karpathy's title is "Yes you should understand backprop", and his first highlight is "The problem with Backpropagation is that it is a leaky abstraction." This is his choice as a communicator, not the poster to HN.

And his _examples_ are about gradients, but nowhere does he distinguish between backpropagation, a (part of) an algorithm for automatic differentiation and the gradients themselves. None of the issues are due to BP returning incorrect gradients (it totally could, for example, lose too much precision, but it doesn't).

HarHarVeryFunny · 2025-11-02T17:38:38 1762105118

Yeah - he chose it as a pithy/catchy description of the issue, then immediately clarified what he meant by it.

> In other words, it is easy to fall into the trap of abstracting away the learning process — believing that you can simply stack arbitrary layers together and backprop will “magically make them work” on your data.

Then follows this with multiple clear examples of exactly what he is talking about.

The target audience was people building and training neural networks (such as his CS231n students), so I think it's safe to assume they knew what backprop and gradients are, especially since he made them code gradients by hand, which is what they were complaining about!

mitthrowaway2 · 2025-11-02T17:12:48 1762103568

But Karpathy is completely right; students who understand and internalize how backprop works, having implemented it rather than treating it as a magic spell cast by TF/PyTorch, will also be able to intuitively understand these problems of vanishing gradients and so on.

Sure, instead of "the problem with backpropagation is that it's a leaky abstraction" he could have written "the problem with not learning how back propagation works and just learning how to call a framework is that backpropagation is a leaky abstraction". But that would be a terrible sub-heading for an introductory-level article for an undergraduate audience, and also unnecessary because he already said that in the introduction.

xpe · 2025-11-03T15:43:34 1762184614

I never disagreed with the utility and importance of understanding backprop. I'm glad the article exists. And it could be easily improved -- and all of us can gain [1] by acknowledging this rather than circling the wagons [2], so to speak, or excusing unforced errors.

> ... he could have written "the problem with not learning how back propagation works and just learning how to call a framework is that backpropagation is a leaky abstraction". But that would be a terrible sub-heading ...

My concern isn't about the heading he chooses. My concern is deeper; he commits a category error [3]. These following things are true, but Karpathy's article gets them wrong: (1) Leaky abstractions only occur with interfaces; (2) Backpropagation is algorithm; (3) Algorithms can never be leaky abstractions.

Karpathy could have communicated his point clearly and correctly by saying e.g.: "treating backprop learning as a magical optimization oracle is risky". There is zero need for introducing the concept of leaky abstractions at all.

---

Ok, with the above out of the way, we can get to some interesting technical questions that are indeed about leaky abstractions which can inform the community about pros/cons of the design space: To what degree is the interface provided by [Library] a leaky abstraction? (where [Library] might be PyTorch or TensorFlow) Getting into these details is interesting. (See [4] for example.) There is room for more writing on this.

[1]: We can all gain because accepting criticism is hard. Once we see that even Karpathy messes up, we probably shouldn't be defensive when we mess up.

[2]: No one is being robbed here. Criticism is a gift; offering constructive criticism is a sign of respect. It also respects the community by saying i.e. "I want to make it easier for people to get the useful, clear ideas into their heads rather than muddled ones."

[3]: https://en.wikipedia.org/wiki/Category_mistake

[4]: https://elanapearl.github.io/blog/2025/the-bug-that-taught-m...

shwaj · 2025-11-04T20:32:40 1762288360

Hear hear, one of my favorite comments recently.

Can’t agree more about the technical points (category error etc), and then the unexpected switch to the value of receiving constructive criticism as a gift not an attack.

Myself, I’m definitely conditioned to receive it as an attack. I’m trying to break this habit. This morning I gave some extensive feedback to some friends who have a startup. The whole time I was writing it, I was stressing out that they’d feel attacked, because that’s how I might take similar criticism.

How was it actually received? A mix I think. Some people explicitly received it as a gift, and others I’m not so sure.

fjdjshsh · 2025-11-02T18:46:26 1762109186

I get your point, but I don't think your nit-pick is useful in this case.

The point is that you can't abstract away the details of back propagation (which involve computing gradients) under some circumstances. For example, when we are using gradient descend. Maybe in other circumstances (global optimization algorithm) it wouldn't be an issue, but the leaky abstraction idea isn't that the abstraction is always an issue.

(Right now, back propagation is virtually the only way to calculate gradients in deep learning)

nirinor · 2025-11-03T02:19:54 1762136394

So, are computing gradients details of backpropagation that it is failing to abstract over, or are gradients the goal that backpropagation achieves? It isn't both, its just the latter.

This is like complaining about long division not behaving nicely when dividing by 0. The algorithm isn't the problem, and blaming the wrong part does not help understanding.

It distracts from what is actually helping which is using different functions with nicer behaving gradients, e.g., the Huber loss instead of quadratic.

grumbelbart2 · 2025-11-03T07:28:23 1762154903

> It distracts from what is actually helping which is using different functions with nicer behaving gradients, e.g., the Huber loss instead of quadratic.

Fully agree. It's not the "fault" of Backprop. It does what you tell it to do, find the direction in which your loss is reduced the most. If the first layers get no signal because the gradient vanishes, then the reason is your network layout: Very small modifications in the initial layers would lead to very large modifications in the final layers (essentially an unstable computation), so gradient descend simply cannot move that fast.

Instead, it's a vital signal for debugging your network. Inspecting things like gradient magnitudes per layer shows you might have vanishing or exploding gradients. And that has lead to great inventions how to deal with that, such as residual networks and a whole class of normalization methods (such as batch normalization).

DSingularity · 2025-11-03T07:18:36 1762154316

It’s just an observation. It’s an abstraction in the classical computer science sense in that you stack some modules and the backprop is generated. It’s leaky in the sense that you cant fully abstract away the details because of the vanishing/exploding gradient issues you must be mindful of.

It is definitely a useful thing for people who are learning this topic to understand from day 1.

nirinor · on Dec 17, 2023

I am not hiring, but might be able to help with some other parts. DM me if you want to talk.

nirinor · on Sept 1, 2023

I use python extensively. I've used bash (+awk+xargs+sed...) extensively.

Nushell is already a great improvement over bash _as a shell_. It is even better when using it to compose _preexisting text based programs_. I would say it is better in every way I can think of, except for: - not (yet) coming pre-installed, and - stability of interfaces and language.

Its already better enough to be my default shell on my daily driver, though I keep bash around because some things really assume it. I very much look forward to one day having a userspace with no traditional shells at all.

nushell is not yet a strict improvement over python, but it might one day be, and it is already better at: - munging text, json, dates and tables - quickly creating nice CLIs callable from the shell (even if that shell isn't nu!) - fun of programming in it

> Nushell is trying to blend two domains, shells and programming languages, which I see distinct advantages in keeping seperate.

Interesting, though, how many PL features the most popular shells tend to have...

> I do not want the world to be built on the back of shell scripts, regardless of how good you make the type system.

If I had read that before knowing nushell, I would strongly agree. Yet, it turns out you can make a shell so good I wouldn't mind if... not the world, but _a lot more_ was built on it.

nirinor · on Jan 25, 2023

Very nice formalization.

One area for refinement: it considers two stacks either identical or unrelated. Consider that stack A;B is actually very close to A;B;C, the difference might be due to a sample time occurring just before or just after the call to C. OP considers them just as different as A;B and Z;W, therefore amplifying a measurement noise.

This suggests using a refined metric between stacks (e.g., an edit distance counting pushes and pops), and then we can use it in defining the metric between flamegraphs (e.g., an optimal transport metric [1], instead of the proposed L1).

Avoiding that noise amplification reduces the background noise level, therefore the cost of effective measurements. From another perspective, the current OP scheme creates an avoidable curse of dimensionality in the form of the Hotelling test's requirement that each sample has more measurements have more samples than distinct stack frames. So the same code split into more functions is harder to measure, and too-small samples are useless. I think neither of those is necessary if we take stack similarity into account.

[1] https://en.wikipedia.org/wiki/Wasserstein_metric

nirinor · on Jan 19, 2023

I've needed these capabilities often while using awk for converting messy logs/error outputs into tables/commands.

Nowadays I like the nushell approach to the composition:

    echo 'quux=123 foo=123 bar=123' | str replace '.*quux=([0-9]+).*foo=([0-9]+).*' $"$2,$1" | from csv -n | each {|r| $r.column1 + $r.column2}

which of course relies on the same regex library (hattip).

nirinor · on Jan 3, 2023

Mislabeled? sounds like you're SEEKING WORK, SEEKING FREELANCER is for the other side.

nirinor · on Jan 3, 2023

SEEKING WORK | NYC area | Remote

Hi, I am Daniel Vainsencher, ML PhD and practitioner. You want to use AI to solve a business or technical problem, have data and a team, but no senior research staff.

I can, on a consulting/contracting basis:

1. Translate your problem and context into well defined prediction and decision problems, connecting performance to the bottom line.

2. Plan out, chunk, and help implement solutions to prediction and decision problems via existing (and if needed, new) ML and optimization software.

3. Support your engineer/research staff in applying ML and connecting it to business goals, via ideas, coaching and troubleshooting.

Have years of experience building software systems, leading dev teams, developing algorithms, publishing state of the art ML/optimization research, and applying them to challenging problems. I've worked/done applied research in domains of entertainment, algorithmic trading, video processing, face recognition, signal processing, demand forecasting and more.

Tech: Python, Julia, Rust, git and others.

https://www.linkedin.com/in/danielvainsencher/

https://www.semanticscholar.org/author/D.-Vainsencher/273458...

Email: danielv at nirinor.ai

nirinor · on Jan 3, 2023

I've joined the Tribe community a few months ago, done one project with them, so far very happy with my experience.

- Pleasant, helpful, knowledgeable people on the slack.

- Brief and effective process matching me with a client.

- They negotiated a rate I was happy with, taking a cut I found reasonable.

- Good working relationships with others in the project, both from Tribe and the client, everyone happy.

- Payment was simple and timely, when I encountered a technical issue with their provider during setup, Tribe were very helpful resolving it.

Feel free to ask questions here or DM me

nirinor · on Sept 21, 2022

Some applications depend on approximately solving optimization problems that are hard even for small problems. The poster child here is combinatorial optimization (more or less equivalently, np-complete problems), concrete examples are SMT solvers and their applications to software verification [1]. Non convex problems are sometimes similarly bad.

Non smooth and badly conditioned optimization problems scale much better with size, but getting high precision solutions is hard. These are important for simulations mentioned elsewhere, but not just for architecture and games, also for automating design, inspections etc [2].

[1] https://ocamlpro.github.io/verification_for_dummies/

[2] https://www.youtube.com/watch?v=1ALvgx-smFI&t=14s

nirinor · on Aug 11, 2022

We see so many examples of regulations having directionally wrong effects...

That regulation is adversarial (largest regulated entities are savvy and often propose the regulations), has higher order and long term effects, this all makes it a hard domain, but also is well known in advance.

What's the current best practice for validating regulation have their intended effects in advance?