Hacker Newsnew | past | comments | ask | show | jobs | submit | more srush's commentslogin

I made these a couple of years ago as a teaching exercise for https://minitorch.github.io/. At the time the resources for doing anything on GPUs were pretty sparse and the NVidia docs were quite challenging.

These days there are great resources for going deep on this topic. The CUDA-mode org is particularly great, both their video series and PMPP reading groups.


Slightly offtopic, but any chance you could update or re-upload code for your https://github.com/harvardnlp/DeepLatentNLP tutorial? I found the NLP latent variable models discussed there really interesting, and notebooks were excellent. However, these seem gone and the only thing left are slides?

Alternatively, any other places that discuss the same topics, including some code? I could only find equivalent discussions with code in Pyro docs and Kevin Murphy's book, volume 2. But these are more sparse as they also cover many other topics.


I'll take a look. Yeah Pyro is the best thing to do here. But it would be nice to revisit some of these implementationz


Thank you so much!


Thanks a lot, Sasha, for creating these. I found your LLM training puzzles to be excellent as well.



Thanks Sasha - this looks like a great resource.Just to be clear, would you recommend going through other newer resources than this instead?

Not sure if your comment is to discourage someone from going through this.


These still hold up, and I think they're a great first step. But they no longer get you to the goal line. Think about it more as conceptual practice, before you enter the jungle.


Got it, thank you.


Do you have links to the other great resources you are referring to?


tweet says the opposite?


PyTorch is a generationally important project. I've never seen a tool that is so inline with how researchers learn and internalize a subject. Teaching Machine Learning before and after its adoption has been a completely different experience. Never can be said enough how cool it is that Meta fosters and supports it.

Viva PyTorch! (Jax rocks too)


This is exactly why I gravitated to it so quickly. The first time I looked at pytorch code it was immediately obvious what the abstractions meant and how to use them to write a model architecture.

Jax looks like something completely different to me. Maybe I’m dumb and probably not the target audience, but it occurs to me that very few people are. When I read about using Jax, I find recommendations for a handful of other libraries that make it more useable. Which of those I choose to learn is not entirely obvious because they all seem to create a very fragmented ecosystem with code that isn’t portable.

I’m still not sure why I’d spend my time learning Jax, especially when it seems like most of the complaints from the author don’t really separate out training and inference, which don’t necessarily need to occur from the same framework.


Honestly, when I turn to JAX, I generally do it without a framework. It’s like asking for a framework to wrap numpy to me. Just JAX plus optax is sufficient for me in the cases I turn to it.


Torch was originally a Lua project, hence why pytorch is called pytorch and not just torch.

In another timeline AI would have made Lua popular.

The best part is it trampled TensorFlow which I personally find obtuse.


> In another timeline AI would have made Lua popular.

I wonder if it'd have been hated more than Python is - especially with the 1-based indexing...


Scientific computing tends to be 1-based. Thus R, Julia, Fortran, Matlab.


Python isn't hated AFAICT, though people will profess to hating building large projects in it (myself included), but many of those people also love it for shorter programs and scripts.


Everything is hated.

Python has always gotten hate for being super super slow and having an ugly syntax (subjective ofc, but I happen to agree)


Additionally, nowadays it also has Java and C++ bindings to the same native libraries, so others can enjoy performance without having to rewrite their research afterwards.


These slides from Lucas Beyer are pretty nice. https://docs.google.com/presentation/d/1ZXFIhYczos679r70Yu8v...


Yup. I often find people learning ML Engineering struggle a lot with shapes and broadcasting. The goal of these puzzles is to force you to really learn the semantics of broadcasting and internalize that data shapes in ML correspond to how most people think about loops.


Hey, I made these. They're pretty fun. Sometimes people tell me they use them for ML interviews, but they're kind of hard.

The motivation was primarily teaching point-free, array programming. I don't think it is a great style, but it is fun as a brain teaser.

If you enjoy this type of thing, I made a bunch more. They're all kind of ML + PL in style.

- https://github.com/srush/gpu-puzzles

- https://github.com/srush/tensor-puzzles

- https://github.com/srush/autodiff-puzzles

- https://github.com/srush/transformer-puzzles

- https://github.com/srush/LLM-Training-Puzzles

- https://github.com/srush/triton-puzzles

All the graphics for these are made in Chalk which is a python port of Haskell's Diagrams library to https://github.com/chalk-diagrams/chalk . Honestly I mostly make the puzzles as an excuse to hack on the graphics library which I find pretty interesting.


I really like the concept, but both Colab and locally running jupyter notebook seem to have issues. I'm getting an error related to "env.height" (can send you the full stacktrace if interested) in the very first puzzle.


Oh no, yes, please send a stack trace (although if it is in colab I should be able to repro)


Nevermind, I think it was just me being silly and not running the bit with wget at the top!


keep 'em coming!


This book is great. Really mind warping at first read. Fernando Pereira has had an incredible influence across NLP for his whole career. Here is an offhand list of papers to check out.

* Conditional random fields: Probabilistic models for segmenting and labeling sequence data (2001) - Central paper of structured supervised learning in the 2000s era

* Weighted finite-state transducers in speech recognition (2002) - This work and OpenFST are so clean

* Non-projective dependency parsing using spanning tree algorithms (2005) - Influential work connecting graph algorithms to syntax. Less relevant now, but still such a nice paper.

* Distributional clustering of English words (1994) - Proto word embeddings.

* The Unreasonable Effectiveness of Data (2009) - More high-level, but certainly explains the last 15 years


Hi! Blog author. This was an attempt a couple years ago to understand and write about this paper in a detailed way. Here is a video going through this topic as well: https://youtu.be/dKJEpOtVgXc?si=PDNO0B0qi6ARHaeb

Section 2 of the blog post is no longer very relevant. A lot of advances (DSS, S4D) simplified that part of the process. Arguably also this all should be updated for Mamba (same authors).


Thanks for your spectacular resources! I see that you began an Annotated Mamba repository -- any chance you could share when that blog page might go live?


This was an excellent write up thanks. It'll help me understand the Mamba work a lot more.

I still find it really confusing how a linear model can perform so well.


Want to give proper credit to my former student for starting this: Yuntian Deng et al., 2016 (https://arxiv.org/abs/1609.04938). I believe this repo uses the dataset from that paper.

Some recent cool work he's been doing: https://www.youtube.com/watch?v=lx1XcTdhalU.


Yup, should work nicely together.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: