More

varunshenoy · on April 1, 2024

I've been playing with AI agents for months, and most of them are pretty bad. They often get stuck in loops, which is frustrating. This happens in MultiOn, AutoGPT, and others.

I've used Devin a few times (see: https://x.com/varunshenoy\_/status/1767591341289250961?s=20), and while it's far from perfect, it's by far the best I've seen. It doesn't get stuck in loops, and it keeps trying new things until it succeeds. Devin feels like a fairly competent high school intern.

Interestingly, Devin seems better suited as an entry-level analyst than a software engineer. We've been using it internally to scrape and structure real estate listings. Their stack for web RPA and browser automation works _really_ well. And it makes sense why this is important: if you want to have a successful agent, you need to provide it with good tools. Again, it's not flawless, but it gives me hope for the future of AI agents.

varunshenoy · on Dec 5, 2023

Hey guys! Just wanted to share a fun side project.

Code is here: https://github.com/varunshenoy/latentverse

varunshenoy · on Nov 21, 2023

Slightly different set of trade-offs, but similar mental model. You always use large batch sizes (compute bound) and the bottleneck usually ends up communication between GPUs/nodes.

varunshenoy · on Nov 21, 2023

Good question. Yes, the 10GB available for batching is in the HBM. In a single forward pass, you move the entire model from HBM -> SRAM exactly once. In a batched forward pass, this is still the case, so you end up doing more compute for the same amount of memory movement.

You can calculate the SRAM as follows: an A100 has 108 SMs, and each SM has 192 KB in SRAM (shared memory, aka its L1 cache) [1]. Multiplied out, this is ~20 MB of total SRAM. This happens to match up with the diagram in the Flash Attention paper [2].

[1] https://developer.nvidia.com/blog/cuda-refresher-cuda-progra...

[2] https://arxiv.org/pdf/2205.14135.pdf

varunshenoy · on Nov 21, 2023

Thanks!

vLLM for quick set up, TRT-LLM for best performance. Both available on https://baseten.co/.

varunshenoy · on Nov 20, 2023

Absolutely. Looks like the M1 Ultra has 800GB/s of memory bandwidth and ~20 TFLOPS of compute.

The same calculations from the post should hold, except with these new values.

varunshenoy · on Nov 20, 2023

varunshenoy · on Oct 31, 2023

Awesome job guys, and thank you for creating it. Curious if you guys have any insights on long-term memory and if there are better ways to do retreivel apart from top-k.

Seems weird that every RAG app uses top-k especially since you might pull in information irrelevant to the context (e.g. if you were asking for the names of the authors of paper, you probably only want the top-1 embedding).

davidtsong · on Oct 31, 2023

Definitely, top-k is a very naive way to do RAG. I think people have experimented with using a cross encoder like approach or even letting the LLM choose the sources. We will experiment with more approaches like this :)

varunshenoy · on Aug 15, 2023

You can write an extension to support LoRA (~10 lines of Python HF Diffusers code).

If you get to this before me, please create a PR!

varunshenoy · on Aug 15, 2023

LoRAs can be handled as a straight-forward Python extension!