More

boywitharupee · 2026-02-22T20:06:20 1771790780

they have a watchdog loop, it runs periodically

boywitharupee · 2025-12-25T21:03:53 1766696633

shouldn't the title be "CUDA Tile IR Open Sourced"?

OneDeuxTriSeiGo · 2025-12-26T03:26:09 1766719569

It's more or less the same thing. CUDA TIle is the name of the IR, cuTile is the name of the high level DSLs.

boywitharupee · 2025-06-13T03:52:55 1749786775

is there a document or reference implementation that describes the full algorithm? tiling, sorting, merging, and strip conversion.

boywitharupee · on Nov 15, 2024

> In C++, it's an rvalue reference , which can be effectively thought of as an lvalue

hmm...this doesn't sound quite right? the comma operator's result in C++ is not an rvalue reference - it takes on exactly the value category of its right operand (which in this case is an lvalue)

boywitharupee · on Oct 31, 2024

so, these are hand optimized primitives for specific model of nvidia gpus? do you still have to make launch/scheduling decisions to maximize occupancy? how does this approach scale to other target devices with specialized instruction sets and different architecture?

boywitharupee · on Sept 24, 2024

can someone explain how is profiling tools like this written for GPU applications? wouldn't you need access to internal runtime api?

for ex. Apple wraps Metal buffers as "Debug" buffers to record allocations/deallocations.

MindSpunk · on Sept 24, 2024

Some graphics APIs support commands that tell the GPU to record a timestamp when it gets to processing the command. This is oversimplified, but is essentially what you ask the GPU to do. There’s lots of gotchas in hardware that makes this more difficult in practice as a GPU won’t always execute and complete work exactly as you specify at the API level if it’s safe to. And the timestamp domain isn’t always the same as the CPU.

But in principle it’s not that different to how you just grab timestamps on the CPU. On Vulkan the API used is called “timestamp queries”

It’s quite tricky on tiled renderers like Arm/Qualcomm/Apple as they can’t provide meaningful timestamps at much tighter granularity than a whole renderpass. I believe Metal only allows you to query timestamps at the encoder level, which roughly maps to a render pass in Vulkan (at the hardware level anyway)

ossobuco · on Sept 24, 2024

I don't know about Tracy, but I've seen a couple WebGPU JS debugging tools simply intercepting calls to the various WebGPU functions like writeBuffer, draw, etc, by modifying the prototypes of Device, Queue and so on[0].

- 0: https://github.com/brendan-duncan/webgpu_inspector/blob/main...

boywitharupee · on Sept 5, 2024

what kind of model architecture was used for this? is it safe to assume they used a transformer model or a variant of it?

boywitharupee · on July 14, 2024

what's the purpose of this? is it one of those 'fun' problems to solve?

jfoutz · on July 14, 2024

This quote might help - https://en.wikipedia.org/wiki/Von_Neumann%27s_elephant#Histo...

yes, a fun problem, but also a criticism of using to many parameters.

boywitharupee · on July 2, 2024

how different is this compared to Facebook's open-source tool Faiss[1]?

[1] https://github.com/facebookresearch/faiss/

throwaway4aday · on July 2, 2024

Faiss is for similarity search over vectors via k-NN. GraphRAG is, well, a graph. More precisely, GraphRAG has more in common with old school knowledge graph techniques involving named entity extraction and the various forms of black magic used to identify relationships between entities. If you remember RDF and the semantic web it's sort of along those lines. One of the uses of Faiss is in a k-NN graph but the edges between nodes in that graph are (similarity) distance based.

Looking at an example prompt from GraphRAG will make things clear https://github.com/microsoft/graphrag/blob/main/graphrag/pro...

especially these lines:

Return output in English as a single list of all the entities and relationships identified in steps 1 and 2.

Format each relationship as a JSON entry with the following format:

{{"source": <source_entity>, "target": <target_entity>, "relationship": <relationship_description>, "relationship_strength": <relationship_strength>}}

yard2010 · on July 2, 2024

Excuse me, how is it not?

boywitharupee · on June 14, 2024

In a similar fashion, you'll see that JAX has frontend code being open-sourced, while device-related code is distributed as binaries. For example, if you're on Google's TPU, you'll see libtpu.so, and on macOS, you'll see pjrt_plugin_metal_1.x.dylib.

The main optimizations (scheduler, vectorizer, etc.) are hidden behind these shared libraries. If open-sourced, they might reveal hints about proprietary algorithms and provide clues to various hardware components, which could potentially be exploited.