Hacker Newsnew | past | comments | ask | show | jobs | submit | boywitharupee's commentslogin

they have a watchdog loop, it runs periodically


shouldn't the title be "CUDA Tile IR Open Sourced"?


It's more or less the same thing. CUDA TIle is the name of the IR, cuTile is the name of the high level DSLs.


is there a document or reference implementation that describes the full algorithm? tiling, sorting, merging, and strip conversion.


> In C++, it's an rvalue reference , which can be effectively thought of as an lvalue

hmm...this doesn't sound quite right? the comma operator's result in C++ is not an rvalue reference - it takes on exactly the value category of its right operand (which in this case is an lvalue)


so, these are hand optimized primitives for specific model of nvidia gpus? do you still have to make launch/scheduling decisions to maximize occupancy? how does this approach scale to other target devices with specialized instruction sets and different architecture?


can someone explain how is profiling tools like this written for GPU applications? wouldn't you need access to internal runtime api?

for ex. Apple wraps Metal buffers as "Debug" buffers to record allocations/deallocations.


Some graphics APIs support commands that tell the GPU to record a timestamp when it gets to processing the command. This is oversimplified, but is essentially what you ask the GPU to do. There’s lots of gotchas in hardware that makes this more difficult in practice as a GPU won’t always execute and complete work exactly as you specify at the API level if it’s safe to. And the timestamp domain isn’t always the same as the CPU.

But in principle it’s not that different to how you just grab timestamps on the CPU. On Vulkan the API used is called “timestamp queries”

It’s quite tricky on tiled renderers like Arm/Qualcomm/Apple as they can’t provide meaningful timestamps at much tighter granularity than a whole renderpass. I believe Metal only allows you to query timestamps at the encoder level, which roughly maps to a render pass in Vulkan (at the hardware level anyway)


I don't know about Tracy, but I've seen a couple WebGPU JS debugging tools simply intercepting calls to the various WebGPU functions like writeBuffer, draw, etc, by modifying the prototypes of Device, Queue and so on[0].

- 0: https://github.com/brendan-duncan/webgpu_inspector/blob/main...


what kind of model architecture was used for this? is it safe to assume they used a transformer model or a variant of it?


what's the purpose of this? is it one of those 'fun' problems to solve?


This quote might help - https://en.wikipedia.org/wiki/Von_Neumann%27s_elephant#Histo...

yes, a fun problem, but also a criticism of using to many parameters.


how different is this compared to Facebook's open-source tool Faiss[1]?

[1] https://github.com/facebookresearch/faiss/


Faiss is for similarity search over vectors via k-NN. GraphRAG is, well, a graph. More precisely, GraphRAG has more in common with old school knowledge graph techniques involving named entity extraction and the various forms of black magic used to identify relationships between entities. If you remember RDF and the semantic web it's sort of along those lines. One of the uses of Faiss is in a k-NN graph but the edges between nodes in that graph are (similarity) distance based.

Looking at an example prompt from GraphRAG will make things clear https://github.com/microsoft/graphrag/blob/main/graphrag/pro...

especially these lines:

Return output in English as a single list of all the entities and relationships identified in steps 1 and 2.

Format each relationship as a JSON entry with the following format:

{{"source": <source_entity>, "target": <target_entity>, "relationship": <relationship_description>, "relationship_strength": <relationship_strength>}}


Excuse me, how is it not?


In a similar fashion, you'll see that JAX has frontend code being open-sourced, while device-related code is distributed as binaries. For example, if you're on Google's TPU, you'll see libtpu.so, and on macOS, you'll see pjrt_plugin_metal_1.x.dylib.

The main optimizations (scheduler, vectorizer, etc.) are hidden behind these shared libraries. If open-sourced, they might reveal hints about proprietary algorithms and provide clues to various hardware components, which could potentially be exploited.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: