dkhudia's comments

dkhudia · on Jan 10, 2025

> It's quite common in machine learning operations to multiply a matrix of unsigned byte by a matrix of signed byte. Don't ask me why, but that's the case.

Overflow is the reason. Intel's vpmaddubsw takes int8_t and uint8_t to give you results in int16_t. If both are unsigned 255 * 255 = 65025 will be out of range for int16_t (−32,768 to +32,767) so likely the instruction is designed to take int8_t and uint8_t. However, if one is signed and other is unsigned extremes -128 * 255 or 127 * 255 are always in int16_t range. The overflow (or rather saturation with this instruction) can still occur because it sums adjacent multiplications. See my comment in PyTorch. https://github.com/pytorch/pytorch/blob/a37db5ae3978010e1bb7...

atq2119 · on Jan 10, 2025

This doesn't feel like a convincing argument. If you wanted to multiply uint8 * uint8, you'd naturally use an unsigned multiply with a uint16 result. That doesn't overflow either.

I believe a better argument is to appeal to the structure of neural networks. Activation inputs into a matrix multiply come out of a non-linear function, and ReLU is a popular function which causes activation inputs to be unsigned. Weights then need to be signed so that the matrix multiplication can have negative outputs -- without negative outputs, you would lose the non-linearity of ReLU.

dkhudia · on Jan 11, 2025

This is true but the instruction already existed and it doesn't support uint16_t accumulation. For the reason you mention, activations are uint8_t and weights are int8_t so it worked out well for neural networks.

dkhudia · on Feb 19, 2024

@tome for the deterministic system, what if the timing for one chip/part is off due to manufacturing/environmental factors (e.g., temperature) ? How does the system handle this?

tome · on Feb 19, 2024

We know the maximum possible clock drift and so we know when we need to do a resynchronisation to keep all the chips in sync. You can read about it in section 3.3 of our recent whitepaper: https://wow.groq.com/wp-content/uploads/2023/05/GroqISCAPape...

mechagodzilla · on Feb 19, 2024

Those sorts of issues are part of timing analysis for a chip, but once a chip's clock rate is set, they don't really factor in unless there is some kind of dynamic voltage/frequency scaling scheme going on. This chip probably does not do any of that and just uses a fixed frequency, so timing is perfectly predictable.

dkhudia · on Oct 2, 2023

MosaicML/Databricks | San Francisco Bay Area or New York | Machine Learning Engineer - Performance Optimization | Full-time

Founded in late 2020 by a small group of machine learning engineers and researchers, MosaicML enables companies to securely fine-tune, train and deploy custom AI models on their own data, for maximum security and control. Compatible with all major cloud providers, the MosaicML platform provides maximum flexibility for AI development. Introduced in 2023, MosaicML’s pretrained transformer models have established a new standard for open source, commercially usable LLMs and have been downloaded over 3 million times. MosaicML is committed to the belief that a company’s AI models are just as valuable as any other core IP, and that high-quality AI models should be available to all.

Now part of Databricks since July 2023, we are passionate about enabling our customers to solve the world's toughest problems — from making the next mode of transportation a reality to accelerating the development of medical breakthroughs. We do this by building and running the world's best data and AI platform so our customers can use deep data insights to improve their business. We leap at every opportunity to solve technical challenges, striving to empower our customers with the best data and AI capabilities.

Apply here or reach out at daya@[company name here].com: https://www.databricks.com/company/careers/engineering/machi... https://www.databricks.com/company/careers/engineering/senio...

dkhudia · on April 20, 2023

Disclaimer: I work for MosaicML (MosaicML is the creator of the training platform used by Replit).

Training these models from scratch on your domain specific data is not as expensive as one might think. We have provided some cost estimates in our blogs.

https://www.mosaicml.com/blog/mosaicbert

https://www.mosaicml.com/blog/training-stable-diffusion-from...

https://www.mosaicml.com/blog/gpt-3-quality-for-500k

moltar · on April 20, 2023

Do you have any examples on how to train a model that can write code but in a specific domain? Eg I only want to train it on a specific set of code. Eg let’s say functional React components in TypeScript.

dskhudia · on April 20, 2023

We recently released 1B parameter model trained on a mix of data.[1] If you got your domain-specific data, our platform can cover the rest.

[1]: https://twitter.com/jefrankle/status/1649060478910357504?s=4...

moltar · on April 20, 2023

But do you have any examples of how to do this? I am a pretty seasoned dev, but never trained a model before :)

ftxbro · on April 20, 2023

Thank you this is very interesting!