Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

CFD was merely used as an example of something that does scale well. I'm not sure it was the best example, since CFD isn't very common. But basically you have a volume mesh and each cell iterates on the Navier-Stokes equation. So if you have N processor cores, you break the mesh in N pieces, each of which get processed in parallel. Doubling the number of cores allows you process double the amount in the same time, minus communication loses (each section of the mesh needs to communicate the results on its boundary to its neighbors).

I don't fully understand the graph, but it looks like his point is that Alpha Go Zero uses 1e5 times as many resources than AlexNet, but does not produce anywhere near 10,000 times better results. We saw that with CFDt 1e5 more cores resulted in 1e5 better results (= scales). The assertion is that DL's results are much less than 1e5 better, hence it does not scale.

Basically the argument is:

1. CFD produces N times better results given N times more resources [this is implied, requires a knowledge of CFD]. That is, f(ax) = a f(x). Or, f(ax) = 1 a * f(x).

2. Empirically, we see that DL has used 1e5 more resources, but is not producing 1e5 times better results. [No quantitative analysis of how much better the results are is given]

3. Since DL has f(a * x) = b * a * f(x), where b < 1, DL does not scale. [Presumably b << 1 but the article did not give any specific results]

This isn't a very rigorous argument and the article left out half the argument, but it is suggestive.



Thanks for that, that is essentially my point. Agree it is not very rigorous, but it gets the idea across. By scalable we'd typically think "you throw more gpu's at it and it works better by some measure". Deep learning does that only in extremely specific domains, e.g. games and self play as in alpha go. For majority of other applications it is architecture bound or data bound. You can't throw more layers, more basic DL primitives and expect better results. You need more data, and more phd students to tweak the architecture. That is not scalable.


More compute -> more precision is just one field's definition of scalable... Saying that DNNs can't get better just by adding GPUs is like complaining that an apple isn't very orange.

To generalize notions of scaling, you need to look at the economics of consumed resources and generated utility, and you haven't begun to make the argument that data acquisition and PhD student time hasn't created ROI, or that ROI on those activities hasn't grown over time.

Data acquisition and labeling is getting cheaper all the time for many applications. Plus, new architectures give ways to do transfer learning or encode domain bias that let you specialize a model with less new data. There is substantial progress and already good returns on these types of scalability which (unlike returns on more GPUs) influence ML economics.


OK, the definition of scalable is crucial here and it causes lots of trouble (this is also response to several other posts so forgive me if I don't address your points exactly).

Let me try once again: an algorithm is scalable if it can process bigger instances by adding more compute power.

E.g. I take a small perceptron and train it on pentium 100, and then take a perceptron with 10x parameters on Core I7 and get better output by some monotonic function of increase in instance size (it is typically a sub linear function but it is OK as long as it is not logarithmic).

DL does not have that property. It requires modifying the algorithm, modifying the task at hand and so on. And it is not that it requires some tiny tweaking. It requires quite a bit of tweaking. I mean if you need a scientific paper to make a bigger instance of your algorithm this algorithm is not scalable.

What many people here are talking about is whether an instance of the algorithm can be created (by a great human effort) in a very specific domain to saturate a given large compute resource. And yes, in that sense deep learning can show some success in very limited domains. Domains where there happens to be a boatload of data, particularly labeled data.

But you see there is a subtle difference here, similar in some sense to difference between Amdahl's law and Gustafson's law (though not literal).

The way many people (including investors) understand deep learning is that: you build a model A, show it a bunch of pictures and it understands something out of them. Then you buy 10x more GPU's, build model B that is 10x bigger, show it those same pictures and it understands 10x more from them. Look I, and many people here understand this is totally naive. But believe me, I talked to many people with big $ that have exactly that level of understanding.


I appreciate the engagement in making this argument more concrete. I understand that you are talking about returns on compute power.

However, your last paragraph about how investors view deep learning does not describe anyone in the community of academics, practitioners and investors that I know. People understand that the limiting inputs to improved performance are data, followed closely by PhD labor. Compute power is relevant mainly because it shortens the feedback loop on that PhD labor, making it more efficient.

Folks investing in AI believe the returns are worth it due to the potential to scale deployment, not (primarily) training. They may be wrong, but this is a straw man definition of scalability that doesn't contribute to that thesis.


You’re arguing around the point here.

Almost all reasearch domains live on a log curve; a little bit gets you a lot to start with, but eventually you exhaust the easy solutions and a lot of work gets you very little improvement.

You’re arguing we haven’t reached the plateau at the top yet, but you’ve offered no meaningful evidence that is the case.

There are real world indicators that we are reaching diminishing returns for investment in compute and research now.

The ‘winter’ becomes a thing when it becomes apparent to investors that their financial bets are based off nothing more concrete than opinions like yours, when they don’t work out.

Are we there yet? Not sure, myself, I think we can get some more wins from machine generated architectures... but I can’t see any indication that the ‘winter’ isn’t coming sooner or later.

Investment is massively outstripping returns right now... we’ll just have to see if that calms down gradually, or pops suddenly.

History does not have a good story to tell about responsible investors behaving in a reasonable manner and avoiding crashes.


Thanks for taking the time to render the more specific argument! I still don't think this is suggestive in a way that should influence readers. Here are some ways in which a naive "10x resources != 10x improvement" argument can err:

- Improvement is hard to define consistently. Sometimes, improving classification accuracy by 0.5% means reducing error by 20%, and makes economic applications that have 100x the value or frequency of use.

- Resources used in training can be amortized over billions of times the same model is reused (much more cheaply). So even achieving an epsilon improvement in the expected utility of each inference can justify a massive increase in training cost.

- Some other notions of "better results" or "less expensive" include amount of training data required, social fairness of results, memory required or power used during inference, and so on. And there are major advances in current research on each of these better formalized axes!

That last bit is what is so frustrating in reading an article like this. The author is sweeping aside with vague arguments a great deal of work that has been written and justified to a much much higher standard of rigor (not just the VCs we all like to snark about). Readers should beware of trusting a summary like this without engaging directly with the source material.


I certainly encourage everybody to consult the source material! Man, this is a blog, opinion by default not perfect.

But when I hear the keyword "major advances" I'm highly suspicious. I had seen already so many such "major advances" that never went beyond a circle of self citing clique.


As a very concrete "major advance" consider Google Translate's tiny language models [1] that can beam to your phone, live in a few megabytes, and translate photographed text for you with low power usage. This was done with incredibly expensive centralized training, but checks every meaningful box for "scalable" AI.

[1] https://ai.googleblog.com/2015/07/how-google-translate-squee...


This is a seriously flawed depiction of CFD.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: