> "You lost me there- if we’re talking about a fixed number of multipliers reduced from double to single precision, and the double multipliers are 32% of the area, then because double is 4x the area, as you pointed out, the savings would be 24% not 16%, right?"
No.
The design with only DP multipliers uses them as N DP multipliers or as 2N SP multipliers. If the DP support is removed completely, an otherwise unchanged GPU will remain with 2N SP multipliers, which have a half of the original area, not a quarter.
Therefore if the DP multipliers occupy P% of the area, removing the DP support completely saves (P/2)% of the area, while reducing the DP throughput to 1/4 of SP throughput saves (P/4)% of the area as half of the DP multipliers are replaced by twice as many SP multipliers, to keep the SP throughput unchanged.
Reducing the DP throughput to less than 1/4 of SP throughput produces various savings intermediate between (P/4)% and (P/2)%.
Also, a 64-bit multiplier (actually the DP multiplier is only a 53-bit multiplier) is significantly less than 64 times larger than an adder, because the adders that compose the multiplier are much simpler than a complete adder (the chain of operations is organized in such a way that there are much less modulo-2 sums and carry propagations than when naively adding 64 partial products with complete adders).
I have already said that there are ways to use the single-precision consumer GPUs, either by rewriting the algorithms to use a carefully chosen mix of single-precision operations and double precision operations, or by representing numbers by multiple single-precision values (which already reduces the speed at least 10 times, making only the most expensive consumer GPUs faster than typical CPUs, but which is still faster than the native 1/32 speed).
However using such methods may require 10 times or 100 times more effort for writing a program than simply writing it in double-precision for CPUs, so this is seldom worthwhile.
For almost any problem in engineering design or physical-systems modeling and simulation, double-precision is mandatory.
Single-precision numbers are perfectly adequate for representing all input and output values, because their precision and range matches those available in digital-analog and analog-digital converters.
On the other hand most intermediate values in all computations must be in double precision. Not only the loss of precision is a problem, but also the range of representable values is the problem. With single-precision, there are many problems where overflows or underflows are guaranteed to happen, while no such things happen in double precision.
In theory, it is possible to avoid overflows and underflows by using various scale factors, adjusted to prevent the appearance of out-of-range results.
However, this is an idiotic method, because the floating-point numbers have been invented precisely to avoid the tedious operations with scale factors, which are needed when using fixed-point numbers. If you have to manage in software scale factors, then you might as well use only operations with integer numbers, as the floating-point numbers bring no simplification in such a case.
There are many other such pieces of advice for how to use SP instead of DP, which are frequently inapplicable.
For example, there is the theory that one should solve a system of equations first approximately in SP, and then refine the approximate solution iteratively in DP, to get the right solution.
There are some very simple mostly linear problems where this method works. However many interesting engineering problems, e.g. all simulations of electronic circuits, have systems of equations obtained by the discretization of stiff non-linear differential equations. Trying to approximately solve such systems of equations in SP usually either results in non-convergence or it results in solutions which, when refined in DP, converge towards different solutions than those that would have been obtained if the system of equations would have been solved in DP since the beginning.
In conclusion, even if single-precision may be used successfully, very rarely can that be done by just changing the variable types in a program. In most cases, a lot of work is necessary, to ensure an acceptable precision of the results.
In most cases, I do not see any advantage in doing extra work and pay for GPUs, just because the GPU maker is not willing to sell me a better GPU at a price difference proportional with the difference in manufacturing cost.
Instead of that, I prefer to pay more for a faster CPU and skip the unnecessary work required for using GPUs.
I still have a few GPUs from the old days, when DP computation on GPUs was cheap (i.e. around $500 per double-precision teraflop/s), but they have become increasingly obsolete in comparison with modern CPUs and GPUs, and no replacement for them has appeared during the last years and no similar GPU models are expected in the future.
One DP multiplier has approximately the area of 4 SP multipliers, therefore twice the area of 2 SP multipliers.
One DP multiplier, by reconfiguring its internal and external connections can function as either 1 DP multiplier, or as 2 SP multipliers. Therefore a GPU using only DP multipliers which does N DP multiplications per clock cycle will also do 2N SP multiplications per clock cycle, like all modern CPUs.
For example, a Ryzen 9 5900X CPU does either 192 SP multiplications per cycle or 96 DP multiplications per cycle, and an old AMD Hawaii GPU does either 2560 SP multiplications per clock cycle or 1280 DP multiplications per clock cycle.
When you do not want DP multiplications, the dual function DP/SP multiplier must be replaced by two SP multipliers, to keep the same SP throughput, so that the only difference between the 2 designs is the possibility or impossibility of doing DP operations. In that case the 2 SP multipliers together have a half of the area needed by a DP multiplier with the same SP throughput.
If you would compare 2 designs having different SP throughputs, then there would be other differences between the 2 designs than the support for DP operations, so the comparison would be meaningless.
When all DP multipliers are replaced by 2 SP multipliers each, you save half of the area previously occupied by multipliers, and the DP throughput becomes 0.
When only a part of the DP multipliers are replaced by 2 SP multipliers each, the SP throughput remains the same, but the DP throughput is reduced. In that case the area saved is less than half of the original area and proportional with the number of DP multipliers that are replaced with 2 SP multipliers each.
I understand your assumptions now. You’re saying SP mult is by convention twice the flops for half the area, and I was talking about same flops for one fourth the area. It’s a choice. There might be a current convention, but regardless, the sum total is a factor of 4 cost for each double precision mult op compared to single. Frame it how you like, divvy the cost up different ways, the DP mult cost is still 4x SP. Aaaanyway... that does answer my question & confirm what I thought, thank you for explaining and clarifying the convention.
No.
The design with only DP multipliers uses them as N DP multipliers or as 2N SP multipliers. If the DP support is removed completely, an otherwise unchanged GPU will remain with 2N SP multipliers, which have a half of the original area, not a quarter.
Therefore if the DP multipliers occupy P% of the area, removing the DP support completely saves (P/2)% of the area, while reducing the DP throughput to 1/4 of SP throughput saves (P/4)% of the area as half of the DP multipliers are replaced by twice as many SP multipliers, to keep the SP throughput unchanged.
Reducing the DP throughput to less than 1/4 of SP throughput produces various savings intermediate between (P/4)% and (P/2)%.
Also, a 64-bit multiplier (actually the DP multiplier is only a 53-bit multiplier) is significantly less than 64 times larger than an adder, because the adders that compose the multiplier are much simpler than a complete adder (the chain of operations is organized in such a way that there are much less modulo-2 sums and carry propagations than when naively adding 64 partial products with complete adders).
I have already said that there are ways to use the single-precision consumer GPUs, either by rewriting the algorithms to use a carefully chosen mix of single-precision operations and double precision operations, or by representing numbers by multiple single-precision values (which already reduces the speed at least 10 times, making only the most expensive consumer GPUs faster than typical CPUs, but which is still faster than the native 1/32 speed).
However using such methods may require 10 times or 100 times more effort for writing a program than simply writing it in double-precision for CPUs, so this is seldom worthwhile.
For almost any problem in engineering design or physical-systems modeling and simulation, double-precision is mandatory.
Single-precision numbers are perfectly adequate for representing all input and output values, because their precision and range matches those available in digital-analog and analog-digital converters.
On the other hand most intermediate values in all computations must be in double precision. Not only the loss of precision is a problem, but also the range of representable values is the problem. With single-precision, there are many problems where overflows or underflows are guaranteed to happen, while no such things happen in double precision.
In theory, it is possible to avoid overflows and underflows by using various scale factors, adjusted to prevent the appearance of out-of-range results.
However, this is an idiotic method, because the floating-point numbers have been invented precisely to avoid the tedious operations with scale factors, which are needed when using fixed-point numbers. If you have to manage in software scale factors, then you might as well use only operations with integer numbers, as the floating-point numbers bring no simplification in such a case.
There are many other such pieces of advice for how to use SP instead of DP, which are frequently inapplicable.
For example, there is the theory that one should solve a system of equations first approximately in SP, and then refine the approximate solution iteratively in DP, to get the right solution.
There are some very simple mostly linear problems where this method works. However many interesting engineering problems, e.g. all simulations of electronic circuits, have systems of equations obtained by the discretization of stiff non-linear differential equations. Trying to approximately solve such systems of equations in SP usually either results in non-convergence or it results in solutions which, when refined in DP, converge towards different solutions than those that would have been obtained if the system of equations would have been solved in DP since the beginning.
In conclusion, even if single-precision may be used successfully, very rarely can that be done by just changing the variable types in a program. In most cases, a lot of work is necessary, to ensure an acceptable precision of the results.
In most cases, I do not see any advantage in doing extra work and pay for GPUs, just because the GPU maker is not willing to sell me a better GPU at a price difference proportional with the difference in manufacturing cost.
Instead of that, I prefer to pay more for a faster CPU and skip the unnecessary work required for using GPUs.
I still have a few GPUs from the old days, when DP computation on GPUs was cheap (i.e. around $500 per double-precision teraflop/s), but they have become increasingly obsolete in comparison with modern CPUs and GPUs, and no replacement for them has appeared during the last years and no similar GPU models are expected in the future.