Actually 'find' will also stat each entry no matter what.
Many of the standard-tools that most people would intuitively expect to be rather optimized (find, rsync, gzip) are embarrassingly inefficient under the hood and turn belly up when confronted with data of any significant size.
That probably stems from the fact that most of the development on these tools took place during a time when 1GB harddrives were "huge" and SMP was "high end".
The only issue I'm aware of with gzip is actually in zlib, where it stored 32-bit byte counters, but those were strictly optional and it works fine with data that overflowed them. The zlib window size may be only 32k, but bzip2 doesn't do that much better with 900k and a better algorithm, so I wouldn't consider it embarrassingly inefficient.
How do you tell a file from a directory without stat()ing it? The d_type field is not portable. Since find and other tools like it need to recursively descend a directory tree, a stat() for each file to determine its type is unavoidable.
But times have changed, and development isn't dead. Why haven't they been updated? The optimizations you imply are often straightforward and well-understood; not major undertakings to implement.
"But times have changed, and development isn't dead. Why haven't they been updated?"
Maybe because listing 8M files is not a common use case, and there just isn't the motivation to update otherwise perfectly working code. It's not an itchy problem.
Many of the standard-tools that most people would intuitively expect to be rather optimized (find, rsync, gzip) are embarrassingly inefficient under the hood and turn belly up when confronted with data of any significant size.
That probably stems from the fact that most of the development on these tools took place during a time when 1GB harddrives were "huge" and SMP was "high end".