One of the authors here. I agree that most of the value in a clinical applicatio...

One of the authors here.

I agree that most of the value in a clinical application won't come from the often (but not always) relatively small performance gains by tweaking your neural network architecture or fiddling with the loss function. Collecting a high-quality and diverse dataset is important for training and arguably even more important for validation because you want to show that the deployed model is reliable.

But before deploying a model, I'd argue that it is worth testing a few architectures out to determine if one is substantially better than the rest. It can be a pain to test out a bunch of architectures, but the ones we mention in the article have many implementations freely available (and we provide ones too!). So you can drop in one of these architectures and test them out pretty easily (especially if you skip hyperparameter tuning initially).

Spending too much time fussing over a 2-3% performance gain is silly, but sometimes, surprisingly, the difference in performance by choosing another architecture can be much greater. I wish I had more intuition as to why some architectures perform well and others don't. It would certainly make R&D easier if you could totally ignore the architecture choice.