From what I can see in the scientific community it seems like TBB does not get the attention it deserves.
In my opinion this comes from two simple facts: 1/ it's easier to throw a OpenMP #pragma on your loop after detecting the bottleneck with a profiler and 2/ you need to understand some TBB concepts like partitioning/splitting to implement your own algorithms. That is, TBB much more integrates into the language (modern C++ with lambdas etc.).
I won't tell you much about TBB's concurrent containers (maps, queues, vectors, ...), parallel algorithms (sort, scan, reduce, ...), or memory allocators. Those are explained in detail in the documentation.
What I want to tell you about is TBB's Pipeline and FlowGraph feature, because of how powerful they are and how often they are simply ignored.
TBB's pipeline lets you build a pipeline of parallel or serial stages.
On a high level it really is function composition, with the benefit of TBB deciding on the level of parallelism.
Here is an example of using TBB's pipeline to receive from the network, deserialize the blob and merge the results concurrently and potentially in parallel:
All you have to do is create your stages as functions: take input from stage n-1, process it and move ownership of the item over to stage n.
Note: C++11's move semantics do not require you to pass raw pointers around or do the memory management yourself, as it is done in TBB's documentation!
As you can see, you only need the parallel_pipeline function (and make_filter<In, Out> for when the stages are not passed directly as lambdas):
TBB's pipeline is really powerful for scenarios where a linear processing chain is needed, e.g. face detection with OpenCV where you have to 1/ grab a frame 2/ do some histogram corrections 3/ apply gaussian blur 4/ apply a face detection algorithm 5/ merge face's position into circular buffer 6/ do average over this buffer to estimate face position.
TBB's FlowGraph is for when your dependencies are more difficult to express as a simple pipeline.
In the OpenCV example, maybe you need to buffer 5 frames for the face detection, maybe you need to join inputs from a camera and a video and react on when a face is found in a camera frame.
An example for expressing arbitrary dependencies is here:
There is also a book about TBB out there and it contains some additional examples.
But as it is with most of the documentation, those examples are not written in modern C++ (C++11/C++14) and the book is a bit dated.
That's a good answer. Here's one more example from the visual effects industry. This is for parallel evaluation of potentially complex dependency graphs in an interactive character engine.
Indeed, it always surprised me that TBB didn't have more traction in the scientific community. The Intel compilers and MKL are pretty popular in the HPC community as is, so it seems natural that TBB would get more attention.
IMO, the reason is probably because of exactly what you said - the programming model is a bit different from the standard "slap an OMP pragma on it" and requires deliberate design to target TBB. From my small amount of experience with it though, it's actually pretty easy to use and can often be much more approachable than the more down-and-dirty OpenMP shared memory approach.