Intel Thread Building Blocks

krapht · on June 16, 2015

I'd be interested in reading about real world experience with Intel TBB, and comparisons to Fastflow, which seems to be the nearest competitor.

mbanck · on June 16, 2015

What a coincidence, I just came across this today: http://arrayfire.com/benchmarking-parallel-vector-libraries/ (posted by Kyle Lutz on G+, he is the author of boost.compute)

danieljh · on June 16, 2015

From what I can see in the scientific community it seems like TBB does not get the attention it deserves.

In my opinion this comes from two simple facts: 1/ it's easier to throw a OpenMP #pragma on your loop after detecting the bottleneck with a profiler and 2/ you need to understand some TBB concepts like partitioning/splitting to implement your own algorithms. That is, TBB much more integrates into the language (modern C++ with lambdas etc.).

I won't tell you much about TBB's concurrent containers (maps, queues, vectors, ...), parallel algorithms (sort, scan, reduce, ...), or memory allocators. Those are explained in detail in the documentation. What I want to tell you about is TBB's Pipeline and FlowGraph feature, because of how powerful they are and how often they are simply ignored.

TBB's pipeline lets you build a pipeline of parallel or serial stages. On a high level it really is function composition, with the benefit of TBB deciding on the level of parallelism.

Here is an example of using TBB's pipeline to receive from the network, deserialize the blob and merge the results concurrently and potentially in parallel:

https://github.com/daniel-j-h/DistributedSearch/blob/97224b1...

You can find a quick explanation in this blog post:

https://daniel-j-h.github.io/post/intuitive-monadic-bind-kle...

All you have to do is create your stages as functions: take input from stage n-1, process it and move ownership of the item over to stage n. Note: C++11's move semantics do not require you to pass raw pointers around or do the memory management yourself, as it is done in TBB's documentation!

As you can see, you only need the parallel_pipeline function (and make_filter<In, Out> for when the stages are not passed directly as lambdas):

https://www.threadingbuildingblocks.org/docs/help/reference/...

TBB's pipeline is really powerful for scenarios where a linear processing chain is needed, e.g. face detection with OpenCV where you have to 1/ grab a frame 2/ do some histogram corrections 3/ apply gaussian blur 4/ apply a face detection algorithm 5/ merge face's position into circular buffer 6/ do average over this buffer to estimate face position.

TBB's FlowGraph is for when your dependencies are more difficult to express as a simple pipeline. In the OpenCV example, maybe you need to buffer 5 frames for the face detection, maybe you need to join inputs from a camera and a video and react on when a face is found in a camera frame.

An example for expressing arbitrary dependencies is here:

https://www.threadingbuildingblocks.org/docs/help/reference/...

And an example for joining two nodes:

https://www.threadingbuildingblocks.org/docs/help/reference/...

This is the documentation's starting point for FlowGraph:

https://www.threadingbuildingblocks.org/docs/help/index.htm#...

There is also a book about TBB out there and it contains some additional examples. But as it is with most of the documentation, those examples are not written in modern C++ (C++11/C++14) and the book is a bit dated.

gonewest · on June 16, 2015

That's a good answer. Here's one more example from the visual effects industry. This is for parallel evaluation of potentially complex dependency graphs in an interactive character engine.

http://www.multithreadingandvfx.org/course_notes/Paralleleva...

vvanders · on June 16, 2015

We'd look at TBB at a previous gig for something similar. However we never ending up pursuing it for unrelated reasons.

gh02t · on June 16, 2015

Indeed, it always surprised me that TBB didn't have more traction in the scientific community. The Intel compilers and MKL are pretty popular in the HPC community as is, so it seems natural that TBB would get more attention.

IMO, the reason is probably because of exactly what you said - the programming model is a bit different from the standard "slap an OMP pragma on it" and requires deliberate design to target TBB. From my small amount of experience with it though, it's actually pretty easy to use and can often be much more approachable than the more down-and-dirty OpenMP shared memory approach.

a8da6b0c91d · on June 16, 2015

Now that Cilk Plus is in GCC I think there's even less reason to use TBB directly. That cilk_spawn stuff is so nice.