The article starts by saying that low latency is the main requirement for financial trading, but then quotes all performance numbers in throughput ("6 million orders per second"). I couldn't find any mention of average or worst-case latency, and throughput numbers alone tell you nothing about latency.
In particular, since each input event has to be journaled and replicated (which both involve I/O) before it can be processed, there is a potentially large and unpredictable delay for each event.
Another issue is that this architecture assumes that 1 CPU core is enough to run all business logic processors, and that 1 machine's memory is enough to hold your state. If your processing is something CPU-intensive or your state is large, you'll hit a scaling wall that requires you to manually shard your input across multiple machines.
You are right predictable latency is key, but that is one of the big wins that we have found with the Disruptor and our surrounding infrastructure. The Disruptor is not the only thing that we have worked on to achieve this. For example we have spent time experimenting with journalling and clustering techniques to give us high-throughput, predictable, mechanisms for protecting our state.
As Martin said in his reply our system is implemented as a number of services, which communicate via pub-sub messaging, it is not quite as naive as you assume, we still have a distributed system. For the high-throughput, low latency functions of the system, (e.g. order matching in our exchange, or risk evaluation in our broker) they are each serviced on a single business logic thread, on separate servers.
Our latency measurements include these multiple hops within our network and represent an inbound instruction arriving at our network to the time that we have a fully processed, outbound, response back at the edge of our network, as Martin pointed out, modern networking can be quiet efficient when used well.
We have designed to allow us to shard our system when the need arises, but we are actually a long way away from that need. Even though these high performance services keep their entire working set in memory, that is still a relatively small number when compared to the amount of memory available in modern commodity servers. We currently have lots of head-room!
We think that this is a very scalable approach and under-used. Keeping the working set in memory is pretty straight forward for many business applications, and has the huge benefit of being very simple to code. Much more straight-forward than the, more conventional, shuffling of data around and translation from one form to another that is such a common theme in more conventional systems.
The sharding decision is simple, for our problem domain we have two obvious dimensions for sharding, accounts and order-books. Each instance (shard) would continue to have it's business logic processed on a single thread. I think that this is normal for most business problems, it is a matter of designing the solution to avoid shared state between shards.
We are not advocating "no parallelism", rather we advocate that any state should be modified by a single thread, to avoid contention. Our measurements have shown that the cost of contention almost always outweighs the benefit. So avoid contention, not parallelism.
Dave Farley
(Head of Software development at LMAX)
They don't call the system low latency --- it is the processing that occurs in the single-threaded main loop that has to be low latency. They of course want the end-to-end system to be low latency, but the term here describes the inner work loop, because their approach doesn't work if you have higher latency work items in the main loop.
If you apply "low latency" to the inner loop only, then you can resolve your second critique too: they won't be supporting anything that is CPU-intensive (since that isn't low latency). Also, all state has to fit in a single node, to keep things low latency.
You're saying this approach doesn't fit problem domains that are CPU-intensive. That's a valid point.
The article doesn't directly address this but does compare LMAX to a typical database backed business application. The computation here is not usually CPU-bound.
The LMAX system is more about latency than throughput. When designing for very low-latency the result can be a system that achieves very great throughput if the appropriate techniques are employed. Single threaded applications are very suitable for low-latency because of the avoidance of lock contention and predictability they bring.
Some means of reliable delivery of messages, to and from this single thread, is necessary to make a useful application. These messages must be delivered in the event of system failures. To address this need the Disruptor is employed to pipeline and run in parallel, the replication, journalling and business logic for these messages. The whole system is asynchronous and non-blocking.
In our architecture we have multiple gateway nodes that handle border security and protocol translation to and from our internal IP multi-cast binary protocol for delivery to highly redundant nodes. Lots of external connections can be multiplexed down from the outside world this way.
The 6 million TPS refers to our matching engine business logic event processor. We have other business logic processors for things like risk management and account functions. These can all communicate via our guaranteed message delivery system that can survive node failures and restarts, even across data centres.
Modern financial exchanges can process over 100K TPS and have to respond with latency in the 100s of microseconds firewall to firewall, thus including the entire internal infrastructure. For those tracking the latest developments will see it is possible to have single digit microsecond network hops with IP multicast with user space network stacks and RDMA over 10GigE. Even a well tuned 1GigE stack can achieve sub 40us for a network hop. For reference single digit microseconds is in the same space as a context switch on a lock with the kernel arbitrating. Most financial exchanges rely on having data on multiple nodes before the transaction is secure. A number of these nodes can be asynchronously journalling the data down to disk. At LMAX we tend to have data in 3 or more nodes at any given time.
In my experience of profiling many business applications the vast majority of the time is either spent in protocol translation such as XML or JSON to business objects, or within the JDBC driver doing buffer copying and waiting on the database to respond, when the application domain is well modelled.
Often applications are not well modelled for their domain. This can result in algorithms that, rather than be O(1) for most transactions, have horrible scale up characteristics because of inappropriate collections representing relationships. If you have the luxury of developing an in-memory application requiring high performance it quickly becomes apparent the cost of a CPU cache miss is the biggest limitation to latency and throughput. For this one needs to employ data structures that exhibit good mechanical sympathy for CPU and memory subsystem. At LMAX we have replaced most of the JDK collections with our own that are cache friendly and garbage free.
So far we have had no issue processing all transactions for a given purpose on a single thread, or holding all the live state in memory for a single node. If we ever cannot process all the transactions necessary on a single thread then we simply shard the model across threads/execution contexts. We only hold the live mutating data in-memory and archive out to database completed transactions as they are then read only.
In particular, since each input event has to be journaled and replicated (which both involve I/O) before it can be processed, there is a potentially large and unpredictable delay for each event.
Another issue is that this architecture assumes that 1 CPU core is enough to run all business logic processors, and that 1 machine's memory is enough to hold your state. If your processing is something CPU-intensive or your state is large, you'll hit a scaling wall that requires you to manually shard your input across multiple machines.