It all started from a linkedin post. A post about the performance of a particular FAST decoder turned into a series of comments with claims and questions.
One particular comment caught my interest, from Mr. Jeffrey M. Birnbaum. He posted the benchmark of his decoder with some exceptional numbers. Although it raised a lot of doubts I know it is not impossible. As a matter of fact I wrote in a previous comment with some thoughts about how I can do to improve my decoder. It is just some thoughts and I never bothered to implement it. Jeffery’s work inspired me to start implementing my ideas.
So I spent a whole weekend on the improvement. The result is quite significant. I see a three times performance improvement with my new implementation. So I emailed Jeffrey with my result and thanked him for his inspiration. Jeff kindly returned my email. And I learned two things from his email. First of all, it appears that his method is different than mine. Secondly, he is extremely confident about the superiority of his decoder.
So I decided to take his challenge and enter into a friendly match against him. Just like Jeff said, “I like a little competition and in the end we will make each other a bit better”. I have to say further improvement is not easy. I know what to do but it’s all tedious work and it takes efforts to get there. To make long story short, I managed to make some time for the work and right now, after a week, I have some result.
Before we go any further, let me first explain the CME market data I used. CME disseminates both quote change and trade information in FAST format. The FAST message (or segment) is wrapped in UDP multicast packets. Currently one UDP packet only contains one FAST message. The FAST encoding method can compress information with very high compression ratio.
The structure of FAST message is very flexible and quite complicated. It is basically a tree structure with predefined schema (FAST template in XML format). The most commonly used message is called “IncrementalRefresh”. The ”IncrementalRefresh” may contain several entries with quote and trade changes.
For example, if the current market best bid and offer is:
Offer: 129000 X 200
Bid: 128975 X 50
And there are two orders sitting at best bid with 25 contracts each. If someone sent a market order to take 50 contracts from bid side. It may result in an ”IncrementalRefresh” with 3 “MDEntries”:
Trade 25@128975; Trade 25@128975; Quote Remove Bid Level 1
There may be some confusions about FAST benchmark result because some refer “message” as “UDP packet” and some refer ”message” as each “Fast Segment”. In the example above, if ”message” is referred to as each “Fast Segment”, then the total messages is 3; otherwise it is 1.
Jeff suggests that “MB/sec is the only fair comparison” and I agree with him.
Now here comes my result, tested on my 3.33GHz Xeon server with GCC 4.4.0.
Total time = 14.743588 seconds;
Packets/Entries/MaxEntriesPerPacket = 34217764/165204011/116;
0.430875 microsecond per packet; 0.089245 microsecond per entry.
Benchmarked using a file with size of 2,838,336,532 bytes. In the benchmark result listed above, the “packet” refers to UDP packets from CME. Entry only refers to MDEntry contained in the packet (not including the MDIncRefresh or MDSnapshotFullRefresh body).
We can tell that for that day there are an average of (0.43/0.089) = 4.83 MDEntries per UDP packet. However the biggest message contains 116 entries in a single packet.
Can I further improve it? Of course I can! Now I have optimized the integer, PMAP, and stream handling. String handling is only partially optimized. And dictionary is not optimized at all (it is still using several std::vector arrays). I can easily squeeze dozens of extra nanoseconds from each entry’s processing.
The beauty about FAST protocol is that once the implementation is done, one only needs to get a template file from exchanges that supports FAST based market data (Arca book, Eures, EBS etc), in order to decode the market data from the exchange. So yes, my implementation is universal.
So what does it mean:
- For some strategies, if one always needs to compete with others on certain amount of shares, it could mean responding many microseconds faster than competitors under certain market condition.
- We’ve heard of performance claim of FPGAs. But both Jeff and I agree it is a bit overrated. It will be difficult for FPGA to match the performance of efficiently implemented pure software decoder, if not impossible. Now the biggest bottleneck is TCP/UDP network processor. It might be something FPGA can do to make significant improvement. But unfortunately I haven’t seen any FPGA product going to that direction.
May the best impl win! And happy trading.
P.S. In case you are interested in the “friendly match”between Jeff and me, all I can tell is that Jeff’s original number is even slightly better than mine (he is using a 2.99 GHz Core i7 and a different data set). While I’m still waiting for his updated result, I would say it will be difficult for any of us to blow the doors off the other’s.
Update: As of Feb. 8th 2011, At midnight, Jeff sent me his up-to-date benchmark result with the same data on a 2.93 GHz Core I7. It is 12.5 seconds to process the whole file! In this case I will be obliged to further improve my implementation. Hopefully I can get it done by tomorrow. Let the game continue! And stay tuned!