Forum OpenACS Q&A: Building a marketplace: NYSE, NASDAQ, CBOT

I am looking for papers, books, designs, etc. that describe how large scale marketplaces are (or have been) built. Exchanges similar to the NYSE, the NASDAQ, and the Chicago Board of Trade.

A client would like to build such an exchange that can handle offers to sell, offers to purchase as well as programmed trades (buy X if the price becomes less than Y, sell X if the price becomes greater than Z). This exchange is expected to meet a peak demand of approximately 500 transactions per second.

I would love to find out if this level of transaction can be (has been) handled with Oracle or PG, or if folks have approached this with specialized databases.

If this has been implemented with relational models, what do those models look like? How has the business logic been implemented? And how does one efficiently handle programmed trades? (I am blinded by a past experience which would suggest avoiding the relational world if possible, and tossing the whole thing into one giant Rete net.)

Any references? And if you have extensive experience building such markets, send me an email, I may have a job or contract for you in a few weeks.


Posted by Don Baccus on
I know very little, but do know that there is at least one specialized db server product that is set up specifically to mine such quickly-changing data very efficiently (as opposed to MySQL which was originally designed to mine infrequently changing data quickly, or a generalized RDBMS which is general purpose and designed to address a wide range of situations more or less inefficiently or efficiently depending on your POV).

And I could swear there was a reference here. If not here, then probably the PG hackers list.

Essentially the technique seemed to be to organize data not by tables but by denormalization into in-RAM structures based on knowledge of the types of queries that would be requested.

Not at all general purpose in the way an RDBMS is meant to be, and certainly not a replacement paradigm, but rather targetted for explicit needs, in this case fast queries against huge amounts of financial/market data.

This doesn't help much but at least perhaps will whet your appetite for searching further, both here and elsewhere.

Nice to hear from you, Jerry ... sounds like you're involved with some fairly heavy shit!

Posted by Marc Spitzer on
I have heard that kdb was good for this kind of stuff,  There are some white papaers there that might be of use.
I believe that there are several in-memory database vendors out there - there is Cache (with an accented e) and one called TenTimes I think as well.

Most likely, you want to read up on TP (transaction processing) monitors which monitor the state of transactions across more than one database.  They are usually pretty pricy.  Your programmed trades sound  like triggers implemented on every update to the price of the item.

Alternatively, put a big mainframe on the task with lots of fast disk.

Posted by David Kuczek on
Hello Jerry,

there has been two discussions on high scalability maybe you find something valueable there...

Posted by Samir Joshi on
I don't know much implementation details, but many stock exchanges use TIB (The Information Bus) messaging software from for real-time trading...TIB handles network transport very reliably with optional support for certified delivery, and its publish/subscribe approach reduces network traffic. It has event based hooks to interface with data source and consumers, but the software itself does not manage persistent storage - just enqueues messages in RAM (or disk for certified delivery messages).
Jerry, what's a "Rete net"?
Posted by Jerry Asher on
Google for rete algorithm forgy ops5.

simple descriptions that aren't terribly helpful

When I was working with production systems, and hooking them into legacy systems and RDBM systems in realtime what struck me was the following.

Consider that you have an application with tons and tons of data. It may have a relatively simple data model. What is not so simple is the business logic, which if you could, could be modeled as many many many triggers. If A and B then update C. If C and not D then delete E.

Well that's a production system, but implemented in an RDBMS, and it's a damn shame the two communities could never speak to one another. Basically these productions systems have all the power of SQL and the SQL is embedded into rules. Another word for a rule would be a trigger. So a rete net has three components: a huge collection of data, a huge collection of IF->THEN rules, and an agenda. The agenda keeps a list of rules that have completely satisfied their IF clause and are ready to fire their THEN clause.

What a rete net does is take those hundreds and thousands of triggers and it precompiles, optimizes, and joins all the queries. So you have this huge network of condition nodes, A, ~A, B, C<2, D>Z, that appear in various rules. Whenever a new piece of data (object, record) enters the system, it trickles through this network. Sometimes that piece of data will find that it completes the match of one of these precomputed queries. When that happens, the "then" portion of that rule is added to the agenda. (There may be various ways to prioritize that agenda, depending perhaps on how complex the rule was, or how salient the system or developer thought the rule was.)

I believe a terrific application for it, might be in building a trading system with many, many different sorts of programmed trades. Trades that could apply depending on various arbitrage, option, or other sorts of conditions.

The problem with these systems is that they have always positioned themselves as better/different from an RDBMS, which is silly since folks that can buy Oracle can buy anything, and since all of these systems have to get their data from somewhere, and store their data back.

I have always wanted to see a data blade/plug in version of one of these things into a popular db: Oracle, DB/2, or PG. Anyway, it's cool stuff, and has usually turned into a way to poor your VC money down the hole slowly.

Posted by Jerry Asher on
Something seemed to get cut out of that answer....

You have this huge network of condition fragments, and when a new piece of data (a record) comes in, it trickles through that network.  Sometimes that piece of data finds that it and some data that has already entered the system completely satisifies the IF clause of a rule.  When that happens, the THEN clause is added to the agenda.  The THEN clause can access the specific records that matched the IF clause.  Sometimes a piece of data results in many many rule matches being added to the agenda.  Sometimes a piece of data doesn't satisfy a complete rule, and when that happens, it remains in the network in the queries that it helps partially match.

I'm reading "Analysis Patterns" by Martin Fowler, and Chapters 9, 10 and 11 are, respectively:
  • Trading
  • Derivative Contracts
  • Trading Packages
I haven't yet read those chapters, but they may give you some ideas. The book in general seems to be pretty good.
Hm, "hard active" datbases, the Rete, TREAT, and LEAPS algorithms, the old HiPAC, VenusDB, and Sentinel projects. Interesting stuff, but yes, seemingly very little that's usefull to RDBMS users in the real world. Most of the research papers also seem to be old, not much in the last five years or so. It's not clear whether anyone's working on this stuff at all right now.

It would be nice to have this kind of Rete-algorithm capability in an RDBMS like Oracle or PostgreSQL. I don't know how triggers are implemented in Oracle exactly, but they definitely are not as sophisticated as rule-based systsms which use the Rete algorithm. I ran into this back when I was implementing an online bond trading site (which worked, but never even remotely approached the size of the NYSE!).

I forget the exact error, but Oracle does not allow recursive trigger execution. So, if you have programmed trades in the system, and you execute one order which changes system information (like "volume of this security traded today"), which could then cause another trade to execute - you have a problem. One can imagine ways to program around this limitation in Oracle, but a more general solution for such scenarios may call for something like the "coupling modes" introduced by HiPAC. At the time, our approach was simply to Not Do That; we avoided adding any features that would cause recursive trigger execution.

Posted by Ken Mayer on
I worked for the Philadelphia Stock Exchange (PHLX) years and years ago, but here're some ancient memories: There are some basic divisions in the work: Quotation, Executions, Orders and Clearing. For all practical purposes, these were implemented as independent systems that talked to each other via standardized (SIAC -- NYSE's IT organization) message formats.
  • Quotes: The market maker would enter new ask and offer prices, the quotes would then hit the ticker.
  • Orders: Were "standing" orders for executions. If a quote triggered an order, it would send it to the execution processor.
  • Executions: Could be received from the floor, from broker offices, or from our automated system. The trade was recorded to a "log" type file, and then it was broadcast to the ticker wire.
  • Clearing: Handled offline at end off day. We literally made a tape to walk over to an IBM mainframe on the other side of the data center.
Quotes and Executions were running on an ancient Honeywell running GCOS with a terribly hacked OS. We wrote directly to the disk driver, bypassing the OS entirely because it was too slow. There were extreme requirements on reporting. I remember one market maker bitching because some arbitrager got a quote 3 seconds before he did, and lost money. Business rules were embedded in the code so deeply that noone new what they all were. I spent two weeks documenting just equity quotations, and when I was done, it was the first time that it had been documented clearly, anywhere, in years. (Recall that PHLX is reportedly the oldest active stock exchange in the country. 😉

The quotation and trading system didn't have to hold a lot of information in memory: just the current states for all stocks. When a new quote or trade hit, it was logged and posted to a ticker. Oddly enough, we posted to SIAC (NY) and our own wire, then had code to filter out the trades when they came back over the NY wire. There were business rules about how far a quote could move (e.g. no more than a 1/4 point at a time on a downtick after an uptick, but maybe more after another downtick), and what constituted a valid trade.

The ordering system was handled by another group, so I have less experience with it. As far as my system was concerned, I fed it a ticker, and it sent back trades just like a broker terminal. But you can see how you eliminate some "networks" of triggers. The ordering system just responded to one message at a time per process thread as fast as messages were broadcast. There were some rules about how you could trade against a quote for a certain length of time (seconds), so you attached the quote to the trade order (and then the execution). That gave the system some flexibility in executing trades in a volatile market. Also, other entities could run their own automated ordering system if they liked, and they just posted trades to the execution system.

Clearing was scary, with big heavy IBM iron doing dark and mysterious things by people who wrote code in Cobol or JCL. They had impact on margin calls, and if things did not clear promptly or properly all sorts of expensive things could happen. Apparently PHLX nearly went bankrupt overnight (before my time there) because of a fouled tape. The exchange was financially responsible for losses it caused, and that night, if things hadn't recovered, it would have been millions.

13: TclCLIPS (response to 1)
Posted by Andrew Piskorski on
FYI, I just noticed that TclCLIPS provides a Tcl interface to the public domain CLIPS expert system software, originally developed at NASA back in 1985. Among other things, CLIPS implements a "forward chaining rule language based on the Rete algorithm", and is supposed to be easily embeddable into other applications. It is also available as a Debian package. Various commercial systems like CLIPS/R2 seem to have been forked from this codebase at some point.

CLIPS was intended was as a toolkit for building expert systems, and unfortunately, nothing I saw in the docs or FAQ mentioned integration with RDBMSs like Postgres. (There has definitely been research work into that sort of thing elsewhere, though.) It is also unclear how applicable CLIPS would be to non expert system production systems, like Jerry's large scale automated online exchange that started off this thread. Question 24 in the CLIPS FAQ, "Can CLIPS be used for real time applications?", implies that it may not scale to such demanding service loads, but mentions only the underlying Rete algorithm itself as a possibly bottleneck, so perhaps updating CLIPS to use newer algorithms like LEAPS would address any such problems.

Posted by Andrew Piskorski on
So Jerry, it's been about nine months now since you started this thread, did anything come of the project which originally spurred your questions into how to build large scale exchanges? Learn anything interesting?
Posted by Jerry Asher on
Well, it was a great project, fun, challenging, and interesting. I was in slightly over my head and learned a great deal. I watched it grow and its requirements grow and actually its budget grow and the marketing folks told us that on day one there would be so many people showing up the continent was likely to become unbalanced and tip over. That was sort of a bad forshadowing.... And so like so many other projects (software has a something like 50% failure rate) it imploded and stopped on account of war, or so they said. We withdrew with honor.

Some details that probably won't help you on your current project but seemed interesting at the time. We were going to build two completely redundant sites hooked up with many and redundant T1s and VPNs -- the T1s were expected to have the least latency, but the VPNs were expected to be the most reliable as they are routable and supposed self healing.

IBM has a product like this for their Z servers, but it only works over about a 40 mile distance -- uh, that's good, they literally sell that as letting you have your main system in Manhattan and a backup in New Jersey. But it was too close for our needs (earthquake, hurricane, wars can still get to both sites.) So we had a client requirement to separate the sites by at least a thousand miles and an international border.

Over the course of about six months, while splashing their names all over the papers, Worldcom and Qwest's T1 prices dropped at least 70%. Soon they were so low I just wanted to buy them a case at a time and hand them out as office prices. T1s stretching across an ocean at a couple hundred a month.

We had tons of servers to offload and partition the problem (for instance we had two redundant machines whose only job was to be the syslog servers for everyone -- we felt that would make it harder for any potential hacker to eliminate his tracks and the cost for that is basically zero) but the system was actually fairly simple and not all that expensive until we started having SANs and the likes tossed at us to get the disk performance we thought we would need in order to satisfy the marketing projections.

Ignoring all the other servers and HSRP routers in the picture, with two completely redundant sites each with two database servers, and with one database server designated primary and the others receiving constant updates and keeping everything in sync, and with our entire database able to be kept in memory at one time, we eventually decided to throw away the disk drives. Our model could fit into memory, we didn't feel we would have a problem with locks, and we replicated the requests to the hot stand by machines. And it turned out that more than one productized real time database engine does the same thing. Run it without ACID on disk and it ticks over at 50K transactions a second on good but not great hardware and then run it with ACID and disks and it comes down to a few hundred transactions a second.

It was a real choice project, and when it was over, I'd never want another (well not for a couple of weeks.)

So we tossed out the SAN and then went the RAIDs and basically we were planning on building a specialized application engine that kept sync locally with his buddy and kept sync remotely over the VPN and T1s. And by paying enough for high availability hardware we felt our replication streams would do the job for us and give us the five nines we sough. And sticking on our business hats, we realized that well, in our domain, if worse came to worse during a failover and someone's transaction was lost (and we didn't see how that could be), that that was actually a cost that occurred infrequently enough to be covered in customer support's budget. I don't think that's a total copout, I actually think that given the right domain, it's a reasonable line item just as shrinkage is.

Now our lead enginer, he was brilliant and outstanding in every way and he was a good man too. Humanitarian man, man of wit, of humor. We worshiped the man, like a god, and followed every order however ridiculous.

October 9th, 0430 hours, sector 13420, head 15: I watched our server crawl along the edge of a straight razor. That's my dream. That's my nightmare. Crawling, slithering, along the edge of a straight razor, and surviving.
So I'd like to say this worked and the continent did tip over, but what I think happened is that someone realized that our methods, marketing and engineering had become unsound and very obviously insane.

They, we, were out there operating without any decent restraint. Totally beyond the pale of any acceptable engineering and business conduct. And he was still in the field commanding his troops.

And so the project was terminated.

"PBR Street Gang, this is Almighty, over...
This is Almighty, standing by, over.
This is Almighty, how do you copy, over...
"The horror. The horror..."
Posted by Andrew Piskorski on

What about using solid state disks? You know, those devises that look like disks to the OS but are actually battery backed RAM. Wouldn't that have let you keep the Durability part of ACID, and yet maintain the much-faster-than-disks transaction rates?

So did the design end up using Rete nets in any way?

What real time database were you using, and how did you like it?

Posted by Jerry Asher on
Basically, we started off with such generic application requirements it looked like we needed an RDBMS system.  But as we dug into it, there were only a half dozen or so transactions in our system, and they were relatively simple (think about two to five select statements, two or three types of update/inserts, and no deletes (no real time deletes)) that we went more for an object model and not at RDBM model.  Queries were things like: Find stock named foo.  Find matching buy/sell order.  Inserts were: insert new stock.  Insert new buy/sell order.  Updates were: modify this order (add seller and quantity to list, add buyer and quantity to list, modify total quantity).  With such a few queries, we hardcoded the queries as C, and built the data structures around them.

We designed this with fixed capacities in place (large capacity, but unchanging without restarting the application.)  Like a reservoir, it has a fixed capacity or else it overflows.  It was required to have a huge capacity, but it was fixed, and knowing that number let us determine how large to create and pre-allocate various tables that an RDBMS would not demand you create or pre-allocate.

We had A, C, I in memory and hard-coded in the code, and the D was to bubble up from the rest of the system we would put in place to meet the required system uptime.  I believe this is analogous to how many OODBM systems can work for you.

In a non real-time basis, as system demand permitted, sweeps through our object model were to stream into a traditional database to permit analysis and reporting and things like that and to clean up memory.

18: Re: TclCLIPS (response to 13)
Posted by Andrew Piskorski on
On his website, Frank Lopez says he is releasing the source to his OPS-2000 forward and backward chaining rule-based language, but so far only some docs are available, plus a note saying that the source code "will soon be here".