I’d planned this blog entry to be about the changes I’m making to the myc4d customer portal but over the past few days it has become clear that there is quite a lot of interest in our new kbmMW transports that are in development. To save making the same comments over and over on the newsgroups I’ve decided to blog about what we’re up to and take the wraps off them instead.
There will be a lot of detail covered but some facts will remain un-answered for the simple reason that some things I do not know the answer to yet J
I’m going to do this in a semi-Q&A style to get my headings out. Have fun!
What is the new kbmMW transport?
Let’s kill this one off straight away. The new transport is not a single new transport. It is a set of transports built on top of our own socket layer removing the need to use third party communication stacks such as Indy or Synapse etc.
What is the motivation to build our own communications stack?
The motivation behind the project was driven by a variety of sources.
- There is growing discontent in our newsgroups about using the Indy components as the underlying communications mechanism. Reasons given vary but within C4D we get frustrated with breaking changes made to the source code over which we have no control. This is an issue when integrating with any third party but it seems particularly true of Indy between simple point releases. kbmMW is only using a relatively small set of Indy APIs but we get repeatedly asked how to get the framework running with Indy version X, or Y, or the latest development snapshot etc. We have transports based on Synapse and have never needed to update these.
- There were requests being made for us to supply messaging transports based on Synapse, motivated partly by the above issues with Indy.
- The current transports have scalability limits.
Are you saying kbmMW is not scalable?
I should expand on the last point. kbmMW is a hugely scalable framework. I am not implying anything to the contrary with the last statement. A single server will scale up to many hundreds of users without problem. This is great for most applications that use the traditional request response operational model – your typical CMS application etc.
When we introduced messaging the framework took on a paradigm shift. To people building standard applications this will go largely unnoticed but internally the framework behaviour has changed to work in terms of discrete messages. When applications are built using the messaging transports developers can still call services just like they did before. But the way these calls are expressed is different. A request is simply a message sent from a client with the first part of its subject being REQ. REQ stands for request funnily enough. The framework takes this message and decodes it into a call to a certain service/method pair and processes the request. Results from the request are sent back to the caller using another message with the subject RES. RES stands for response. I’m digressing a little but the point is people can use the messaging transports without losing any existing functionality. In fact it opens up huge possibilities when we add intelligent routing to these REQ messages. Messaging is incredibly powerful!
What else can people gain from using messaging then?
To use a cliché, access to a whole new world of applications where discrete message passing is the key. I have been involved in a project requiring the distribution of real-time information to huge numbers of clients or potential clients. Majority of the time the clients may be doing nothing other than monitoring some statistical feeds. The amount of data they receive is small but there are a lot of them. By this I’m talking about potentially millions. The question is how to support millions of client connections to the kbmMW messaging based WIB?
It is time for another digression – why can’t we use the Indy messaging transport? To answer this question I need to explain how the Indy, Synapse and DXSocks TCPIP based transports work. The focus of this is the server or hub in messaging terms.
Each client connects to the hub using a TCPIP socket. Data is sent from the client and transferred by the TCPIP stack over to the server where it is presented as a memory buffer to be read. But how do we know there is data ready for us to read? Simple – we call various APIs to tell us if there is a buffer waiting for us. That is the problem. Read it again, “We call various APIs to tell us if there is a buffer waiting for us”. We have to poll the status of the socket. To do this the usual arrangement is to have a thread executing in a loop polling the socket.
Here is the code pattern for the server transports.
While the client is connected
Test socket for data
If data is present
Process data
Else
Sleep for a moment
In the Indy transport their source implements the loop. For Synapse, which is API based, we implement it in ourselves.
A quick quiz!
Q. How many threads does it take to service 100 clients?
A. 101 (There is also a separate thread responsible for listening for new connections and creating the new socket/thread pair).
Q. How many threads does it take to service 500 clients?
A. 501
Q. How many threads does it take to service 10000 clients?
A. Too many!
As more and more threads are created there is more and more contention for CPU resource. This takes the form of expensive thread context switches and there comes a point where the majority of CPU time is spent simply switching from thread to thread. The official description is thread thrashing. Throughput of the server decays rapidly.
How do we avoid thread thrashing?
The answer is relatively obvious – use less threads! One approach that can be used is to have each thread handling more than one socket but this is still essentially a polling solution. We need something else.
Welcome to IO Completion Ports
This was introduced in Windows NT 3.5 and is a complete paradigm shift. Instead of polling sockets for data we instead get the TCPIP stack to notify us when new data has arrived. It sounds obvious and simple. Of course it doesn’t work out to be quite that simple. The goal in any server design should be to try and incur as few context switches as possible by avoiding threads being unnecessarily blocked whilst at the same time maximising parallelism using multiple threads. The ideal situation is to have a thread actively servicing a client request on every processor and for those threads not to block in the event that additional requests are waiting when they complete their current request. For this to work there must be a way for the application to activate another thread when others are busy processing another I/O operation. Windows NT 3.5 introduced a set of APIs based on something called the completion port. Applications can associate a completion port with a TCPIP socket. Data coming in off the socket results in a completion packet being queued to the port for processing. A set of threads associated with the port read off the data packet and process it. When a completion port is created we specify a concurrency value. This is the maximum number of threads that can be actively processing data packets queued to the port at any one time. The Windows Kernel performs the creation and management of these threads. The aim is to have one thread active at any given time per processor. A typical rule of thumb for the concurrency value is 2 times the number of CPU cores. At this time we have an open mind on this until we perform some more scalability tests. The Windows scheduler attempts to reduce context switches by selecting the same thread to process request N + 1 after processing request N this allows CPUs to be utilised to near their full capacity.
Running the quick quiz again!
For a 4 CPU server
Q. How many IO worker threads does it take to service 100 clients?
A. 8 using rule of thumb
Q. How many threads does it take to service 500 clients?
A. 8 using rule of thumb
Q. How many threads does it take to service 10000 clients?
A. 8 using rule of thumb
What a result!
It’s not just about threads.
We have a mechanism to reduce the thread count and avoid thread thrashing by using our completion port. But there is another effect we can overcome by having our own socket layer.
As competent and complete as they are Indy and Synapse are complete communication stacks. By this I mean they present a nice uniform view of the data coming off the socket. When we detect in our listener thread that there is data on the socket we are presented with the entire stream of data sent over from the client. The buffer size can range from a few bytes to multi-megabytes. kbmMW must then take a copy of that buffer in order to safely work with it. That is a copy operation.
IOCP works slightly differently. It is not just the IO worker threads that are associated with the completion port; the socket is also using a special data structure of type WSAOverlapped. We can see from the diagram above that somehow data passes through the completion port from the TCPIP stack and gets processed by the IO worker thread. The data passes through by being copied to memory buffers that we supply. We supply the completion port with our memory buffers by posting them to the completion port using the WSAOverlapped structure associated with the socket. What this means is that the IO worker threads can copy data into our buffers straight off the TCPIP stack without needing a context switch. If we supply enough buffers to the completion port before the client request comes in, then the whole request can be copied to buffers without a context switch. In addition if we’re processing a request in the framework and another request (read message actually) comes in from the client on that socket another IO thread can start processing it in parallel.
What does this mean? It means that we are removing the large buffer copy of the complete stream that we have to perform for Indy/Synapse. This will boost performance. In addition if we manage the number of buffer we post to the completion port well we can allow the IO worker thread to pull data off the TCPIP stack unhindered. It is worth remembering that the IO worker threads are now running in kernel mode not user mode. In essence – kbmMW is moving closer to the TCPIP stack with the removal of Indy/Synapse.
I’ve been writing for a while now and need to get back to some coding. There will be some further blog entries on IOCP because there is a lot more to talk about.
Keep well
Richard