Archive for the ‘Concurrent programming’ Category

To SLEE or not to SLEE?

July 5, 2012 Leave a comment

I was working for a company which is making a product based on JAIN SLEE specification. I’ve developed several protocol stacks, several protocol frameworks and applications. Sometimes I’ve even studied code of SLEE implementation. However, I should admit that I had never read a specification myself. Most of my former collegues too. It is big and boring. I always had an impression that it is based on some very simple techinal ideas, and whole complexity is artificial. I felt that this specification was making life of developers harder, while it should have made it simpler. However, I understood that producing a JAIN SLEE was more a political and market-oriented decision.

A widespread approach to systems design is to compose solutions from components with well-defined interfaces. This means that each component should not know any details about other components it interacts with. Unfortunatelly a component design also often means that implementation difficulties of each component are considered to be specific to this component, and should be solved on component level. Architects are tasked in defining functional requirements to components and often don’t care about implementation difficulties. Sometimes implementations could be very much simplified if someone knows internals of communicating components. This problem is called “a cost of abstraction”. Since I have an advantage of understanding both whole picture and lots of individual parts, I’ve making this article. I claim that JAIN SLEE is a too complex solution to simple technical problems, and complexity makes it hard for people to see these simple technical problems. I claim that simple technical problems should have simple technical solutions. I also claim that SLEE is simply not enough.

The idea of this article is to show that JAIN SLEE should not be a framework or application server. Instead it should be an implementation principle of application server, covering all protocol stacks and protocol frameworks. Instead of being a layer of a solution stack it should be an approach applied to all layers of a solution stack, thus fixing problems for a whole solution.

The article is big and is split into several parts. First part is an introduction in multithreaded design. It explains how threads should be used. Second part compares synchronous and asynchronous protocol APIs. Third part introduces SLEE by combining ideas from first and second parts. Fourth part explains layered design to networking software and shows that protocol layers suffer from same problems as applications. Fifth part explains a new solution which is better than JAIN SLEE. Sixth part explains why JAIN SLEE became such a monster.

JAIN SLEE is implemented in Java, but most of the things I’ll say are not specific to Java, and applicable to any networking systems. All examples are written in Java-like pseudocode.


Networking in Java: non-blocking NIO, blocking NIO and IO

January 29, 2009 4 comments

Standard run-time library in Java provides two interfaces for networking. One, which exists in Java since the beginning, is called “basic IO”, because it is based on generic framework of  “input streams” and “output streams” defined in package “”. Sun did a good thing by providing uniform way for accessing files and sockets, following a Unix philosophy. However, there are some drawbacks in stream-based access, so Sun created another set of interfaces located in “java.nio” package. This package also provides uniform access to files and sockets, and is much more flexible than basic IO.

Main problem with basic IO was scalability to number of connections. Operation read() will block until some data will become available. It is not a problem if your program accesses files, because file operations never block for a long time. You are just reading the data until you’ll reach the end of file. Reading after the end of file will immediatelly return with “-1” bytes read. Another good thing is most programs usually access quite small amount of files. In other words, when working with files it is data who is waiting for program to process it, while program can decide what size of internal buffer to use for processing.

But with networking a model of basic IO is not so convenient. First, read() operation may block an execution thread for a long time. This means that to handle several connections simultaneously you’ll need as many threads as the amount of incoming connections you have. There is a small thing which can help you not to block forewer: you can specify a timeout for socket operations. But it will not solve a scalability problem.

Another problem is related to “message”-based structure of most protocols. Often you don’t know how much data you’ll receive. So, you have to organize your code in a special way:

  • Always read data by one byte, then assemble data array from those bytes. Code is simple, but slow.
  • Read one byte first, then use available() method to determine if there are more data to read. If there are, then read remaining data using bulk operation. Code is more complex, but faster then previous way.

NIO helps you to deal with both these problems. I’ll explain them in a way which seems to me most logical.

First, NIO introduces “Buffers” which are used to combine data and information used to process it. There are also “Channels” which can read into buffers and write from buffers.

To simplify your “basic IO” code you can just call Channels.newChannel() method for your input stream. The resulting channel will implement read() operation which will either block fill provided ByteBuffer with data and moving position to a place right after last byte. This makes code much more simple.

You can avoid wrapping by creating SocketChannel directly. This will get you almost the same result. It is called “blocking NIO”, and I strongly advise using it in simple cases, when thread blocking is not a problem for you.

The only difference between “blocking NIO” and “NIO wrapped around IO” is that you can’t use socket timeout with SocketChannels. Why ? Read a javadoc for setSocketTimeout(). It says that this timeout is used only by streams. However, you can use a trick to make it working:

SocketChannel socketChannel;


InputStream inStream = socketChannel.socket().getInputStream();

ReadableByteChannel wrappedChannel = Channels.newChannel(inStream);

In this example, reading from socketChannel directly will not be interrupted by timeout, but reading from wrappedChannel will be. To find out why it is so, you can take a look inside Java RT library. Socket timeout is used by OS-specific implementation of SocketInputStream, but is is not used by OS-specific implementation of SocketChannel.

However, NIO has much better things to solve a scalability problem. First, you can put a channel into non-blocking mode. This means that read() operation will return immediatelly if there are no data to read. Thus, you can create a single thread which will check all SocketChannels in cycle and read a data if it is available.

Having a single thread is nice, but if it will spin around read() operation it will waste lots of CPU cycles. To help with the performance NIO has a class called “Selector” which WILL block on non-blocking channels. The difference is that it can monitor any amount of channels, resuming execution when at least one of those channels has some readable data. This idea was copied from Unix, but with one big flaw: Selector can use only non-blocking channels.

I don’t know why Sun has introduced this limitation. This article focuses on reading, but both basic IO and NIO also support writing. Since a blocing/non-blocking mode applies both to read and write directions simultaneously, then usage of Selector makes connect() and write() operations more complex. Anyway, it is the only way to have only one thread reading from several network connections.

Let’s finish for today. It’s quite easy to understand what to use. If scalability is an issue, then use “non-blocking NIO”. Otherwise, use “blocking NIO” with thread per connection. You can make those threads as daemons so they will not prevent application from termination when all other threads will stop. Another way to stop those threads is to close channels they are reading from. This will cause a read operation to interrupt with exception.

I hope I’ve shown that NIO is simple. So, don’t use NIO frameworks. They are bad.

Explanation of concurrent programming. Part 3: implementation

April 11, 2008 Leave a comment

At this moment we have covered some of the theory of concurrent programming, and even given some examples of it. Now let’s discuss how does this work inside.

We’ll start from scheduler. There are two common approaches: cooperative tasks, which give away control themselves, and pre-emptive task switching, where scheduler has a power to stop execution of any task at any moment. For interpreted language a scheduler is easier to implement as a part of interpreter: it can just count instructions and switch to another task then number of processed instructions reached some threshold. Making scheduler a part of interpteter means that scheduler works on lower abstraction level (and on higher capability level) then logic of tasks. For tasks which are executed as machine code a scheduler is also implemented in machine code, because there are no lower abstraction level available. When scheduler and tasks work at same abstraction layer, then switching happens in following way: at some point an execution should be interrupted and transferred to scheduler itself, and scheduler will run another task.

Usually this is achieved using build-in facility of CPU for called “interrupt”. Originally this facility was designed for on-demand processing for events on external devices. Interruption takes place when device sends electrical signal to CPU. Since one of those devices could be a timer, interrupt facility could be used for interrupting a task. So, even a scheduler is programmed in machine code, an interruption facility is a part of CPU, so it is on lower abstraction level.

Low-level scheduler for CPU-native tasks is usually a part of OS. Main reason is that OSes already use interruption facility of CPU for I/O operations, so giving to a user program an ability to use (and misuse) interruption facility whould be dangerous. Another reason is that OSes follow a design philosophy that all “system-level” code should be a part of an OS. As a side effect of including a scheduler, OS itself takes advantage of multi-tasking by splitting itself into several tasks.

Now let’s talk about task interaction facilities. Message-passing facilities are usually a part of OS, because it requires memory manipulations, which is usually done  by OS. However, it is not nesessary.

We alrady explained that If tasks use unprotected shared memory, then they should use some facilities to ensure that shared data are consistent.

Some shared objects of small size could be realized without any facility at all, by using “atomic” operations of CPU. To understand the importance of atomic operations, let’s use classical “array traversal” code as an example:

int[] a;  // shared
int i = 0; // shared
for (i = 0; i < size; i++) { /* do something with a[i] */ }

This code works well in single-tasking environment, but if we want to run several instances of this loop simultaneously, we’ll get in trouble. This code doesn’t protect several tasks from processing same array element several times. Let’s try to modify this code:

int[] a; // shared
int i = 0; // shared
for (i = 0; i++ < size; ) { /* do something with a[i-1]*/ }

Here we have tried to increase an index just after check so other instance of task will process next element. However, it will not always work as expected, because check and increase are two separate operations. The following sequence is possible: taskA checks i, task B checks i, taskA increases i, taskB increases i. To solve this problem, we need atomic “check and increase” operation. Modern CPUs provide such operations. Using them is a recommended way for small objects, because such operations are fast and don’t depend on OS.

Achieving atomic behaviour for larger objects is easy once we have atomic operations for small objects. We just need a small “flag” for each large object. Task which modifies an object should do those four steps:

  1. Check if flag is in “ready” state. Repeat until this condition is satisfied.
  2. If yes, then switch flag into “object is being modified” state. Note that checking and switching should be performed as single atomic operation.
  3. Read/modify change large object as needed.
  4. Finally put flag back into “ready” state.

This flag is usually called “the lock”, because it controls access to some other object. It’s states are called “locked” and “unlocked”, and operations are called “try to lock” and “unlock”.

In lots of multi-tasking implementations locks are used for ad-hoc critical sections. If task is unable to lock an object because it was locked by another task, then it voluntarily “spins around” the lock repeatedly trying to lock it. So, the task which has locked an object is virtually uninterruptible: scheduler will switch between tasks as usual, but other tasks virtually do nothing, they just spin around the lock.

To summarize abovementioned: usage of unprotected shared memory is possible without ineraction with scheduler, so without any OS-related facility. But sometimes using an OS could be good.

SInce spinning around a lock is just a waste of time, it is a good idea to put all waiting tasks to sleeping, and wake them up then object will be unlocked. A usual approach is to have a list of pending tasks for each lock. If object is locked, then current task is added to the list, and scheduler is asked to put this task asleep. Then lock is released, then the scheduler is asked to wake up all pending tasks. Since interaction with scheduler is nesessary, such “blocking locks” should be implemented as part of an OS.

An ability to wake up a particular sleeping task is useful regardless locks, so it is also available through OS.

Let’s summarize it all. A cornerstone of multi-tasking implementation is a scheduler, which is usually a part of an OS. Simplest form of inter-task communication is through unprotected shared memory. Data consistence could be achieved by using atomic operations or spinlocks without dealing with scheduler. If objects are locked for a long time, it is a good idea to use blocking locks, which are available from OS.

Further reading:

Explanation of concurrent programming. Part 2: examples

April 4, 2008 Leave a comment

In the previous part we have covered a lot of theory, so in this part we will take a look at real-world examples of concurrent environments and will try to see how well they will fit with our theory.

Classical Unix environment could be considered as “message-passing”-class system. Each task is represented by “process”. All processes have access only to their local data. OS even protects memory of the process from being physically accessed from other processes. Processes can communicate through files: one process can write into a file, and another process can read from it. There are even special kinds of files, called “pipes”, which are designed specially for inter-process communication: if there are no data in pipe then reader will block until data will arrive or pipe will be closed. Unix guarantees that writing less then certain number of bytes into a file will happen atomically. There is an easy solution for this problem: you can have two files, one for data, and another for signalling. Subsequent versions of Unix-like OSes have also another message-passing facility called “message queue”.

A “semaphore” is a good example of guarded shared memory. However, it’s capacity is limited to single integer. More than that, it interacts with scheduler, so some operations like “get” can block the process.

Another well-known concurrent framework is called “POSIX threads”. It specifies that tasks have all their data memory shared, but each task has an isolated stack. For languages like C and C++ this means that “automatic” variables, which are stored on stack, will always be thread-local, but “global” and “heap” variables will be shared. This framework could be classified as “shared memory with critical code sections”. Critical section is started then thread “grabs” an object called “mutex”, and ends then thread “releases” this mutex. While grabbed by one thread, mutex cannot be grabbed by any other threads. There are two kinds of “grab” operation: “blocking” which puts thread asleep if mutex could not be grabbed, and “non-blocking” which returns error if mutex could not be grabbed. There is also a facility called condition variable. It is useful then thread doesn’t have any work to do, and wants to go asleep, but it doesn’t know when to wake up, so it will be woken up by another thread.

Threads in Java are very much like POSIX threads. The only difference is that it is implemented not as library, but as a part of language itself.

Explanation of concurrent programming. Part 1: theory

April 3, 2008 Leave a comment

I’ll start an introduction to concurrent programming with example. Imagine a real-time multiplayer strategy game. Lots of things are going on simultaneously: your units move to destinated point, AI-controlled units also move, animated scenery is drawn, number of harvested resources increases, player’s mouse pointer moves across the screen. How is this program written ? Some time ago there was an approach to program’s structure called “main loop”: program first checks if mouse was moved and updates the pointer, then AI analyzes the situation and takes decisions, then all units move according to their orders, then screen update takes place, then again from the beginning. Several tasks are executed sequentally in loop. Loop itself is quite simple, main difficulty is to implement tasks themselves. They must be short, because otherwise it will be noticable that those actions happen sequentally. All tasks are coded along the same pattern: restore a context (like last mouse position, last drawn frame of unit, last amount of resource), perform action, which usually changes the context, then store a context back. Temporary data which were generated during execution of task are just thrown away. Writing of such games is not an easy task, because developer should remember what should be in context, and how to store/restore it. Making task too short is also bad: context will be bigger to hold data which otherwise would be temporary. For example, it is a good idea to have whole unit sprite drawn in one cycle and just increase unit’s frame number, otherwise it will be nesessary to remember screen position of last drawn pixel.

A concurrent programming is an attempt to perform storing/restoring of context automatic and transparent. Instead of having specific context for each task, some low-level “execution context” is introduced. Main loop is replaced with “scheduler”, which works on time basis and switches tasks. So, each task has a “slice” of CPU’s time for execution. Main benefit of concurrent programming is a transparency of context switching: code becomes much simpler.

Another benefit of concurrent programming is that it works transparently on several CPUs. Tasks are really executed in parallel, and automatic context switching allows to continue task execution on another CPU.

Introduction of concurrent programming rised several interesting issues. One of them is a consistence of common data. In sequental programming, task is continuous and could not be interrupted, so it can do any actions as long as it will leave a shared data in context in consistent state upon exit. In concurrent programming, tasks could be interrupted at any time, possibly leaving data in broken state.

Let’s stop at this point, and summarize that was already said. Concurrent programming is a solution for certain kind of programs. It works by moving a context outside of program’s scope. If program is compiled into machine language, then context is located at hardware level. Concurrent program works like it has several sequental tasks which are execuded in parallel by time-slicing single CPUs. There is an entity called “thread”, which is this sequential executor of some task. At this point a “thread” should look like as completelly artifitial entity, but there are reasons for it’s existence, and they will be revealed soon. Some tasks are completelly independent, some tasks work together, so there should be means for task isolation and for task interaction. And there is an entity called “scheduler” which chooses what task (or tasks, in multi-CPU machines) should be executed at each moment. 

If tasks are completelly isolated, then we just have a single machine which works as several independent machines. For example, if we will launch several inet server daemons (for FTP, HTTP, e-mail protocols), then we just have a single machine which is totally equivalent to several machines, each running a single daemon. There are absolutelly nothing which should be done to convert simple sequental task into “isolated task”. However, some actions are possible for isolated tasks to take benefit of concurrent environment.

Tasks are isolated from each other, but they can interact with scheduler. For example, if task is sure that there will be no work for it for some time, it can ask scheduler to “put her asleep” and use CPU for other tasks. Another possibility is to modify task’s “priority”, which defines how much of CPU should be given to this task compared to other tasks. All these actions use “thread” entity to indentify a particular task. So, “thread” is an indentification entity used then task should interact with scheduler, so scheduler can work with tasks as abstractions.

Now let’s discuss interaction between tasks. There are two known approaches. First is to have all task’s data isolated from other tasks, and have special facility for ineraction between tasks. For example, in a computer game a task which tracks mouse state can detect that mouse was moved, then it can calculate a rectangle between old and new position, then it can signal to task which paints on the screen to re-draw this rectangle because mouse pointer should be drawn in different place. Such approach is called “message passing”, because tasks pass messages between each other. Then some task sends a message, then delivery facility copies all the supplied data from local memory of sending task into internal buffer. Receiving task can copy data from facility’s buffer into it’s local memory.

This approach has shown us that task interaction adds a new entity into whole picture. Except tasks and scheduler we need a message delivery facility. But, should tasks and scheduler change in order to make interaction possible? No, it is not nesessary, but it could be very useful. For example, instead of having receving task to poll a facility in a loop waiting for a message, it is possible to make scheduler “to put asleep” a task by not scheduling it for execution, so CPU time could be used by other tasks. Programmers say that “non-blocking” receive operation is turned into “blocking” operation, which allows to avoid “spinning” around it.

This approach solves data integrity problem by copying data two times. When data has been copied into internal buffer, nobody can change them. Access to buffer is available only then “send” operation has been completed.

Message passing is somehow similar to manual multi-tasking described at the beginning. Instead of context store/restore there are send/receive operations. However, this approach doesn’t restrict “task granularity”: tasks can send messages when they want, which is very convenient.

Now let’s duscuss a second approach to task interaction, called “shared memory”. Tasks still have some local data which are isolated to access from other tasks, and there are some common data, which could be accessed by all tasks simultaneously. For example, to signal a need of screen refresh, one task can set a common flag, and screen painter task will detect this and re-draw a screen.

Let’s investigate if this approach requires some special facility or not. First, a special facility may (or may not) be needed to define which memory is shared and which is not. Second, an access to shared memory may happen through some facility. Both of these statements use word “may”, meaning that “it is an implementation detail, not required by theory”.

The reason of possible data inconsistence is the uncertainty of order of read and write operations. Writer task could be interrupted in the middle of “write” operation, and reader task will try to read shared data. So, the problem should be solved by making access to shared data non-interruptible. Two solutions exist.

First solution is to make all shared memory access through special facility. This facility interacts with scheduler so all access operations are non-interruptible. It is obvious that this approach is very similar to message passing. Instead of “receive” operation which either blocks or returns nothing if there are no messages there is a “read” operation which returns old value of shared data.

Second solution is to have a facility which will just make some task non-interruptible, regardless if it accesses shared data or not. Part of the task during execution of which it will not be interrupted is called “critical section”. Simplest approach is to have this facility just to ask scheduler, but for multi-CPU machines it means that some of CPUs will waste their time. This problem is solved by having several shared areas, and specifying a specific shared area then entering non-interruptible section, so other CPUs can execute tasks which don’t access this shared area. Since dividing a task into interruptible and non-interruptible parts is a responsibility of a developer, this solution lacks unified mechanism for enforcing data integrity.

That’s all for today. Let’s summarize it all. Concurrent programming is an approach for solving following tasks: to emulate several sequental machines on one through “time sharing” of CPU, to simplify internal structure of certain kind of programs, and to take benefit of multi-CPU machines. Concurrent programming introduces following entities: tasks, scheduler and interaction facilities. Task interaction could be done using one of three approaches: message passing, guarded shared memory and shared memory with non-interruptible code sections. Facilities for task interaction are usually (but not always) internally interact with scheduler.

Condition variables

March 20, 2008 Leave a comment

While reading excellent book about IPC by Richard Stevens, I’ve learned about “condition variables” in POSIX threads, which provide a way to send a signal from one thread to another. A similar feature present in Java. Every object can become condition variable because it has methods wait() and notify().

However, reading the book made me curious. Why condition variables are needed ? They have simple semantics: wait() blocks the thread, notify() wakes it up. But common implementation of “locks” (called “mutexes” in POSIX) have similar semantics: if lock is held, then get() will block the thread, and releasing the lock will wake up waiting thread. So, maybe all condition variables could be replaced by locks ?

To answer this question, I’ve used producer-consumer task as an example and tried to replace condition variables with locks. Here is what I got (in pseudo-code):

var data;
lock signal;
lock protector;

proc init() {get(signal);}

proc produce(arg newdata) {
  data = newdata;

proc consume() {
  var localdata = data;
  return localdata;

This code uses lock named “signal” to block consumer thread until producer will put some data. However, this code has two problems.

First problem is that the time between releasing and acqusition of “signal” lock in “produce” is very small. The threading system must start a blocked thread immediatelly then the lock is released. Otherwise a producer thread will get the “signal” lock again, and there are chances that a consumer thread will block forever. With conditional variables it is also possible that a notification will not be delivered because there are no threads waiting, but in that case locking is usually designed in a way that a consumer thread will be scheduled with very high probability.

Second problem is that this code will not work if there are several producer threads. All invocations of produce() by threads other than thread which invoked init() will block. To solve this problem, we need our lock to be special. Same as with common lock, an attempt to acquire a lock which is already held by another thread should lead to blocking. But non-owner thread should be able to release the lock.

Even if we will have this “special lock”, there is still probability that produce() procedure will block, if another producing thread will grab the lock just then it was released. But, let us think that this is a minor problem.

Let’s analyze this new type of lock. Let’s call it “s_lock” (“special lock”). Its behaviour is very similar to semaphore: release/get sequence is like “post”, and get/release sequence is like “wait”.

Another interesting idea is to try and implement s_lock using usual locks (which must be released by thread which owns them). Let’s try (in pseudo-code):

struct s_lock {bool held, lock atomic, lock blocker};

proc s_get(s_lock arg) {
  while (arg.held) {
  arg.held = true;

proc s_release(s_lock arg) {
  arg.held = false;

To implement semantics of s_lock, we use two locks: one for performing atomic operations on flag, and another for blocking the thread which will try to get a busy lock. It is clear that this code will not work: s_release() should work correctly if invoked by any thread, but it will unable to release “blocker” lock if it is held by another thread. This means that we need s_locks to implement s_locks! So, the only options are either to have s_locks as primitives, or to implement them using condition variables (which we are trying to replace).

So, our attempt has failed. We definetelly need two separate things: one for atomicity (lock), and another for blocking and waking up. In fact, main source of confusion is a semantics of lock. Blocking on busy is not nesessary. If get() will just immediatelly return with unsuccesfull result if lock is already held, then all concurrent programs are still possible. It will just require usage of conditional variables in some cases there they are currently don’t needed.

To summarize things up. There are two tasks in multi-threaded programming: to perform some tasks in “atomic” way, and to avoid busy waiting (spinning). Those tasks could be acomplished by separate primitives (non-blocking locks and condition variables), or by single primitive (semaphore).