As discussed in Where a thread is a good idea, threads also find use where a number of independent processing algorithms are occurring with shared data structures. While strictly speaking you can have a number of processes (each with one thread) explicitly sharing data, in some cases it's far more convenient to have a number of threads in one process instead. Let's see why and where you use threads in this case.
For our examples, we'll evolve a standard input/process/output model. In the most generic sense, one part of the model is responsible for getting input from somewhere, another part is responsible for processing the input to produce some form of output (or control), and the third part is responsible for feeding the output somewhere.
Let's first understand the situation from a multiple process, one-thread-per-process outlook. In this case, we have three processes, an input process, a processing process, and an output process:
This is the most highly abstracted form, and also the most loosely coupled. The input process has no real binding with either of the processing or output processes — it's responsible for gathering input and somehow giving it to the next stage (the processing stage). We can say the same thing of the processing and output processes — they too have no real binding with each other. We are also assuming in this example that the communication path (that is, the input-to-processing and the processing-to-output data flow) is accomplished over some connectioned protocol (for example, pipes, POSIX message queues, native BlackBerry 10 OS message passing — whatever).
Depending on the volume of data flow, we may want to optimize the communication path. The easiest way of doing this is to make the coupling between the three processes tighter. Instead of using a general-purpose connectioned protocol, we now choose a shared memory scheme (in the diagram, the thick lines indicate data flow; the thin lines, control flow):
In this scheme, we've tightened up the coupling, resulting in faster and more efficient data flow. We may still use a general-purpose connectioned protocol to transfer control information around — we're not expecting the control information to consume a lot of bandwidth.
The most tightly-coupled system is represented by the following scheme:
Here we see one process with three threads. The three threads share the data areas implicitly. Also, the control information may be implemented as it was in the previous examples, or it may also be implemented via some of the thread synchronization primitives such as mutexes, barriers, and semaphores.
Now, let's compare the three methods using various categories, and we'll also describe some of the trade-offs. With system 1, we see the loosest coupling. This has the advantage that each of the three processes can be easily (that is, via the command line, as opposed to recompile/redesign) replaced with a different module. This follows naturally, because the unit of modularity is the entire module itself. System 1 is also the only one that can be distributed among multiple nodes in a BlackBerry 10 OS network. Since the communications pathway is abstracted over some connectioned protocol, it's easy to see that the three processes can be executing on any machine in the network. This may be a very powerful scalability factor for your design — you may need your system to scale up to having hundreds of machines distributed geographically (or in other ways, for example, for peripheral hardware capability) and communicating with each other.
Once we commit to a shared memory region, however, we lose the ability to distribute over a network. BlackBerry 10 OS doesn't support network-distributed shared memory objects. So in system 2, we've effectively limited ourselves to running all three processes on the same box. We haven't lost the ability to easily remove or change a component, because we still have separate processes that can be controlled from the command line. But we have added the constraint that all the removable components need to conform to the shared-memory model.
In system 3, we've lost all the above abilities. We definitely can't run different threads from one process on multiple nodes (we can run them on different processors in an SMP system, though). And we've lost our configurability aspects — now we need to have an explicit mechanism to define which input, processing, or output algorithm we want to use (which we can solve with shared objects, also known as DLLs.)
So why would you design your system to have multiple threads like system 3? Why not go for the maximally flexible system 1?
Well, even though system 3 is the most inflexible, it is most likely going to be the fastest. There are no thread-to-thread context switches for threads in different processes, you don't have to set up memory sharing explicitly, and you don't have to use abstracted synchronization methods like pipes, POSIX message queues, or message passing to deliver the data or control information — you can use basic kernel-level thread-synchronization primitives. Another advantage is that when the system described by the one process (with the three threads) starts, you know that everything you need has been loaded off the storage medium (that is, I'm not going to find out later that Oops, the processing driver is missing from the disk!). Finally, system 3 is also most likely going to be the smallest, because we won't have three individual copies of process information (for example, file descriptors).
To sum up: know what the trade-offs are, and use what works for your design.