Suppose that we have a graphics program that performs ray tracing. Each raster line on the screen is dependent on the main database (which describes the actual picture being generated). The key here is this: each raster line is independent of the others. This immediately causes the problem to stand out as a threadable program.
Here's the single-threaded version:
int
main (int argc, char **argv)
{
int x1;
... // perform initializations
for (x1 = 0; x1 < num_x_lines; x1++) {
do_one_line (x1);
}
... // display results
}
Here we see that the program iterates x1 over all the raster lines that are to be calculated.
On an SMP system, this program uses only one of the CPUs. Why? Because we haven't told the operating system to do anything in parallel. The operating system isn't smart enough to look at the program and say, hey, hold on a second! We have 4 CPUs, and it looks like there are independent execution flows here. Let's run it on all 4 CPUs!
So, it's up to the system designer (you) to tell BlackBerry 10 OS which parts can be run in parallel. The easiest way to do that would be:
int
main (int argc, char **argv)
{
int x1;
... // perform initializations
for (x1 = 0; x1 < num_x_lines; x1++) {
pthread_create (NULL, NULL, do_one_line, (void *) x1);
}
... // display results
}
There are a number of problems with this simplistic approach. First of all (and this is most minor), the do_one_line() function has to be modified to take a void * instead of an int as its argument. This is easily remedied with a typecast.
The second problem is a little bit trickier. Let's say that the screen resolution that you are computing the picture for was 1280 by 1024. You would be creating 1280 threads! This is not a problem for BlackBerry 10 OS — the OS limits you to 32767 threads per process! However, each thread must have a unique stack. If your stack is a reasonable size (say 8 KB), you have used 1280 × 8 KB (10 megabytes!) of stack. And for what? There are only 4 processors in your SMP system. This means that only 4 of the 1280 threads run at a time — the other 1276 threads are waiting for a CPU. (In reality, the stack faults in, meaning that the space for it is allocated only as required. Nonetheless, it's a waste — there are still other overheads.)
A much better solution to this is to break the problem up into 4 pieces (one for each CPU), and start a thread for each piece:
int num_lines_per_cpu;
int num_cpus;
int
main (int argc, char **argv)
{
int cpu;
... // perform initializations
// get the number of CPUs
num_cpus = _syspage_ptr -> num_cpu;
num_lines_per_cpu = num_x_lines / num_cpus;
for (cpu = 0; cpu < num_cpus; cpu++) {
pthread_create (NULL, NULL,
do_one_batch, (void *) cpu);
}
... // display results
}
void *
do_one_batch (void *c)
{
int cpu = (int) c;
int x1;
for (x1 = 0; x1 < num_lines_per_cpu; x1++) {
do_line_line (x1 + cpu * num_lines_per_cpu);
}
}
Here we start only num_cpus threads. Each thread runs on one CPU. And since we have only a small number of threads, we're not wasting memory with unnecessary stacks. Notice how we got the number of CPUs by dereferencing the System Page global variable _syspage_ptr.
The best part about this code is that it functions just fine on a single-processor system — you create only one thread, and have it do all the work. The additional overhead (one stack) is well worth the flexibility of having the software just work faster on an SMP box.
We mentioned that there were a number of problems with the simplistic code sample initially shown. Another problem with it is that main() starts up a bunch of threads and then displays the results. How does the function know when it's safe to display the results? To have the main() function poll for completion would defeat the purpose of a realtime operating system:
int
main (int argc, char **argv)
{
...
// start threads as before
while (num_lines_completed < num_x_lines) {
sleep (1);
}
}
Don't even consider writing code like this!
There are two elegant solutions to this problem: pthread_join() and pthread_barrier_wait() .
The simplest method of synchronization is to join the threads as they terminate. Joining really means waiting for termination. Joining is accomplished by one thread waiting for the termination of another thread. The waiting thread calls pthread_join() :
#include <pthread.h> int pthread_join (pthread_t thread, void **value_ptr);
To use pthread_join(), you pass it the thread ID of the thread that you want to join, and an optional value_ptr, which can be used to store the termination return value from the joined thread. (You can pass in a NULL if you aren't interested in this value — we're not, in this case.)
Where did the thread ID came from? We ignored it in the pthread_create() — we passed in a NULL for the first parameter. Let's now correct our code:
int num_lines_per_cpu, num_cpus;
int main (int argc, char **argv)
{
int cpu;
pthread_t *thread_ids;
... // perform initializations
thread_ids = malloc (sizeof (pthread_t) * num_cpus);
num_lines_per_cpu = num_x_lines / num_cpus;
for (cpu = 0; cpu < num_cpus; cpu++) {
pthread_create (&thread_ids [cpu], NULL,
do_one_batch, (void *) cpu);
}
// synchronize to termination of all threads
for (cpu = 0; cpu < num_cpus; cpu++) {
pthread_join (thread_ids [cpu], NULL);
}
... // display results
}
This time we passed the first argument to pthread_create() as a pointer to a pthread_t. This is where the thread ID of the newly created thread gets stored. After the first for loop finishes, we have num_cpus threads running, plus the thread that's running main(). We're not too concerned about the main() thread consuming all our CPU; it's going to spend its time waiting.
The waiting is accomplished by doing a pthread_join() to each of our threads in turn. First, we wait for thread_ids [0] to finish. When it completes, the pthread_join() unblocks. The next iteration of the for loop causes us to wait for thread_ids [1] to finish, and so on, for all num_cpus threads.
A common question that arises at this point is, what if the threads finish in the reverse order? In other words, what if there are 4 CPUs, and, for whatever reason, the thread running on the last CPU (CPU 3) finishes first, and then the thread running on CPU 2 finishes next, and so on? Well, the beauty of this scheme is that nothing bad happens.
The first thing that's going to happen is that the pthread_join() blocks on thread_ids [0]. Meanwhile, thread_ids [3] finishes. This has absolutely no impact on the main() thread, which is still waiting for the first thread to finish. Then thread_ids [2] finishes. Still no impact. And so on, until finally thread_ids [0] finishes, at which point, the pthread_join() unblocks, and we immediately proceed to the next iteration of the for loop. The second iteration of the for loop executes a pthread_join() on thread_ids [1], which does not block — it returns immediately. Why? Because the thread identified by thread_ids [1] is already finished. Therefore, our for loop whips through the other threads, and then exits. At that point, we know that we've synched up with all the computational threads, so we can now display the results.
When we talked about the synchronization of the main() function to the completion of the worker threads (in Synchronizing to the termination of a thread), we mentioned two methods: pthread_join(), which we've looked at, and a barrier.
Returning to our house analogy, suppose that the family wanted to take a trip somewhere. The driver gets in the minivan and starts the engine. And waits. The driver waits until all the family members have boarded, and only then does the van leave to go on the trip—we can't leave anyone behind!
This is exactly what happened with the graphics example. The main thread needs to wait until all the worker threads have completed, and only then can the next part of the program begin.
Note an important distinction, however. With pthread_join(), we wait for the termination of the threads. This means that the threads are no longer with us; they've exited.
With the barrier, we wait for a certain number of threads to rendezvous at the barrier. Then, when the requisite number are present, we unblock all of them. (Note that the threads continue to run.)
You first create a barrier with pthread_barrier_init() :
#include <pthread.h>
int
pthread_barrier_init (pthread_barrier_t *barrier,
const pthread_barrierattr_t *attr,
unsigned int count);
This creates a barrier object at the passed address (pointer to the barrier object is in barrier), with the attributes as specified by attr (we'll just use NULL to get the defaults). The number of threads that must call pthread_barrier_wait() is passed in count.
Once the barrier is created, we then want each of the threads to call pthread_barrier_wait() to indicate that it has completed:
#include <pthread.h> int pthread_barrier_wait (pthread_barrier_t *barrier);
When a thread calls pthread_barrier_wait(), it blocks until the number of threads specified initially in the pthread_barrier_init() have called pthread_barrier_wait() (and blocked too). When the correct number of threads have called pthread_barrier_wait(), all those threads simultaneously unblock.
Here's an example:
/*
* barrier1.c
*/
#include <stdio.h>
#include <time.h>
#include <pthread.h>
#include <sys/neutrino.h>
pthread_barrier_t barrier; // the barrier synchronization object
void *
thread1 (void *not_used)
{
time_t now;
char buf [27];
time (&now);
printf ("thread1 starting at %s", ctime_r (&now, buf));
// do the computation
// let's just do a sleep here...
sleep (20);
pthread_barrier_wait (&barrier);
// after this point, all three threads have completed.
time (&now);
printf ("barrier in thread1() done at %s", ctime_r (&now, buf));
}
void *
thread2 (void *not_used)
{
time_t now;
char buf [27];
time (&now);
printf ("thread2 starting at %s", ctime_r (&now, buf));
// do the computation
// let's just do a sleep here...
sleep (40);
pthread_barrier_wait (&barrier);
// after this point, all three threads have completed.
time (&now);
printf ("barrier in thread2() done at %s", ctime_r (&now, buf));
}
main () // ignore arguments
{
time_t now;
char buf [27];
// create a barrier object with a count of 3
pthread_barrier_init (&barrier, NULL, 3);
// start up two threads, thread1 and thread2
pthread_create (NULL, NULL, thread1, NULL);
pthread_create (NULL, NULL, thread2, NULL);
// at this point, thread1 and thread2 are running
// now wait for completion
time (&now);
printf ("main () waiting for barrier at %s", ctime_r (&now, buf));
pthread_barrier_wait (&barrier);
// after this point, all three threads have completed.
time (&now);
printf ("barrier in main () done at %s", ctime_r (&now, buf));
}
The main thread creates the barrier object and initializes it with a count of how many threads (including itself!) should be synchronized to the barrier before it breaks through. In our sample, this is a count of 3 — one for the main() thread, one for thread1(), and one for thread2(). Then the graphics computational threads (thread1() and thread2() in our case here) start, as before. For illustration, instead of showing source for graphics computations, we stick in a sleep (20); and sleep (40); to cause a delay, as if computations are occurring. To synchronize, the main thread blocks itself on the barrier, knowing that the barrier unblocks only after the worker threads have joined it as well.
As mentioned earlier, with the pthread_join() , the worker threads are done and dead for the main thread to synchronize with them. But with the barrier, the threads are alive and well. In fact, they've just unblocked from the pthread_barrier_wait() when all have completed. The wrinkle introduced here is that you should be prepared to do something with these threads! In our graphics example, there's nothing for them to do (as we've written it). In real life, you may want to start the next frame calculations.
Suppose that we modify our example slightly so that we can illustrate why it's also sometimes a good idea to have multiple threads even on a single-CPU system.
In this modified example, one node on a network is responsible for calculating the raster lines (same as the graphics example, above). However, when a line is computed, its data should be sent over the network to another node, which performs the display functions. Here's our modified main() (from the original example, without threads):
int
main (int argc, char **argv)
{
int x1;
… // perform initializations
for (x1 = 0; x1 < num_x_lines; x1++) {
do_one_line (x1); // "C" in our diagram, below
tx_one_line_wait_ack (x1); // "X" and "W" in diagram below
}
}
Notice that we've eliminated the display portion and instead added a tx_one_line_wait_ack() function. Let's further suppose that we're dealing with a reasonably slow network, but that the CPU doesn't really get involved in the transmission aspects — it fires the data off to some hardware that then worries about transmitting it. The tx_one_line_wait_ack() uses a bit of CPU to get the data to the hardware, but then uses no CPU while it's waiting for the acknowledgment from the far end.
Here's a diagram showing the CPU usage (we've used C for the graphics compute part, X for the transmit part, and W for waiting for the acknowledgment from the far end):
Wait a minute! We're wasting precious seconds waiting for the hardware to do its thing!
If we made this multithreaded, we should be able to get much better use of our CPU, right?
This is much better, because now, even though the second thread spends a bit of its time waiting, we've reduced the total overall time required to compute.
If our times were T compute to compute, T tx to transmit, and T wait to let the hardware do its thing, in the first case our total running time would be:
(Tcompute + Ttx + Twait) × num_x_lines
whereas with the two threads it would be
(Tcompute + Ttx) × num_x_lines + Twait
which is shorter by
Twait × (num_x_lines - 1)
assuming of course that T wait ≤ T compute .
Tcompute + Ttx × num_x_lines
because we have to incur at least one full computation, and we have to transmit the data out the hardware — while we can use multithreading to overlay the computation cycles, we have only one hardware resource for the transmit.
Now, if we create a four-thread version and run it on an SMP system with 4 CPUs, we end up with something that looks like this:
Notice how each of the four CPUs is underutilized (as indicated by the empty rectangles in the utilization graph). There are two interesting areas in the figure above. When the four threads start, they each compute. Unfortunately, when the threads are finished each computation, they're contending for the transmit hardware (the X parts in the figure are offset — only one transmission may be in progress at a time). This gives us a small anomaly in the startup part. Once the threads are past this stage, they're naturally synchronized to the transmit hardware, since the time to transmit is much smaller than ¼ of a compute cycle. Ignoring the small anomaly at the beginning, this system is characterized by the formula:
(Tcompute + Ttx + Twait) × num_x_lines / num_cpus
This formula states that using four threads on four CPUs is approximately 4 times faster than the single-threaded model we started out with.
By combining what we learned from having a multithreaded single-processor version, we want ideally like to have more threads than CPUs, so that the extra threads can soak up the idle CPU time from the transmit acknowledge waits (and the transmit slot contention waits) that naturally occur. In that case, we have something like this:
This figure assumes a few things:
Notice from the diagram that even though we now have twice as many threads as CPUs, we still run into places where the CPUs are under-utilized. In the diagram, there are three such places where the CPU is stalled; these are indicated by numbers in the individual CPU utilization bar graphs:
This example also serves as an important lesson — you can't just keep adding CPUs in the hopes that things keep getting faster. There are limiting factors. In some cases, these limiting factors are governed by the design of the multi-CPU motherboard — how much memory and device contention occurs when many CPUs try to access the same area of memory. In our case, notice that the TX Slot Utilization bar graph was starting to become full. If we added enough CPUs, they would eventually run into problems because their threads would be stalled, waiting to transmit.
In any event, by using soaker threads to soak up spare CPU, we now have much better CPU utilization. This utilization approaches:
(Tcompute + Ttx) × num_x_lines / num_cpus
In the computation, we're limited only by the amount of CPU we have; we're not idling any processor waiting for acknowledgment. (Obviously, that's the ideal case. As you saw in the diagram there are a few times when we're idling one CPU periodically. Also, as noted above,
Tcompute + Ttx × num_x_lines
is our limit on how fast we can go.)
While in general you can ignore whether or not you're running on an SMP architecture or a single processor, there are certain things that can bite you. Unfortunately, they may be such low-probability events that they won't show up during development but rather during testing, demos, or the worst: out in the field. Taking a few moments now to program defensively to save problems down the road.
Here are the kinds of things that you can run up against on an SMP system: