As mentioned earlier, the computation of likelihoods for different sites is close to independant. Therefore, the computation lends itself well to parallel computation. If your system has multiple processors, each processor can be working on a different site. Therefore, choosing the number of threads to be the same as the number of processors can minimise computation time.
Unfortunately, reality is more complicated than this. The computation
is very memory intensive. Computing a sitewise hessian for a tree with
n branches, for a model with p parameters, requires total
memory usage of about (3*61/2)n(p*p+n) variables of type
long double
. Typically, somewhere around 8-16 bytes. That is,
each thread could need about about n(p*p+n) kilobytes of
memory. For large trees with lots of parameters, this can quickly use
up most of your available memory, actually causing the program to run
more slowly. Therefore, if your memory is limited, or if other
programs are also using it up, it may be more efficient to use fewer
threads. [On Unix-based systems, you can test memory usage using the
top command.] [There are probably other ways too.]
As an example of typical memory usage, on my computer, for a tree with 6 species, using a model with 32 parameters, and 4 threads, the memory usage was about 750 megabytes. For a tree with 12 species, it was using 875 megabytes. For a tree with 25 species, it was using about 800 megabytes. For a tree with 349 species it used about 5 gigabytes.
For the tree with 349 species (using sequences of 987 nucleotides) the times for a single hessian calculation with various numbers of threads are as follows:
Number of Threads | Time for one Hessian Calculation
|
1 | 24m22.164s
|
2 | 18m40.653s
|
3 | 24m11.497s
|
4 | 22m35.210s
|
5 | 14m??
|
6 | 20m??
|
These tests were done on a computer with 2 2.66GHz intel i5 processors (which support hyperthreading), with 4MB shared L3 cache, and 6GB total RAM.
[Note that the threaded version is not yet optimally implemented. Many values are needlessly calculated once for each thread. Improvements to later versions of the program should increase the benefits from using additional threads.]