cold
calculates and maximises log-likelihoods for user-defined codon
models in phylogeny.
This manual is for version 1.0.0 of cold
.
cold
has a number of command line options as detailed below. A number
of the options read input from files. The required formats of these
files are described in files.html. The
options have been divided into the following categories:
loads the arguments from the chosen model, which must be described in the model file. The model file models is provided with the package, and contains details for various standard models. If these are not sufficient, the user can supply their own model file for describing their own models.
instructs the program to use the specified model file instead of the default one.
tells the program to only read the first n parameter matrices (thereby selecting a submodel of the model described in the parameter file).
tells the program to ignore the mask matrix (which determines which codon changes are possible in a single step - e.g. only single nucleotide change, ...)
Parameter file - the program loads the parameters from the given text file. The files parametermatrices and standardmodelmatrices are provided with the package. Alternatively, the user can write their own parameter files
allows a selected subset of the parameters to be used. The format is fairly obvious, it should accept a comma-separated list of ranges, for example 1,3-7,9-11,36. If you want to include spaces in the list, you will probably have to enclose the list in quotes.
specifies that parameters don't vary, and that the specified transition matrix should be used.
instructs the program to fix the parameter values, and only optimise the branch lengths.
selects a mask to use from the maskfile. (This allows the user to specify no multiple-nucleotide changes for example). The input is the number of the chosen mask within the file. The default mask file masks supplied with the package only contains one mask at present, so this value should always be 1 unless using a user-supplied mask file.
instructs the program to use the file specified when reading masks, instead of the default mask file.
Tells the program to use a mixed model, and indicates the number of separate distributions to be mixed. Parameters can be further controlled by the option --mixfile or --mixstring
Defines the parameters in a mixed model (some parameters may be estimated for each component of the mixture; others may be the same for all component; while others may be fixed in certain components). See files.html for details of the format of this file
Like mixfile, except that the information is input on the command line as a string instead of being read from a file. The format is the same as for the mixture file.
Instructs the program to estimate the Pi parameters empirically in the manner described. There are currently four options: F61 Empirically estimates all codon frequencies F3x4 Empirically estimates nucleotide frequencies in each position within the codon. F1x4 Empirically estimates nucleotide frequencies. Fequal Sets all codon frequencies to 1/61.
Instructs the program to use the given values for mixing probabilities, rather than estimating them. This can be useful for discrete approximations to fixed distributions.
Sets the values of fixed parameters.
Tree file - the program loads the tree from the given text file
Selects a tree from a file containing multiple trees.
Data file - the program loads the sequence data from the given text file.
family - both the tree and sequence file have the same name, but a
different file extension. The program attempts to locate both
of them, by testing the known extensions. [The possibilities
are set at compile time
from fileExtensions.h
. Hopefully later versions
of cold
will allow the chosen file extensions to be set at
runtime. Also, hopefully, later versions will have improved
detection routines to help find the right files. For full
details on the file searching algorithms,
see this note.]
gives a file from which the initial set of parameters are read. If this option is not given, the initial parameters are estimated from an initial matrix.
Matrix file - the program will base its initial estimate of the parameters on the Q matrix read from this file. It chooses the parameters that best approximate the Q matrix loaded from this file, as its starting value.
reads additional commandline arguments from the specified file. By default the file .variables is read for extra command-line arguments. Editing this file can be used to change the default values of variables, or in environments without a command line, to give the commandline arguments.
By default the program uses a parsimony method to estimate the initial branch lengths. This command tells the program to use the branch lengths indicated on the tree.
Saves the final output (hopefully the parameters that give maximum likelihood) to the file named filename. Without this option, it outputs this information to standard output.
causes the program to print the likelihood for the data at each site, if the Newton-Raphson method converges, so that the maximum likelihood is found. If the optimisation does not converge (so the maximum likelihood is not found) this option does nothing.
causes the program to print t-statistics for all variables at the MLE (assuming it converges). These can be used to decide which variables are more likely to be important.
Causes the observed information matrix to be printed out in the final output. This matrix can be used to estimate asymptotic covariance.
Outputs the estimated standard deviations of the branchlengths. These can be used to obtain confidence intervals for the branchlength estimates.
Number of iterations - this is the maximum number of iterations of Newton-Raphson to try while seeking to maximise likelihood. If the algorithm hasn't converged by the end of n iterations, it reports the current position. Unless the quiet flag is set, the output should be enough to indicate whether continuing would be likely to lead to convergence (i.e. if the value of n had been set too small).
sets the searchpath for all other files. This is
where cold
searches for files. This
allows the user to store necessary data files in non-standard
locations. The default searchpath is stored in
the .variables
file, and so can be changed.
[Later versions may allow this option to add to the search
path, rather than replacing it.]
Simulates data: starting from the root of the tree, it simulates random evolution according to the model specified (using the other options, same as for maximising likelihood). The first number is the number of files to produce, and the second is the length of the sequences in each file (in nucleotides). This length should therefore be a multiple of 3 [there is currently no error checking for this]. If given, the seed is used to seed the random number generator. This makes it possible to reproduce the results. [The simulation uses the default random number generator provided by the compiler. This may not be good enough for some purposes.]
Automatically runs a crude variable selection algorithm. It checks the t-statistics after estimating the parameters. Any that are not significant are removed, and a new model is fitted on the remaining variables. If this model is significantly worse, the previous model is output. Otherwise, it looks at the t-statistics in the new model, and tries removing ones that are not significant.
This variable selection is rather crude, and it would probably be better to do the variable selection manually for individual data sets. This option is useful when selecting variables for many simulated data sets.
gives the name of the file to save the current state to if the program is interrupted, or has certain kinds of error. The file output is plain text, so it can be read by the user in the case of error, to help with debugging. The default filename is statefile.
tells the program to attempt to recover it's last state from a file, which should have been set using the --state option on a previous run, saving the need to repeat a lot of calculation.
By default, the program automatically backs up it's current position after every Newton-Raphson iteration, allowing some recovery in the event of an unstoppable interruption. This option disables automatic backups.
Interactivity - sets the level of interactivity for the program. Options are as follows:
-1 | LOGFILE | send all messages to a log file. |
0 | AWAY | display messages, make default choices, don't ask for user input. |
1 | VITAL | only request user input when absolutely necessary. |
2 | PRESENT | request user input whenever it may be helpful to do so. |
3 | ALL | request user input on all irregularities, even when no input would usually be expected. |
START | [Not yet implemented] Perform all file loading and parsing operations at the start of execution, and prompt for user input at this time. Do not prompt for user input once optimisation has started. |
Debug level - controls the amount of debuging information output. The following options are available:
-1 | SILENCE | No Debugging or even status updates displayed. |
0 | NODEBUG | Appropriate for most users. This provides basic status updates indicating the progress of the program. This should be sufficient to indicate whether the program is progressing normally. If a problem occurs, the program can be rerun with a higher debugging level to identify the problem. |
1 | BASICDEBUG | Gives basic debugging output. May be helpful if something goes wrong. Might identify file format problems or similar things that may be fixed easily by the user. |
2 | FULLDEBUG | Mostly only useful for developing the package, or for users who want to extend the package with their own code. if you want to send a bug report, please include the output from the command run with -D2. |
Numbers greater than 2 default to 2, so running with -D3 is also OK, and will allow for any later extensions to the debugging options. See the table at the bottom of the page for details of what output is available.
[Not yet implemented] quiet - does not print any information about the status of the program - just outputs the desired information at the end. This can be achieved with -D -1.
[Not yet implemented] verbose - provides a lot of information about the current state of the program. Useful for debugging. This can be achieved with -D 2
controls the status display. The option should be a number. When calculating a hessian, the program will print information about which site it is currently working on. This option controls how often the program prints this message.
[Not yet implemented] precision - controls how close to zero numbers have to be before being treated as zero. Too small a precision can lead to unstable results.
[Not yet implemented] buffer size - controls sizes of internal buffers for loading data from files. There should be little reason to change this, but could be useful if there seem to be problems with saved data being corrupted.
Number of threads. In order to make more efficient use of multiple processors, computations for different sites are performed by different threads, which can be run simultaneously on different processors. The number of threads should usually be equal to the number of 'cores' on the machine (or twice the number if hyperthreading is possible). This should be set as the default when compiling. However, if other processes are also running on the machine, it may be better to use fewer threads. Single threaded mode is less likely to have bugs (though the program has been well tested for threading bugs and seems OK). [Note that for large models or trees, each thread can take up a lot of memory, so it may be faster to run fewer threads than there are processors available in this case.] For more information on choosing the number of threads, see this note.
Operation - the default operation 1 is to maximise, this option can be used to attempt to minimise the likelihood, when set with value 0.
For debugging. Tests whether the hessian and derivatives calculated by the program are correct (by comparing to values approximated by comparing two nearby points).
For use with the --testderivs option. This sets the distance between points for testing derivatives. The default is 1e-6. If the derivatives are small relative to the values, this can cause rounding errors to make the test inaccurate. In such a case, setting a larger value can improve the testing.
The program output can be divided into two sorts - status and debugging information during execution; and final output.
cold
model (...), and the exponential of the coefficient (which
is the parameter in some models).Minimum Debug Level | Statement | Meaning |
0 | Started step number n | Currently working on nth iteration. |
0 | Value | Log Likelihood for current parameters. |
0 | Expected Improvement | The amount by which the Newton-Raphson method predicts that the next point will be better than the current one (Assuming that the step-size is not changed by other factors). If the value is positive it means that it seems to be near a local maximum. If the value is negative, it is near a local minimum. |
0 | Step size | The modulus of the difference between the current parameters and the next parameters. |
2 | Producing n threads for computation | The site likelihoods are computed separately. Computing them in different threads can improve speed. However, it increases memory usage, and there are more danger of bugs. |
2 | Angle between steepest and actual ascent | This is a normalised dot product rather than an angle. It gives the dot product between the direction suggested by Newton-Raphson and the direction of steepest descent. Generally, values close to 1 or -1 indicate circular contours, which should lead to good convergence. Values close to 0 indicate elongated contours which can indicate ill-conditioned parameters. Positive values indicate the program is converging towards a maximum; negative values indicate it is moving away from a minimum. |
2 | Modulus | The sum of squares of all current parameter values. This gives an idea of the large-scale movement of the algorithm. |