Cold requires a number of files for input. The formats and uses of these files are as described below, as well as the example files included in the package. The example files can be found in either the "Data", or the "Models" directory, depending on the use of the file:
Generally Newick formatted trees should work. The semicolon at the end is optional. Multiple trees can be given in a single file, each on a separate line; no more than one tree should be listed on a single line. It is not necessary to include blank lines between different trees, but it will improve readability.
Clades should be placed in brackets, with the length of the branch joining the clade to the rest of the tree listed after the clade, separated by a colon. The species names can be anything not containining the characters '(', ',', ')', ':', or newline. However the space ' ' character should be avoided because it makes reading the data not work. An example of a valid tree is:
(ABC100:5,((DEF_37:0.004,GHI:1):0.3,JKL32:1.003,(MNO23:0.6,(MNO32:0.12,PQR:0.33):0.7):0.56):0.4)
See the file testtrees
for more examples.
The lengths of the sequences can either be listed on the first line of the file, as the second number on that line (the first number is usually used for the number of species, though cold does not require this number, it assumes that it is there for compatibility with other packages which do require it), or else, the length of each sequence can be listed on the line with that sequence. It is possible to input sequences of different lengths, aligned at the start of the first sequence, however, cold assumes that all sequences are at least as long as the first sequence, so the first species in the tree must be the one with the shortest sequence in this case.
The data should be listed as nucleotide sequences. Each line
should consist of the name of the species, followed by a
space (or multiple spaces), then the length of the sequence
(if necessary), then the sequence. [Note that this format
does not allow spaces in species names. It is possible that
such a feature may be incorporated into later versions.] Any
character other than ACGT
or a space is treated
as an unknown nucleotide. (This means there is limited error
checking, for instance no error would be produced by the
sequence ACTGTGTTTCSC
- the program would
simply interpret the S
as an unknown
nucleotide, instead of a typo, which is a more likely
explanation.)
[Currently there is no way to specify partial information about a nucleotide. This feature may be added in later versions.]
Each sequence should be on a separate line. The program can search the tree for species names, so, as long as the sequences for all species in the tree are contained in the file, it will work - it doesn't matter if the file contains extra data. The program will give an error if it can't find the data for any of the species in the tree.
See the file testdata
for examples.
Matrices should be listed in the usual format. Entries should be separated by the space character. Multiple spaces are OK. Lines are separated by newline characters. The size of the matrix is determined by the length of the first line, so it is important that the first line should be the right length. Other lines don't matter (though it will look better if all lines are the right length).
Separate matrices in the file do not [at least, should not - I haven't done a lot of testing of obscure input formats] need to be separated by newlines, but doing so will improve readability of the file.
The following files contain matrices:
ECMq.txt
masks
parametermatrices
standardmodelmatrices
For the first two, the first line can be used to indicate the number of matrices in the file (and also whether any matrix should be used as a mask). The format of the first line is
number [mask] number
where the numbers are the number of matrices before and after the mask matrix.
The initial parameters file contains a list of the Pi values, followed by initial values for all the parameters that are to be estimated.
See the file M0_example_initpars
for an example.
The variable file consists of lines, each of which specifies a command line option. The lines have the following format:
name commandlineoption type value
The name doesn't really matter for most options. The commandlineoption is what you would type at the commandline to invoke this option. The type is the type of the argument (see the following list for the required values). The value is the argument for the option.
The following options can be invoked from a file. The type value should be as given in this list.
--model | string |
--modelfile | string |
--numpars | string |
--nomask |
|
--parameterselection | string |
--usematrix | string |
--justbl | string |
--mask | string |
--maskfile | string |
--mixture | string |
--mixfile | string |
--mixstring | string |
--empirical | string |
--fixedprobs | string |
--setfixedpars | string |
--treenumber | string |
--initpars | string |
--variables | string |
--noparsimony |
|
--path | path |
-q |
|
-i | string |
-D | string |
--showeverysite | int |
--printsitelikes |
|
--hessian | |
--state | string |
--recover | string |
--noautobackup |
|
-P | string |
-b | string |
-T | string |
-v |
|
--testderivs |
For an example, see the file .variables
. This
file is automatically read before any other arguments. This is
important because the path variable needs to be set before
other options are processed.
Each row represents a parameter for the components. The row
starts with the number of the parameter (starting with 0,
which is the rate). Then it has a list of classes separated by
colons if the parameters must have equal values, and commas if
they can have separate values. A dot between two values
indicates a comma-separated list of all values between them,
that have not already been listed. A dash indicates a
colon-seperated list. If there is no first value, zero is
assumed. If there is no second value, the number of components
minus one is assumed (so that all later components are
included). If the parameter number in a row is a dot, this
means that all rows between the one above and below that do
not already have patterns should follow this pattern (or the
pattern of the row above if there is no pattern). If the
parameter number is a |
, it indicates that the
groups in the rows above and below should be merged. Rows can
be separated with a semicolon instead of an end of line. This
is useful for inputing the mixture as a string, rather than
writing a separate file for very simple mixtures.
[Some aspects of this don't work quite as described yet. This
shouldn't be a problem for usual mixture files, only if the
mixture file is specified in a strange way. If you list the
rows and entries in increasing order, there should be no
problems. Things like 5 7-5,3.6,0:2;|;.;2
,
however, are not guarenteed to work correctly.]
For examples, see the file mixture
. See also the mixture
strings in the file models
.
The modelfile consists of a number of models. Each model
consists of the model name, then any parameters as a comma
separated list between parantheses e.g. (a,b,c)
,
then the model specification enclosed between
braces {
and }
. [Currently, if
there are no parameters, there needs to be a space between the
model name and the opening brace. Hopefully this bug will be
sorted out soon.]
The model specification consists of a collection of assignments of the form
FIELDNAME=VALUE
then a closing brace. The parameters are substituted directly
into the VALUE
part when they occur (separated by
spaces). For example, if there is a parameter
called a
, whose value
is testparameters
, then a line of the form
PARAMETERS=myparamfile. a
woulds set the parameter file to
myparamfile. testparameters
Note the space produced in the middle of the filename. This is probably not what you want. The easiest way to avoid it is by separating the text from the parameter with a pair of quotation marks. For example, the line above would be rewritten:
PARAMETERS=myparamfile.""a
Currently, arithmetic expressions are not available in the
modelfile. I plan to add support for them in a later
version. They will start with the #
character. If
you want to avoid a character having it's normal meaning, you
can either precede it with a backslash, or place the text in
quotation marks.
Note that the \
character has a special meaning - it
removes any special significance to the following character. For
example, if your file name has quotation marks in it, you can
use \"
to get them. This means that if your filename
has a \
character in it (which is only likely to be a
problem for Windows users) you need to use \\
.
The modelfile simply substitutes the values of the fieldname into commandline arguments. It can therefore do anything that the corresponding commandline options can do. (And conversely, it can only do what the commandline options can do.)
The following aspects of the model can be set in the model
file: (given with the equivalent commandline options, which
can be looked up in cold.info
if more information is
required).
MIXTURE | --mixture | the number of sets of parameters in the mixture model. |
MIXFILE | --mixfile | a file that describes how the parameters vary (see above). |
MIXSTRING | --mixstring | like mixfile, but gives the mixture as a string, rather than loading it from a separate file. |
PARAMETERS | -p | the file from which parameters are read |
PARAMETERSELECTION | --parameterselection | the parameters from this file to be used in the model. |
INITIALMATRIX | -m | the matrix used to choose the initial parameters. |
NUMPARS | --numpars | the number of parameters to be read from the parameter file. |
MASKFILE | --maskfile | the file containing the mask to be used. |
MASK | --mask | the mask to be used (if more than one is available) |
USEMATRIX | --usematrix | use a constant matrix for the Q matrix. |
INITIALPARS | --initpars | the initial values for parameters. |
JUSTBRANCHLENGTHS | --justbl | only optimise the branch lengths. [Doesn't yet work with mixed models] |
EMPIRICAL | --empirical | sets empirical pi values. |
FIXEDPROBS | --fixedprobs | fixes mixing probabilities as the given values. |
FIXEDPARS | --setfixedpars | sets the values of fixed parameters |
See the file models
for examples.
This file should not be edited by the user. The reason for understanding the file layout is for debugging in the event of an error. The statefile is generated either in the event of a fatal error, or as a regular backup for long runs. It's purpose is to allow the program to resume running from the point at which it was interrupted. This is done using the --recover option.
The statefile is however in a human-readable format, so it can be
read in an attempt to determine the cause of the fatal error. Its
format is as follows:
newx
The program has just calculated the
next set of parameters and branch lengths. hessian
The program was in the middle of
calculating the hessian. endhessian
The program had just finished
calculating the derivatives, but had not yet sorted the values out
into a matrix and applied the Newton-Raphson method. LiLiLiLiL
. This
line is to allow the same code to be useable for other programs
which might save different types of variables.
COLD searches in a number of places for files. There are three things that affect where it searches:
ROOTSEARCHDIR
variable, which is set at compile
time (installation) by setting
either the ROOTSEARCHDIR
or mainsearchdir
variable when running make and make
install]. This is the directory into which the default files
are installed, so this is where COLD will search for model files,
parameter files, etc..variables
file (which can be configured at installation, or manually after
installation). It can be changed by the commandline options, or
another variable file. The default
path includes directories Models
and Data
. These are necessary for finding several default
files, so if you want to modify the path, you should probably
include these directories.The file searching algorithm searches in all of the following places, in this order:
cold
of your home directory. It is the
directory in which COLD installed its default files.Actually, the last two are interleaved, so it goes through each directory in the search path, and searches in that subdirectory of the current directory, then that subdirectory of the main search directory.
COLD opens the first match for the file in question, so for example, by putting a file in the current directory, you can effectively override the file in a later search directory.
It might be that you have all your data stored in a particular directory. In such a case, it would probably be a good idea to add that directory (and any subdirectories which you need to search) to the search path. Similarly, if you are studying your own models, you could put them all in one directory, and add that directory to the search path. [You should probably make these directories absolute, so that COLD can find the files wherever you run it from.]
There are three ways:
.variables
.variables
file, and add the new search
path to it. The first option will change the search path every time you run
the program. The second will also change the search path every time
you run the program, but the options in your
personal .variables
file can easily be all added, or
all removed together, so if you have a set of options that you use
for some program runs, but not others, then you could put these in a
variable file and use
the --variable
commandline option whenever you want to use those options. The third
option will set the search path just for the current run.
The search path just consists of a list of directories, separated by
the colon character :
. If the directory name contains a
colon, you can precede the colon with a
backslash \
. This means that in order to have a
backslash in your directory name, you need to precede it by a
backslash as well (so it would be \\
). [Note that if
specifying the path variable on the command line, your commandline
may give special significance to certain characters, which may
therefore need to be escaped (often by preceding them with a
backslash, or by putting them in quotation marks). [For example, on
linux, if you want to specify a search path on the command line, and
one of the directory names contains a colon, you would have to
precede that colon with a backslash, but the shell would remove the
backslash, so you would need to precede it by another backslash.]
Directory names containing new line characters are not supported,
and probably never will be.
This is probably OK. There are a few issues.
When using the -f option, the search algorithm works
slightly differently. It still searches directories in the order
above. For each directory, it first searches for the file as
typed. Then it searches for the file with each of the allowed file
extensions. [The defaults
are .tree
, .tree.1.txt
, .tree.txt
,
and .txt
for tree files,
and .seq
, .nuc.txt
, .dat
,
and .txt
for sequence files. Currently, these can only
be changed by modifying the fileExtensions.h
file
before compiling the program. I expect that later versions of COLD
will allow it to be set more easily.]
It then compiles a list of all the possibilities, in the order listed above. Once it has the lists for data and tree files, it considers using the first possible tree file as a tree file, and checks whether this would allow it to find a data file. If it would, then it opens this tree file and opens the first remaining data file. Otherwise, it tries the next tree file, which should allow it to find a data file. If not, it will keep trying (but in this case, it should not manage to find a tree and data file).
[Note that the program does not make any checks on the contents of the files. It simply chooses the file names. If it tries to open a data file as a tree file or vice-versa, it will almost certainly abort with an error message. Hopefully, later versions of COLD will have more sophisticated methods for checking file type, and will be able to resolve more cases like this.]