Department of Genome Sciences and Department of Biology University of Washington
address: |
Contents of This Document
A Brief Description of the Programs
Copyright Notice for PHYLIP
The Documentation Files and How to Read Them
What The Programs Do
Running the Programs
A word about input files
Installing a recent version of Oracle Java
Running the programs on a Windows machine
Running the programs on Apple MacOS
Running the programs on a Unix or Linux system
Running the programs in MSDOS
Running the Drawgram and Drawtree Java interfaces
Running the Drawgram and Drawtree Java GUI interfaces in Windows
Running the programs in background or under control of a command file
An example (Unix, Linux or MacOS)
Subtleties (in Unix, Linux, or MacOS)
An example (Windows)
Testing for existence of files
Prototyping keyboard response files
Preparing Input Files
Input and output files
Where the files are
Data file format
The Menu
The Output File
The Tree File
The Options and How To Invoke Them
Common options in the menu
The U (User tree) option
The G (Global) option
The J (Jumble) option
The O (Outgroup) option
The T (Threshold) option
The M (Multiple data sets) option
The W (Weights) option
The option to write out the trees into a tree file
The (0) terminal type option
The Algorithm for Constructing Trees
Local rearrangements
Global rearrangements
Multiple jumbles
Saving multiple tied trees
Strategy for finding the best tree
A Warning on Interpreting Results
Relative Speed of Different Programs and Machines
Relative speed of the different programs
Speed with different numbers of species
Relative speed of different machines
General Comments on Adapting the Package to Different Computer Systems
Compiling the programs
Unix and Linux
On Windows systems
Compiling with CygWin and Gnu C++
The Mingw compiler
Compiling with Mingw under CygWin
Compiling with Microsoft Visual C++
Macintosh
Compiling with GCC on MacOS with our Makefile
Compiling with GCC on MacOS with X Windows
Parallel computers
Other computer systems
Compiling the Java interfaces
Frequently Asked Questions
How to make it do various things
Background information needed:
Questions about distribution and citation:
Questions about documentation
Additional Frequently Asked Questions, or: "Why didn't it occur to you to ...
(Fortunately) obsolete questions
New Features in This Version
Coming Attractions, Future Plans
Endorsements
From the pages of Cladistics
... in the pages of other journals:
... and in the comments made by users when they register:
References for the Documentation Files
Credits
Other Phylogeny Programs Available Elsewhere
PAUP*
MrBayes
MEGA
PAML
Phyml
RAxML
TNT
DAMBE
How You Can Help Me
In Case of Trouble
PHYLIP, the Phylogeny Inference Package, is a package of programs for inferring phylogenies (evolutionary trees). It has been distributed since 1980, and has over 30,000 registered users, making it the most widely distributed package of phylogeny programs. It is available free, from its web site:
PHYLIP is available as source code in C, and also as executables for some common computer systems. It can infer phylogenies by parsimony, compatibility, distance matrix methods, and likelihood. It can also compute consensus trees, compute distances between trees, draw trees, resample data sets by bootstrapping or jackknifing, edit trees, and compute distance matrices. It can handle data that are nucleotide sequences, protein sequences, gene frequencies, restriction sites, restriction fragments, distances, discrete characters, and continuous characters.
Copyright Notice for PHYLIPThe following copyright notice given below is intended to cover all source code, all documentation, and all executable programs of the PHYLIP package. This is a "BSD 2-Clause License" which is open source. It is not a GNU license and does not insist that other materials distributed with PHYLIP be under a similar license.
© Copyright 1980-2019, Joseph Felsenstein Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
PHYLIP comes with an extensive set of documentation files. These include the main documentation file (this one), which you should read fairly completely. In addition there are files for groups of programs, including ones for the molecular sequence programs, the distance matrix programs, the gene frequency and continuous characters programs, the discrete characters programs, and the tree drawing programs. Finally, each program has its own documentation file. References for the documentation files are all gathered together in this main documentation file. A good strategy is to:
There is an excellent guide to using PHYLIP 3.6 also available. It was written by Jarno Tuimala of the Center for Scientific Computing in Espoo, Finland and was formerly available as a PDF at a site there. It is still available at the main PHYLIP web site (here).
Here is a short description of each of the programs. For more detailed discussion you should definitely read the documentation file for the individual program and the documentation file for the group of programs it is in. In this list the name of each program is a link which will take you to the documentation file for that program. Note that there is no program in the PHYLIP package called PHYLIP.
This section assumes that you have obtained PHYLIP as compiled executables (for Windows, MacOS, or Linux), or else you have obtained the source code and compiled it yourself (for Linux, Unix, MacOS, or Windows). For the programs Drawtree and Drawgram you will also need a recent version of Java installed on your computer to run them interactively. Note that for machines for which compiled executables are available, there will usually be no need for you to have a compiler or compile the programs yourself. This section describes how to run the programs. Later in this document we will discuss how to download and install PHYLIP (in case you are reading this without yet having done that). Normally you will only read your copy of the documentation files after downloading and installing PHYLIP.
After describing the input files, we will describe how to run most of the programs on Windows, MacOS, Linux, and Unix systems). After that, we will give special descriptions of the interactive Java interface for the tree-drawing programs Drawgram and Drawtree, including how to run these interfaces on Windows, MacOS, and Linux systems. These may require you to download and install on your computer the most recent version of Oracle Java, which is available from Oracle at no cost. We describe this below after discussing input files.
For all of these types of machines, it is important to have the input files for the programs (typically data files) prepared in advance. They can be prepared in any editor, but it is important that they be saved in Text Only ("flat ASCII") format, not in the format that word processors such as Microsoft Word want to write (in Microsoft Word, make sure that the data encoding used is "US ASCII", as using any of the Unicode codings can cause trouble). It is up to you to read the PHYLIP documentation files which describe the files formats that are needed. There is a partial description in the next section of this document. The input files can also be obtained by running a program that produces output files in PHYLIP format (some of these programs do, and so do programs by others such as sequence alignment programs such as ClustalW and sequence format conversion programs such as Readseq). There is not any input file editor available in any program in PHYLIP (you should not simply start running one of the programs and then expect to click a mouse somewhere to start creating a data file).
When they start running, the programs look first for input files with particular names (such as infile, treefile, intree, or fontfile). Exactly which file names they look for varies a bit from program to program, and you should read the documentation file for the particular program to find out. If you have files with those names the programs will use them and not ask you for the file name. If they do not find files of those names, the programs will say that they cannot find a file of that name, and ask you to type in the file name. For example, if Dnaml looks for the file infile and does not find one of that name, it prints the message:
dnaml: can't find input file "infile" Please enter a new file name> |
This does not mean that an error has occurred. All you need to do is to type in the name of the file.
The program looks for the input files in the same folder that the program is in (a folder is the same thing as a "directory"). In Windows, MacOS, Linux, or Unix, if you are asked for the file name you can type in the path to the file, as part of the name (thus, if the file is in the folder containing the current folder, you can type in a file name such as ../myfile.dna).
We have provided a MacOS version of the executables, in the form of "universal binaries" that should run either on PowerMac or Intel iMac systems (to ensure that they will run on both 32-bit and 64-bit MacOS systems, we have made sure that we compiled the executables as 32-bit executables). The programs can be run by clicking on their icons. They open a Terminal window, and the menu appears in it. Note that after the program is finished, the Terminal window remains open, and operations can be done in it. You will have to close the window yourself if you don't want it. The programs can be terminated by typing control-C (press down the "control" key in the lower-left corner of the keyboard and type "c").
It is also possible to run the executables from within a Terminal window by typing the program name, but this is a little harder. You will find the Terminal utility available in the Utilities folder in the Applications folder. You do need to have links made in the exe folder to the programs. This can be done the first time you need them, by entering the exe folder and opening a Terminal window, and then typing source linkmac. This creates the proper links, and thereafter you do not need to do this again. The programs can be run by typing their names in a Terminal window whose current working directory is exe The programs work well this way, though the programs Drawgram and Drawtree may be slow to open and close plotting windows. The programs can be terminated by typing control-C or by closing the Terminal window by using the red button in the upper-left corner of the window.
One problem we have often encountered using MacOS is that it is possible for data files to have the wrong kind of characters at the ends of their lines. They may have carriage-return (ASCII/ISO 13 or control-M) characters at the ends of their lines when they should instead have the Unix newline character (ASCII/ISO 10 or control-J) there. This can happen with files transferred from other operating systems or files produced in some word processors. It results in segmentation-fault or memory errors. If you encounter these, check this possibility carefully.
If you normally run MacOS applications using open -a, you may need to use the command lsregister -f -r /your/path/to/apps. You can find it with the command locate lsregister.
Type the name of the program in lower-case letters (such as dnaml). To terminate the program while it is running, type Control-C (which means to press down on the Ctrl key while typing the letter C).
On some systems you may need to type ./ before the program name, so that in the above case it would be ./dnaml. This is mostly needed if the user's PATH does not include their current directory, something which is often done as a security precaution.
With version 3.695 we have released an interactive Java interface for the tree-drawing programs, Drawgram and Drawtree. The reason is that the graphic interface language for MacOS has changed from the Carbon GUI to the Cocoa GUI, which would require a lot of rewriting of code. The alternative X11 (X Windows) GUI machinery on MacOS has been deprecated by Apple, and is showing its age on Linux systems.
Looking at available options, it seemed best to use Java to construct GUI interfaces, as this could be done in a reasonably compatible way across all three major platforms. There are disadvantages too -- to get full compatibility we need to ask users to download the most recent available Java from its maker, Oracle. That is not difficult but is a tiresome extra step. Oracle owns Java, and Java is not public-source, but there seems to be no sign that Oracle is going to make Java runtime machinery unavailable or charge for it.
Not all Java implementations will run PHYLIP's Drawgram and Drawtree GUIs. A reasonably compatible Java is distributed with MacOS, but no Java is distributed along with Windows. So you will need to download Java for Windows from the Java site at Oracle. The Java distributed with Linux distributions is now largely adequate. If it is a version that is not compatible enough with our Java GUI, you will need to download Oracle Java for Linux. We will give you instructions for that below.
The new GUI for Drawgram and Drawtree is a testbed for a general set of GUI interfaces for all our programs, which will be present in version 4.0 when that is distributed, which will be soon. The work you do to put a recent version of Oracle Java on your system will make using version 4.0 easier.
For people who use Drawgram or Drawtree in a "pipeline" run by shell scripts, there should be no interruption in your ability to do that. The current C code for those programs can either be called by the Java GUI or be run from a command line or a shellscript (for which see below). Almost all of the features of Drawgram and Drawtree are available from their character-mode menu when run that way, except for the interactive previewing of plots. We hope that the shell scripts will still work and will not need modification for this version of PHYLIP.
To run the Drawgram or Drawtree programs, you find the Drawgram.jar or Drawtree.jar files, which are Java Archive files in our folder of executable programs. You can run them by clicking on their icons. Detailed instructions for using the interfaces are given in the general documentation file for tree-drawing programs draw.html (which you should read), and the documentation files for the two programs drawgram.html and drawtree.html.
To run the interactive interfaces of the tree-drawing programs Drawgram and Drawtree, you need to have an appropriate version of Java installed on your computer. If you have Java installed, you should test whether it is an appropriate version by trying to run Drawgram or Drawtree (for this you will need an input tree file present as well). Is it likely that you have a compatible Java on your system?
Double-click on the icon for the program. A window should open with a menu in it. Further dialog with the program occurs by typing on the keyboard in response to what you see in the window. The programs can be terminated either by typing Control-C (which means to press down on the Ctrl key while typing the letter C), or by using the mouse to open the File menu in the upper-left corner of the program's window area and then select Quit. Other than this, most PHYLIP programs make no use of the mouse. The tree-drawing programs Drawtree and Drawgram do allow use of the mouse to select some options.
The programs open a window for their menus. This window may be too small for your tastes. They can be resized by tugging on the lower-right corner of the window. In addition, the font may be too small. On most versions of Windows, you can click on the small C:\ icon symbol at the upper-left corner of the window, and choose the Properties menu choice there. One of its tab options allows you to change the font and size of the print. I prefer large font sizes such as 16x12.
The programs can also be run in a Command Prompt window under Windows, in much
the same way as they were under the MSDOS operating system, which is what the
Command Prompt window emulates. Command Prompt windows can be open by
choosing that option in the Windows System menu which is in the All Programs menu.
Once in the Command Prompt window, make sure that you are in the
correct folder, using the cd and dir commands as needed to find the folder
where the executable PHYLIP programs are. Then type the name of the program
that you want to use (such as
dnaml.exe
In running the programs, you may sometimes want to put them in background
so you can proceed with other work. On systems with a windowing environment
they can be put in their own window, and commands like the Unix and Linux
nice command used to make
them have lower priority so that they do not interfere with interactive
applications in other windows. This part of the discussion will
assume either a Windows system or a Unix or Linux system. I will
note when the commands work on one of these systems but not the other.
MacOS is actually Unix (surprise! surprise!) and you can
run PHYLIP programs in background on any MacOS system by simply following
the instructions for Unix, using a terminal window to do so if necessary.
(The Terminal utility can be found in the Utilities folder which is
inside the Applications folder).
If there is no windowing environment, or if you want to make PHYLIP programs
part of a larger workflow of some sort, on a Unix or Linux system you will
want to use an
ampersand (&) after the command file name when invoking it to put the
job in the background. You will have to put all the responses to the
interactive menu of the program into a file and tell the background job
to take its input from that file (we cover this below).
On Windows systems there is no & or nice command
but input and output redirection and command files work fine in a Commmand
window. A command file can either be invoked by clicking on its icon or
by typing its name from a Command Prompt window. The a file of commands must
have a name ending in .bat or .cmd, such as
foofile.bat. You can
run the batch file from a Command window by typing its name (such as
foofile) without the .bat.
Here are examples, for the different operating systems:
Here is an example for Windows, Linux, or using a Terminal window of
MacOS. Below you will find a separate example for Windows. If you
are using Windows you should read that section instead.
Suppose you want to run Dnaml in a background, taking its
input data from a file called sequences.dat, putting its interactive
output to file called screenout, and using a file called input as
the place to store the interactive input. The file input need only
contain two lines:
which is what you would have typed to run the program interactively, in
response to the program's request for an input file name if it did not
find a file named infile, in response the the menu.
To run the program in background, in Unix or Linux you would simply give the command:
dnaml < input > screenout &
These run the program with input responses coming from input and
interactive output being put into file screenout. The usual output
file and tree file will also be created by this run (keep that in mind
as if you run any other PHYLIP program from the same directory while
this one is running in background you may overwrite the output file from
one program with that from the other!).
If you wanted to give the program lower priority, so that it would
not interfere with other work, and you have Berkeley Unix type job control
facilities in your Unix or Linux (and you usually do), you can use the
nice command:
nice +10 dnaml < input > screenout &
which lowers the priority of the run. To also time the run and put the
timing at the end of screenout, you can do this:
nice +10 ( time dnapars < input ) >& screenout &
which I will not attempt to explain.
On Unix or Linux systems
you may also want to explore putting the interactive output into the
null file /dev/null so as to not be bothered with it (but then you
cannot look at it to see why something went wrong). If you have problems
with creating output files that are too large, you may want to
explore carefully the turning off of options in the programs you run.
If you are doing several runs in one, as for example when you do a
bootstrap analysis using Seqboot, Dnapars (say), and Consense, you
can use an editor to create a "command file" with these commands:
The command file might be named something like
foofile
It must be given
execute permission by using the command chmod +x foofile.
The job that foofile describes
can be run in background on Unix or Linux by giving the command
foofile &
Note that you must also have the interactive input
commands for Seqboot (including the random number seed), Dnapars, and
Consense in the separate files input1, input2, and input3.
If you have a Windows system and want to run Dnaml in a background, taking its
input data from a file called sequences.dat, putting its interactive
output to file called screenout, and using a file called input
as
the place to store the interactive input. The file input need only
contain two lines:
which is what you would have typed to run the program interactively, in
response to the program's request for an input file name if it did not
find a file named infile, in response the the menu.
To run the program in background, you can place the command
dnaml < input > screenout &
in a file called something like foofile.bat. This "batch file" that
has commands and has its name end in .bat or .cmd
can be run simply by double-clicking on the file icon, which will usually
have a picture of a gear. A Command Prompt windows (an MSDOS window) will then
open and the commands in the batch file will be run in it. Alternatively,
you can open a Command Prompt window yourself. It will be found in the
All Programs menu, as one of the options under Accessories. Make sure that
after it opens, you tell it to change its working directory to the one that
has the batch file in it.
The batch file with this command runs the program with input responses coming
from input and
interactive output being put into file screenout. The usual output
file and tree file will also be created by this run (keep that in mind
as, if you run any other PHYLIP program from the same directory while
this one is running in background, you may overwrite the output file from
one program with that from the other!).
Note also that when PHYLIP programs attempt to open a new output file (such as
outfile, outtree, or plotfile, if they see
a file of that name already in existence they will ask you if you want to
overwrite it, and offer alternatives including writing to another file,
appending information to that file, or quitting the program without writing to
he file. This means that in writing batch files it is important to know
whether there will be a prompt of this sort. You must know in advance
whether the file will exist. You may want to put in your batch file a
command that tests for the existence of a pre-existing output file and
if so, removes it, such as these commands in Unix, Linux, or MacOS:
You might even want to put in a command that creates a
file of that name, so that you can be sure it is there! Either way,
you will then know whether to put into your file of keyboard responses the
proper response to the inquiry about overwriting that output file.
Offhand, I do not know how to test for the existence of files in Windows, but
I suspect that there is a way.
Making the proper files of keyboard responses for use with command
files is most easily done if you prototype the process by simply
running the program and keeping a careful record of the keyboard
responses that you need to give to get the program to run properly.
Then create a file in an editor and type those keyboard responses into
it. Thus if the program requires that you answer a question about
what to do with the output file with a keyboard response of R,
then wants you to type a menu selection of U (to have it use a User tree),
then wants you to answer Y to end the menu, and another R to tell it to
replace the output file, you would have the file of keyboard responses
be
Since when you run the program interactively, each keyboard
response is ended by pressing the Enter key on your keyboard,
in the file of keyboard responses you must end each line after
typing the appropriate character.
Testing the keyboard responses with an interactive run will
be essential to having batch runs succeed.
The input files for PHYLIP programs must be prepared separately - there is
no data editor within PHYLIP. You can use a word processor (or text
editor) to prepare them yourself, or you can use a program that produces
a PHYLIP-format output.
With the 3.695 release of Phylip we have included a directory called TestData which
contains the data used to generate the examples shown in the individual program html pages
and the output files they produce. Within this TestData directory there is a subdirectory
that has the name of the program (for example contrast) and within that there are the files
contrastinfile.txt, contrastintree.txt and contrastoutfile.txt.
If you look at the Contrast documentation you can see infile,
intree, and outfile mentioned in the example. The testdata/contrast/*.txt
files exactly match those in the example, so if you wish to experiment with Contrast
you have both a good infile and a good intree and the outfile expected
from the example, if you set your conditions to match the example.
Sequence alignment programs such as ClustalW
commonly have an option to produce PHYLIP files as output, and some
other phylogeny programs, such as MacClade and TreeView, are capable of
producing a PHYLIP-format file.
It is very important that the input files be in "Text Only" or "ASCII" format. This means that they contain only printable ASCII/ISO
characters, and not any unprintable characters. Many word processors such
as Microsoft Word save their files in a format that contains unprintable
characters, unless you tell them not to. In the Microsoft Word family of
word processors, the first time you edit a file, when you go to Save
in the File menu,
the file the program will instead do a Save As function, and ask you
in what format you want the file to be written.
For these word processors, the next time you edit the same file, using
Save, the program should use those settings without asking you. If
you have some trouble getting an input file that the programs can read, look
into whether you properly set these options. This can be usually be done by
using the Save As choice in the File menu and making the
right settings.
Text editors such as the vi and emacs editors on
Unix and Linux (and available on MacOS too), or the pico
editor that comes with the pine
mailer program, produce their files in Text Only format and should not
cause any trouble.
The format of the input files is discussed below, and you should also
read the other PHYLIP documentation relevant to the particular type of
data that you are using, and the particular programs you want to run, as
there will be more details there.
For most of the PHYLIP programs, information comes from a series of
input files, and ends up in a series of output files:
The programs interact with the user by presenting a menu. Aside from the
user's choices from the menu, they read
all other input from files. These files have default names. The program
will try to find a file of that name - if it does not, it will ask the
user to supply the name of that file.
Input data such as DNA sequences
comes from a file whose default name is infile. If the user
supplies a tree, this is in a file whose default name is intree.
Values of weights for the characters are in weights, and the
tree plotting program need some digitized fonts which are supplied in
fontfile (all these are default names).
For example, if Dnaml looks
for the file infile and does not find one of that name,
it prints the message:
This simply means that it wants you to type in the name of the
input file.
When you run a program, you are in a current folder. If you run it by clicking
on an icon, the folder is the one that has the icon. If you run it by
typing the name of the program, the folder is the current folder when
you do that. The program will look for default files (such as infile
and intree) in that folder. When it writes files, their
default locations are also in your current folder.
The program need not actually be in the current folder. An icon can
sometimes be a link to a program located elsewhere. A program name
typed by you can contain a “path”, so that if you type
/usr/local/phylip/dnaml the program run will be located in
folder /usr/local/phylip. The operating system maintains a default
path for your account, which is a series of names of folders. When you
type the name of a program, the
operating system will look in that series of folders until it finds the
program, and then run it. But in all of these cases, the input and output
files will, by default, be in the current folder, even if the program
is located in some other folder.
Users can change where the input files are, or where the output files
go. If no file called infile is found in the current folder,
you will be asked to type the name of the file. In that case you
can type a filename with a path, such as foobar/mydata, and
in that case the program will look for file mydata in folder
foobar within the current folder. A similar process occurs
when the program cannot find file intree.
When the program starts to write an output file, such as outfile,
a similar series of events happens, with one important difference.
It is when a file outfile already exists in the current folder
that the user will be asked what to do. (In the case of input files,
it was when they did not exist that the user is asked what to do).
You will be given the opportunity to Replace the file, Append to the
file, write to a different File, or Quit. If you choose the response F
you will be asked for the name of the different file, and that is when
you can give a filename with a path, such as foobar/myoutput.out,
and the file will be written in that folder instead of the current folder.
Understanding which folder is the current folder, and whether there are
files named infile, intree, outfile, or
outtree there, is crucial to successfully running PHYLIP
programs, and making sure that they analyze the correct data set and
write their files in the right place.
I have tried to adhere to a rather stereotyped input and output
format. For the parsimony, compatibility and maximum likelihood programs,
excluding the distance matrix methods, the simplest version of the input
data file looks something like this:
The first line of the input file contains the number of species and the
number of characters (in this case sites). These are in free format, separated
by blanks. The information for each species follows, starting with a
ten-character species name (which can include blanks and some punctuation
marks), and continuing with the characters for that species. The name should
be on the same line as the first character of the data for that species.
(I will use the term "species" for the tips of the trees, recognizing
that in some cases these will actually be populations or individual gene
sequences).
The name should be ten characters in length, and either terminated
by a Tab character or filled out to the full
ten characters by blanks if it is shorter than 10 characters. Any printable ASCII/ISO character is
allowed in the name, except for parentheses ("(" and ")"), square
brackets ("[" and "]"), colon (":"), semicolon (";") and comma (",").
If you forget to extend the names to ten characters in length by blanks,
and also do not terminate them with a Tab character,
the program will get out of synchronization with the contents of the data
file, and an error message will result. A Tab character that terminates
a name will not be taken as part of the name that is read; the name will then
automatically be filled with blanks to a total length of 10 characters.
In the
discrete-character programs, DNA sequence programs and protein sequence
programs the characters are each a
single letter or digit, sometimes separated by blanks. In
the continuous-characters programs they are real numbers with decimal points,
separated by blanks:
Latimeria 2.03 3.457 100.2 0.0 -3.7
The conventions about continuing the data beyond one line per species are
different between the molecular sequence programs and the others. The
molecular sequence programs can take the data in "aligned" or "interleaved"
format, in which we first have some lines giving the first part of each of the
sequences, then some
lines giving the next part of each, and so on. Thus the sequences might
look like this:
Note that in these sequences we have a blank every
ten sites to make them easier to read: any such blanks are allowed. The blank
line which separates the two groups of lines (the ones
containing sites 1-20 and ones containing sites 21-39) may or may not
be present. It is important that the number of sites in each
group be the same for all species (i.e., it will not be possible to run
the programs successfully if the first species line contains 20 bases, but
the first line for the second species contains 21 bases).
Alternatively, an option can be selected in the menu to take the data in
"sequential" format, with all of the data for the first species,
then all of the characters for the next species, and so on. This is also
the way that the discrete characters programs and the gene frequencies
and quantitative characters programs want to read the data. They do not
allow the interleaved format.
In the sequential format, the character data can run on to a new line at any
time (except in the middle of a species name or, in the case of continuous
character and distance matrix programs where you cannot go to a new line in
the middle of a real number). Thus it is legal to have:
Archaeopt 001100
or even:
Archaeopt
though note that the full ten characters of the species name must
then be present: in the above case there must be a blank after the "t". In all
cases it is possible to put internal blanks between any of the character
values, so that
Archaeopt 0011001101 0111011100
is allowed.
Note that you can convert molecular sequence data between the interleaved
and the sequential data formats by using the Rewrite option of the J
menu item in Seqboot.
If you make an error in the format of the input file, the programs can
sometimes detect that
they have been fed an illegal character or illegal numerical value and issue
an error message such as BAD CHARACTER STATE:, often printing out the
bad value, and sometimes the number of the species and character in which it
occurred. The program will then stop shortly after. One of the things which
can lead to a bad value is the omission of something earlier in the file, or
the insertion of something superfluous, which cause the reading of the file to
get out of synchronization. The program then starts reading things it
didn't expect, and concludes that they are in error. So if you see this error
message, you may also want
to look for the earlier problem that may have led to the program becoming
confused about what it is reading.
Some options are described below, but you should also read the documentation
for the groups of the programs and for the individual programs.
The menu is straightforward. It typically looks like this (this one is for
Dnapars):
If you want to accept the default settings (they are shown in the above case)
you can simply type Y followed by pressing on the Enter key.
If you want to change any of the options, you should type the letter
shown to the left of its entry in the menu. For example, to set a threshold
type T. Lower-case letters will also work. For many of the options
the program will ask for supplementary information, such as the value of
the threshold.
Note the Terminal type entry, which you will find on all menus. It
allows you to specify which type of terminal your screen is. The options
are an IBM PC screen, an ANSI standard terminal, or none.
Choosing zero (0) toggles
among these three options in cyclical order, changing each time the 0
option is chosen. If one of them is right for your terminal the screen will be
cleared before the menu is displayed. If none works, the none option
should probably be chosen. The programs should start with a terminal option
appropriate for your computer, but if they do not, you can change the
terminal type manually. This is particularly important in program Retree
where a tree is displayed on the screen - if the terminal type is set to the
wrong value, the tree can look very strange.
The other numbered options control which information the program will
display on your screen or on the output files. The option to Print
indications of progress of run will show information such as the names of
the species as they are successively added to the tree, and the
progress of rearrangements. You will usually want to see these as
reassurance that the program is running and to help you estimate how long
it will take. But if you are running the program "in background" as can be
done on multitasking and multiuser systems, and do not have the
program running in its own window, you may want to turn this option off so
that it does not disturb your use of the computer while the program is
running. Note also menu option 3, "Print out tree". This can be useful
when you are running many data sets, and will be using the resulting trees
from the output tree file. It may be helpful to turn off the printing out
of the trees in that case, particularly if those files would be too big.
Most of the programs write their output onto a file called (usually) outfile, and a representation of the trees found onto a file called
outtree.
The exact contents of the output file vary from program to program and also
depend on which menu options you have selected. For many programs, if you
select all possible output information, the output will consist of
(1) the name of the program and its
version number, (2) some of the input information printed out, and (3) a series of
phylogenies, some with associated information indicating how much change
there was in each character or on each part of the tree. A typical rooted tree
looks like this:
The interpretation of the tree is fairly straightforward: it "grows"
from left to right. The numbers at the forks are arbitrary and are used (if
present) merely to identify the forks. For many of the programs the tree
produced is unrooted. Rooted and unrooted trees are printed in nearly the
same form, but the unrooted ones are accompanied by the
warning message:
remember: this is an unrooted tree!
to indicate that this is an unrooted tree and to warn against
taking the position of its root too seriously. (Mathematicians still call
an unrooted tree a tree, though some systematists unfortunately use the term
"network" for an unrooted tree. This conflicts with standard mathematical
usage, which reserves the name "network" for a completely different kind of
graph). The root of this tree could be anywhere, say on the line leading
immediately to Mouse. As an exercise,
see if you can tell whether the following tree is or is not a different
one from the above:
(it is not different). It is important also to realize that the
lengths of the segments of the printed tree may not be significant: some
may actually represent branches of zero length, in the sense that there is no
evidence that
those branches are nonzero in length. Some of the diagrams of trees attempt
to print branches approximately proportional to estimated
branch lengths, while in others the lengths are purely conventional and
are presented just to make the topology visible. You will have to look closely
at the documentation that accompanies each program to see what it presents
and what is known about the lengths of the branches on the tree. The above
tree attempts to represent branch lengths approximately in the diagram. But
even in those cases, some of the smaller branches are likely to be
artificially lengthened to make the tree topology clearer. Here is what
a tree from Dnapars looks like, when no attempt is made to make the
lengths of branches in the diagram proportional to estimated branch
lengths:
When a tree has branch lengths, it will be accompanied by a table showing
for each branch the numbers (or names) of the nodes at each end of the
branch, and the length of that branch. For the first tree shown above,
the corresponding table is:
Ignoring the asterisks and the approximate confidence limits, which will be
described in the documentation file for Dnaml, we can see that the table
gives a more precise idea of what the lengths of all the branches are.
Similar tables exist in distance matrix and likelihood programs, as well
as in the parsimony programs Dnapars and Pars.
Some of the parsimony programs in the package can print out a table
of the number of steps that different characters (or sites) require on
the tree. This table may not be obvious at first. A typical example looks like
this:
The numbers across the top and down the side indicate which site
is being referred to. Thus site 23 is column "3" of row "20"
and has 1 step in this case.
There are other kinds of information that can appear in the
output file, They vary from program to program, and we leave their
description to the documentation files for the groups of
programs and for the individual programs.
In output from most programs,
a representation of the tree is also written into the tree file
outtree. The tree is specified by nested pairs
of parentheses, enclosing
names and separated by commas. We will describe how this works
below. If there are any blanks in the names,
these must be replaced by the underscore character "_". Trailing blanks
in the name may be omitted. The pattern of the parentheses indicates
the pattern of the tree by having each pair of parentheses enclose all
the members of a monophyletic group. The tree file could look like this:
In this tree the first fork separates the lineage leading to
Mouse and Bovine from the lineage leading to the rest. Within the
latter group there is a fork separating Gibbon from the rest, and so on.
The entire tree is enclosed in an outermost pair of parentheses. The tree ends
with a semicolon. In some programs such as Dnaml, Fitch, and Contml,
the tree will be unrooted. An unrooted tree should have its
bottommost fork have a
three-way split, with three groups separated by two commas:
Here the three groups at the bottom node are A, (B,C,D), and
(E,F). The single three-way split corresponds to one of the interior
nodes of the unrooted tree (it can be any interior node of the tree). The
remaining forks are encountered as you move out from that first node.
In newer programs, some are able to tolerate these other forks being
multifurcations (multi-way splits).
You should check the documentation files
for the particular programs you are using to see in which of these forms
you can expect the user tree to be in. Note that many of the programs
that actually estimate an unrooted tree (such as Dnapars) produce trees in the
treefile in rooted form! This is done for reasons of arbitrary internal bookkeeping. The placement of the root is arbitrary. We are working toward
having all programs be able to read all trees, whether rooted or unrooted,
multifurcating or bifurcating, and having them do the right thing with
them. But this is a long-term goal and it is not yet achieved.
For programs that infer branch lengths, these are given in the trees in the
tree file as real numbers following a colon, and placed immediately
after the group descended from that branch. Here is a typical tree
with branch lengths:
Note that the tree may continue to a new line at any time except in the
middle of a name or the middle of a branch length, although in trees
written to the tree file this will only be done after a comma.
These representations of trees are a subset of the standard adopted
on 24 June 1986 at the annual meetings of the Society for the Study of
Evolution by an informal committee (its final session in Newick's
lobster restaurant in Dover, New Hapshire - hence its name, the Newick standard).
It consisted of Wayne Maddison (author of MacClade), David Swofford (PAUP),
F. James Rohlf (NTSYS-PC), Chris Meacham (COMPROB and the original
PHYLIP tree drawing programs), James Archie,
William H.E. Day, and me. This standard is a generalization of
PHYLIP's format, invented by Chris Meacham when he visited
my lab in 1985-6. That itself was based on a well-known representation of trees in
terms of parenthesis patterns which is due to the famous mathematician
Arthur Cayley, and which has been around for over a century. The
standard is now employed by most phylogeny computer programs but unfortunately
has yet to be decribed in a formal published description. Other
descriptions by me and by Gary Olsen can be accessed using the Web at:
Most of the programs allow various options that alter the amount of
information the program is provided or what is done with the
information. Options are selected in the menu.
A number of the options from the menu, the U (User tree), G (Global),
J (Jumble), O (Outgroup), W (Weights),
T (Threshold), M (multiple data sets), and the tree output options, are used
so widely that it is best to discuss them in this document.
The U (User tree) option. This option toggles between the default
setting, which allows the program to search for the best tree, and the
User tree setting, which reads a tree or trees ("user trees") from the input
tree file and evaluates them. The input tree file's
default name is intree.
An initial line with the number of trees was formerly
required, but this will now not work.
Some programs require rooted
trees, some unrooted
trees, and some can handle multifurcating trees. You should read
the documentation for the particular program to find out which it
requires. Program Retree can be used to convert trees among
these forms (on saving a tree from Retree, you are asked whether
you want it to be rooted or unrooted).
In using the user tree option, check the pattern of parentheses
carefully. The programs do not always detect
whether the tree makes sense, and if it does not there will probably be
a crash (hopefully, but not inevitably, with an error message indicating
the nature of the problem). Trees written out by programs are
typically in the proper form.
The G (Global) option. In the programs which construct trees (except for
Neighbor, the "...penny" programs and Clique, and of course
the "...move" programs where you construct the trees yourself),
after all species have been added to the tree a rearrangements phase
ensues. In most of these programs the rearrangements are automatically
global, which in this case means that subtrees will be removed from the tree
and put back on in all possible ways so as to have a better chance of
finding a better tree. Since this can be time consuming (it roughly
triples the time taken for a run) it is left as an option in some of the
programs, specifically Contml, Fitch, Dnaml and Proml. In these programs
the G menu option toggles between the default of local rearrangement and
global rearrangement. The rearrangements are explained more below.
The J (Jumble) option. In most of the tree construction programs
(except for the "...penny" programs and Clique), the exact
details of the search of different trees depend on the order of input of
species. In these programs J option enables you to tell the program to use
a random number
generator to choose the input order of species. This option is toggled on
and off by
selecting option J in the menu. The program will then prompt you for
a "seed" for the random number generator. The seed should be an integer
between 1 and 232-3 (which is 4,294,967,293), and it should be
of form 4n+1,
which means that it must give a remainder of 1 when divided by 4. This can be
judged by looking at the last two digits of the number (for example, in the
upper limit given above, the last two digits are 93, which is of form 4n+1. Each different seed
leads to a different sequence of addition of species. By simply changing the
random number seed and re-running the programs one can look for other, and
better trees. If the seed entered is not odd, the program will not proceed,
but will prompt for another seed.
The Jumble option also causes the program to ask you how many times you
want to restart the process. If you answer 10, the program will
try ten different orders of species in constructing the trees, and the
results printed out will reflect this entire search process (that is,
the best trees found among all 10 runs will be printed out, not the
best trees from each individual run).
Some people have asked what are good values of the random number seed.
The random number seed is used to start a process of choosing "random"
(actually pseudorandom) numbers, which behave as if they were
unpredictably randomly chosen between 0 and 232-1 (which is
4,294,967,295). You could put in the number 133 and find that the
next random number was 221,381,825. As they are effectively
unpredictable, there is no such thing as a choice that is better than
any other, provided that the numbers are of the form 4n+1. However
if you re-use a random number seed, the sequence of random numbers
that result will be the same as before, resulting in exactly the same
series of choices, which may not be what you want.
The O (Outgroup) option. This specifies which species is to
have the root of the tree be on the line leading to it. For example, if the
outgroup is a species "Mouse" then the root of the tree will be placed in the
middle of the branch which is connected to this species, with Mouse branching
off on one side of the root and the lineage leading to the rest of the tree
on the other. This option is toggled on and off by choosing O in the
menu (the alphabetic character O, not the digit 0). When it
is on, the program will then prompt for the
number of the outgroup (the species being taken in the numerical order that
they occur in the input file). Responding by typing 6 and then an
Enter character indicates that the sixth species in the data
(the 6th in the first set of data if there are multiple data sets)
is taken as the outgroup. Outgroup-rooting will not be attempted if the
data have already established a root for the tree from some other
consideration, and may not be if it is a user-defined tree,
despite your invoking the option. Thus programs such as Dollop that
produce only rooted trees do not allow the Outgroup option. It is also
not available in Kitsch, Dnamlk, Promlk or Clique. When it is used, the tree as
printed out is still listed as being an
unrooted tree, though the outgroup is connected to the bottommost node
so that it is easy to visually convert the tree into rooted form.
The T (Threshold) option. This sets a threshold forn the
parsimony programs such that if the
number of steps counted in a character is higher than the threshold, it
will be taken to be the threshold value rather than the actual number of
steps. The default is a threshold so high that it will never be
surpassed (in which case the steps whill simply be counted). The T
menu option toggles on and off asking the user to
supply a threshold. The use of thresholds to obtain methods intermediate
between parsimony and compatibility methods is described in my 1981b paper.
When the T option is in force, the program
will prompt for the numerical threshold value. This will be a positive
real number greater than 1. In programs Mix, Move, Penny, Protpars,
Dnapars, Dnamove, and Dnapenny, do not use threshold values less
than or equal to 1.0, as they have no meaning and lead to a tree which
depends only on considerations such as the input order of species and not at
all on the character state data! In programs Dollop, Dolmove, and Dolpenny
the threshold should never be 0.0 or less, for the same
reason. The T option is an
important and underutilized one: it is, for example, the only way in this
package (except for program Dnacomp) to do a compatibility analysis when there
are missing data. It is a method of de-weighting characters that evolve
rapidly. I wish more people were aware of its properties.
The M (Multiple data sets) option. In menu programs there is an
M menu
option which allows one to toggle on the multiple data sets option. The
program will ask you how many data sets it should expect. The data sets
have the same format as the first data set. Here is a (very small) input file
with two five-species data sets:
The main use of this option will be to allow all of the methods in these
programs to be bootstrapped. Using the program Seqboot one can take any
DNA, protein, restriction sites, gene frequency or binary character data set and
make multiple data sets by bootstrapping. Trees can be produced for all of
these using the M option. They will be written on the tree output file if
that option is left in force. Then the program Consense can be used with
that tree file as its input file. The result is a majority rule consensus
tree which can be used to make confidence intervals. The present version
of the package allows, with the use of Seqboot and Consense and the M option,
bootstrapping of many of the methods in the package.
Programs Dnaml, Dnapars and Pars can also take multiple weights
instead of multiple data sets. They can then do bootstrapping by
reading in one data set, together with a file of weights that show how
the characters (or sites) are reweighted in each bootstrap sample. Thus a
site that is omitted in a bootstrap sample has effectively been given
weight 0, while a site that has been duplicated has effectively been
given weight 2. Seqboot has a menu selection to produce the file of
weights information automatically, instead of producing a file of
multiple data sets. It can be renamed and used as the input weights file.
The W (Weights) option. This signals the program that, in
addition to the data set, you want to read in a series of weights that
tell how many times each character is to be counted. If the weight
for a character is zero (0) then that character is in effect to
be omitted when the tree is evaluated. If it is (1) the
character is to be counted once. Some programs allow weights greater than
1 as well. These have the effect that the character is counted as
if it were present that many times, so that a weight of 4 means that the
character is counted 4 times.
The values 0-9 give weights 0 through 9, and the
values A-Z give weights 10 through 35. By use of the weights we can
give overwhelming weight to some characters, and drop others from the
analysis. In the molecular sequence programs only two values of the
weights, 0 or 1 are allowed.
The weights are used to analyze subsets of the characters, and also can be
used for resampling of the data as in bootstrap and jackknife resampling.
For those programs that allow weights to be greater than 1, they can also
be used to emphasize information from some characters more strongly than
others. Of course, you must have some rationale for doing this.
The weights are provided as a sequence of digits. Thus they might be
10011111100010100011110001100
The weights are to be provided in an input file
whose default name is weights. The weights in it are
a simple string of digits. Blanks in the weightfile are skipped over and
ignored, and the weights can continue to a new line. In programs such as
Seqboot
that can also output a file of weights, the input weights have a default
file name of inweights, and the output file name has a default
file name of outweights.
Weights can be used to analyze different subsets of characters (by weighting
the rest as zero). Alternatively, in the discrete characters programs
they can be used to force a certain
group to appear on the phylogeny (in effect confining consideration to only
phylogenies containing that group). This is done by adding an imaginary
character that has 1's for the members of the group, and 0's
for all the
other species. That imaginary character is then given the highest weight
possible: the result will be that any phylogeny that does not contain that
group will be penalized by such a heavy amount that it will not (except in
the most unusual circumstances) be considered. Of course, the new character
brings extra steps to the tree, but the number of these can be calculated
in advance and subtracted out of the total when reporting the results. This
use of weights is an important one, and one sadly ignored
by many users who could profit from it. In the case of molecular sequences
we cannot use weights this way, so that to force a given group to appear we
have to add a large extra segment of sites to the molecule, with (say) A's
for that group and C's for every other species.
The option to write out the trees into a tree file. This specifies that you
want the program to write
out the tree not only on its usual output, but also onto a file in
nested-parenthesis notation (as described above). This option is sufficiently
useful that it is turned on by default in all programs that allow it. You
can optionally turn it off if you wish, by typing the appropriate number
from the menu (it varies from program to program). This option is useful for
creating tree files that can be directly read into the programs, including
the consensus tree and tree distance programs, and the tree plotting programs.
The output tree file has a default name of outtree.
The (0) terminal type option . (This is the digit 0, not
the alphabetic character O). The program will default to
one particular assumption about your terminal (ANSI in the case of Linux,
Unix, or MacOS, and IBM PC in the case of
Windows). You can
alternatively select it to be either an IBM PC, or nothing.
This affects the ability of the programs to clear the screen when they
display their menus, and the graphics characters used to display trees
in the programs Dnamove, Move, Dolmove, and Retree. In the case of Windows,
the screen will clear properly with either the IBM PC or the ANSI settings,
but the graphics characters needed by Move, Dnamove, Dolmove, or Retree
will display correctly only with the IBM PC setting.
All of the programs except Factor, Dnadist, Gendist, Dnainvar, Seqboot,
Contrast, Retree, and the plotting and
consensus tree programs act to construct an estimate of a phylogeny. Move,
Dolmove, and Dnamove let you construct it yourself by hand. All of
the rest but Neighbor, the "...penny" programs and Clique make use of
a common approach involving additions and rearrangements. They are
trying to minimize or maximize some quantity over the space of all
possible evolutionary trees. Each program contains a part that, given
the topology of the tree, evaluates the quantity that is being minimized
or maximized. The straightforward approach would be to evaluate all
possible tree topologies one after another and pick the one which,
according to the criterion being used, is best. This would not be
possible for more than a small number of species, since the number of
possible tree topologies is enormous. A review of the literature on the
counting of evolutionary trees will be found one of my papers
(Felsenstein, 1978a) and in my book (Felsenstein, 2004, chapter 3).
Since we cannot search all topologies, these programs are not
guaranteed to always find the best tree, although they seem to do quite
well in practice. The strategy they employ is as follows: the species
are taken in the order in which they appear in the input file. The
first two (in some programs the first three) are taken and a tree
constructed containing only those. There is only one possible topology for
this tree. Then the next species is taken, and we consider where it
might be added to the tree. If the initial tree is (say) a rooted tree
with two species and we want the resulting three-species tree to be a
bifurcating tree, there are only three places where we could add the
third species. Each of these is tried, and each time the resulting tree is
evaluated according to the criterion. The best one is chosen to be the
basis for further operations. Now we consider adding the fourth
species, again at each of the five possible places that would result in
a bifurcating tree. Again, the best of these is accepted. This is
usually known as the Sequential Addition strategy.
The process continues in this manner, with one important exception. After
each species is added, and before the next
is added, a number of rearrangements of the tree are tried, in an effort
to improve it. The algorithms move through the tree, making all
possible local rearrangements of the tree. A local rearrangement involves an
internal segment of the tree in the following manner. Each internal
segment of the tree is of this form (where T1, T2, and T3 are subtrees
- parts of the tree that can contain further forks and tips):
the segment we are discussing being indicated by the asterisks. A local
rearrangement consists of switching the subtrees T1 and T3 or T2 and T3,
so as to obtain one of the following:
Each time a local rearrangement is successful in finding a better tree,
the new arrangement is accepted. The phase of local rearrangements does
not end until the program can traverse the entire tree, attempting local
rearrangements, without finding any that improve the tree.
This strategy of adding species and making local rearrangements will look
at about (n-1)x(2n-3) different topologies, though if
rearrangements are frequently successful the number may be larger. I
have been describing the strategy when rooted trees are being
considered. For unrooted trees there is a precisely similar strategy,
though the first tree constructed may be a three-species tree and the
rearrangements may not start until after the addition of the fifth
species.
These local rearrangements have come to be called Nearest Neighbor
Interchanges (NNIs) in the phylogeny literature.
Though we are not guaranteed to have found the best tree topology,
we are guaranteed that no nearby topology (i. e. none accessible by a
single local rearrangement) is better. In this sense we have reached a
local optimum of our criterion. Note that the whole process is
dependent on the order in which the species are present in the input
file. We can try to find a different and better solution by reordering
the species in the input file and running the program again (or, more
easily, by using the J option). If none of
these attempts finds a better solution, then we have some indication
that we may have found the best topology, though we can never be certain
of this.
Note also that a new topology is never accepted unless it is better
than the previous one, so that the rearrangement process can never fall
into an endless loop. This is also the way ties in our criterion are
resolved, namely by sticking with the tree found first. However, the tree
construction programs other than Clique, Contml, Fitch,
and Dnaml do keep a record of all trees found that are tied with the best one
found. This gives you some immediate idea of which parts of the tree can be
altered without affecting the quality of the result.
A feature of most of the programs, such as Protpars, Dnapars,
Dnacomp, Dnaml, Dnamlk, Restml, Kitsch, Fitch, Contml, Mix, and Dollop,
is "global" optimization of the tree. In four of these (Contml,
Fitch, Dnaml and Dnamlk) this is an option, G. In the others it
automatically applies. When
it is present there is an additional stage to the search for the best tree.
Each possible subtree is removed from the tree from the tree and added back in
all possible places. This process continues until all subtrees can be removed
and added again without any improvement in the tree. The purpose of this
extra rearrangement is to make it less likely that one or more a species gets
"stuck" in a suboptimal region of the space of all possible trees. The use of
global optimization results in approximately a tripling (3 x ) of the run-time,
which is why I have left it as an option in some of the slower programs.
What PHYLIP calls "global" rearrangements are more properly called
SPR (subtree pruning and regrafting) by Swofford et. al. (1996) as distinct
from the NNI (nearest neighbor interchange) rearrangements that PHYLIP
also uses, and the TBR (tree bisection and reconnection) rearrangements
that it does not use. My book (Felsenstein, 2004, chapter 4) contains
a review of work on these and other rearrangements and search methods.
The programs doing global optimization print out a dot "." after each group is
removed and re-added to the tree, to give the user some sign that the
rearrangements are proceeding. A new line of dots is started whenever a new
round of global rearrangements is started following an improvement in the
tree. On the line before the dots are printed there is printed a bar of
the form "!---------------!" to show how many dots
to expect. The dots will
not be printed out at a uniform rate, but the later dots, which represent
removal of larger groups from the tree and trying them consequently in fewer
places, will print out more quickly. With some compilers each row of dots may
not be printed out until it is complete.
It should be noted that Penny, Dolpenny, Dnapenny and Clique use a more
sophisticated strategy of "depth-first search" with a "branch and bound"
search method that guarantees that all
of the best trees will be found. In the case
of Penny, Dolpenny and Dnapenny there can be a considerable sacrifice of
computer time if the number of species is greater than about ten: it is a
matter for you to consider whether it is worth it for you to guarantee finding
all the most parsimonious trees, and that depends on how much free computer
time you have! Clique finds all largest cliques, and does so without undue
burning of computer time. Although all of these problems that have been
investigated fall into the
category of "NP-hard" problems that in effect do not have a rapid solution,
the cases that cause this trouble for the largest-cliques algorithm in
Clique apparently are not biologically realistic and do not occur in actual
data.
As just mentioned, for most of these programs the search depends on the order
in which the species are entered into the tree. Using the J (Jumble)
option you can supply a random number seed which will allow the program to put
the species in in a random order. Jumbling can be
done multiple times. For example, if you tell the program to do it
10 times, it will go through the tree-building process 10 times, each with a
different random order of adding species. It will keep a record of the trees
tied for best over the whole process. In other words, it does not just
record the best trees from each of the 10 runs, but records the best ones
overall. Of course this is slow, taking 10 times longer than a single run.
But it does give us a much greater chance of finding all of the most
parsimonious trees. In the terminology of Maddison (1991) it
can find different "islands" of trees. The present algorithms do not
guarantee us to find all trees in a given "island" from a single run, so
multiple runs also help explore those "islands" that are found.
For the parsimony and compatibility programs, one can have a perfect tie
between two or more trees. In these programs these trees are all
saved. For the newer parsimony programs such as Dnapars and Pars,
global rearrangement is carried out on all of these tied trees. This can
be turned off in the menu.
For trees with criteria which are real numbers, such as the distance
matrix programs Fitch and Kitsch, and the likelihood programs Dnaml,
Dnamlk, Contml, and Restml, it is difficult to get an exact tie between
trees. Consequently these programs save only the single best tree
(even though the others may be only a tiny bit worse).
In practice, it is advisable to use the Jumble option to evaluate many
different orderings of the input species. It is advisable to use the
Jumble option and specify that it be done many times (as many as
different orderings
of the input species). (This is usually not necessary when bootstrapping,
though the programs will then default to doing it once to avoid artifacts
caused by the order in which species are added to the tree.)
People who want a magic "black box" program whose results they do
not have to question (or think about) often are upset that these
programs give results that are dependent on the order in which the species
are entered in the data. To me this property is an advantage, for it
permits you to try different searches for better trees, simply by
varying the input order of species. If you do not use the multiple Jumble
option, but do multiple individual runs instead, you
can easily decide which to pay most attention to - the one or ones that
are best according to the criterion employed (for example, with parsimony,
the one out of the runs that results in the tree with the fewest changes).
In practice, in a single run, it usually seems best to put species that are
likely to be sources of confusion in the topology last, as by the time they are
added the arrangement of the earlier species will have stabilized into a
good configuration, and then the last few species will by fitted into
that topology. There will be less chance this way of a poor initial
topology that would affect all subsequent parts of the search. However,
a variety of arrangements of the input order of species should be tried,
as can be done if the J option is used,
and no species should be kept in a fixed place in the order of input.
Note that the results of the "...penny" programs and Clique
are not sensitive to the input order of species, and Neighbor is only
slightly sensistive to it, so that multiple Jumbling is not possible
with those programs. Note also that with global search, which
is standard in many programs and in others is an
option, each group (including
each individual species) will be removed and re-added in all possible
positions, so that a species causing confusion will have more chance of moving
to a new location than it would without global rearrangement.
An innovative search strategy was developed by Kevin Nixon (1999). If one
uses a manual rearrangement program such as Dnamove, Move, or Dolmove, and
look at the distribution of characters on the trees, you will see some
characters whose distributions appear to recommend alternative groupings.
One would want a program that automatically found such alternative
suggestions and used them to rearrange the tree so as to explore trees that
had those groups. Nixon had the idea of using resampling methods to
do this. Using either bootstrap or jackknife sampling, one can make data
sets that emphasize randomly sampled subsets of characters. We
then search for trees that fit those data sets. After finding them, we
revert to the initial data set and then search using those trees as
starting points. This sampling allows us to explore parts of tree space
recommended by particular subsets of characters. (This is not exactly
Nixon's original strategy, which started the searches for each resampled
data set from the best tree found so far. For each resampled data set we
instead start from scratch, doing sequential addition of taxa.)
Nixon's method has proven to be very effective in searching for most
parsimonious trees -- it is currently the state of the art for that.
Nixon called his method the "parsimony ratchet", but actually it can be
applied straightforwardly to any method of phylogeny inference that has an
optimality criterion, including likelihood and least squares distance methods.
Starting with version 4.0, PHYLIP programs will have the ability to search by
rearranging a tree supplied to them by the user. This makes it possible to
implement our variant of Nixon's strategy. You need to do so in multiple steps:
There is some more information on how this may be done in the documentation
files for Seqboot and for the individual tree inference programs.
Probably the most important thing to keep in mind while running any of the
parsimony or compatibility programs is not
to overinterpret the result. Some users treat the set of most parsimonious
trees as if it were a confidence interval. If a group appears in all of the
most parsimonious trees then they treat it as well established. Unfortunately
the confidence interval on phylogenies appears to be much
larger than the set of all most parsimonious trees (Felsenstein, 1985b).
Likewise, variation of result among different methods will not be a good
indicator of the size of the confidence interval. Consider a simple data set
in which, out of 100 binary characters, 51 recommend the unrooted tree
((A,B),(C,D)) and 49 the tree ((A,D),(B,C)). Many different
methods will all give the same result on
such a data set: they will estimate the tree as ((A,B),(C,D)).
Nevertheless it is
clear that the 51:49 margin by which this tree is favored is not statistically
significantly different from 50:50. So consistency among different methods
is a poor guide to statistical significance.
C compilers differ in efficiency of the code they generate,
and some deal with some features of the language better than with
others. Thus a program which is unusually fast on one computer may be
unusually slow on another. Nevertheless, as a rough guide to relative
execution speeds, I have tested the programs on three data sets, each of
which has 10 species and 40 characters. The first is an imaginary one
in which all characters are compatible - ("The Willi Hennig Memorial
Data Set" as J. S. Farris once called ones like it). The second is the binary
recoded form of the fossil horses data set of Camin and Sokal (1965).
The third data set has data that is completely random: 10 species and 20
characters that have a 50% chance that each character state is 0 or
1 (or A or G). The data sets thus range from a completely
compatible one in which there is no homoplasy (paralellism or convergence),
through the horses data set, which requires 29 steps where the possible
minimum number would be 20, to the random data set, which requires 49 steps.
We can thus see how this increasing messiness of the data affects running
times. The three data sets have all had 20 sites of A's added to the
end of each sequence, so as to prevent likelihood or distance matrix programs
from having infinite branch lengths (the test data sets used for timing
previous versions of PHYLIP were the same except that they lacked these
20 extra sites).
Here are the nucleotide sequence versions of the three data sets:
Here are the timings of many of the version 3.6 programs on these three data
sets as run after being compiled by Gnu C (version 3.2) and run on an
AMD Athlon XP 2200+ computer under Linux.
In all cases the programs were run under the default options with optimized
compiler switches (-03 -fomit-frame-pointer), except as
specified here.
The data sets used for the discrete characters programs have 0's and 1's
instead of A's and C's. For Contml the A's and C's
were made into 0.0's and 1.0's and considered as 40 2-allele loci.
For the distance programs 10 x 10 distance matrices were
computed from the three data sets.
For the restriction sites programs A and C were changed into
+ and -. It does not
make much sense to benchmark Move, Dolmove, or Dnamove, although when there
are many characters and many species the response time after each
alteration of the tree should be proportional to the product of the number of
species and the number of characters.
For Dnaml, Dnamlk, and Dnadist the frequencies of the four bases were
set to be equal rather than determined empirically as is the default. For
Restml the number of enzymes was set to 1.
In most cases, the benchmark was made more accurate by analyzing 100 data
sets using the M (Multiple data sets) option and dividing the resulting
time by 100. Times were determined as user times using the Linux time
command. Several patterns will be apparent from this. The algorithms (Mix,
Dollop, Contml, Fitch, Kitsch, Protpars, Dnapars, Dnacomp, and
Dnaml, Dnamlk, Restml) that use the above-described addition strategy have
run times that do not depend strongly on the messiness of the data. The only
exception to this is that if a data set such as the Random data requires
extra rounds of global rearrangements it takes longer. The
programs differ greatly in run time: the protein likelihood programs
Proml and Promlk were very slow, and the other likelihood programs
Restml, Dnaml and
Contml are slower than the rest of the programs. The protein sequence parsimony
program, which has to do a considerable amount of bookkeeping to keep track of
which amino acids can mutate to each other, is also relatively slow.
Another class of algorithms includes Penny, Dolpenny, Dnapenny and Clique.
These are branch-and-bound methods: in principle they should have execution
times that rise exponentially with the number of species and/or
characters, and they might be much more sensitive to messy data. This is
apparent with Penny, Dolpenny, and Dnapenny, which go from being reasonably
fast with clean data to very slow with messy data. Dolpenny is particularly
slow on messy data - this is because this algorithm cannot make use of some of
the lower-bound calculations that are possible with Dnapenny and Penny. Clique
is very fast on all
data sets. Although in theory it should bog down if the number of cliques in
the data is very large, that does not happen with random data, which in
fact has few cliques and those small ones. Apparently the "worst-case"
data sets that cause exponential run time are much rarer for Clique than for
the other branch-and-bound methods.
Neighbor is quite fast compared to Fitch and Kitsch, and should make it
possible to run much larger cases, although the results are expected to be
a bit rougher than with those programs.
How will the speed depend on the number of species and the number
of characters? For the sequential-addition algorithms, the speed should
be proportional to somewhere between the cube of the number of species and
the square of the number of species, and to the number
of characters. Thus a case that has, instead of 10 species and 20
characters, 20 species and 50 characters would take (in the cubic case)
2 x 2 x 2 x 2.5 = 20
times as long. This implies that cases with more than 20 species will
be slow, and cases with more than 40 species very slow. This places a
premium on working on small subproblems rather than just dumping a whole
large data set into the programs.
An exception to these rules will be some of the DNA programs that use an
aliasing device to save execution time. In these programs execution time
will not necessarily increase proportional to the number of sites,
as sites that show the same pattern of nucleotides will be detected
as identical and the calculations for them will be done only once, which does
not lead to more execution time. This is particularly
likely to happen with few species and many sites, or with data sets that have
small amounts of evolutionary divergence.
For programs Fitch and Kitsch, the distance matrix is square, so
that when we double the number of species we also double the number of
"characters", so that running times will go up as the fourth power of
the number of species rather than the third power. Thus a 20-species
case with Fitch is expected to run sixteen times more slowly than a 10-species
case.
For programs like Penny and Clique the run times will rise faster
than the cube of the number of species (in fact, they can rise faster
than any power since these algorithms are not guaranteed to work in
polynomial time). In practice, Penny will frequently bog down above 11
species, while Clique easily deals with larger numbers.
For Neighbor the speed should vary only as the cube of the number of
species, so a case twice as large will take only eight times as long. This
will make it an attractive alternative to Fitch and Kitsch for large data
sets.
Suggestion: If you are unsure of how long a program will take, try it first on
a few species, then work your way up until you get a feel for the speed
and for what size programs you can afford to run.
Execution time is not the most important criterion for a program,
particularly as computer time gets much cheaper than your time or a
programmer's time. With workstations on which background jobs can be run
all night, execution speed is not overwhelmingly relevant. Some of us have been
conditioned by an earlier era of computing to consider execution speed
paramount. But ease of use, ease of adaptation to your computer system,
and ease of modification are much more important in practice, and in
these respects I think these programs are adequate. Only if you are
engaged in 1960's style mainframe computing, or if you have very large
amounts of data is minimization of execution
time paramount. If you spent six months getting your data, it may not be
overwhelmingly important whether your run takes 10 seconds or 10 hours.
Nevertheless it would have been nice to have made the programs
faster. The present speeds are a compromise between speed and
effectiveness: by making them slower and trying more rearrangements in the
trees, or by enumerating all possible trees, I could have made the programs
more likely to find the best tree. By trying fewer rearrangements I
could have speeded them up, but at the cost of finding worse trees. I
could also have speeded them up by writing critical sections in assembly
language, but this would have sacrificed ease of distribution to new
computer systems. There are also some options included in these programs that
make it
harder to adopt some of the economies of bookkeeping that make other programs
faster. However to some extent I have simply made the decision not to spend
time trying to speed up program bookkeeping when there were new likelihood and
statistical methods to be developed.
It is interesting to compare different machines using Dnapars as the
standard task. One can rate a machine on the Dnapars benchmark by summing the
times for all three of the data sets. Here are relative total timings over
all three data sets (done with various versions of Dnapars) for some machines,
taking an AMD Athlon 1.2 GHz computer running Linux with gcc as the
standard. Benchmarks from versions 3.4 and 3.5 of the program are
also included (respectively the Pascal and C versions whose timings are in
parentheses). They are compared only with each other and are scaled to the
rest of the timings using the joint runs on the 386SX and the Pentium MMX 266.
This use of separate standards is necessary not
because of different languages but because different versions of the package
are being compared. Thus, the "Time" is the ratio of the Total to that for
the Pentium, adjusted by the scalings of machines using 3.4 and 3.5 when
appropriate. The Relative Speed is the reciprocal of the Time. For the
moment these benchmarks are for version 3.6; they will be updated when 3.7
is fully released.
This list of machines may evoke nostalgia. The timings were compiled some
years ago. When we release PHYLIP 4.0 we hope to add to them some benchmarks
on more recent machines.
This benchmark not only reflects integer performance of these machines
(as Dnapars has few floating-point operations) but also the efficiency
of the compilers. Some of the machines (the DEC 3000/400 AXP
and the IBM RS/6000, in particular) are much faster than this benchmark
would indicate. The numerical programs benchmark below gives them a
fairer test. The Compaq/Digital Alpha 500au times are exaggerated because,
although their compiles are optimized for that processor, some of the Pentium
compiles are not similarly optimized.
Note that parallel machines like the Sequent and the SGI PowerChallenge are not
really as slow as indicated by the data here, as these runs did nothing to take
advantage of their parallelism.
These benchmarks have now extended over 22 years (1986-2008), and in the Dnapars
benchmark they extend over a range of over 54,000-fold in speed!
The experience of our laboratory, which seems typical, is thatover that
period
computer power grew by a factor of about 1.85 per year. This is
roughly consistent with these benchmarks.
For a picture of speeds for a more numerically intensive program,
here are benchmarks using Dnaml, with an AMD Athlon 1.2 GHz Linux system
as the standard. Some of the timings, the ones in parentheses, are
using PHYLIP version 3.5, and those are compared to that version run on
the Pentium 266. Runs using the PHYLIP 3.4 Pascal version are adjusted
using the 386SX timings where both were run. Numbers are
total run times (total user time in the case of Unix) over all three data sets.
As before, the parallel machines such as the Convex and the SGI PowerChallenge
were only run using one processor, which does not take into account the
gain that could be obtained by parallelizing the programs. The speed of the
Compaq/Digital Alpha 500au is exaggerated because it was compiled in a way
optimized for its processor, while some of the Pentium compiles were not.
You are invited to send me figures for your machine for
inclusion in future tables. Use the data sets above and compute the total
times for Dnapars and for Dnaml for the three data sets (setting the
frequencies of the four bases to 0.25 each for the Dnaml runs). Be sure to
tell me the name and version of your compiler, and the version of PHYLIP you
tested.
If the times are too small to be measured accurately, obtain the times
for 10 or 100 data sets (the Multiple data sets option) and divide by 10 or
100.
In the sections following you will find instructions on how to adapt the
programs to different computers and compilers. The programs should compile
without alteration on most versions of C. They use the "malloc" library
or "calloc" function to allocate memory so that the upper limits on how many
species or how many sites or characters they can run is set by the system memory
available to that memory-allocation function.
In the document file for each program, I have supplied a small
input example, and the output it produces, to help you check whether the
programs are running properly.
If you have not been able to get executables for PHYLIP, you should be
able to make your own. This can be easy under Linux and Unix, but more
difficult if you have a Macintosh or a Windows system. If you have the
latter, we strongly recommend you download and use the Macintosh and
Windows executables that we distribute. If you do that, you will not need
to have any compiler or to do any compiling. I get a certain number of
inquiries each year from confused users who are not sure what a compiler
is but think they need one. After downloading the executables they
contact me and complain that they did not find a compiler included in the
package, and would I please e-mail them the compiler. What they really
need to do is use the executables and forget about compiling them.
Some users may also need to compile the programs in order to modify them.
The instructions below will help with this.
I will discuss how to compile PHYLIP using one of a number of widely-used
compilers. After these I will comment on compiling PHYLIP on other, less
widely-used systems.
For Unix and Linux (which is Unix in all important functional respects, if
not in all legal respects) you must compile PHYLIP yourself.
This is usually easy to do.
Unix (and Linux)
systems generally have a C compiler and have the make utility. We
distribute with the PHYLIP source code a Unix-compatible Makefile.
We use GNU's
make utility,
which might be installed on your system as "make" or as
"gmake".
However, note that some popular Linux distributions do not include a C
compiler in their default configuration. For example, in RedHat Linux version
8, the "Personal Workstation" installation that is the default does not
include the C compiler or the X Windows libraries needed to compile
PHYLIP. These are available, and can be loaded from the CDROMs in the
distribution. The following instructions assume that you have the
C compiler and X libraries. If you cannot easily configure your system to
include them, you should look into using the RedHat RPM binary
distribution, mentioned on the PHYLIP 3.6 web page.
As is mentioned below (under Macintoshes) the MacOS operating system is
a Unix, and if the X windows windowing system is installed, these Unix
instructions will work for it.
After you have finished unpacking the Documentation and Source Code
archive, you will find that you have created a folder phylip-3.6
in which there are three
folders, called exe, src, and doc.
There is also an HTML web page, phylip.html. The exe
folder
will be empty, src contains the source code files, including the
Makefile. Directory doc contains the documentation files.
Enter the src folder. Before you compile, you will want to
look at the Makefile and see whether you want to alter the compilation
command. We have the default C compiler flags set with no flags. If you
have modified the programs, you might want to use the debugging flags
"-g". On the other hand, if you are trying to make a fast executable using
the GCC compiler, you may want to use the one which is "An optimized one
for gcc". In either case, remove the "#" before that CFLAGS command, and
place it before the CFLAGS command that was previously in use.
There are careful instructions on this in the Makefile.
Once you have set up the CFLAGS and DFLAGS statements to be the way you
want, to compile all the programs just type:
make install
You will then see the compiling commands as they happen, with
occasional warning messages. If these are warnings, rather than errors,
they are not too serious. A typical warning would be like this:
dnaml.c:1204: warning: static declaration for re_move follows non-static
After a time the compiler will finish compiling. If you have done a
make install the system will then move the executables into the
exe folder and also save space by erasing all the relocatable
object files that were produced in the process. You should be left with
useable executables in the exe folder, and the src
folder should be as before. To run the executables, go into the
exe folder and type the program name (say dnaml, which
you may or may not have to precede by a dot and a slash./).
The names of the
executables will be the same as the names of the C programs, but without the
.c suffix. Thus dnaml.c compiles to make an executable called dnaml.
Our two tree-drawing programs, Drawgram and Drawtree, require an
X Windows installation including the Athena Widgets. These are provided
with most X Windows installations.
If you see messages that the compilation could not find "Xlib.h" and other,
similar functions, this means that some parts of the X Windows development
environment is not installed on your system, or is not installed in the
default location.
Similarly, if you get error messages saying that some files with "Xaw"
in the name cannot be found, this means that the Athena Widgets
are not installed on your system, or are not installed in the
default location.
In either case, you will need to make sure that they are installed properly.
If they are there but not found during the compile, change the
DFLAGS and DLIBS variables in the Makefile to
point to the locations of the header files and libraries, respectively.
Another is that the usual Linux C compiler is the Gnu GCC compiler.
In some Linux systems it is not invoked by the command cc but
by gcc. You would then need to edit the Makefile to reflect this
(see below for comments on that process).
A typical Unix or Linux installation would put the directory phylip-3.6
in /usr/local. The name of the executables directory EXEDIR
could be changed to be /usr/local/bin, so that the make install
command puts the executables there. If the users have /usr/local/bin
in their paths, the programs would be found when their names are typed.
The font files font1 through font6 could also be
placed there. A batch script containing the lines
could be used to establish links in the user's working directory so that
Drawtree and Drawgram would find these font files when users
type a name such as font1 when the program asks
them for a font file name. The
documentation web pages are in subdirectory doc of the
main PHYLIP directory, except for one, phylip.html which is
in the main PHYLIP directory. It has a table of all of the documentation
pages, including this one. If users create a bookmark to that page
it can be used to access all of the other documentation pages.
To compile just one program, such as Dnaml, type:
make dnaml
After this compilation, dnaml will be in the src
subdirectory. So will some relocatable object code files that
were used to create the executable. These have names ending in
.o - they can safely be deleted.
If you have problems with the compilation command, you can edit the
Makefile. It has careful explanations at its front of how you
might want to do so. For example, you might want to change the C
compiler name cc to the name of the Gnu C compiler, gcc.
This can be done by removing the comment character # from the
front of one line, and placing it at the front of a nearby line.
How to do so should be clear from the material at the beginning of the
Makefile. We have included sample lines for using the gcc
compiler and for using the Cygwin Gnu C++ environment on Windows, as
well as the default of cc.
We have encountered some problems with the Gnu C Compiler (gcc)
on 64-bit Itanium processors when compiled with the the -O 3
optimization level, in our code for generating random numbers.
Some older C compilers (notably the Berkeley C compiler which is
included free with some Sun systems) do not adhere to the ANSI C
standard (because they were written before it was set down).
They have trouble with the function prototypes which are in
our programs. We have included an #ifndef preprocessor
command to eliminate the problem, if you use the switch -DOLDC
when compiling. Thus with these compilers you need only use this in
your C flags (in the Makefile) and compilers such as Berkeley C
will cause no trouble.
We distribute Windows executables, and most likely you can use these and
do not need to recompile them. The following instructions will only be
necessary if you want to modify the programs and need to recompile them.
They are given for several different compilers available on Windows systems.
Another major compiler is Intel compiler -- we do not have information yet
on how to use it, but expect that PHYLIP will compile on it.
Compiling with CygWin and Gnu C++
The CygWin project has adapted the Gnu C compiler
to Windows systems and
provided an environment, CygWin, which mimics Unix for compiling.
Currently, this is the compiler that we use to prepare the Windows
executables.
Cygwin is available for purchase, and they also make it
available to be downloaded for free. The download is large. To get it, go
to the Cygwin web site at
When installing Cygwin it is important to install gcc and make. During
the course of the setup program Setup will ask you to select packages.
Expand the Devel Category by clicking on it. Scroll down to gcc and
check if the "New" column says "Skip". If it does,
click on "skip". "Skip" will change to the current version of gcc.
Scroll down to the make package, and if it has "Skip"
click on "Skip". These two
programs are nessessary to install phylip.
The Minggw gcc compiler
If you are not offered the opportunity to install the gcc compiler
when installing CygWin, you can install it separately.
The 64-bit version of Mingw is called Mingw-w64. It is appropriate for most
Windows systems except for older 32-bit ones. It
can be downloaded from https://mingw-w64.org.
Make sure that when installing it, you set a PATH variable
which includes the folder bin in the Mingw-w64 folder.
For example, if you chose to install Mingw-w64 in a folder called
C:\Mingw then the PATH in CygWin should have C:\Mingw\bin
in it, preferably at the front. That can be done by executing the
command export PATH=/cygdrive/c/Mingw/bin:$PATH
Compiling with Mingw under CygWin
Once you have
installed the free CygWin environment and the MinGW C compiler
on your Windows system, compiling PHYLIP is closely similar to
what one does for Unix or Linux:
dnaml ICON "dna.ico"
We have provided a folder icons in the src
folder, containing a full set of icons and a full set of resource
files (*.rc) so you will not have to do this yourself.
Compiling with Microsoft Visual C++
We have had success in the past compiling PHYLIP with Microsoft Visual
C++ (the compiler in the Microsoft .NET package), although the
Windows executables that we distribute are built
using the Cygwin GCC compiler. The following instructions are the ones we
have used for Visual C++ for the .NET 2008 version with Visual C++ version 9.0.
Microsoft also makes a free download version of their C++ compiler
from 2005 available as Visual C++ Express Edition. That version has a somewhat
different content, and these instructions will not work with it. If you
figure out how to get the compiler and Makefiles to work together, please
let us know -- we don't have the energy to figure this out for all possible
configurations of the Microsoft C++ compiler.
The instructions use the nmake command that uses a Makefile which is
called Makefile.msvc in our distribution.
At the end of this section we have some comments on how to
compile the programs with Visual C++ version 7.0, which also has a somewhat
different file folder structure.
With Microsoft Visual C++, you can compile using a Makefile. We have supplied
this in the source code distrubution as Makefile.msvc.
You may wish to preserve the Unix Makefile by renaming Makefile to,
say, Makefile.unix, then make a copy of Makefile.msvc
and call it Makefile. (You may have to change your Windows desktop
settings to make the three-letter extensions visible, or you could use
the RENAME command in the Command tool).
Once you have set MSVC, type
The Makefile has some paths in it, which are, I hope, in
the correct form for Visual C++ 9.0 on your system.
If not, the statement
If instead you have an earlier version of Visual Studio .NET which
has the Visual C++ 7.0 compiler, you should proceed as above, but
instead, set MSVC to
C:\Program Files\Microsoft Visual Studio .NET,
and then type
You will also need to edit the line in the Makefile that
defines the variable MSVCPATH. You should change this to
Compiling with Borland C++
Borland C++ can be downloaded for free. It is a compiler released
in 2000, and which is now owned by Embarcadero Technologies, Inc.
(see their site
http://www.codegear.com/downloads/free/cppbuilder). To download
it you need to register with them.
It has a somewhat restrictive license, so we cannot use it for the
widely-distributed executables.
You should download the compiler as it includes all the utilities needed
to compile phylip. It can compile using a Makefile. We have supplied
this in the source code distribution as Makefile.bcc. You will need to
preserve the Unix Makefile by renaming it to, say, Makefile.unix, then
make a copy of Makefile.bcc and call it Makefile. The
Makefile is invoked using the make command.
You will first need to create an ilink32.cfg and a bcc32.cfg
file and put the files into the src folder. These files are text files
and their contents are described in the readme.txt that comes with the
Borland tools. If the Borland tools are in the default location the
contents of ilink32.cfg would be.
and the contents of bcc32.cfg
These files can be created in a text editor such as Notepad or Wordpad.
To invoke the make command you will first need to open a command
prompt window. Then set the path appropriately. To set the path, type
Where "Path" is where Borland is installed, such as C:\Borland\BCC55.
Then type
If you simply type make you will get a list of possible make commands.
For example, to compile a single program such as Dnaml but not install
it, type make dnaml. To compile and install all programs type
make install. We have supplied all the the support files and icons
needed for
the compilations. They are in folder bcc of the main PHYLIP
ource code folder.
We have had to supply a complete second set of the resource files with
names *.brc because Borland resource files have a minor
incompatibility with Microsoft Visual C++ resource files.
Compiling with GCC on MacOS with our Makefile
The executables distributed by us for MacOS are currently compiled
using the GCC compiler that is distributed with MacOS. You may not need
to recompile them, unless you want to make changes in the programs.
We are distributing 32-bit "universal binaries" that work on both PowerMac and
Intel iMac.
You may not need to recompile unless you need to make a version of the
executables more closely adapted to your system, or unless you want to
modify the programs. One reason to recompile might be if you want
64-bit executables, which you might need to address large amounts of
memory.
If you do want to recompile, conder the following:
make install -f Makefile.osx
make dnaml.install
Compiling with GCC on MacOS with X Windows
On MacOS systems you can also use the GCC compiler and X Windows to
compile a version of the executables that runs from the command line
in native mode.
To do that, you must have the GCC compiler and the X11 windows
development kit materials installed.
In recent MacOS versions X11 is not made available, but you
can install it by going to www.Xquartz.org
and downloading and installing the latest X11 server from there. It is
an executable that does not need to be recompiled.
It is easy to download and install on a MacOS system.
If you have the GCC compiler and the X11 libraries installed, you can use a
Terminal window (which you
will find available in the Utilities folder in the Applications folder) and
compile PHYLIP by treating it as a Unix or Linux application and following the
instructions given above under "Unix and Linux". Basically you just get
into the folder that contains the PHYLIP source code and type
make install
This uses the ordinary Unix/Linux Makefile, which works
in creating programs using X11 for
MacOS with the gcc compiler. Note that to run the
programs drawgram and drawtree that actually use the X Windows, you
will need to
As parallel computers become more common, the issue of how to compile
PHYLIP for them has become more pressing. People have been compiling
PHYLIP for vector machines and parallel machines for many years. We
have not made a version for parallel machines because there is still
no standard parallel programming environment on such machines (or rather,
there are many standards, so that one cannot find one that makes
a parallel execution version of PHYLIP widely distributable). However
symmetric multiprocessing using the
MPI Message Passing Interface is widely spread, and we will
probably support it in future versions of PHYLIP.
Although the underlying algorithms of most programs,
which treat sites independently, should be amenable to vector and
parallel processors,
there are details of the code which might best be changed.
In certain of the programs (Dnaml, Dnamlk,
Proml, Promlk) I have put a special
comment statement next to the loops in the program where
the program will spend most of its time, and which are the places
most likely to benefit from parallelization. This comment statement is:
If you succeed in making a parallel version of PHYLIP we would like to
know how you did it. In particular, if you can prepare a web page which
describes how to do it for your computer system, we would like to use
material from it
in our PHYLIP web pages. Please e-mail it to me. We hope to
have a set of pages that give detailed instructions on how to make parallel
version of PHYLIP on various kinds of machines. Alternatively, if we
were given your modified version of the program we might be able to
figure out how to make modifications to our source code to allow
users to compile the program in a way which makes those modifications.
As you can see from the variety of different systems on which these
programs have been successfully run, there are no serious
incompatibility problems with most computer systems. PHYLIP in various
past Pascal versions has also been compiled on 8080 and Z80 CP/M Systems, Apple
II systems running UCSD Pascal, a variety of minicomputer systems such as
DEC PDP-11's and HP 1000's, on 1970's era mainframes such as CDC
Cyber systems, and so on. In a later era
it was also compiled on IBM 370 mainframes, and of course on DOS and
Windows systems, on MacOS 8 and 9 systems, and on Mac OS X (now
renamed MacOS).
We have gradually
accumulated experience on a wider variety of C compilers. If you succeed in
compiling the C version of PHYLIP on a different machine or a different
compiler, I would like to
hear the details so that I can consider including the instructions in a future version
of this manual.
The ONLY reason you should do this is if you want to add or modify
functionality on the Java interface. In all other cases, the .jar files that
already exist in the javajars folder will run on your Mac / MS /
Linux / Unix system, and along with the executables in the exe folder.
If those are present,
you do not need to and you should not be reading this section.
Here are detailed instructions, provided by our lab's Java expert Jim McGill.
Welcome to a fairly complex process. Unless you are an experienced object
oriented programmer, you will find Java has a steep learning curve and will
cause you headaches.
The general overview is that there is a Java interface that gathers and
validates input from the user, there is a call from the Java code to a dynamic
C library that contains the Phylip functionality, and there is feedback to the
user from the Java interface as to the status of the underlying C code.
Because one has two very different kinds of software running, the feedback is
not as elegant as one would expect from a single integrated environment.
Now for the specifics. We have developed these Java interfaces using the Eclipse environment (available from
www.eclipse.org). Go there and download the version of the Java
development environment appropriate to your operating system.
In the distribution there is a javasrc folder which contains folders
that match the programs. These folders contain the program's Java interfaces.
For example folder drawgram contains DrawgramInterface.java
and DrawgramUserInterface.java. The former does the interaction with
the compiled C library, the latter contains the user interface.
If you want to modify the Java interface for Drawgram, open the Eclipse
Java development environment, create a project called Phylip3.698, create a
folder under it called src. Under that create a project called
drawgram. Now import the two Drawgram associated java files
(DrawgramUserInterface.java and DrawgramInterface.java) into
that project. You will also need to create a project called util and
import all the items in the javasrc/util directory. Open
DrawgramUserInterface.java with the Eclipse WindowBuilderEditor and
you can edit it however you want. Remember that you'll need to add
ActionListeners (described in Java manuals) to anything that changes things on
the screen. There are plenty of examples of them in
DrawgramUserInterface.java, for example, TreeGrowToggle
which handles the toggling "Tree grows:" between "Horizontal" and "Vertical"
using Radio buttons. Most of the pieces you'll need are in the existing code.
You can clone them and edit to fit. Beyond that, "Google is your friend".
Once you have added new functionality or changed existing functionality in the
user interface, you will need to pass the information it collects from the
user to the underlying C code. This is a bit tricky because C and Java are
very different kinds of languages. Luckily Sun provided the Java Native Access
/ Java Native Interface (JNA/JNI) interface package to take care of it. We
used JNA (which calls JNI) because it is simpler to use and our needs were
basic enough we could live within its confines. In order to use it you will
also need to get two public jars off the web (do a Google search for these as
they keep moving around):
JNA passes everything via an enormous list of variables. This is simple to
program but very hard to keep track of, as you have to keep things exactly
parallel in the Java and C code and there is no debugger that will help you.
We have found it best to build a public class in Java that contains everything
that is going to the C code and create an instance of it when the user is
finished with data entry and decides to execute the process (in the
Drawgram case, selects Preview). We then copy all the data
from the screen into the members of the class, and pass these directly into
the JNA call to the underlying C code (look in DrawgramInterface.java
for an example).
In the underlying C code (which must be compiled as a library so that Java can
access it), there is an entry point that is the name of the program (for
example the function drawgram in drawgram.c) containing as
arguments every one of the variables that were passed by the Java interface,
in the same order. If you have weird bugs, most likely you messed this up.
Make a copy of the Java class definition, paste it into the C code and check
everything. Another wrinkle that can bite you is that booleans come though as
integers and Java and C do not agree as to what that means. False is 0 in both
languages. True is "not 0" in Java and often set to all bits on (which is a
very big negative number in C). C often has problems with this. Each compiler
is different and there are environment variables that effect this also. It is
safest to explicitly fix things before you execute any C code. There are a lot
of other odd quirks, but you have two working examples (Drawgram.c
and Drawtree.c), so you can probably figure them out.
Feedback from C to Java can be difficult. In Drawgram and Drawtree it is
fairly easy, as the plotting is done (to the file JavaPreview.ps in
case you need to know) and the program returns. The Java interface waits until
the C code completes and returns, then reads JavaPreview.ps and
displays the preview. In cases where one needs progress indicators, one needs
to multithread the Java code and display a continually updating progress
file. Phylip 3.698 has no need of multithreading but it may be implemented
in Phylip 4.0.
This set of Frequently Asked Questions, and their answers, is from the
PHYLIP web site. A more up-to-date version can be found there, at:
At the upper-undergraduate level:
and as graduate-level texts:
For more mathematically-oriented readers, there is the book
Best of all is of course my own book on phylogenies, which
covers the subject for many data types, at a
graduate course level:
There are also some recent books that take a more practical hands-on approach,
and give some detailed information on how to use programs, including PHYLIP
programs. These include:
In addition, one of these three review articles may help:
I have already mentioned above that there is an excellent guide to using PHYLIP 3.6
for molecular analyses available. It is by Jarno Tuimala:
Felsenstein, J. 2009. PHYLIP (Phylogeny Inference Package) version 3.698.
Distributed by the author. Department of Genome Sciences, University of
Washington, Seattle.
or if the editor for whom you are writing insists that the citation must be to
a printed publication, you could cite a notice for version 3.2 published in
Cladistics:
Felsenstein, J. 1989. PHYLIP - Phylogeny Inference Package (Version 3.2).
Cladistics 5: 164-166.
(This citation has been so commonly made that this is the most-cited paper
ever in the journal Cladistics, with more than twice as
many citations as the next-most-cited paper. I am the second most-cited author ever in that
journal, after J. S. Farris, and these 1989 citations are responsible for more than 10% of the
impact factor of that journal!).
For a while a printed version of the PHYLIP documentation was available and one
could cite that. This is no longer true. Other than that, this is difficult,
because I have never written a paper announcing PHYLIP! My 1985b paper in
Evolution on the bootstrap method contains a
one-paragraph Appendix describing the availability of this package, and that
can also be cited as a reference for the package, although it was
distributed since 1980 while the bootstrap paper is 1985. A paper on PHYLIP
is needed mostly to give people something to cite, as word-of-mouth, references
in other people's papers, and electronic newsgroup postings have spread the
word about PHYLIP's existence quite effectively.
(The following four questions, once
common, have finally disappeared, I am pleased to report. I include them to
give you some idea of what kinds of requests I had to cope with.)
Version 3.6 has many new features:
There are many more, lesser features added as well.
Version 3.7 has some new features:
There are some obvious deficiencies in this version. Some of these
holes will be filled in the next few releases (starting with version
4.0). They include:
There will also be many future developments in the programs that treat
continuously-measured data (quantitative characters) and morphological
or behavioral data with discrete states, as I have new ideas for
analyzing these data in ways that connect to within-species
quantitative genetic analyses. This will compete with parsimony
analysis.
Here are some comments people have made in print about PHYLIP. Explanatory
material in square brackets is my own. They fall naturally into three groups:
(note also W. Fink's critical remarks (1986) on version 2.8 of PHYLIP).
In the documentation files that follow I frequently refer to papers
in the literature. In order to centralize the references they are given
in this section. If you want to find further papers beyond these, my
book (Felsenstein, 2004) lists more than 1,000 further references.
Adams, E. N. 1972. Consensus techniques and the comparison of
taxonomic trees. Systematic Zoology 21: 390-397.
Adams, E. N. 1986. N-trees as nestings: complexity, similarity, and
consensus. Journal of Classification 3: 299-317.
Archie, J. W. 1989. A randomization test for phylogenetic information in
systematic data. Systematic Zoology 38: 239-252.
Backeljau, T., L. De Bruyn, H. De Wolf, K. Jordaens, S. Van Dongen,
and B. Winnepenninckx.
1996. Multiple UPGMA and neighbor-joining trees and the performance of some
computer packages. Molecular Biology and Evolution 13: 309–313.
Barry, D., and J. A. Hartigan. 1987. Statistical analysis of hominoid
molecular evolution. Statistical Science 2: 191-210.
Baum, B. R. 1989. PHYLIP: Phylogeny Inference Package. Version 3.2. (Software
review). Quarterly Review of Biology 64: 539-541.
Bourque, M. 1978. Arbres de Steiner et reseaux dont certains sommets sont
à localisation variable. Ph. D. Dissertation, Université de
Montréal, Quebec.
Bron, C., and J. Kerbosch. 1973. Algorithm 457: Finding all cliques
of an undirected graph. Communications of the Association for Computing Machinery 16: 575-577.
Camin, J. H., and R. R. Sokal. 1965. A method for deducing branching
sequences in phylogeny. Evolution 19: 311-326.
Carpenter, J. 1987a. A report on the Society for the Study of Evolution
workshop "Computer Programs for Inferring Phylogenies". Cladistics 3:
363-375.
Carpenter, J. 1987b. Cladistics of cladists. Cladistics 3: 363-375.
Cavalli-Sforza, L. L., and A. W. F. Edwards. 1967. Phylogenetic
analysis: models and estimation procedures. Evolution 32: 550-570
(also American Journal of Human Genetics 19: 233-257).
Cavender, J. A. and J. Felsenstein. 1987. Invariants of phylogenies in a
simple case with discrete states. Journal of Classification 4: 57-71.
Churchill, G.A. 1989. Stochastic models for heterogeneous DNA sequences.
Bulletin of Mathematical Biology 51: 79-94.
Conn, E. E. and P. K. Stumpf. 1963. Outlines of Biochemistry. John Wiley
and Sons, New York.
Day, W. H. E. 1983. Computationally difficult parsimony problems in
phylogenetic systematics. Journal of Theoretical Biology 103:
429-438.
Dayhoff, M. O. and R. V. Eck. 1968. Atlas of Protein Sequence
and Structure 1967-1968. National Biomedical Research Foundation,
Silver Spring, Maryland.
Dayhoff, M. O., R. M. Schwartz, and B. C. Orcutt. 1979. A model of
evolutionary change in proteins. pp. 345-352 in Atlas of
Protein Sequence and Structure, volume 5, supplement 3, 1978, ed.
M. O. Dayhoff. National Biomedical Research Foundation, Silver Spring, Maryland
.
Dayhoff, M. O. 1979. Atlas of Protein Sequence and Structure, Volume 5,
Supplement 3, 1978. National Biomedical Research Foundation, Washington, D.C.
DeBry, R. W. and N. A. Slade. 1985. Cladistic analysis of restriction
endonuclease cleavage maps within a maximum-likelihood framework.
Systematic Zoology 34: 21-34.
Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. Maximum
likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B 39: 1-38.
Eck, R. V., and M. O. Dayhoff. 1966. Atlas of Protein Sequence and
Structure 1966. National Biomedical Research Foundation, Silver
Spring, Maryland.
Edwards, A. W. F., and L. L. Cavalli-Sforza. 1964. Reconstruction of
evolutionary trees. pp. 67-76 in Phenetic and Phylogenetic
Classification, ed. V. H. Heywood and J. McNeill. Systematics
Association Volume No. 6. Systematics Association, London.
Estabrook, G. F., C. S. Johnson, Jr., and F. R. McMorris. 1976a. A
mathematical foundation for the analysis of character
compatibility. Mathematical Biosciences 23: 181-187.
Estabrook, G. F., C. S. Johnson, Jr., and F. R. McMorris. 1976b. An
algebraic analysis of cladistic characters. Discrete Mathematics 16: 141-147.
Estabrook, G. F., F. R. McMorris, and C. A. Meacham. 1985. Comparison of
undirected phylogenetic trees based on subtrees of four evolutionary units.
Systematic Zoology 34: 193-200.
Faith, D. P. 1990. Chance marsupial relationships. Nature 345: 393-394.
Faith, D. P. and P. S. Cranston. 1991. Could a cladogram this short have
arisen by chance alone?: On permutation tests for cladistic
structure. Cladistics 7: 1-28.
Farris, J. S. 1977. Phylogenetic analysis under Dollo's Law. Systematic Zoology 26: 77-88.
Farris, J. S. 1978a. Inferring phylogenetic trees from chromosome
inversion data. Systematic Zoology 27: 275-284.
Farris, J. S. 1981. Distance data in phylogenetic analysis. pp. 3-23
in Advances in Cladistics: Proceedings of the first meeting of the
Willi Hennig Society, ed. V. A. Funk and D. R. Brooks. New York
Botanical Garden, Bronx, New York.
Farris, J. S. 1983. The logical basis of phylogenetic analysis. pp. 1-47
in Advances in Cladistics, Volume 2, Proceedings of the Second Meeting of
the Willi Hennig Society. ed. Norman I. Platnick and V. A. Funk. Columbia
University Press, New York.
Farris, J. S. 1985. Distance data revisited. Cladistics 1: 67-85.
Farris, J. S. 1986. Distances and statistics. Cladistics 2: 144-157.
Farris, J. S. [“T. N. Nayenizgani”]. 1990. The systematics association
enters its golden years (review of Prospects in Systematics, ed. D.
Hawksworth). Cladistics 6: 307-314.
Farris, J. S., V. A. Albert, M. K&aauml;llersj&oauml;, D.
Lipscomb, and A. G. Kluge. 1996. Parsimony jackknifing outperforms
neighbor-joining. Cladistics 12: 99-124.
Felsenstein, J. 1973a. Maximum likelihood and minimum-steps methods
for estimating evolutionary trees from data on discrete characters.
Systematic Zoology 22: 240-249.
Felsenstein, J. 1973b. Maximum-likelihood estimation of evolutionary
trees from continuous characters. American Journal of Human Genetics 25:
471-492.
Felsenstein, J. 1978a. The number of evolutionary trees. Systematic Zoology 27: 27-33.
Felsenstein, J. 1978b. Cases in which parsimony and compatibility
methods will be positively misleading. Systematic Zoology 27:
401-410.
Felsenstein, J. 1979. Alternative methods of phylogenetic inference
and their interrelationship. Systematic Zoology 28: 49-62.
Felsenstein, J. 1981a. Evolutionary trees from DNA sequences: a
maximum likelihood approach. Journal of Molecular Evolution 17: 368-376.
Felsenstein, J. 1981b. A likelihood approach to character weighting
and what it tells us about parsimony and compatibility. Biological Journal of the Linnean Society 16: 183-196.
Felsenstein, J. 1981c. Evolutionary trees from gene frequencies and
quantitative characters: finding maximum likelihood estimates.
Evolution 35: 1229-1242.
Felsenstein, J. 1982. Numerical methods for inferring evolutionary
trees. Quarterly Review of Biology 57: 379-404.
Felsenstein, J. 1983b. Parsimony in systematics: biological and
statistical issues. Annual Review of Ecology and Systematics 14: 313-333.
Felsenstein, J. 1984a. Distance methods for inferring phylogenies: a
justification. Evolution 38: 16-24.
Felsenstein, J. 1984b. The statistical approach to inferring
evolutionary trees and what it tells us about parsimony and
compatibility. pp. 169-191 in: Cladistics: Perspectives in the
Reconstruction of Evolutionary History, edited by T. Duncan and T. F.
Stuessy. Columbia University Press, New York.
Felsenstein, J. 1985a. Confidence limits on phylogenies with a molecular
clock. Systematic Zoology 34: 152-161.
Felsenstein, J. 1985b. Confidence limits on phylogenies: an approach
using the bootstrap. Evolution 39: 783-791.
Felsenstein, J. 1985c. Phylogenies from gene frequencies: a statistical
problem. Systematic Zoology 34: 300-311.
Felsenstein, J. 1985d. Phylogenies and the comparative method. American Naturalist 125: 1-12.
Felsenstein, J. 1986. Distance methods: a reply to Farris. Cladistics 2:
130-144.
Felsenstein, J. and E. Sober. 1986. Parsimony and likelihood: an
exchange. Systematic Zoology 35: 617-626.
Felsenstein, J. 1988a. Phylogenies and quantitative characters. Annual Review of Ecology and Systematics 19: 445-471.
Felsenstein, J. 1988b. Phylogenies from molecular sequences: inference and
reliability. Annual Review of Genetics 22: 521-565.
Felsenstein, J. 1992. Phylogenies from restriction sites, a
maximum likelihood approach. Evolution 46: 159-173.
Felsenstein, J. and G. A. Churchill. 1996.
A hidden Markov model approach to variation among sites in rate of evolution
Molecular Biology and Evolution 13: 93-104.
Felsenstein, J. 2004. Inferring Phylogenies. Sinauer Associates,
Sunderland, Massachusetts.
Felsenstein, J. 2005. Using the threshold model of quantitative genetics
for inferences within and between species. Philosophical Transactions
of the Royal Society of London, Series B 360 1427-1434.
Felsenstein, J. 2008. Comparative methods with sampling error and within-species
variation: contrasts revisited and revised. American Naturalist 171: 713-725.
Fink, W. L. 1986. Microcomputers and phylogenetic analysis. Science 234: 1135-1139.
Fitch, W. M., and E. Markowitz. 1970. An improved method for determining
codon variability in a gene and its application to the rate of fixation of
mutations in evolution. Biochemical Genetics 4: 579-593.
Fitch, W. M., and E. Margoliash. 1967. Construction of phylogenetic
trees. Science 155: 279-284.
Fitch, W. M. 1971. Toward defining the course of evolution: minimum
change for a specified tree topology. Systematic Zoology 20: 406-416.
Fitch, W. M. 1975. Toward finding the tree of maximum parsimony. pp. 189-230
in Proceedings of the Eighth International Conference on Numerical Taxonomy,
ed. G. F. Estabrook. W. H. Freeman, San Francisco.
Fitch, W. M. and E. Markowitz. 1970. An improved method for determining
codon variability and its application to the rate of fixation of mutations
in evolution. Biochemical Genetics 4: 579-593.
George, D. G., L. T. Hunt, and W. C. Barker. 1988. Current methods in
sequence comparison and analysis. pp. 127-149 in Macromolecular Sequencing
and Synthesis, ed. D. H. Schlesinger. Alan R. Liss, New York.
Gilmour, R. 2000. Taxonomic markup language: applying XML to systematic data.
Bioinformatics 16: 406-407.
Goldman, N., and Z. Yang. 1994.
A codon-based model of nucleotide substitution for protein-coding DNA
sequences.
Molecular Biology and Evolution 11: 725-736.
Goldstein, D. B., A. Ru&iiacute;z-Linares, M. Feldman, and L. L. Cavalli-Sforza.
1995. Genetic absolute dating based on microsatellites and the origin of
modern humans. Proceedings of the National Academy of Sciences USA
92: 6720-6727.
Gomberg, D. 1968. "Bayesian" post-diction in an evolution process.
unpublished manuscript, University of Pavia, Italy.
Graham, R. L., and L. R. Foulds. 1982. Unlikelihood that minimal
phylogenies for a realistic biological study can be constructed in
reasonable computational time. Mathematical Biosciences 60: 133-142.
Hasegawa, M. and T. Yano. 1984a. Maximum likelihood method of phylogenetic
inference from DNA sequence data. Bulletin of the Biometric Society of Japan No. 5: 1-7.
Hasegawa, M. and T. Yano. 1984b. Phylogeny and classification of
Hominoidea as inferred from DNA sequence data. Proceedings of the Japan Academy 60 B: 389-392.
Hasegawa, M., Y. Iida, T. Yano, F. Takaiwa, and M. Iwabuchi. 1985a.
Phylogenetic relationships among eukaryotic kingdoms as inferred from
ribosomal RNA sequences. Journal of Molecular Evolution 22: 32-38.
Hasegawa, M., H. Kishino, and T. Yano. 1985b. Dating of the human-ape
splitting by a molecular clock of mitochondrial DNA. Journal of Molecular
Evolution 22: 160-174.
Hendy, M. D., and D. Penny. 1982. Branch and bound algorithms to
determine minimal evolutionary trees. Mathematical Biosciences 59: 277-290.
Higgins, D. G. and P. M. Sharp. 1989. Fast and sensitive
multiple sequence alignments on a microcomputer. Computer Applications in the Biological Sciences (CABIOS) 5: 151-153.
Hochbaum, D. S. and A. Pathria. 1997. Path costs in evolutionary
tree reconstruction. Journal of Computational Biology 4: 163-175.
Holmquist, R., M. M. Miyamoto, and M. Goodman. 1988. Higher-primate
phylogeny - why can't we decide? Molecular Biology and Evolution 5: 201-216.
Inger, R. F. 1967. The development of a phylogeny of frogs.
Evolution 21: 369-384.
Jin, L. and M. Nei. 1990. Limitations of the evolutionary parsimony method
of phylogenetic analysis. Molecular Biology and Evolution 7: 82-102.
Jones, D. T., W. R. Taylor and J. M. Thornton. 1992. The rapid generation of
mutation data matrices from protein sequences. Computer Applications
in the Biosciences (CABIOS) 8: 275-282.
Jukes, T. H. and C. R. Cantor. 1969. Evolution of protein molecules. pp.
21-132 in Mammalian Protein Metabolism, ed. H. N. Munro. Academic Press, New
York.
Kidd, K. K. and L. A. Sgaramella-Zonta. 1971. Phylogenetic analysis: concepts
and methods. American Journal of Human Genetics 23: 235-252.
Kim, J. and M. A. Burgman. 1988. Accuracy of phylogenetic-estimation
methods using simulated allele-frequency data. Evolution 42: 596-602.
Kimura, M. 1980. A simple model for estimating evolutionary rates of base
substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution 16: 111-120.
Kimura, M. 1983. The Neutral Theory of Molecular Evolution. Cambridge
University Press, Cambridge.
Kishino, H. and M. Hasegawa. 1989. Evaluation of the maximum likelihood
estimate of the evolutionary tree topologies from DNA sequence data, and the
branching order in Hominoidea. Journal of Molecular Evolution 29: 170-179.
Kluge, A. G., and J. S. Farris. 1969. Quantitative phyletics and the
evolution of anurans. Systematic Zoology 18: 1-32.
Kosiol, C., and N. Goldman. 2005. Different versions of the Dayhoff rate
matrix. Molecular Biology and Evolution 22: 193-199.
Kuhner, M. K. and J. Felsenstein. 1994. A simulation comparison of
phylogeny algorithms under equal and unequal evolutionary rates.
Molecular Biology and Evolution 11: 459-468 (Erratum 12: 525 1995).
Künsch, H. R. 1989. The jackknife and the bootstrap for general stationary
observations. Annals of Statistics 17: 1217-1241.
Lake, J. A. 1987. A rate-independent technique for analysis of nucleic acid
sequences: evolutionary parsimony. Molecular Biology and Evolution 4: 167-191.
Lake, J. A. 1994. Reconstructing evolutionary trees from DNA and protein
sequences: paralinear distances.
Proceedings of the Natonal Academy of Sciences, USA 91: 1455-1459.
Le Quesne, W. J. 1969. A method of selection of characters in
numerical taxonomy. Systematic Zoology 18: 201-205.
Le Quesne, W. J. 1974. The uniquely evolved character concept and its
cladistic application. Systematic Zoology 23: 513-517.
Lewis, H. R., and C. H. Papadimitriou. 1978. The efficiency of
algorithms. Scientific American 238: 96-109 (January issue)
Lockhart, P. J., M. A. Steel, M. D. Hendy, and D. Penny. 1994.
Recovering evolutionary trees under a more realistic model of sequence
evolution. Molecular Biology and Evolution 11: 605-612.
Luckow, M. and D. Pimentel. 1985. An empirical comparison of
numerical Wagner computer programs. Cladistics 1: 47-66.
Lynch, M. 1990. Methods for the analysis of comparative data in evolutionary
biology. Evolution 45: 1065-1080.
Maddison, D. R. 1991. The discovery and importance of multiple islands of
most-parsimonious trees. Systematic Zoology 40: 315-328.
Margush, T. and F. R. McMorris. 1981. Consensus n-trees. Bulletin of Mathematical Biology 43: 239-244.
Muse, S. V. and B. S. Gaut. 1994.
A likelihood approach for comparing synonymous and nonsynonymous nucleotide
substitution rates, with application to the chloroplast genome.
Molecular Biology and Evolution 11: 715-724,
Nelson, G. 1979. Cladistic analysis and synthesis: principles and definitions,
with a historical note on Adanson's Familles des Plantes
(1763-1764). Systematic Zoology 28: 1-21.
Nei, M. 1972. Genetic distance between populations. American Naturalist 106: 283-292.
Nei, M. and W.-H. Li. 1979. Mathematical model for studying genetic variation
in terms of restriction endonucleases. Proceedings of the National Academy of Sciences, USA 76: 5269-5273.
Nei, M. and T. Gojobori. 1986. Simple methods for estimating the numbers of
synonymous and nonsynonymous nucleotide substitutions. Molecular Biology and
Evolution 3: 418-426.
Nielsen, R., and Z. Yang. 1998. Likelihood models for detecting positively
selected amino acid sites and applications to the HIV-1 envelope gene.
Genetics 148: 929-936.
Nixon, K. C. 1999. The parsimony ratchet, a new method for rapid parsimony
analysis. Cladistics 15: 407-414.
Page, R. D. M. 1989. Comments on component-compatibility in historical
biogeography. Cladistics 5: 167-182.
Penny, D. and M. D. Hendy. 1985. Testing methods of evolutionary tree
construction. Cladistics 1: 266-278.
Platnick, N. 1987. An empirical comparison of microcomputer parsimony
programs. Cladistics 3: 121-144.
Platnick, N. 1989. An empirical comparison of microcomputer parsimony
programs. II. Cladistics 5: 145-161.
Reynolds, J. B., B. S. Weir, and C. C. Cockerham. 1983. Estimation of the
coancestry coefficient: basis for a short-term genetic
distance. Genetics 105: 767-779.
Robinson, D. F. and L. R. Foulds. 1979. Comparison of weighted
labelled trees. pp. 119-126 in Combinatorial Mathematics VI. Proceedings
of the Sixth Australian Conference on Combinatorial Mathematics, Armidale,
Australia, August, 1978, ed. A. F. Horadam and W. D. Wallis. Lecture Notes in Mathematics, No. 748. Springer-Verlag, Berlin.
Robinson, D. F. and L. R. Foulds. 1981. Comparison of phylogenetic trees.
Mathematical Biosciences 53: 131-147.
Rohlf, F. J. and M. C. Wooten. 1988. Evaluation of the restricted maximum
likelihood method for estimating phylogenetic trees using simulated allele-
frequency data. Evolution 42: 581-595.
Rzhetsky, A., and M. Nei. 1992. Statistical properties of the ordinary
least-squares, generalized least-squares, and minimum-evolution methods
of phylogenetic inference. Journal of Molecular Evolution 35:
367-375 .
Saitou, N., and M. Nei. 1987. The neighbor-joining method: a new method for
reconstructing phylogenetic trees. Molecular Biology and Evolution 4: 406-425.
Sanderson, M. J. 1990. Flexible phylogeny reconstruction: a review of
phylogenetic inference packages using parsimony. Systematic Zoology 39: 414-420.
Sankoff, D. D., C. Morel, R. J. Cedergren. 1973. Evolution of 5S RNA and
the nonrandomness of base replacement. Nature New Biology 245: 232-234.
Shimodaira, H. and M. Hasegawa. 1999. Multiple comparisons of log-likelihoods
with applications to phylogenetic inference. Molecular Biology and
Evolution 16: 1114-1116.
Shimodaira, H. 2002. An approximately unbiased test of phylogenetic
tree selection. Systematic Biology 51: 492-508.
Sokal, R. R. and P. H. A. Sneath. 1963. Principles of Numerical Taxonomy.
W. H. Freeman, San Francisco.
Smouse, P. E. and W.-H. Li. 1987. Likelihood analysis of mitochondrial
restriction-cleavage patterns for the human-chimpanzee-gorilla trichotomy.
Evolution 41: 1162-1176.
Sober, E. 1983a. Parsimony in systematics: philosophical issues. Annual Review of Ecology and Systematics 14: 335-357.
Sober, E. 1983b. A likelihood justification of parsimony. Cladistics 1: 209-233.
Sober, E. 1988. Reconstructing the Past: Parsimony, Evolution,
and Inference. MIT Press, Cambridge, Massachusetts.
Sokal, R. R., and P. H. A. Sneath. 1963. Principles of Numerical
Taxonomy. W. H. Freeman, San Francisco.
Steel, M. A., P. J. Lockhart, and D. Penny. 1993. Confidence in evolutionary trees from biological sequence data. Nature 364: 440-442.
Steel, M. A. 1994. Recovering a tree from the Markov leaf colourations
it generates under a Markov model. Applied Mathematics Letters
7: 19-23.
Studier, J. A. and K. J. Keppler. 1988. A note on the neighbor-joining
algorithm of Saitou and Nei. Molecular Biology and Evolution 5: 729-731.
Swofford, D. L. and G. J. Olsen. 1990. Phylogeny reconstruction. Chapter
11, pages 411-501 in Molecular Systematics, ed. D. M. Hillis and C. Moritz.
Sinauer Associates, Sunderland, Massachusetts.
Swofford, D. L., G. J. Olsen, P. J. Waddell, and D. M. Hillis. 1996.
Phylogenetic inference. pp. 407-514 in Molecular Systematics, 2nd ed.,
ed. D. M. Hillis, C. Moritz, and B. K. Mable. Sinauer Associates, Sunderland,
Massachusetts.
Templeton, A. R. 1983. Phylogenetic inference from restriction endonuclease
cleavage site maps with particular reference to the evolution of humans and the
apes. Evolution 37: 221-244.
Thompson, E. A. 1975. Human Evolutionary Trees. Cambridge University
Press, Cambridge.
Veerassamy, S., A. Smith and E. R. M. Tillier. 2003.
A transition probability model for amino acid substitutions from Blocks.
Journal of Computational Biology 10: 997-1010.
Wright, S. 1934. An analysis of variability in number of digits in an inbred
strain of guinea pigs. Genetics 19: 506-536.
Wu, C. F. J. 1986. Jackknife, bootstrap and other resampling plans in
regression analysis. Annals of Statistics 14: 1261-1295.
Yang, Z. 1993. Maximum-likelihood estimation of phylogeny from DNA sequences
when substitution rates differ over sites. Molecular Biology and
Evolution 10: 1396-1401.
Yang, Z. 1994. Maximum likelihood phylogenetic estimation from DNA sequences
with variable rates over sites: approximate methods. Journal of Molecular
Evolution 39: 306-314.
Yang, Z. 1995. A space-time process model for the evolution of DNA sequences.
Genetics 139: 993-1005.
Yang, Z. 1998. Likelihood ratio tests for detecting positive selection and
application to primate lysozyme evolution. Molecular Biology and
Evolution15: 568-573.
Yang, Z., and R. Nielsen. 1998. Synonymous and nonsynonymous rate variation in
nuclear genes of mammals. Journal of Molecular Evolution 46: 409-418.
Yang, Z. 2006. Computational Molecular Evolution. Oxford University
Press, Oxford.
Zharkikh, A. and W.-H. Li. 1995. Estimation of confidence in phylogeny:
the complete-and-partial bootstrap technique. Molecular Biology and
Evolution 4: 44-63.
Over the years various granting agencies have contributed to the
support of the PHYLIP project (at first without knowing it). They are:
However, starting in April, 2009 there is no grant support for
PHYLIP.
I am particularly grateful to past program administrators William Moore,
Irene Eckstrand, Peter Arzberger, and Conrad Istock, who have
gone beyond the call of duty to make sure that PHYLIP continued.
Booby prizes for funding are awarded to:
The original Camin-Sokal parsimony program and the polymorphism parsimony
program were written by me in 1977 and 1978. They were Pascal versions of
earlier FORTRAN programs I wrote in 1966 and 1967 using the same algorithm to
infer phylogenies under the Camin-Sokal and polymorphism parsimony
criteria. Harvey Motulsky worked for me as a programmer in 1971 and wrote
FORTRAN programs to carry out the Camin-Sokal, Dollo, and polymorphism
methods (he is better-known these days as the author of the scientific
data analysis package GraphPad). But most of the early work on PHYLIP other
than my own was by Jerry
Shurman and Mark Moehring. Jerry Shurman worked for me in the summers of
1979 and 1980, and Mark Moehring worked for me in the summers of 1980 and
1981. Both wrote original versions of many of the other programs, based on
the original versions of my Camin-Sokal parsimony program and my polymorphism
parsimony program. These
formed the basis of Version 1 of the Package, first distributed in October,
1980.
Version 2, released in the spring of 1982, involved a fairly complete rewrite
by me of many of those programs. Hisashi Horino for
version 3.3 reworked some parts of the programs Clique and Consense
to make their output more comprehensible, and has added some code to the
tree-drawing programs Drawgram and Drawtree as well. He also worked on
some of the Drawtree and Drawgram driver code.
Later programmers Akiko Fuseki, Sean Lamont,
Andrew Keeffe, Daniel Yek, Dan Fineman, Patrick Colacurcio,
Mike Palczewski, Doug Buxton, Ian Robertson, Marissa LaMadrid, Eric Rynes,
and Elizabeth Walkup gave
me substantial help with the 3.6 releases, and their excellent work is
greatly appreciated. Akiko, in over 10 years of excellent work, did much of the
hard work of adding
new features and changing old ones in the 3.4 and 3.5 releases,
centralized many of the C routines in support files, and is responsible for the
new versions of Dnapars and Pars. Andrew
prepared the Macintosh version, wrote Retree, added the ray-tracing
and PICT code to the Draw... programs and has since done much other work. Sean
was central to the conversion to
C, and tested it extensively. Mike Palczewski reorganized the code and
centralized routines, bringing us closer to object-oriented structure.
My (then) postdoctoral fellow
Mary Kuhner and her associate Jon Yamato created Neighbor, the
neighbor-joining and UPGMA program, for the current release, for which I am
also grateful (Naruya Saitou and Li Jin kindly encouraged us to use some of the
code from their own implementation of this method). Lucas Mix created
the protein likelihood programs Protml and Protmlk. Elisabeth Tillier
provided the code for her PMB amino acid model. My current programmers
Jim McGill and Bob Giansiracusa have made a great contribution to
getting the current version working.
I am very grateful to over 400
users for algorithmic suggestions, complaints about features (or lack of
features), and information about the behavior of their operating systems
and compilers. A list of some of their names will be found at
the
credits page on the PHYLIP web site which is at
A major contribution to this package has been made by others
writing programs or parts of programs. Chris Meacham contributed the
important program Factor, long demanded by users, and the even more
important ones PLOTREE and PLOTGRAM. Important parts of the code in
Drawgram and Drawtree were taken over from those two programs.
Kent Fiala wrote
function "reroot" to do outgroup-rooting, which was an essential part of many
programs in earlier versions. Someone at the Western Australia Institute of
Technology suggested the name PHYLIP (by writing it the label on the
outside of a magnetic tape). Probably it was the late Julian Ford
(I've lost the relevant letter).
The distribution of the package also owes much to Buz Wilson and Willem Ellis,
who put a lot of effort into the early distributions of the PCDOS and
Macintosh versions respectively. Christopher Meacham and Tom Duncan for three
versions distributed a printed version of these documentation files (they could
not continue to do so), and I am
very grateful to them for those efforts. William H.E. Day and F. James Rohlf
were very helpful in setting up the listserver news bulletin service which
succeeded the PHYLIP newsletter for a time.
I also wish to thank the people who have made computer resources available to
me, mostly in the loan of use of microcomputers. These include Jeremy
Field, Clem Furlong, Rick Garber, Dan Jacobson, Rochelle Kochin, Monty Slatkin,
Jim Archie, Jim Thomas, and George Gilchrist.
I should also note the computers used to develop this package:
These include a CDC 6400, two DECSystem 1090s, my trusty old SOL-20, my
old Osborne-1, a VAX 11/780, a VAX 8600, a MicroVAX I, a DECstation
3100, my old Toshiba 1100+, my
DECstation 5000/200, a DECstation 5000/125, a Compudyne 486DX/33, a
Trinity Genesis 386SX, a Zenith Z386, a Mac Classic, a DEC Alphastation 400
4/233, a Pentium 120, a Pentium 200, a PowerMac 6100, and a Macintosh G3.
(One of the reasons
we have been successful in achieving compatibility between different computer
systems is that I have had to run them myself under so many different operating
systems and compilers).
A comprehensive list of phylogeny programs is maintained at the PHYLIP
web site on the Phylogeny Programs pages:
Here we will simply mention some of the major general-purpose programs. For
many more and much more, see those web pages.
PAUP* A comprehensive program with parsimony, likelihood, and
distance matrix methods. It succeeded PHYLIP in about 1995 as responsible for
the most trees published. Written by David Swofford, now of Duke
University. Distributed for free from
the PAUP* website.
at https://paup.phylosolutions.com/.
MrBayes The leading program for Bayesian inference of
phylogenies. It uses Markov Chain Monte Carlo inference to assess
support for clades and to infer posterior distrubutions of parameters.
Produced by John Huelsenbeck and Fredrik Ronquist, it is available at
https://nbisweden.github.com/MrBayes/ as a MacOS or Windows
executable, or in source code in C.
MEGA A program by Sudhir Kumar of Arizona State University
(written together with Koichiro Tamura, Joel Dudley and Masatoshi Nei).
It can carry out parsimony and distance matrix methods
for DNA sequence data. Version 4 for Windows, Macintosh, and Linux
can be downloaded from
the MEGA web site
at https://www.megasoftware.net.
PAML Ziheng Yang of the Department of Genetics and Biometry at
University College, London has written this package of programs to
carry out likelihood analysis of DNA and protein sequence data.
It is one of the only packages able to use the codon model for protein
sequence data which takes the genetic code reasonably fully into account.
PAML is particularly strong in the options for coping with variability of rates
of evolution from site to site, though it is less able than some other
packages to search effectively for the best tree. It is available as
C source code and as MacOS and Windows executables from its web site at
http://abacus.gene.ucl.ac.uk/software/paml.html
Phyml
Stephane Guindon, currently of the University of Auckland, New Zealand,
has written Phyml, a fast likelihood program for molecular sequence data
It is available as binaries from its page at the ATGC site in France:
https://www.atgc-montpellier.fr/phyml/binaries.ph
Source code for Phyml, including later developments of the program,
are available at
https://github.com/stephaneguindon/phyml/
RAxML Alexis Stamatakis, of the Exelexis Lab at the
Technische Universität München has written RAxML, a very fast
likelihood program for molecular sequences. It is available from
his Github software web page: https://github.com/stamatak/standard-RAxML
It seems to be the fastest implementation of likelihood for molecular data.
TNT This program, by Pablo Goloboff, J. S. Farris, and Kevin Nixon,
is for searching large data sets for most parsimonious trees.
The authors are respectively at the Instituto Miguel Lillo in Tucumán,
Argentina, the Naturhistoriska Riksmuseet in Stockholm, Sweden, and the
Hortorium, Cornell University, Ithaca, New York.
TNT is described
as faster than other methods, though not faster than NONA for small to
medium data sets. It is distributed for free as
Windows, Linux, and MacOS executables, and
some support files including documentation,
from its download site:
http://www.lillo.org.ar/phylogeny/tnt/
DAMBE A package written by Xuhua Xia of the
Department of Biology of the University of Ottawa.
Its initials stand for Data Analysis in Molecular Biology and Evolution.
DAMBE is a general-purpose package for DNA and protein sequence phylogenies.
It can read and
convert a number of file formats, and has many features for
descriptive statistics, and can compute a number of commonly-used
distance matrix measures and infer phylogenies by parsimony, distance,
or likelihood methods, including bootstrapping and jackknifing. There are
a number of kinds of statistical tests of trees available and it
can also display phylogenies. DAMBE includes a copy of ClustalW as well;
DAMBE consists of Windows executables. It is available from its
web site at http://dambe.bio.uottawa.ca/DAMBE/dambe.aspx.
These are only a few of the over 400 different phylogeny packages that
were available when this was written (as of July, 2010 - the number keeps increasing). Prior to
2013, a list of all known programs and packages was available at
my Phylogeny Programs web pages at the address given above. That list has
not been updated since then owing to the burden of doing so.
Simply let me know of any problems you have had adapting the
programs to your computer. I can often make "transparent" changes that, by
making the code avoid the wilder, woolier, and less standard parts of
C, not only help others who have your machine but even improve the
chance of the programs functioning on new machines. I would like fairly
detailed information on what gave trouble, on what operating system,
machine, and (if relevant) compiler, and what had to be done to make the
programs work. Electronic mail is a the best
way for me to be asked about problems, as you can include your
input and output files so I can see what is going on (please do not
send them as Attachments, but as part of the body of a message). I'd really
like these programs to be
able to run with only routine changes on absolutely everything, down to
and possibly including the Amana Touchmatic Radarange Microwave Oven
which was an Intel 8080 system (in fact, early versions of this package did
run successfully on Intel 8080 systems running the CP/M operating system).
Versions for Android and iOS have been contemplated too.
I would also like to know timings of programs from the package, when
run on the three test input files provided above, for various computer and
compiler combinations, so that I can provide this information in the
section on speeds of this document.
For the phylogeny plotting programs Drawgram and Drawtree,
I am particularly interested in knowing what has to be done
to adapt them for other graphic file formats.
You can also be helpful to PHYLIP users in your part of the world by
helping them get the latest version of PHYLIP from our web site
and by helping them with any
problems they may have in getting PHYLIP working on their data.
Your help is appreciated. I am always happy to hear suggestions
for features and programs that ought to be incorporated in the package,
but please do not be upset if I turn out to have already considered the
particular possibility you suggest and decided against it.
Read The (documentation) Files Meticulously ("RTFM"). If that doesn't solve the
problem, please check the Frequently Asked Questions web page at the
PHYLIP web site:
http://evolution.gs.washington.edu/phylip/faq.html
and the PHYLIP Bugs web page at that site:
http://evolution.gs.washington.edu/phylip/bugs.html
If none of these answers your question, get in touch with me. My email address
is given below. If you do ask about a problem, please specify the program
name, version of the package, computer operating system, and
send me your data file so I can test the problem. Also it will help if you
have the relevant output and documentation files so that you
can refer to them in any correspondence.
Particularly if you are in a part of the world distant from me, you may also
want to try to get in touch with other users of PHYLIP nearby. I can also,
if requested, provide a list of nearby users.
Electronic mail addresses: joe (at) gs.washington.edu
Running the programs in background or under control of a command file
An example (Unix, Linux or MacOS)
sequences.dat
Y
Subtleties (in Unix, Linux, or MacOS)
seqboot < input1 > screenout
mv outfile infile
dnapars < input2 >> screenout
mv outtree intree
consense < input3 >> screenout
An example (Windows)
sequences.dat
Y
Testing for existence of files
if test -e fubarfile
then
rm fubarfile
fi
Prototyping keyboard response files
R
U
Y
R
Preparing Input Files
Input and output files
,-------------------.
| |
infile ---------> | |
| |
intree ---------> | | -----------> outfile
| |
weights --------> | program | -----------> outtree
| |
categories -----> | | -----------> plotfile
| |
fontfile -------> | |
| |
`-------------------'
dnaml: can't find input file "infile"
Please enter a new file name>
Where the files are
Data file format
6 13
Archaeopt CGATGCTTAC CGC
HesperorniCGTTACTCGT TGT
BaluchitheTAATGTTAAT TGT
B. virginiTAATGTTCGT TGT
BrontosaurCAAAACCCAT CAT
B.subtilisGGCAGCCAAT CAC
6 39
Archaeopt CGATGCTTAC CGCCGATGCT
HesperorniCGTTACTCGT TGTCGTTACT
BaluchitheTAATGTTAAT TGTTAATGTT
B. virginiTAATGTTCGT TGTTAATGTT
BrontosaurCAAAACCCAT CATCAAAACC
B.subtilisGGCAGCCAAT CACGGCAGCC
TACCGCCGAT GCTTACCGC
CGTTGTCGTT ACTCGTTGT
AATTGTTAAT GTTAATTGT
CGTTGTTAAT GTTCGTTGT
CATCATCAAA ACCCATCAT
AATCACGGCA GCCAATCAC
1101
0011001101
The Menu
DNA parsimony algorithm, version 3.695
Setting for this run:
U Search for best tree? Yes
S Search option? More thorough search
V Number of trees to save? 10000
J Randomize input order of sequences? No. Use input order
O Outgroup root? No, use as outgroup species 1
T Use Threshold parsimony? No, use ordinary parsimony
N Use Transversion parsimony? No, count all steps
W Sites weighted? No
M Analyze multiple data sets? No
I Input sequences interleaved? Yes
0 Terminal type (IBM PC, ANSI, none)? ANSI
1 Print out the data at start of run No
2 Print indications of progress of run Yes
3 Print out tree Yes
4 Print out steps in each site No
5 Print sequences at all nodes of tree No
6 Write out trees onto tree file? Yes
Y to accept these or type the letter for one to change
The Output File
+-------------------Gibbon
+----------------------------2
! ! +------------------Orang
! +------4
! ! +---------Gorilla
+-----3 +--6
! ! ! +---------Chimp
! ! +----5
--1 ! +-----Human
! !
! +-----------------------------------------------Mouse
!
+------------------------------------------------Bovine
+-----------------------------------------------Mouse
!
+---------4 +------------------Orang
! ! +------3
! ! ! ! +---------Chimp
---6 +----------------------------1 ! +----2
! ! +--5 +-----Human
! ! !
! ! +---------Gorilla
! !
! +-------------------Gibbon
!
+-------------------------------------------Bovine
remember: this is an unrooted tree!
+--Human
+--5
+--4 +--Chimp
! !
+--3 +-----Gorilla
! !
+--2 +--------Orang
! !
+--1 +-----------Gibbon
! !
--6 +--------------Mouse
!
+-----------------Bovine
remember: this is an unrooted tree!
Between And Length Approx. Confidence Limits
------- --- ------ ------- ---------- ------
1 Bovine 0.90216 ( 0.50346, 1.30086) **
1 Mouse 0.79240 ( 0.42191, 1.16297) **
1 2 0.48553 ( 0.16602, 0.80496) **
2 3 0.12113 ( zero, 0.24676) *
3 4 0.04895 ( zero, 0.12668)
4 5 0.07459 ( 0.00735, 0.14180) **
5 Human 0.10563 ( 0.04234, 0.16889) **
5 Chimp 0.17158 ( 0.09765, 0.24553) **
4 Gorilla 0.15266 ( 0.07468, 0.23069) **
3 Orang 0.30368 ( 0.18735, 0.41999) **
2 Gibbon 0.33636 ( 0.19264, 0.48009) **
* = significantly positive, P < 0.05
** = significantly positive, P < 0.01
steps in each site:
0 1 2 3 4 5 6 7 8 9
*-----------------------------------------
0! 2 2 2 2 1 1 2 2 1
10! 1 2 3 1 1 1 1 1 1 2
20! 1 2 2 1 2 2 1 1 1 2
30! 1 2 1 1 1 2 1 3 1 1
40! 1
The Tree File
((Mouse,Bovine),(Gibbon,(Orang,(Gorilla,(Chimp,Human)))));
(A,(B,(C,D)),(E,F));
((cat:47.14069,(weasel:18.87953,((dog:25.46154,(raccoon:19.19959,
bear:6.80041):0.84600):3.87382,(sea_lion:11.99700,
seal:12.00300):7.52973):2.09461):20.59201):25.0,monkey:75.85931);
The Options and How To Invoke Them
Common options in the menu
((Alligator,Bear),((Cow,(Dog,Elephant)),Ferret));
((Alligator,Bear),(((Cow,Dog),Elephant),Ferret));
((Alligator,Bear),((Cow,Dog),(Elephant,Ferret)));
5 6
Alpha CCACCA
Beta CCAAAA
Gamma CAACCA
Delta AACAAC
Epsilon AACCCA
5 6
Alpha CACACA
Beta CCAACC
Gamma CAACAC
Delta GCCTGG
Epsilon TGCAAT
The Algorithm for Constructing Trees
Local rearrangements
T1 T2 T3
\ / /
\ / /
\ / /
\/ /
* /
* /
* /
* /
*
!
!
T3 T2 T1 T1 T3 T2
\ / / \ / /
\ / / \ / /
\ / / \ / /
\ / / \ / /
\ / \ /
\ / \ /
\ / \ /
\ / \ /
! !
! !
! !
Global rearrangements
Multiple jumbles
Saving multiple tied trees
Strategy for finding the best tree
Nixon's search strategy
It will not necessarily be fast to do this, as the last step may be slow.
But the resampling will cause emphasis on different sets of characters in
the initial searches, allowing the process to explore regions of tree
space not usually examined by conventional rearrangement strategies.
A Warning on Interpreting Results
Relative Speed of Different
Programs and MachinesRelative speed of the different programs
10 40
A CACACACAAAAAAAAAAACAAAAAAAAAAAAAAAAAAAAA
B CACACAACAAAAAAAAAACAAAAAAAAAAAAAAAAAAAAA
C CACAACAAAAAAAAAAAACAAAAAAAAAAAAAAAAAAAAA
D CAACAAAACAAAAAAAAACAAAAAAAAAAAAAAAAAAAAA
E CAACAAAAACAAAAAAAACAAAAAAAAAAAAAAAAAAAAA
F ACAAAAAAAACACACAAAACAAAAAAAAAAAAAAAAAAAA
G ACAAAAAAAACACAACAAACAAAAAAAAAAAAAAAAAAAA
H ACAAAAAAAACAACAAAAACAAAAAAAAAAAAAAAAAAAA
I ACAAAAAAAAACAAAACAACAAAAAAAAAAAAAAAAAAAA
J ACAAAAAAAAACAAAAACACAAAAAAAAAAAAAAAAAAAA
10 40
MesohippusAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
HypohippusAAACCCCCCCAAAAAAAAACAAAAAAAAAAAAAAAAAAAA
ArchaeohipCAAAAAAAAAAAAAAAACACAAAAAAAAAAAAAAAAAAAA
ParahippusCAAACAACAACAAAAAAAACAAAAAAAAAAAAAAAAAAAA
MerychippuCCAACCACCACCCCACACCCAAAAAAAAAAAAAAAAAAAA
M. secunduCCAACCACCACCCACACCCCAAAAAAAAAAAAAAAAAAAA
Nannipus CCAACCACAACCCCACACCCAAAAAAAAAAAAAAAAAAAA
NeohippariCCAACCCCCCCCCCACACCCAAAAAAAAAAAAAAAAAAAA
Calippus CCAACCACAACCCACACCCCAAAAAAAAAAAAAAAAAAAA
PliohippusCCCACCCCCCCCCACACCCCAAAAAAAAAAAAAAAAAAAA
10 40
A CACACAACCAAACAAACCACAAAAAAAAAAAAAAAAAAAA
B AAACCACACACACAAACCCAAAAAAAAAAAAAAAAAAAAA
C ACAAAACCAAACCACCCACAAAAAAAAAAAAAAAAAAAAA
D AAAAACACAACACACCAAACAAAAAAAAAAAAAAAAAAAA
E AAACAACCACACACAACCAAAAAAAAAAAAAAAAAAAAAA
F CCCAAACACCCCCAAAAAACAAAAAAAAAAAAAAAAAAAA
G ACACCCCCACACCCACCAACAAAAAAAAAAAAAAAAAAAA
H AAAACAACAACCACCCCACCAAAAAAAAAAAAAAAAAAAA
I ACACAACAACACAAACAACCAAAAAAAAAAAAAAAAAAAA
J CCAAAAACACCCAACCCAACAAAAAAAAAAAAAAAAAAAA
Hennigian Data
Horses Data
Random Data
Protpars
0.00500
0.00670
0.01289
Dnapars
0.01050
0.00940
0.00980
Dnapenny
0.01400
0.00860
1.71100
Dnacomp
0.00240
0.00250
0.00590
Dnaml
0.17749
0.23970
0.21350
Dnamlk
0.21740
0.19450
0.24400
Proml
1.3527
3.2085
2.0055
Promlk
3.3567
8.6078
4.4886
Dnainvar
0.00020
0.00020
0.00020
Dnadist
0.00140
0.00080
0.00150
Protdist
0.09220
0.09210
0.09310
Restml
0.14560
0.28810
0.21540
Restdist
0.00110
0.00090
0.00080
Fitch
0.00760
0.01280
0.00880
Kitsch
0.00180
0.00260
0.00280
Neighbor
0.00020
0.00050
0.00050
Contml
0.01310
0.01500
0.01780
Gendist
0.00070
0.00070
0.00070
Pars
0.00780
0.00610
0.02930
Mix
0.00360
0.00410
0.00610
Penny
0.00190
0.00470
0.8060
Dollop
0.00480
0.00450
0.00820
Dolpenny
0.00200
0.01060
1.1270
Clique
0.00100
0.00070
0.00130
Speed with different numbers of species
Relative speed of different machines
Machine
Operating
SystemCompiler
Total
Time
Relative
SpeedToshiba T1100+
MSDOS
Turbo Pascal 3.01A
(269)
10542
0.00009486
Apple Mac Plus
Mac OS
Lightspeed Pascal 2
(175.84)
6891
0.00014511
Toshiba T1100+
MSDOS
Turbo Pascal 5.0
(162)
6349
0.00015750
Macintosh Classic
Mac OS
Think Pascal 3
(160)
6271
0.00015947
Macintosh Classic
Mac OS
Think C
(43.0)
4771
0.0002096
IBM PS2/60
MSDOS
Turbo Pascal 5.0
(58.76)
2303
0.0004343
80286 (12 Mhz)
MSDOS
Turbo Pascal 5.0
(47.09)
1845.4
0.0005419
Apple Mac IIcx
Mac OS
Think Pascal 3
(42)
1645.5
0.0006077
Apple Mac SE/30
Mac OS
Think Pascal 3
(42)
1645.6
0.0006077
Apple Mac IIcx
Mac OS
Lightspeed Pascal 2
(39.84)
1561.6
0.0006404
Apple Mac IIcx
Mac OS
Lightspeed Pascal 2#
(39.69)
1555.0
0.00006431
Zenith Z386 (16MHz)
MSDOS
Turbo Pascal 5.0
(38.27)
1539.0
0.0006498
Macintosh SE/30
Mac OS
Think C
(13.6)
1508.4
0.0006630
386SX (16 MHz)
MSDOS
Turbo Pascal 6.0
(34)
1333.6
0.0007498
386SX (16 MHz)
MSDOS
Microsoft Quick C
(12.01)
1333.6
0.0007499
Sequent-S81
DYNIX
Silicon Valley Pascal
(13.0)
509.0
0.0019646
VAX 11/785
Unix
Berkeley Pascal
(11.9)
466.3
0.002144
80486-33
MSDOS
Turbo Pascal 6.0
(11.46)
449.0
0.02227
Sun 3/60
SunOS
Sun C
(3.93)
435.7
0.002295
NeXT Cube (68030)
Mach
Gnu C
(2.608)
289.3
0.003456
Sequent S-81
DYNIX
Sequent Symmetry C
(2.604)
288.9
0.003461
VAXstation 3500
Unix
Berkeley Pascal
(7.3)
286.5
0.003491
Sequent S-81
DYNIX
Berkeley Pascal
(5.6)
219.5
0.004557
Unisys 7000/40
Unix
Berkeley Pascal
(5.24)
205.3
0.004870
VAX 8600
VMS
DEC VAX Pascal
(3.96)
155.23
0.006442
Sun SPARC IPX
SunOS
Gnu C version 2.1
(1.28)
142.04
0.007040
VAX 6000-530
VMS
DEC C
(0.858)
95.14
0.010511
VAXstation 4000
VMS
DEC C
(0.809)
89.81
0.011135
IBM RS/6000 540
AIX
XLP Pascal
(2.276)
89.14
0.011219
NeXTstation(040/25)
Mach
Gnu C
(0.75)
83.15
0.012027
Sun SPARC IPX
SunOS
Sun C
(0.68)
75.43
0.01326
486DX (33 MHz)
Linux
Gnu C #
(0.63)
69.95
0.01430
Sun SPARCstation-1
Unix
Sun Pascal
(1.7)
66.62
0.01501
DECstation 5000/200
Unix
DEC Ultrix C
(0.45)
49.97
0.02001
Sun SPARC 1+
SunOS
Sun C
(0.40)
44.37
0.02254
DECstation 3100
Unix
DEC Ultrix Pascal
(0.77)
30.11
0.03321
IBM 3090-300E
AIX
Metaware High C
(0.27)
29.98
0.03336
DECstation 5000/125
Unix
DEC Ultrix C
(0.267)
29.58
0.03381
DECstation 5000/200
Unix
DEC Ultrix C
(0.256)
28.38
0.03524
Sun SPARC 4/50
SunOS
Sun C
(0.249)
27.62
0.03621
DEC 3000/400 AXP
Unix
DEC C
(0.224)
24.85
0.04024
DECstation 5000/240
Unix
DEC Ultrix C
(0.1889)
20.96
0.04771
SGI Iris R4000
Unix
SGI C
(0.184)
20.41
0.04898
IBM 3090-300E
VM
Pascal VS
(0.464)
18.12
0.05519
DECstation 5000/200
Unix
DEC Ultrix Pascal
(0.39)
15.188
0.06583
Pentium 120
Linux
Gnu C
1.848
11.953
0.08366
Pentium Pro 180
Linux
Gnu C
1.009
6.527
0.1532
Pentium 266 MMX
Linux
Gnu C (PHYLIP 3.5)
(0.054)
5.996
0.1668
Pentium 266 MMX
Linux
Gnu C
0.927
5.996
0.1668
Pentium 200
Linux
Gnu C
0.853
5.517
0.1812
SGI PowerChallenge
Irix
Gnu C
0.844
5.459
0.1832
DEC Alpha 400 4/233
DUNIX
Digital C (cc -fast)
0.730
4.722
0.2118
Pentium II 500
Linux
Gnu C
0.368
2.380
0.4201
Dual 448/633 MHz Pentiums
Linux
gcc
0.3069
1.985
0.5037
Sun Ultra 10
Solaris 8
gcc
0.25848
1.672
0.5981
Macintosh G3 300 MHz
Mac OS X
Gnu C (-O 3)
0.2330
1.5071
0.6635
Compaq/Digital Alpha 500au
DUNIX
Digital C (cc -fast)
0.167
1.080
0.9257
AMD Athlon 1.2 GHz
Linux
gcc
0.1546
1.0
1.0
Intel Pentium 4 2.26 GHz
Windows XP
Cygwin gcc
0.1078
0.6973
1.434
Pentium 4 1700 MHz
Linux
Gnu C
0.10730
0.6940
1.441
SGI Fuel R16000/700MHz
IRIX 6.5.30
MipsPro 7.4.4
0.09
0.58
1.72
Macintosh G4 1.2GHz
Mac OS X
Gnu C (-O 3)
0.0582
0.3765
2.656
AMD Athlon 2800 2.1 GHz
Linux
gcc (-O 3)
0.0455
0.2943
3.398
iMac 2 Ghz Intel Core Duo
Mac OS X
gcc (-O 3)
0.0300
0.1940
5.153
Machine
Operating
SystemCompiler
Seconds
Time
Relative
Speed386SX 16 Mhz
PCDOS
Turbo Pascal 6
(7826)
1027.55
0.0009732
386SX 16 Mhz
PCDOS
Quick C
(6549.79)
1027.55
0.0009732
Compudyne 486DX/33
Linux
Gnu C
(1599.9)
251.0
0.003984
SUN Sparcstation 1+
SunOS
Sun C
(1402.8)
220.1
0.004543
Everex STEP 386/20
PCDOS
Turbo Pascal 5.5
(1440.8)
189.17
0.005286
486DX/33
PCDOS
Turbo C++
(1107.2)
173.70
0.005757
Compudyne 486DX/33
PCDOS
Waterloo C/386
(1045.78)
164.07
0.006094
Sun SPARCstation IPX
SunOS
Gnu C
(960.2)
150.64
0.006638
NeXTstation(68040/25)
Mach
Gnu C
(916.6)
143.80
0.006954
486DX/33
PCDOS
Waterloo C/386
(861.0)
135.08
0.007403
Sun SPARCstation IPX
SunOS
Sun C
(787.7)
123.58
0.008091
486DX/33
PCDOS
Gnu C
(650.9)
102.12
0.009792
VAX 6000-530
VMS
DEC C
(637.0)
99.94
0.01001
DECstation 5000/200
Unix
DEC Ultrix RISC C
(423.3)
66.41
0.01506
IBM 3090-300E
AIX
Metaware High C
(201.8)
31.65
0.03159
Convex C240/1024
Unix
C
(101.6)
15.940
0.06274
DEC 3000/400 AXP
Unix
DEC C
(98.29)
15.42
0.06485
Pentium 120
Linux
Gnu C
25.26
19.230
0.05200
Pentium Pro 180
Linux
Gnu C
18.88
14.372
0.06957
Pentium 200
Linux
Gnu C
16.51
12.569
0.07956
SGI PowerChallenge
IRIX
Gnu C
12.446
9.475
0.10554
DEC Alpha 400 4/233
Linux
Gnu C (cc -fast)
8.0418
6.122
0.16335
Pentium MMX 266
Linux
Gnu C (PHYLIP 3.5)
(36.15)
5.671
0.17632
Pentium MMX 266
Linux
Gnu C
7.45
5.671
0.17632
Pentium II 500
Linux
Gnu C
6.02
4.583
0.2182
Dual 448/633 MHz Pentiums
Linux
Gnu C
3.7225
2.834
0.3529
Sun Ultra 10
Solaris 8
Gnu C
3.7101
2.824
0.3541
Pentium 4 1.7 GHz
Linux
Gnu C
2.0668
1.5734
0.6356
Macintosh G3 300 MHz
Mac OS X
Gnu C (-O 3)
1.805
1.3741
0.7278
Intel Pentium 4 2.26 GHz
Windows XP
Cygwin gcc
1.55457
1.1834
0.8450
AMD Athlon 1.2 GHz
Linux
Gnu C
1.3136
1.0
1.0
Compaq/Digital Alpha 500au
Linux
Gnu C (cc -fast)
0.9383
0.7143
1.4000
Macintosh G4 1.2 GHz
Mac OS X
Gnu C (-O 3)
0.7080
0.5390
1.8554
SGI Fuel R16000/700Mhz
IRIX 6.5.30
MipsPro 7.4.4
0.55
0.41
2.43
AMD Athlon 2800 2.1 GHz
Linux
gcc (-O 3)
0.3065
0.2333
4.286
iMac 2 Ghz Intel Core Duo
Mac OS X
gcc (-O 3)
0.2535
0.1930
5.182
General Comments on Adapting
the Package to Different Computer Systems
ln -s /usr/local/bin/font1 font1
ln -s /usr/local/bin/font2 font2
ln -s /usr/local/bin/font3 font3
ln -s /usr/local/bin/font4 font4
ln -s /usr/local/bin/font5 font5
ln -s /usr/local/bin/font6 font6
http://www.cygwin.com
and follow the
instructions there. To download it you need to download
their setup.exe program and then it will download the rest
when it is run. You will need a lot of disk space for it (about a gigabyte).
set MSVC=Path
where Path is where Microsoft Visual Studio is installed.
On our Windows XP system, "Path" is
C:\Program Files\Microsoft Visual Studio 9.0
PATH=%PATH%;%MSVC%\VC\bin;%MSVC%\Common7\IDE
(The "7" is correct here; it is not a typo.)
MSVCPATH="C:\Program Files\Microsoft Visual Studio 9.0\VC"
will need to be changed so that
it points to wherever Microsoft Visual Studio is installed, followed by
\VC (for Visual Studio 9.0).
PATH=%PATH%;%MSVC%\Vc7\bin;%MSVC%\Common7\IDE
MSVCPATH="C:\Program Files\Microsoft Visual Studio .NET\Vc7"
If this does not work with your Visual C++ 7.0 compiler,
then the most likely reason is that your installation
was not placed into the folder C:\Program Files,
or has a name that is not exactly identical to
Microsoft Visual Studio .NET. In that case,
you will need to find the correct path to the Visual C++
7.0 installation on your system, and supply this in the
MSVC variable above, and also in the Makefile. (Note
that in the Makefile, you will need to follow this path
with \Vc7.)
-L"c:\Borland\Bcc55\lib"
-I"c:\Borland\Bcc55\include"
-L"c:\Borland\Bcc55\lib"
set BORLAND=Path
PATH=%PATH%;%BORLAND%\Bin
/* parallelize here */
In particular
within these innermost loops of the programs there are often scalar quantities
that are used for temporary bookkeeping. These quantities, such as
sum1, sum2, zz, z1, yy, y1, aa, bb, cc, sum, and denom in procedure makenewv
of Dnaml and similar quantities in procedure nuview) are there to
minimize the number of array references. For vectorizing and parallelizing
compilers it will
be better to replace them by arrays so that processing can occur
simultaneously.
Problems that are encountered
How to make it do various things
(a)
In Mix, make up an extra character with states 0 for all the outgroups
and 1 for all the ingroups. If using
Dnapars the ingroup can have (say)
G and the outgroup A.(b)
Assign this character an enormous weight
(such as Z for 35) using the W
option,
all other characters getting weight 1, or whatever weight they had
before.(c)
If it is available, Use the A (Ancestral
states) option to designate that
for that new character the state found in the
outgroup is the ancestral
state.(d) In Mix do not use the O (Outgroup) option.
(e)
After the tree is found, the designated
ingroup should have been held
together by the fake character. The tree will be
rooted somewhere in the
outgroup (the program may or may not have a preference for one place in
the outgroup over another).
Make sure that you subtract from the total
number of steps on the tree all steps in the new character.In programs like Dnapars, you cannot use this method as weights of sites
cannot be greater than 1. But you do an analogous trick, by adding a
largish number of extra sites to the data, with one nucleotide state ("A")
for the ingroup and another ("G") for the outgroup. You will then have to
use Retree to manually reroot the tree in the desired place.
For restriction sites (rather than fragments) life is a bit
easier: they evolve nearly independently so bootstrapping is possible
and Restml can be used, as well as restriction sites distances
computed in Restdist. Also directionality of change
is less ambiguous when parsimony is used. A more complete tour of the
issues for restriction sites and restriction fragments is given in chapter
15 of my book (Felsenstein, 2004).
Background information needed:
"How do I use the programs? I can't find any documentation!"
(No, I am not confining my list largely to books put out by
my own publisher, Oxford University Press. It happens that it
not only has published many books in this area, it has bought out
Sinauer Associates and its extensive line of books on
molecular evolution and phylogeny.)
and as we mentioned above, although it is no longer distributed from
there, it is
available as a PDF at the main PHYLIP website.
Questions about distribution and citation:
Questions about documentation
Additional Frequently Asked Questions, or:
"Why didn't it occur to you to ...
(Fortunately) obsolete questions
"Why didn't it occur to you to ...
New Features in This Version
Coming Attractions, Future Plans
Endorsements
From the pages of Cladistics:
"Under no circumstances can we recommend PHYLIP/WAG [their name for the
Wagner parsimony option of Mix]."
"PHYLIP has not proven very effective in implementing parsimony (Luckow and
Pimentel, 1985)."
"... PHYLIP. This is the computer program where every newsletter concerning
it is mostly bug-catching, some of which have been put there by previous
corrections. As Platnick (1987) documents, through dint of much labor useful
results may be attained with this program, but I would suggest an
easier way: FORMAT b:"
"PHYLIP is bug-infested and both less effective and orders of
magnitude slower than other programs ...."
"Hennig86 [by J. S. Farris] provides such substantial improvements over
previously available programs (for both mainframes and microcomputers) that
it should now become the tool of choice for practising systematists."
... in the pages of other journals:
"The availability, within PHYLIP of distance, compatibility, maximum likelihood,
and generalized `invariants' algorithms (Cavender and Felsenstein, 1987) sets
it apart from other packages .... One of the strengths of PHYLIP is its
documentation ...."
"This package of programs has gradually become a basic necessity to anyone
working seriously on various aspects of phylogenetic inference .... The package
includes more programs than any other known phylogeny package. But it is not
just a collection of cladistic and related programs. The package has great
value added to the whole, and for this it is unique and of extreme
importance .... its various strengths are in the great array of methods
provided ...."
... and in the comments made by users when they register:
"a program on phylogeny --
PHYLOGENY INTERFERENCE PACKAGE (PHYLIP). We would therefore like to ask ..."
"I am struglling with your clever programs."
"I'm famously computer illiterate - I look forward to many frustrating hours trying to run this program"
"I am a brave man. PHYLIP is a brave program. We'll do fine together."
"The Mahabarata of phylogenetics looks better than ever."
"I love phylip. Tastes great and less filling!"
References for the Documentation Files
Years
Agency
Grant or Contract Number
2005-2009
NIH NIGMS
R01 GM071639
2003-2007
NIH NIGMS
R01 GM51929-05 (PI: Mary Kuhner)
1999-2003
NSF
BIR-9527687
1999-2002
NIH NIGMS
R01 GM51929-04
1999-2001
NIH NIMH
R01 HG01989-01
1995-1999
NIH NIGMS
R01 GM51929-01
1992-1995
National Science Foundation
DEB-9207558
1992-1994
NIH NIGMS Shannon Award
2 R55 GM41716-04
1989-1992
NIH NIGMS
1 R01-GM41716-01
1990-1992
National Science Foundation
BSR-8918333
1987-1990
National Science Foundation
BSR-8614807
1979-1987
U.S. Department of Energy
DE-AM06-76RLO2225 TA DE-AT06-76EV71005
http://evolution.gs.washington.edu/phylip/credits.html
Other Phylogeny Programs Available Elsewhere
How You Can Help Me
In Case of Trouble
Joe Felsenstein
Department of Genome Sciences
University of Washington
Box 355065
Seattle, Washington 98195-5065, U.S.A.