2019-05-21-Multigene joint tree building software astral method

astral is developed based on java to build speciestree based on a set of unrooted genetrees.

No installation is required to run astral, but it needs to be run under java environment.

astral does not have a graphical interface and needs to be run on the command line.

After running, you can see the options of astral. if you run it without errors, it means the installation is successful.

-o output

The input file is a file in Newick format containing all the genetrees. The input genetree is treated as an unrooted tree, regardless of whether they are rooted or not. astral's output is also treated as an unrooted tree. the input genetree supports multiple branches.

The output is in Newick format, which can be viewed by many programs.

The astral measure of branch length is in terms of coalscent units, not in terms of what we usually think of as boostrap value

The -q parameter

gives you the quartet score and branch length and branch support values. 0.9 means that the input genetree supports multiple branches. values. 0.9 means that 90% of the quartet tree produced by genetree is present in the species tree. The command to score the tree is as follows:

A comparison with simulated_14taxon.gene.tre will score the species tree simulated_14taxon.default.tre.

Indicates that 4803 quartet trees from the genetree are present in the species tree. 4803 quartet trees represent 47.98% of all quartet trees. This dataset has a high level of ILS. The inconsistency between genetree and species tree is high which leads to this result.

When you get a species tree or score the tree with the -q parameter, you will get the branch length and local posterior support for each branch. In addition to these default parameters, other branch information can be output. There are four groups for each branch of a rootless tree. They are the first child (L), the second child (R), the sister group (S), and everything else (O). Pairing two and two gives three topologies. One of them is the current tree topology. The remaining two topologies are optional. astral can compute not only the local posterior probability of the current tree, but also the remaining two topologies. The -t argument

command is as follows

Read all the values given by several branches and understand them.

Calculate the local posterior probabilities and branch lengths for the branch length of the speciestree using the Yule prior model.The default value for the rate of species formation (in coalscent units) for the Yule process is set to 0.5, resulting in a quartet frequency that is smooth between [1/3,1]. （The hyper-parameter can be adjusted (and not understood) with the -c option.

astral can output the branch support value without bootstrapping.This support is more reliable (with the author's data) than bootstrapping. Although, you may still want bootstrapping. astral can do multi-locus bootstrapping. in order to carry out multi-locus bootstrapping, astral needs access to the boostrap replicate trees for each gene.

For example:

You need to provide the locations of all gene tree bootstrap replicates. Bootstrapping in test data.

1. Go to test_data directory

2. Unzip called song_mammals.424genes.bs-trees.zip.

3. Then run

Then it will run 100 times bootstrapping.

1. -i includes all the MLgenetrees (as if they had to be entered without calculating bootstrap)

2. -b tells astral that it needs to calculate the bootstrap value. -b is followed by the file bs-files containing the paths to the files of genetree bootstrap files, one gene per line. e.g.

424genes/100/raxmlboot.gtrgamma/RAxML_bootstrap.allbs

1.100 bootstrapped replicate trees

2. A greedy consensus of the 100 bootstrapped replicate trees; this tree has support values drawn on branches based on the bootstrap replicate trees. A greedy consensus of the 100 bootstrapped replicate trees; this tree has support values drawn on branches based on the bootstrap replicate trees. Support values show the percentage of bootstrap replicates that contain a branch.

3.The "main" ASTRAL tree; this is the results of running ASTRAL. This is the results of running ASTRAL on the best_ml input gene trees. This main tree also includes support values, which are again drawn based on the 100 bootstrap replicate trees.(don't get it)

Note: support values are presented as percentages. And local posterior probabilities are numbers between 0 and 1. When astral calculates bootstrapping, it will continue to output every duplicate bootstrapped astral tree.So, if duplicate is entered as 100, it will output 100 numbers, and then, output 100 bootstrapped trees of greedy Finally, it will do the main analysis (the file with the -i parameter) and then calculate the branch support for the main tree. in this example it is 102 trees.

The default value is 100, and the -r parameter can be set to any number of replicates. But make sure that your genetree's bootstrap file has more bootstrap replicates than you set after the -r parameter.

astral carries out site-only resampling with the -g parameter.

At this point we need more genetree replicates. if it is -g -r 100, for some genes it might be 150 replicates, because when genes are resampled some genes are more likely to be sampled than others.

astral expands gene-only bootstrapping with the --gene-only option. just one inputfile. use the -i argument. don't use the -b argument for this one.

Since bootstrapping involves a random process, we can provide a seed number to astral to ensure repeatability. seed number can be set with -s. The default is 692.

astral has both exact and heuristic versions. exact version saves time when the number of taxa is small. But you can't have more than 37 categories.

The -x parameter is to turn on the exact version. about 30 seconds. Similarly, we can use the default heuristic heuristic

That's only 1 second, so what's the difference between their runs? It's actually consistent

The default primate dataset we used in the previous step had 424 genes and 14 taxa. Since we have a relatively large number of gene trees, we Since we have a relatively large number of gene trees, we could reasonably expect the exact and heuristic versions to generate identical output. The key point here is that as the number of genes increases, the probability that each bipartition will generate the same output will increase. The key point here is that as the number of genes increases, the probability that each bipartition of the species tree appears in at least one input gene tree increases. Thus, with 424 genes all bipartitions from the species tree are in at least one input gene tree. Thus, with 424 genes all bipartitions from the species tree are in at least one input gene tree, and therefore, the exact and the heuristic versions are identical.

We tried hard to find a subset of We tried hard to find a subset of genes in the biological primates dataset where the exact and the heuristic versions did not match. We couldn't! simulated a 14-taxon dataset with extreme levels of ILS (average 87% RF between gene trees and the species tree). Now, with this simulated dataset, if you take only 10 genes, something interesting happens.

Run:

At this point the scores will be a little different, and the topology will be different. So in extreme cases (higher ILS levels, more genetree errors or fewer genetrees available compared to the classification e.g. only 10 genes for taxon 14, compared to the previous 424 genes is just less). Then the difference between the two algorithms can be observed.

To expand the search space, run:

Here the -e parameter is used to input a set of extra trees to expand the astral search space. This file provides 200 bootstrap replicates for 10 simulated genes. The -f argument is used when the input tree has species labels instead of gene labels.

Large datasets (>500taxa) increase memory available to java.

run

-m: Removes genes containing fewer than the specified number of leaves. useful for taxon occupancy requiring a certain level of categorization. The number is set later.

-k completed : To build the set X (and not to score the species tree), ASTRAL internally completes the gene trees. To see these completed gene trees, run this option. This option is usable only when you also have -o (don't understand)

-k bootstrapped and -k bootstraps_norun: these options output the bootstrap replicate inputs to ASTRAL. These are useful if you want to run ASTRAL separately on each bootstrap replicate on a cluster.

-k searchspace_norun. Export search space and exit.

----polylimit:

--samplingrounds: For multi-individual datasets, this option controls how many rounds of individual sampling is used For multi-individual datasets, this option controls how many rounds of individual sampling is used in building the constraint set. Adjust to reduce/increase the search space for multi-individual datasets

Article reference: [ /smirarab/astral/blob/ master/astral-tutorial.md#running-on-a-multi-individual-datasets]

What are the good majors in universities?

How is the high school department of Suzhou Foreign Languages ??School?

Why is big data so "hot"?

Take Jingdong, what is it worth?

Milestones of UC Browser

...... quietly changing,essay.600 junior

Which schools have better vehicle engineering programs?

I've been hearing people talk about big data startups lately, so what are the directions for big data startups?

What kind of product is Netease Yanxuan?

What is GIS?