Unbiased probabilistic taxonomic classification for DNA barcoding

We present a probabilistic method for taxonomical classification (PROTAX) of DNA sequences. Given a pre-defined taxonomical tree structure that is partially populated by reference sequences, PROTAX decomposes the probability of one to the set of all possible outcomes. PROTAX accounts for species that are present in the taxonomy but that do not have reference sequences, the possibility of unknown taxonomical units, as well as mislabeled reference sequences. PROTAX is based on a statistical multinomial regression model, and it can utilize any kind of sequence similarity measures or the outputs of other classifiers as predictors. Results: We demonstrate the performance of PROTAX by using as predictors the output from BLAST, the phylogenetic classification software TIPP, and the RDP classifier. We show that PROTAX improves the predictions of the baseline implementations of TIPP and RDP classifiers, and that it is able to combine complementary information provided by BLAST and TIPP, resulting in accurate and unbiased classifications even with very challenging cases such as 50% mislabeling of reference sequences. Availability and implementation: Perl/R implementation of PROTAX is available at http://www.helsinki.fi/science/metapop/Software.htm. Contact: panu.somervuo@helsinki.fi Supplementary information: Supplementary data are available at Bioinformatics online.
Source: Bioinformatics - Category: Bioinformatics Authors: Tags: SEQUENCE ANALYSIS Source Type: research