OrfM: a fast open reading frame predictor for metagenomic data
Summary: Finding and translating stretches of DNA lacking stop codons is a task common in the analysis of sequence data. However, the computational tools for finding open reading frames are sufficiently slow that they are becoming a bottleneck as the volume of sequence data grows. This computational bottleneck is especially problematic in metagenomics when searching unassembled reads, or screening assembled contigs for genes of interest. Here, we present OrfM, a tool to rapidly identify open reading frames (ORFs) in sequence data by applying the Aho–Corasick algorithm to find regions uninterrupted by stop codons. Ben...
Source: Bioinformatics - August 31, 2016 Category: Bioinformatics Authors: Woodcroft, B. J., Boyd, J. A., Tyson, G. W. Tags: SEQUENCE ANALYSIS Source Type: research

BAM-matcher: a tool for rapid NGS sample matching
The standard method used by high-throughput genome sequencing facilities for detecting mislabelled samples is to use independently generated high-density SNP data to determine sample identity. However, as it has now become commonplace to have multiple samples sequenced from the same source, such as for analysis of somatic variants using matched tumour and normal samples, we can directly use the genotype information inherent in the sequence data to match samples and thus bypass the need for additional laboratory testing. Here we present BAM-matcher, a tool that can rapidly determine whether two BAM files represent samples f...
Source: Bioinformatics - August 31, 2016 Category: Bioinformatics Authors: Wang, P. P. S., Parker, W. T., Branford, S., Schreiber, A. W. Tags: SEQUENCE ANALYSIS Source Type: research

BRAT-nova: fast and accurate mapping of bisulfite-treated reads
Summary: In response to increasing amounts of sequencing data, faster and faster aligners need to become available. Here, we introduce BRAT-nova, a completely rewritten and improved implementation of the mapping tool BRAT-BW for bisulfite-treated reads (BS-Seq). BRAT-nova is very fast and accurate. On the human genome, BRAT-nova is 2–7 times faster than state-of-the-art aligners, while maintaining the same percentage of uniquely mapped reads and space usage. On synthetic reads, BRAT-nova is 2–8 times faster than state-of-the-art aligners while maintaining similar mapping accuracy, methylation call accuracy, met...
Source: Bioinformatics - August 31, 2016 Category: Bioinformatics Authors: Harris, E. Y., Ounit, R., Lonardi, S. Tags: SEQUENCE ANALYSIS Source Type: research

FastHiC: a fast and accurate algorithm to detect long-range chromosomal interactions from Hi-C data
Motivation: How chromatin folds in three-dimensional (3D) space is closely related to transcription regulation. As powerful tools to study such 3D chromatin conformation, the recently developed Hi-C technologies enable a genome-wide measurement of pair-wise chromatin interaction. However, methods for the detection of biologically meaningful chromatin interactions, i.e. peak calling, from Hi-C data, are still under development. In our previous work, we have developed a novel hidden Markov random field (HMRF) based Bayesian method, which through explicitly modeling the non-negligible spatial dependency among adjacent pairs o...
Source: Bioinformatics - August 31, 2016 Category: Bioinformatics Authors: Xu, Z., Zhang, G., Wu, C., Li, Y., Hu, M. Tags: GENOME ANALYSIS Source Type: research

Bioinformatics programs are 31-fold over-represented among the highest impact scientific papers of the past two decades
Motivation: To analyze the relative proportion of bioinformatics papers and their non-bioinformatics counterparts in the top 20 most cited papers annually for the past two decades. Results: When defining bioinformatics papers as encompassing both those that provide software for data analysis or methods underlying data analysis software, we find that over the past two decades, more than a third (34%) of the most cited papers in science were bioinformatics papers, which is approximately a 31-fold enrichment relative to the total number of bioinformatics papers published. More than half of the most cited papers during this sp...
Source: Bioinformatics - August 31, 2016 Category: Bioinformatics Authors: Wren, J. D. Tags: DATA AND TEXT MINING Source Type: research

Computational discovery and in vivo validation of hnf4 as a regulatory gene in planarian regeneration
Conclusion: These results suggest that hnf4 is a regulatory gene in planarian regeneration, validate the computational predictions of the reverse-engineered dynamic model, and demonstrate the automated methodology for the discovery of novel genes, pathways and experimental phenotypes. Contact: michael.levin@tufts.edu (Source: Bioinformatics)
Source: Bioinformatics - August 31, 2016 Category: Bioinformatics Authors: Lobo, D., Morokuma, J., Levin, M. Tags: SYSTEMS BIOLOGY Source Type: research

A computational model to predict the immune system activation by citrus-derived vaccine adjuvants
Motivation: Vaccines represent the most effective and cost-efficient weapons against a wide range of diseases. Nowadays new generation vaccines based on subunit antigens reduce adverse effects in high risk individuals. However, vaccine antigens are often poor immunogens when administered alone. Adjuvants represent a good strategy to overcome such hurdles, indeed they are able to: enhance the immune response; allow antigens sparing; accelerate the specific immune response; and increase vaccine efficacy in vulnerable groups such as newborns, elderly or immuno-compromised people. However, due to safety concerns and adverse re...
Source: Bioinformatics - August 31, 2016 Category: Bioinformatics Authors: Pappalardo, F., Fichera, E., Paparone, N., Lombardo, A., Pennisi, M., Russo, G., Leotta, M., Pappalardo, F., Pedretti, A., De Fiore, F., Motta, S. Tags: SYSTEMS BIOLOGY Source Type: research

Drug repositioning based on comprehensive similarity measures and Bi-Random walk algorithm
Motivation: Drug repositioning, which aims to identify new indications for existing drugs, offers a promising alternative to reduce the total time and cost of traditional drug development. Many computational strategies for drug repositioning have been proposed, which are based on similarities among drugs and diseases. Current studies typically use either only drug-related properties (e.g. chemical structures) or only disease-related properties (e.g. phenotypes) to calculate drug or disease similarity, respectively, while not taking into account the influence of known drug–disease association information on the simila...
Source: Bioinformatics - August 31, 2016 Category: Bioinformatics Authors: Luo, H., Wang, J., Li, M., Luo, J., Peng, X., Wu, F.-X., Pan, Y. Tags: SYSTEMS BIOLOGY Source Type: research

RCP: a novel probe design bias correction method for Illumina Methylation BeadChip
Motivation: The Illumina HumanMethylation450 BeadChip has been extensively utilized in epigenome-wide association studies. This array and its successor, the MethylationEPIC array, use two types of probes—Infinium I (type I) and Infinium II (type II)—in order to increase genome coverage but differences in probe chemistries result in different type I and II distributions of methylation values. Ignoring the difference in distributions between the two probe types may bias downstream analysis. Results: Here, we developed a novel method, called Regression on Correlated Probes (RCP), which uses the existing correlatio...
Source: Bioinformatics - August 31, 2016 Category: Bioinformatics Authors: Niu, L., Xu, Z., Taylor, J. A. Tags: GENE EXPRESSION Source Type: research

Calculating and scoring high quality multiple flexible protein structure alignments
This article describes several novel improvements to the Kpax algorithm which allow high quality flexible MSAs to be calculated. This article also introduces a new Gaussian-based MSA quality measure called ‘M-score’, which circumvents the pitfalls of RMSD-based quality measures. Results: As well as calculating flexible MSAs, the new version of Kpax can also score MSAs from other aligners and from previously aligned reference datasets. Results are presented for a large-scale evaluation of the Homstrad, SABmark and SISY benchmark sets using Kpax and Matt as examples of state-of-the-art flexible aligners and 3DCOM...
Source: Bioinformatics - August 31, 2016 Category: Bioinformatics Authors: Ritchie, D. W. Tags: STRUCTURAL BIOINFORMATICS Source Type: research

Confidence assignment for mass spectrometry based peptide identifications via the extreme value distribution
Motivation: There is a growing trend for biomedical researchers to extract evidence and draw conclusions from mass spectrometry based proteomics experiments, the cornerstone of which is peptide identification. Inaccurate assignments of peptide identification confidence thus may have far-reaching and adverse consequences. Although some peptide identification methods report accurate statistics, they have been limited to certain types of scoring function. The extreme value statistics based method, while more general in the scoring functions it allows, demands accurate parameter estimates and requires, at least in its original...
Source: Bioinformatics - August 31, 2016 Category: Bioinformatics Authors: Alves, G., Yu, Y.-K. Tags: STRUCTURAL BIOINFORMATICS Source Type: research

Benchmarking the next generation of homology inference tools
Conclusion: Profile methods are superior at inferring remote homologs but the difference in accuracy between methods is relatively small. PHMMER and CSBLAST stand out with the highest accuracy, yet still at a reasonable computational cost. Additionally, we show that less than 0.1% of Swiss-Prot protein pairs considered homologous by one database are considered non-homologous by another, implying that these classifications represent equivalent underlying biological phenomena, differing mostly in coverage and granularity. Availability and Implementation: Benchmark datasets and all scripts are placed at (http://sonnhammer.org...
Source: Bioinformatics - August 31, 2016 Category: Bioinformatics Authors: Saripella, G. V., Sonnhammer, E. L. L., Forslund, K. Tags: SEQUENCE ANALYSIS Source Type: research

ProbFold: a probabilistic method for integration of probing data in RNA secondary structure prediction
Motivation: Recently, new RNA secondary structure probing techniques have been developed, including Next Generation Sequencing based methods capable of probing transcriptome-wide. These techniques hold great promise for improving structure prediction accuracy. However, each new data type comes with its own signal properties and biases, which may even be experiment specific. There is therefore a growing need for RNA structure prediction methods that can be automatically trained on new data types and readily extended to integrate and fully exploit multiple types of data. Results: Here, we develop and explore a modular probab...
Source: Bioinformatics - August 31, 2016 Category: Bioinformatics Authors: Sahoo, S., Switnicki, M. P., Pedersen, J. S. Tags: SEQUENCE ANALYSIS Source Type: research

PERMANOVA-S: association test for microbial community composition that accommodates confounders and multiple distances
Motivation: Recent advances in sequencing technology have made it possible to obtain high-throughput data on the composition of microbial communities and to study the effects of dysbiosis on the human host. Analysis of pairwise intersample distances quantifies the association between the microbiome diversity and covariates of interest (e.g. environmental factors, clinical outcomes, treatment groups). In the design of these analyses, multiple choices for distance metrics are available. Most distance-based methods, however, use a single distance and are underpowered if the distance is poorly chosen. In addition, distance-bas...
Source: Bioinformatics - August 31, 2016 Category: Bioinformatics Authors: Tang, Z.-Z., Chen, G., Alekseyenko, A. V. Tags: GENOME ANALYSIS Source Type: research

A two-part mixed-effects model for analyzing longitudinal microbiome compositional data
Motivation: The human microbial communities are associated with many human diseases such as obesity, diabetes and inflammatory bowel disease. High-throughput sequencing technology has been widely used to quantify the microbial composition in order to understand its impacts on human health. Longitudinal measurements of microbial communities are commonly obtained in many microbiome studies. A key question in such microbiome studies is to identify the microbes that are associated with clinical outcomes or environmental factors. However, microbiome compositional data are highly skewed, bounded in [0,1), and often sparse with m...
Source: Bioinformatics - August 31, 2016 Category: Bioinformatics Authors: Chen, E. Z., Li, H. Tags: GENOME ANALYSIS Source Type: research