Conpair: concordance and contamination estimator for matched tumor-normal pairs
Motivation: Sequencing of matched tumor and normal samples is the standard study design for reliable detection of somatic alterations. However, even very low levels of cross-sample contamination significantly impact calling of somatic mutations, because contaminant germline variants can be incorrectly interpreted as somatic. There are currently no sequence-only based methods that reliably estimate contamination levels in tumor samples, which frequently display copy number changes. As a solution, we developed Conpair, a tool for detection of sample swaps and cross-individual contamination in whole-genome and whole-exome tum...
Source: Bioinformatics - October 2, 2016 Category: Bioinformatics Authors: Bergmann, E. A., Chen, B.-J., Arora, K., Vacic, V., Zody, M. C. Tags: SEQUENCE ANALYSIS Source Type: research

PEP_scaffolder: using (homologous) proteins to scaffold genomes
Motivation: Recovering the gene structures is one of the important goals of genome assembly. In low-quality assemblies, and even some high-quality assemblies, certain gene regions are still incomplete; thus, novel scaffolding approaches are required to complete gene regions. Results: We developed an efficient and fast genome scaffolding method called PEP_scaffolder, using proteins to scaffold genomes. The pipeline aims to recover protein-coding gene structures. We tested the method on human contigs; using human UniProt proteins as guides, the improvement on N50 size was 17% increase with an accuracy of ~97%. PEP_scaffolder...
Source: Bioinformatics - October 2, 2016 Category: Bioinformatics Authors: Zhu, B.-H., Song, Y.-N., Xue, W., Xu, G.-C., Xiao, J., Sun, M.-Y., Sun, X.-W., Li, J.-T. Tags: GENOME ANALYSIS Source Type: research

TADtool: visual parameter identification for TAD-calling algorithms
Summary: Eukaryotic genomes are hierarchically organized into topologically associating domains (TADs). The computational identification of these domains and their associated properties critically depends on the choice of suitable parameters of TAD-calling algorithms. To reduce the element of trial-and-error in parameter selection, we have developed TADtool: an interactive plot to find robust TAD-calling parameters with immediate visual feedback. TADtool allows the direct export of TADs called with a chosen set of parameters for two of the most common TAD calling algorithms: directionality and insulation index. It can be u...
Source: Bioinformatics - October 2, 2016 Category: Bioinformatics Authors: Kruse, K., Hug, C. B., Hernandez-Rodriguez, B., Vaquerizas, J. M. Tags: GENOME ANALYSIS Source Type: research

A simple model predicts UGT-mediated metabolism
Motivation: Uridine diphosphate glucunosyltransferases (UGTs) metabolize 15% of FDA approved drugs. Lead optimization efforts benefit from knowing how candidate drugs are metabolized by UGTs. This paper describes a computational method for predicting sites of UGT-mediated metabolism on drug-like molecules. Results: XenoSite correctly predicts test molecule’s sites of glucoronidation in the Top-1 or Top-2 predictions at a rate of 86 and 97%, respectively. In addition to predicting common sites of UGT conjugation, like hydroxyl groups, it can also accurately predict the glucoronidation of atypical sites, such as carbon...
Source: Bioinformatics - October 2, 2016 Category: Bioinformatics Authors: Dang, N. L., Hughes, T. B., Krishnamurthy, V., Swamidass, S. J. Tags: DATA AND TEXT MINING Source Type: research

A probabilistic approach for collective similarity-based drug-drug interaction prediction
Motivation: As concurrent use of multiple medications becomes ubiquitous among patients, it is crucial to characterize both adverse and synergistic interactions between drugs. Statistical methods for prediction of putative drug–drug interactions (DDIs) can guide in vitro testing and cut down significant cost and effort. With the abundance of experimental data characterizing drugs and their associated targets, such methods must effectively fuse multiple sources of information and perform inference over the network of drugs. Results: We propose a probabilistic approach for jointly inferring unknown DDIs from a network ...
Source: Bioinformatics - October 2, 2016 Category: Bioinformatics Authors: Sridhar, D., Fakhraei, S., Getoor, L. Tags: DATA AND TEXT MINING Source Type: research

Network-based pathway enrichment analysis with incomplete network information
Motivation: Pathway enrichment analysis has become a key tool for biomedical researchers to gain insight into the underlying biology of differentially expressed genes, proteins and metabolites. It reduces complexity and provides a system-level view of changes in cellular activity in response to treatments and/or in disease states. Methods that use existing pathway network information have been shown to outperform simpler methods that only take into account pathway membership. However, despite significant progress in understanding the association amongst members of biological pathways, and expansion of data bases containing...
Source: Bioinformatics - October 2, 2016 Category: Bioinformatics Authors: Ma, J., Shojaie, A., Michailidis, G. Tags: SYSTEMS BIOLOGY Source Type: research

Local versus global biological network alignment
Motivation: Network alignment (NA) aims to find regions of similarities between species’ molecular networks. There exist two NA categories: local (LNA) and global (GNA). LNA finds small highly conserved network regions and produces a many-to-many node mapping. GNA finds large conserved regions and produces a one-to-one node mapping. Given the different outputs of LNA and GNA, when a new NA method is proposed, it is compared against existing methods from the same category. However, both NA categories have the same goal: to allow for transferring functional knowledge from well- to poorly-studied species between conserv...
Source: Bioinformatics - October 2, 2016 Category: Bioinformatics Authors: Meng, L., Striegel, A., Milenkovic, T. Tags: SYSTEMS BIOLOGY Source Type: research

Estimating and testing high-dimensional mediation effects in epigenetic studies
Motivation: High-dimensional DNA methylation markers may mediate pathways linking environmental exposures with health outcomes. However, there is a lack of analytical methods to identify significant mediators for high-dimensional mediation analysis. Results: Based on sure independent screening and minimax concave penalty techniques, we use a joint significance test for mediation effect. We demonstrate its practical performance using Monte Carlo simulation studies and apply this method to investigate the extent to which DNA methylation markers mediate the causal pathway from smoking to reduced lung function in the Normative...
Source: Bioinformatics - October 2, 2016 Category: Bioinformatics Authors: Zhang, H., Zheng, Y., Zhang, Z., Gao, T., Joyce, B., Yoon, G., Zhang, W., Schwartz, J., Just, A., Colicino, E., Vokonas, P., Zhao, L., Lv, J., Baccarelli, A., Hou, L., Liu, L. Tags: GENETICS AND POPULATION ANALYSIS Source Type: research

AutoSite: an automated approach for pseudo-ligands prediction--from ligand-binding sites identification to predicting key ligand atoms
Motivation: The identification of ligand-binding sites from a protein structure facilitates computational drug design and optimization, and protein function assignment. We introduce AutoSite: an efficient software tool for identifying ligand-binding sites and predicting pseudo ligand corresponding to each binding site identified. Binding sites are reported as clusters of 3D points called fills in which every point is labelled as hydrophobic or as hydrogen bond donor or acceptor. From these fills AutoSite derives feature points: a set of putative positions of hydrophobic-, and hydrogen-bond forming ligand atoms. Results: We...
Source: Bioinformatics - October 2, 2016 Category: Bioinformatics Authors: Ravindranath, P. A., Sanner, M. F. Tags: STRUCTURAL BIOINFORMATICS Source Type: research

pSumo-CD: predicting sumoylation sites in proteins with covariance discriminant algorithm by incorporating sequence-coupled effects into general PseAAC
Motivation: Sumoylation is a post-translational modification (PTM) process, in which small ubiquitin-related modifier (SUMO) is attaching by covalent bonds to substrate protein. It is critical to many different biological processes such as replicating genome, expressing gene, localizing and stabilizing proteins; unfortunately, it is also involved with many major disorders including Alzheimer’s and Parkinson’s diseases. Therefore, for both basic research and drug development, it is important to identify the sumoylation sites in proteins. Results: To address such a problem, we developed a predictor called pSumo-C...
Source: Bioinformatics - October 2, 2016 Category: Bioinformatics Authors: Jia, J., Zhang, L., Liu, Z., Xiao, X., Chou, K.-C. Tags: SEQUENCE ANALYSIS Source Type: research

GeneCodeq: quality score compression and improved genotyping using a Bayesian framework
Motivation: The exponential reduction in cost of genome sequencing has resulted in a rapid growth of genomic data. Most of the entropy of short read data lies not in the sequence of read bases themselves but in their Quality Scores—the confidence measurement that each base has been sequenced correctly. Lossless compression methods are now close to their theoretical limits and hence there is a need for lossy methods that further reduce the complexity of these data without impacting downstream analyses. Results: We here propose GeneCodeq, a Bayesian method inspired by coding theory for adjusting quality scores to impro...
Source: Bioinformatics - October 2, 2016 Category: Bioinformatics Authors: Greenfield, D. L., Stegle, O., Rrustemi, A. Tags: SEQUENCE ANALYSIS Source Type: research

iPTM-mLys: identifying multiple lysine PTM sites and their different types
Motivation: Post-translational modification, abbreviated as PTM, refers to the change of the amino acid side chains of a protein after its biosynthesis. Owing to its significance for in-depth understanding various biological processes and developing effective drugs, prediction of PTM sites in proteins have currently become a hot topic in bioinformatics. Although many computational methods were established to identify various single-label PTM types and their occurrence sites in proteins, no method has ever been developed for multi-label PTM types. As one of the most frequently observed PTMs, the K-PTM, namely, the modificat...
Source: Bioinformatics - October 2, 2016 Category: Bioinformatics Authors: Qiu, W.-R., Sun, B.-Q., Xiao, X., Xu, Z.-C., Chou, K.-C. Tags: SEQUENCE ANALYSIS Source Type: research

Accurate in silico prediction of species-specific methylation sites based on information gain feature optimization
As one of the most important reversible types of post-translational modification, protein methylation catalyzed by methyltransferases carries many pivotal biological functions as well as many essential biological processes. Identification of methylation sites is prerequisite for decoding methylation regulatory networks in living cells and understanding their physiological roles. Experimental methods are limitations of labor-intensive and time-consuming. While in silicon approaches are cost-effective and high-throughput manner to predict potential methylation sites, but those previous predictors only have a mixed model and ...
Source: Bioinformatics - October 2, 2016 Category: Bioinformatics Authors: Wen, P.-P., Shi, S.-P., Xu, H.-D., Wang, L.-N., Qiu, J.-D. Tags: SEQUENCE ANALYSIS Source Type: research

RTCR: a pipeline for complete and accurate recovery of T cell repertoires from high throughput sequencing data
Motivation: High Throughput Sequencing (HTS) has enabled researchers to probe the human T cell receptor (TCR) repertoire, which consists of many rare sequences. Distinguishing between true but rare TCR sequences and variants generated by polymerase chain reaction (PCR) and sequencing errors remains a formidable challenge. The conventional approach to handle errors is to remove low quality reads, and/or rare TCR sequences. Such filtering discards a large number of true and often rare TCR sequences. However, accurate identification and quantification of rare TCR sequences is essential for repertoire diversity estimation. Res...
Source: Bioinformatics - October 2, 2016 Category: Bioinformatics Authors: Gerritsen, B., Pandit, A., Andeweg, A. C., de Boer, R. J. Tags: SEQUENCE ANALYSIS Source Type: research

ACE: adaptive cluster expansion for maximum entropy graphical model inference
Motivation: Graphical models are often employed to interpret patterns of correlations observed in data through a network of interactions between the variables. Recently, Ising/Potts models, also known as Markov random fields, have been productively applied to diverse problems in biology, including the prediction of structural contacts from protein sequence data and the description of neural activity patterns. However, inference of such models is a challenging computational problem that cannot be solved exactly. Here, we describe the adaptive cluster expansion (ACE) method to quickly and accurately infer Ising or Potts mode...
Source: Bioinformatics - October 2, 2016 Category: Bioinformatics Authors: Barton, J. P., De Leonardis, E., Coucke, A., Cocco, S. Tags: SEQUENCE ANALYSIS Source Type: research