SSEP-Domain: protein domain prediction by alignment of secondary structure elements and profiles.

MOTIVATION: The prediction of protein domains is a crucial task for functional classification, homology-based structure prediction and structural genomics. In this paper, we present the SSEP-Domain protein domain prediction approach, which is based on the application of secondary structure element alignment (SSEA) and profile-profile alignment (PPA) in combination with InterPro pattern searches. SSEA allows rapid screening for potential domain regions while PPA provides us with the necessary specificity for selecting significant hits. The combination with InterPro patterns allows finding domain regions without solved structural templates if sequence family definitions exist. RESULTS: A preliminary version of SSEP-Domain was ranked among the top-performing domain prediction servers in the CASP 6 and CAFASP 4 experiments. Evaluation of the final version shows further improvement over these results together with a significant speed-up. AVAILABILITY: The server is available at http://www.bio.ifi.lmu.de/SSEP/

The ontology of biological taxa.

MOTIVATION: The classification of biological entities in terms of species and taxa is an important endeavor in biology. Although a large amount of statements encoded in current biomedical ontologies is taxon-dependent there is no obvious or standard way for introducing taxon information into an integrative ontology architecture, supposedly because of ongoing controversies about the ontological nature of species and taxa. RESULTS: In this article, we discuss different approaches on how to represent biological taxa using existing standards for biomedical ontologies such as the description logic OWL DL and the Open Biomedical Ontologies Relation Ontology. We demonstrate how hidden ambiguities of the species concept can be dealt with and existing controversies can be overcome. A novel approach is to envisage taxon information as qualities that inhere in biological organisms, organism parts and populations. AVAILABILITY: The presented methodology has been implemented in the domain top-level ontology BioTop, openly accessible at http://purl.org/biotop. BioTop may help to improve the logical and ontological rigor of biomedical ontologies and further provides a clear architectural principle to deal with biological taxa information.

Combining Pareto-optimal clusters using supervised learning for identifying co-expressed genes.

BACKGROUND: The landscape of biological and biomedical research is being changed rapidly with the invention of microarrays which enables simultaneous view on the transcription levels of a huge number of genes across different experimental conditions or time points. Using microarray data sets, clustering algorithms have been actively utilized in order to identify groups of co-expressed genes. This article poses the problem of fuzzy clustering in microarray data as a multiobjective optimization problem which simultaneously optimizes two internal fuzzy cluster validity indices to yield a set of Pareto-optimal clustering solutions. Each of these clustering solutions possesses some amount of information regarding the clustering structure of the input data. Motivated by this fact, a novel fuzzy majority voting approach is proposed to combine the clustering information from all the solutions in the resultant Pareto-optimal set. This approach first identifies the genes which are assigned to some particular cluster with high membership degree by most of the Pareto-optimal solutions. Using this set of genes as the training set, the remaining genes are classified by a supervised learning algorithm. In this work, we have used a Support Vector Machine (SVM) classifier for this purpose. RESULTS: The performance of the proposed clustering technique has been demonstrated on five publicly available benchmark microarray data sets, viz., Yeast Sporulation, Yeast Cell Cycle, Arabidopsis Thaliana, Human Fibroblasts Serum and Rat Central Nervous System. Comparative studies of the use of different SVM kernels and several widely used microarray clustering techniques are reported. Moreover, statistical significance tests have been carried out to establish the statistical superiority of the proposed clustering approach. Finally, biological significance tests have been carried out using a web based gene annotation tool to show that the proposed method is able to produce biologically relevant clusters of co-expressed genes. CONCLUSION: The proposed clustering method has been shown to perform better than other well-known clustering algorithms in finding clusters of co-expressed genes efficiently. The clusters of genes produced by the proposed technique are also found to be biologically significant, i.e., consist of genes which belong to the same functional groups. This indicates that the proposed clustering method can be used efficiently to identify co-expressed genes in microarray gene expression data.Supplementary Website The pre-processed and normalized data sets, the matlab code and other related materials are available at http://anirbanmukhopadhyay.50webs.com/mogasvm.html.

Using multi-data hidden Markov models trained on local neighborhoods of protein structure to predict residue-residue contacts.

MOTIVATION: Correct prediction of residue-residue contacts in proteins that lack good templates with known structure would take ab initio protein structure prediction a large step forward. The lack of correct contacts, and in particular long-range contacts, is considered the main reason why these methods often fail. RESULTS: We propose a novel hidden Markov model (HMM)-based method for predicting residue-residue contacts from protein sequences using as training data homologous sequences, predicted secondary structure and a library of local neighborhoods (local descriptors of protein structure). The library consists of recurring structural entities incorporating short-, medium- and long-range interactions and is general enough to reassemble the cores of nearly all proteins in the PDB. The method is tested on an external test set of 606 domains with no significant sequence similarity to the training set as well as 151 domains with SCOP folds not present in the training set. Considering the top 0.2 x L predictions (L = sequence length), our HMMs obtained an accuracy of 22.8% for long-range interactions in new fold targets, and an average accuracy of 28.6% for long-, medium- and short-range contacts. This is a significant performance increase over currently available methods when comparing against results published in the literature. AVAILABILITY: http://predictioncenter.org/Services/FragHMMent/.

Variable slope normalization of reverse phase protein arrays.

MOTIVATION: Reverse phase protein arrays (RPPA) measure the relative expression levels of a protein in many samples simultaneously. A set of identically spotted arrays can be used to measure the levels of more than one protein. Protein expression within each sample on an array is estimated by borrowing strength across all the samples, but using only within array information. When comparing across slides, it is essential to account for sample loading, the total amount of protein printed per sample. Currently, total protein is estimated using either a housekeeping protein or the sample median across all slides. When the variability in sample loading is large, these methods are suboptimal because they do not account for the fact that the protein expression for each slide is estimated separately. RESULTS: We propose a new normalization method for RPPA data, called variable slope (VS) normalization, that takes into account that quantification of RPPA slides is performed separately. This method is better able to remove loading bias and recover true correlation structures between proteins. AVAILABILITY: Code to implement the method in the statistical package R and anonymized data are available at (http://bioinformatics.mdanderson.org/supplements.html).

WebGBrowse–a web server for GBrowse.

SUMMARY: The Generic Genome Browser (GBrowse) is one of the most widely used tools for visualizing genomic features along a reference sequence. However, the installation and configuration of GBrowse is not trivial for biologists. We have developed a web server, WebGBrowse that allows users to upload genome annotation in the GFF3 format, configure the display of each genomic feature by simply using a web browser and visualize the configured genomic features with the integrated GBrowse software. AVAILABILITY: WebGBrowse is accessible via http://webgbrowse.cgb.indiana.edu/ and the system is also freely available for local installations.

ViaComplex: software for landscape analysis of gene expression networks in genomic context.

ViaComplex is an open-source application that builds landscape maps of gene expression networks. The motivation for this software comes from two previous publications (Nucleic Acids Res., 35, 1859-1867, 2007; Nucleic Acids Res., 36, 6269-6283, 2008). The first article presents a network-based model of genome stability pathways where we defined a set of genes that characterizes each genetic system. In the second article we analyzed this model by projecting functional information from several experiments onto the gene network topology. In order to systematize the methods developed in these articles, ViaComplex provides tools that may help potential users to assess different high-throughput experiments in the context of six core genome maintenance mechanisms. This model illustrates how different gene networks can be analyzed by the same algorithm. AVAILABILITY: (http://lief.if.ufrgs.br/pub/biosoftwares/viacomplex).

VARNA: Interactive drawing and editing of the RNA secondary structure.

DESCRIPTION: VARNA is a tool for the automated drawing, visualization and annotation of the secondary structure of RNA, designed as a companion software for web servers and databases. FEATURES: VARNA implements four drawing algorithms, supports input/output using the classic formats dbn, ct, bpseq and RNAML and exports the drawing as five picture formats, either pixel-based (JPEG, PNG) or vector-based (SVG, EPS and XFIG). It also allows manual modification and structural annotation of the resulting drawing using either an interactive point and click approach, within a web server or through command-line arguments. AVAILABILITY: VARNA is a free software, released under the terms of the GPLv3.0 license and available at http://varna.lri.fr. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.