Top Banner

of 52

IEEE Final Year Projects 2011-2012 ::Computational Biology:: Elysium Technologies Pvt Ltd

Apr 07, 2018

Download

Documents

sunda_pass
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 8/6/2019 IEEE Final Year Projects 2011-2012 ::Computational Biology:: Elysium Technologies Pvt Ltd

    1/52

    Elysium Technologies Private LimitedISO 9001:2008 A leading Research and Development DivisionMadurai | Chennai | Trichy | Coimbatore | Kollam| SingaporeWebsite: elysiumtechnologies.com, elysiumtechnologies.infoEmail: [email protected]

    IEEE Project List 2011 - 2012

    [Type text]

    Madurai

    Elysium Technologies Private Limited

    230, Church Road, Annanagar,

    Madurai , Tamilnadu 625 020.

    Contact : 91452 4390702, 4392702, 4394702.

    eMail: [email protected]

    Trichy

    Elysium Technologies Private Limited

    3rd Floor,SI Towers,

    15 ,Melapudur , Trichy,

    Tamilnadu 620 001.

    Contact : 91431 - 4002234.

    eMail: [email protected]

    Kollam

    Elysium Technologies Private Limited

    Surya Complex,Vendor junction,

    kollam,Kerala 691 010.

    Contact : 91474 2723622.

    eMail: [email protected]

    [Type text] [Type text]

    A b s t r a c t COMPUTATIONAL BIOLOGY ANDBIO INFORMATICS 2011 - 2012

    01 3D Shape Reconstruction of Loop Objects in X-Ray Protein Crystallography

    Knowledge of the shape of crystals can benefit data collection in X-ray crystallography. A preliminary step is the

    determination of the loop object, i.e., the shape of the loop holding the crystal. Based on the standard set-up of

    experimental X-ray stations for protein crystallography, the paper reviews a reconstruction method merely requiring 2Dobject contours and presents a dedicated novel algorithm. Properties of the object surface (e.g., texture) and depth

    information do not have to be considered. The complexity of the reconstruction task is significantly reduced by slicing the

    3D object into parallel 2D cross-sections. The shape of each cross-section is determined using support lines forming

    polygons. The slicing technique allows the reconstruction of concave surfaces perpendicular to the direction of projection.

    In spite of the low computational complexity, the reconstruction method is resilient to noisy object projections caused by

    imperfections in the image-processing system extracting the contours. The algorithm developed here has been

    successfully applied to the reconstruction of shapes of loop objects in X-ray crystallography.

    02 A Biologically Inspired Measure for Co expression Analysis

    Two genes are said to be coexpressed if their expression levels have a similar spatial or temporal pattern. Ever since the

    profiling of gene microarrays has been in progress, computational modeling of coexpression has acquired a major focus.

    As a result, several similarity/distance measures have evolved over time to quantify coexpression similarity/dissimilarity

    between gene pairs. Of these, correlation coefficient has been established to be a suitable quantifier of pairwise

    coexpression. In general, correlation coefficient is good for symbolizing linear dependence, but not for nonlinear

    dependence. In spite of this drawback, it outperforms many other existing measures in modeling the dependency in

    biological data. In this paper, for the first time, we point out a significant weakness of the existing similarity/distance

    measures, including the standard correlation coefficient, in modeling pairwise coexpression of genes. A novel measure,

    called BioSim, which assumes values between 1 and 1 corresponding to negative and positive dependency and 0 for

    independency, is introduced. The computation of BioSim is based on the aggregation of stepwise relative angular deviation

    of the expression vectors considered. The proposed measure is analytically suitable for modeling coexpression as it

    accounts for the features of expression similarity, expression deviation and also the relative dependence. It is

    demonstrated how the proposed measure is better able to capture the degree of coexpression between a pair of genes as

    compared to several other existing ones. The efficacy of the measure is statistically analyzed by integrating it with several

    module-finding algorithms based on coexpression values and then applying it on synthetic and biological data. The

    annotation results of the coexpressed genes as obtained from gene ontology establish the significance of the introduced

    measure. By further extending the BioSim measure, it has been shown that one can effectively identify the variability in the

    expression patterns over multiple phenotypes. We have also extended BioSim to figure out pairwise differential expression

    pattern and coexpression dynamics. The significance of these studies is shown based on the analysis over several real-life

    data sets. The computation of the measure by focusing on stepwise time points also makes it effective to identify partially

    1

  • 8/6/2019 IEEE Final Year Projects 2011-2012 ::Computational Biology:: Elysium Technologies Pvt Ltd

    2/52

    Elysium Technologies Private LimitedISO 9001:2008 A leading Research and Development DivisionMadurai | Chennai | Trichy | Coimbatore | Kollam| SingaporeWebsite: elysiumtechnologies.com, elysiumtechnologies.infoEmail: [email protected]

    IEEE Project List 2011 - 2012

    [Type text]

    Madurai

    Elysium Technologies Private Limited

    230, Church Road, Annanagar,

    Madurai , Tamilnadu 625 020.

    Contact : 91452 4390702, 4392702, 4394702.

    eMail: [email protected]

    Trichy

    Elysium Technologies Private Limited

    3rd Floor,SI Towers,

    15 ,Melapudur , Trichy,

    Tamilnadu 620 001.

    Contact : 91431 - 4002234.

    eMail: [email protected]

    Kollam

    Elysium Technologies Private Limited

    Surya Complex,Vendor junction,

    kollam,Kerala 691 010.

    Contact : 91474 2723622.

    eMail: [email protected]

    [Type text] [Type text]

    coexpressed genes. On the whole, we put forward a complete framework for coexpression analysis based on the BioSim

    measure.

    03 A cDNA Microarray Gene Expression Data Classifier for Clinical Diagnostics Based on Graph Theory

    Despite great advances in discovering cancer molecular profiles, the proper application of microarray technology to routine

    clinical diagnostics is still a challenge. Current practices in the classification of microarrays data show two main

    limitations: the reliability of the training data sets used to build the classifiers, and the classifiers performances, especially

    when the sample to be classified does not belong to any of the available classes. In this case, state-of-the-art algorithms

    usually produce a high rate of false positives that, in real diagnostic applications, are unacceptable. To address this

    problem, this paper presents a new cDNA microarray data classification algorithm based on graph theory and is able toovercome most of the limitations of known classification methodologies. The classifier works by analyzing gene expression

    data organized in an innovative data structure based on graphs, where vertices correspond to genes and edges to gene

    expression relationships. To demonstrate the novelty of the proposed approach, the authors present an experimental

    performance comparison between the proposed classifier and several state-of-the-art classification algorithms.

    04 A Comprehensive Statistical Model for Cell Signaling

    Protein signaling networks play a central role in transcriptional regulation and the etiology of many diseases. Statistical

    methods, particularly Bayesian networks, have been widely used to model cell signaling, mostly for model organisms and

    with focus on uncovering connectivity rather than inferring aberrations. Extensions to mammalian systems have not

    yielded compelling results, due likely to greatly increased complexity and limited proteomic measurements in vivo. In this

    study, we propose a comprehensive statistical model that is anchored to a predefined core topology, has a limited

    complexity due to parameter sharing and uses micorarray data of mRNA transcripts as the only observable components of

    signaling. Specifically, we account for cell heterogeneity and a multilevel process, representing signaling as a Bayesian

    network at the cell level, modeling measurements as ensemble averages at the tissue level, and incorporating patient-to-

    patient differences at the population level. Motivated by the goal of identifying individual protein abnormalities as potential

    therapeutical targets, we applied our method to the RAS-RAF network using a breast cancer study with 118 patients. We

    demonstrated rigorous statistical inference, established reproducibility through simulations and the ability to recover

    receptor status from available microarray data.

    05A Consensus Tree Approach for Reconstructing Human Evolutionary History and Detecting Population Substructure

    The random accumulation of variations in the human genome over time implicitly encodes a history of how human

    populations have arisen, dispersed, and intermixed since we emerged as a species. Reconstructing that history is a

    challenging computational and statistical problem but has important applications both to basic research and to the

    discovery of genotypephenotype correlations. We present a novel approach to inferring human evolutionary history from

    genetic variation data. We use the idea of consensus trees, a technique generally used to reconcile species trees from

    2

  • 8/6/2019 IEEE Final Year Projects 2011-2012 ::Computational Biology:: Elysium Technologies Pvt Ltd

    3/52

    Elysium Technologies Private LimitedISO 9001:2008 A leading Research and Development DivisionMadurai | Chennai | Trichy | Coimbatore | Kollam| SingaporeWebsite: elysiumtechnologies.com, elysiumtechnologies.infoEmail: [email protected]

    IEEE Project List 2011 - 2012

    [Type text]

    Madurai

    Elysium Technologies Private Limited

    230, Church Road, Annanagar,

    Madurai , Tamilnadu 625 020.

    Contact : 91452 4390702, 4392702, 4394702.

    eMail: [email protected]

    Trichy

    Elysium Technologies Private Limited

    3rd Floor,SI Towers,

    15 ,Melapudur , Trichy,

    Tamilnadu 620 001.

    Contact : 91431 - 4002234.

    eMail: [email protected]

    Kollam

    Elysium Technologies Private Limited

    Surya Complex,Vendor junction,

    kollam,Kerala 691 010.

    Contact : 91474 2723622.

    eMail: [email protected]

    [Type text] [Type text]

    divergent gene trees, adapting it to the problem of finding robust relationships within a set of intraspecies phylogenies

    derived from local regions of the genome. Validation on both simulated and real data shows the method to be effective in

    recapitulating known true structure of the data closely matching our best current understanding of human evolutionary

    history. Additional comparison with results of leading methods for the problem of population substructure assignment

    verifies that our method provides comparable accuracy in identifying meaningful population subgroups in addition to

    inferring relationships among them. The consensus tree approach thus provides a promising new model for the robust

    inference of substructure and ancestry from large-scale genetic variation data.

    06 A Comprehensive Statistical Model for Cell Signaling

    The random accumulation of variations in the human genome over time implicitly encodes a history of how human

    populations have arisen, dispersed, and intermixed since we emerged as a species. Reconstructing that history is a

    challenging computational and statistical problem but has important applications both to basic research and to the

    discovery of genotypephenotype correlations. We present a novel approach to inferring human evolutionary history from

    genetic variation data. We use the idea of consensus trees, a technique generally used to reconcile species trees from

    divergent gene trees, adapting it to the problem of finding robust relationships within a set of intraspecies phylogenies

    derived from local regions of the genome. Validation on both simulated and real data shows the method to be effective in

    recapitulating known true structure of the data closely matching our best current understanding of human evolutionary

    history. Additional comparison with results of leading methods for the problem of population substructure assignment

    verifies that our method provides comparable accuracy in identifying meaningful population subgroups in addition to

    inferring relationships among them. The consensus tree approach thus provides a promising new model for the robust

    inference of substructure and ancestry from large-scale genetic variation data.

    07 A Continuous-Time, Discrete-State Method for Simulating the Dynamics of Biochemical Systems

    Computational systems biology is largely driven by mathematical modeling and simulation of biochemical networks, via

    continuous deterministic methods or discrete event stochastic methods. Although the deterministic methods are efficient

    in predicting the macroscopic behavior of a biochemical system, they are severely limited by their inability to represent the

    stochastic effects of random molecular fluctuations at lower concentration. In this work, we have presented a novel method

    for simulating biochemical networks based on a deterministic solution with a modification that permits the incorporation of

    stochastic effects. To demonstrate the feasibility of our approach, we have tested our method on three previously reported

    biochemical networks. The results, while staying true to their deterministic form, also reflect the stochastic effects of

    random fluctuations that are dominant as the system transitions into a lower concentration. This ability to adapt to a

    concentration gradient makes this method particularly attractive for systems biologybased applications.

    3

  • 8/6/2019 IEEE Final Year Projects 2011-2012 ::Computational Biology:: Elysium Technologies Pvt Ltd

    4/52

    Elysium Technologies Private LimitedISO 9001:2008 A leading Research and Development DivisionMadurai | Chennai | Trichy | Coimbatore | Kollam| SingaporeWebsite: elysiumtechnologies.com, elysiumtechnologies.infoEmail: [email protected]

    IEEE Project List 2011 - 2012

    [Type text]

    Madurai

    Elysium Technologies Private Limited

    230, Church Road, Annanagar,

    Madurai , Tamilnadu 625 020.

    Contact : 91452 4390702, 4392702, 4394702.

    eMail: [email protected]

    Trichy

    Elysium Technologies Private Limited

    3rd Floor,SI Towers,

    15 ,Melapudur , Trichy,

    Tamilnadu 620 001.

    Contact : 91431 - 4002234.

    eMail: [email protected]

    Kollam

    Elysium Technologies Private Limited

    Surya Complex,Vendor junction,

    kollam,Kerala 691 010.

    Contact : 91474 2723622.

    eMail: [email protected]

    [Type text] [Type text]

    08 A Fast Algorithm for Computing Geodesic Distances in Tree Space

    Comparing and computing distances between phylogenetic trees are important biological problems, especially for models

    where edge lengths play an important role. The geodesic distance measure between two phylogenetic trees with edge

    lengths is the length of the shortest path between them in the continuous tree space introduced by Billera, Holmes, and

    Vogtmann. This tree space provides a powerful tool for studying and comparing phylogenetic trees, both in exhibiting a

    natural distance measure and in providing a euclidean-like structure for solving optimization problems on trees. An

    important open problem is to find a polynomial time algorithm for finding geodesics in tree space. This paper gives such an

    algorithm, which starts with a simple initial path and moves through a series of successively shorter paths until the

    geodesic is attained.

    09 A Fast Hierarchical Clustering Algorithm for Functional Modules Discovery in Protein Interaction Networks

    As advances in the technologies of predicting protein interactions, huge data sets portrayed as networks have been

    available. Identification of functional modules from such networks is crucial for understanding principles of cellular

    organization and functions. However, protein interaction data produced by high-throughput experiments are generally

    associated with high false positives, which makes it difficult to identify functional modules accurately. In this paper, we

    propose a fast hierarchical clustering algorithm HC-PIN based on the local metric of edge clustering value which can be

    used both in the unweighted network and in the weighted network. The proposed algorithm HC-PIN is applied to the yeast

    protein interaction network, and the identified modules are validated by all the three types of Gene Ontology (GO) Terms:

    Biological Process, Molecular Function, and Cellular Component. The experimental results show that HC-PIN is not only

    robust to false positives, but also can discover the functional modules with low density. The identified modules are

    statistically significant in terms of three types of GO annotations. Moreover, HC-PIN can uncover the hierarchical

    organization of functional modules with the variation of its parameters value, which is approximatively corresponding to

    the hierarchical structure of GO annotations. Compared to other previous competing algorithms, our algorithm HC-PIN is

    faster and more accurate.

    10A Framework for Semi supervised Feature Generation and Its Applications in Biomedical Literature Mining

    Feature representation is essential to machine learning and text mining. In this paper, we present a feature coupling

    generalization (FCG) framework for generating new features from unlabeled data. It selects two special types of features,

    i.e., example-distinguishing features (EDFs) and class-distinguishing features (CDFs) from original feature set, and then

    4

  • 8/6/2019 IEEE Final Year Projects 2011-2012 ::Computational Biology:: Elysium Technologies Pvt Ltd

    5/52

    Elysium Technologies Private LimitedISO 9001:2008 A leading Research and Development DivisionMadurai | Chennai | Trichy | Coimbatore | Kollam| SingaporeWebsite: elysiumtechnologies.com, elysiumtechnologies.infoEmail: [email protected]

    IEEE Project List 2011 - 2012

    [Type text]

    Madurai

    Elysium Technologies Private Limited

    230, Church Road, Annanagar,

    Madurai , Tamilnadu 625 020.

    Contact : 91452 4390702, 4392702, 4394702.

    eMail: [email protected]

    Trichy

    Elysium Technologies Private Limited

    3rd Floor,SI Towers,

    15 ,Melapudur , Trichy,

    Tamilnadu 620 001.

    Contact : 91431 - 4002234.

    eMail: [email protected]

    Kollam

    Elysium Technologies Private Limited

    Surya Complex,Vendor junction,

    kollam,Kerala 691 010.

    Contact : 91474 2723622.

    eMail: [email protected]

    [Type text] [Type text]

    generalizes EDFs into higher-level features based on their coupling degrees with CDFs in unlabeled data. The advantage is:

    EDFs with extreme sparsity in labeled data can be enriched by their co-occurrences with CDFs in unlabeled data so that the

    performance of these low-frequency features can be greatly boosted and new information from unlabeled can be

    incorporated. We apply this approach to three tasks in biomedical literature mining: gene named entity recognition (NER),

    protein-protein interaction extraction (PPIE), and text classification (TC) for gene ontology (GO) annotation. New features

    are generated from over 20 GB unlabeled PubMed abstracts. The experimental results on BioCreative 2, AIMED corpus, and

    TREC 2005 Genomics Track show that 1) FCG can utilize well the sparse features ignored by supervised learning. 2) It

    improves the performance of supervised baselines by 7.8 percent, 5.0 percent, and 5.8 percent, respectively, in the tree

    tasks. 3) Our methods achieve 89.1, 64.5 F-score, and 60.1 normalized utility on the three benchmark data sets

    11 A General Framework for Analyzing Data from Two Short Time-Series Microarray Experiments

    We propose a general theoretical framework for analyzing differentially expressed genes and behavior patterns from two

    homogenous short time-course data. The framework generalizes the recently proposed Hilbert-Schmidt Independence

    Criterion (HSIC)-based framework [34], [35] adapting it to the time-series scenario by utilizing tensor analysis for data

    transformation. The proposed framework is effective in yielding criteria that can identify both the differentially expressed

    genes and time-course patterns of interest between two time-series experiments without requiring to explicitly cluster the

    data. The results, obtained by applying the proposed framework with a linear kernel formulation, on various data sets are

    found to be both biologically meaningful and consistent with published studies.

    12 A Genetic Optimization Approach for Isolating Translational Efficiency Bias

    The study of codon usage bias is an important research area that contributes to our understanding of molecular evolution,

    phylogenetic relationships, respiratory lifestyle, and other characteristics. Translational efficiency bias is perhaps the most

    well-studied codon usage bias, as it is frequently utilized to predict relative protein expression levels. We present a novel

    approach to isolating translational efficiency bias in microbial genomes. There are several existent methods for isolating

    translational efficiency bias. Previous approaches are susceptible to the confounding influences of other potentially

    dominant biases. Additionally, existing approaches to identifying translational efficiency bias generally require both

    genomic sequence information and prior knowledge of a set of highly expressed genes. This novel approach provides more

    accurate results from sequence information alone by resisting the confounding effects of other biases. We validate this

    increase in accuracy in isolating translational efficiency bias on 10 microbial genomes, five of which have proven

    particularly difficult for existing approaches due to the presence of strong confounding biases.

    13 A Markov-Blanket-Based Model for Gene Regulatory Network Inference

    An efficient two-step Markov blanket method for modeling and inferring complex regulatory networks from large-scale

    microarray data sets is presented. The inferred gene regulatory network (GRN) is based on the time series gene expression

    data capturing the underlying gene interactions. For constructing a highly accurate GRN, the proposed method performs: 1)

    5

  • 8/6/2019 IEEE Final Year Projects 2011-2012 ::Computational Biology:: Elysium Technologies Pvt Ltd

    6/52

    Elysium Technologies Private LimitedISO 9001:2008 A leading Research and Development DivisionMadurai | Chennai | Trichy | Coimbatore | Kollam| SingaporeWebsite: elysiumtechnologies.com, elysiumtechnologies.infoEmail: [email protected]

    IEEE Project List 2011 - 2012

    [Type text]

    Madurai

    Elysium Technologies Private Limited

    230, Church Road, Annanagar,

    Madurai , Tamilnadu 625 020.

    Contact : 91452 4390702, 4392702, 4394702.

    eMail: [email protected]

    Trichy

    Elysium Technologies Private Limited

    3rd Floor,SI Towers,

    15 ,Melapudur , Trichy,

    Tamilnadu 620 001.

    Contact : 91431 - 4002234.

    eMail: [email protected]

    Kollam

    Elysium Technologies Private Limited

    Surya Complex,Vendor junction,

    kollam,Kerala 691 010.

    Contact : 91474 2723622.

    eMail: [email protected]

    [Type text] [Type text]

    discovery of a genes Markov Blanket (MB), 2) formulation of a flexible measure to determine the networks quality, 3)

    efficient searching with the aid of a guided genetic algorithm, and 4) pruning to obtain a minimal set of correct interactions.

    Investigations are carried out using both synthetic as well as yeast cell cycle gene expression data sets. The realistic

    synthetic data sets validate the robustness of the method by varying topology, sample size, time delay, noise, vertex in-

    degree, and the presence of hidden nodes. It is shown that the proposed approach has excellent inferential capabilities and

    high accuracy even in the presence of noise. The gene network inferred from yeast cell cycle data is investigated for its

    biological relevance using well-known interactions, sequence analysis, motif patterns, and GO data. Further, novel

    interactions are predicted for the unknown genes of the network and their influence on other genes is also discussed.

    14 A Max-Flow-Based Approach to the Identification of Protein Complexes Using Protein Interaction and Microarray Data

    The emergence of high-throughput technologies leads to abundant protein-protein interaction (PPI) data and microarray

    gene expression profiles, and provides a great opportunity for the identification of novel protein complexes using

    computational methods. By combining these two types of data, we propose a novel Graph Fragmentation Algorithm (GFA)

    for protein complex identification. Adapted from a classical max-flow algorithm for finding the (weighted) densest

    subgraphs, GFA first finds large (weighted) dense subgraphs in a protein-protein interaction network, and then, breaks

    each such subgraph into fragments iteratively by weighting its nodes appropriately in terms of their corresponding log-fold

    changes in the microarray data, until the fragment subgraphs are sufficiently small. Our tests on three widely used protein-

    protein interaction data sets and comparisons with several latest methods for protein complex identification demonstrate

    the strong performance of our method in predicting novel protein complexes in terms of its specificity and efficiency. Given

    the high specificity (or precision) that our method has achieved, we conjecture that our prediction results imply more than

    200 novel protein complexes.

    15 A Note on the Fixed Parameter Tractability of the Gene-Duplication Problem

    The NP-hard gene-duplication problem takes as input a collection of gene trees and seeks a species tree that requires the

    fewest number of gene duplications to reconcile the input gene trees. An oft-cited, decade-old result by Stege states that

    the gene-duplication problem is fixed parameter tractable when parameterized by the number of gene duplications

    necessary for the reconciliation. Here, we uncover an error in this fixed parameter algorithm and show that this error cannot

    be corrected without sacrificing the fixed parameter tractability of the algorithm. Furthermore, we show a link between the

    geneduplication problem and the minimum rooted triplets inconsistency problem which implies that the gene-duplication

    problem is 1) W[2]-hard when parameterized by the number of gene duplications necessary for the reconciliation and 2)

    hard to approximate to better than a logarithmic factor.

    16 A Partial Set Covering Model for Protein Mixture Identification Using Mass Spectrometry Data

    6

  • 8/6/2019 IEEE Final Year Projects 2011-2012 ::Computational Biology:: Elysium Technologies Pvt Ltd

    7/52

    Elysium Technologies Private LimitedISO 9001:2008 A leading Research and Development DivisionMadurai | Chennai | Trichy | Coimbatore | Kollam| SingaporeWebsite: elysiumtechnologies.com, elysiumtechnologies.infoEmail: [email protected]

    IEEE Project List 2011 - 2012

    [Type text]

    Madurai

    Elysium Technologies Private Limited

    230, Church Road, Annanagar,

    Madurai , Tamilnadu 625 020.

    Contact : 91452 4390702, 4392702, 4394702.

    eMail: [email protected]

    Trichy

    Elysium Technologies Private Limited

    3rd Floor,SI Towers,

    15 ,Melapudur , Trichy,

    Tamilnadu 620 001.

    Contact : 91431 - 4002234.

    eMail: [email protected]

    Kollam

    Elysium Technologies Private Limited

    Surya Complex,Vendor junction,

    kollam,Kerala 691 010.

    Contact : 91474 2723622.

    eMail: [email protected]

    [Type text] [Type text]

    Protein identification is a key and essential step in mass spectrometry (MS) based proteome research. To date, there are

    many protein identification strategies that employ either MS data or MS/MS data for database searching. While MS-based

    methods provide wider coverage than MS/MS-based methods, their identification accuracy is lower since MS data have less

    information than MS/MS data. Thus, it is desired to design more sophisticated algorithms that achieve higher identification

    accuracy using MS data. Peptide Mass Fingerprinting (PMF) has been widely used to identify single purified proteins from

    MS data for many years. In this paper, we extend this technology to protein mixture identification. First, we formulate the

    problem of protein mixture identification as a Partial Set Covering (PSC) problem. Then, we present several algorithms that

    can solve the PSC problem efficiently. Finally, we extend the partial set covering model to both MS/MS data and the

    combination of MS data and MS/MS data. The experimental results on simulated data and real data demonstrate the

    advantages of our method: 1) it outperforms previous MS-based approaches significantly; 2) it is useful in the MS/MS-based

    protein inference; and 3) it combines MS data and MS/MS data in a unified model such that the identification performance is

    further improved.

    17 A Practical Algorithm for Reconstructing Level-1 Phylogenetic Networks

    Recently, much attention has been devoted to the construction of phylogenetic networks which generalize phylogenetic

    trees in order to accommodate complex evolutionary processes. Here, we present an efficient, practical algorithm for

    reconstructing level-1 phylogenetic networksa type of network slightly more general than a phylogenetic treefrom

    triplets. Our algorithm has been made publicly available as the program LEV1ATHAN. It combines ideas from several known

    theoretical algorithms for phylogenetic tree and network reconstruction with two novel subroutines. Namely, an

    exponential-time exact and a greedy algorithm both of which are of independent theoretical interest. Most importantly,

    LEV1ATHAN runs in polynomial time and always constructs a level-1 network. If the data are consistent with a phylogenetic

    tree, then the algorithm constructs such a tree. Moreover, if the input triplet set is dense and, in addition, is fully consistent

    with some level-1 network, it will find such a network. The potential of LEV1ATHAN is explored by means of an extensive

    simulation study and a biological data set. One of our conclusions is that LEV1ATHAN is able to construct networks

    consistent with a high percentage of input triplets, even when these input triplets are affected by a low to moderate level of

    noise.

    18 A Spectral Approach to Protein Structure Alignment

    A new intrinsic geometry based on a spectral analysis is used to motivate methods for aligning protein folds. The geometry

    is induced by the fact that a distance matrix can be scaled so that its eigenvalues are positive. We provide a mathematically

    rigorous development of the intrinsic geometry underlying our spectral approach and use it to motivate two alignment

    algorithms. The first uses eigenvalues alone and dynamic programming to quickly compute a fold alignment. Family

    identification results are reported for the Skolnick40 and Proteus300 data sets. The second algorithm extends our spectral

    method by iterating between our intrinsic geometry and the 3D geometry of a fold to make high-quality alignments. Results

    and comparisons are reported for several difficult fold alignments. The second algorithms ability to correctly identify fold

    families in the Skolnick40 and Proteus300 data sets is also established.

    7

  • 8/6/2019 IEEE Final Year Projects 2011-2012 ::Computational Biology:: Elysium Technologies Pvt Ltd

    8/52

    Elysium Technologies Private LimitedISO 9001:2008 A leading Research and Development DivisionMadurai | Chennai | Trichy | Coimbatore | Kollam| SingaporeWebsite: elysiumtechnologies.com, elysiumtechnologies.infoEmail: [email protected]

    IEEE Project List 2011 - 2012

    [Type text]

    Madurai

    Elysium Technologies Private Limited

    230, Church Road, Annanagar,

    Madurai , Tamilnadu 625 020.

    Contact : 91452 4390702, 4392702, 4394702.

    eMail: [email protected]

    Trichy

    Elysium Technologies Private Limited

    3rd Floor,SI Towers,

    15 ,Melapudur , Trichy,

    Tamilnadu 620 001.

    Contact : 91431 - 4002234.

    eMail: [email protected]

    Kollam

    Elysium Technologies Private Limited

    Surya Complex,Vendor junction,

    kollam,Kerala 691 010.

    Contact : 91474 2723622.

    eMail: [email protected]

    [Type text] [Type text]

    19 A Survey on Methods for Modeling and Analyzing Integrated Biological Networks

    Understanding how cellular systems build up integrated responses to their dynamically changing environment is one of the

    open questions in Systems Biology. Despite their intertwinement, signaling networks, gene regulation and metabolism have

    been frequently modeled independently in the context of well-defined subsystems. For this purpose, several mathematical

    formalisms have been developed according to the features of each particular network under study. Nonetheless, a deeper

    understanding of cellular behavior requires the integration of these various systems into a model capable of capturing how

    they operate as an ensemble. With the recent advances in the omics technologies, more data is becoming available and,

    thus, recent efforts have been driven toward this integrated modeling approach. We herein review and discuss

    methodological frameworks currently available for modeling and analyzing integrated biological networks, in particular

    metabolic, gene regulatory and signaling networks. These include network-based methods and Chemical Organization

    Theory, Flux-Balance Analysis and its extensions, logical discrete modeling, Petri Nets, traditional kinetic modeling, Hybrid

    Systems and stochastic models. Comparisons are also established regarding data requirements, scalability with network

    size and computational burden. The methods are illustrated with successful case studies in large-scale genome models and

    in particular subsystems of various organisms.

    20 A Theoretical Analysis of the Prodrug Delivery System for Treating Antibiotic-Resistant Bacteria

    Simulations were carried out to analyze a promising new antimicrobial treatment strategy for targeting antibiotic-resistant

    bacteria called the -lactamase-dependent prodrug delivery system. In this system, the antibacterial drugs are delivered as

    inactive precursors that only become activated after contact with an enzyme characteristic of many species of antibiotic-

    resistant bacteria ( - lactamase enzyme). The addition of an activation step contributes an extra layer of complexity to the

    system that can lead to unexpected emergent behavior. In order to optimize for treatment success and minimize the risk of

    resistance development, there must be a clear understanding of the system dynamics taking place and how they impact on

    the overall response. It makes sense to use a systems biology approach to analyze this method because it can facilitate a

    better understanding of the complex emergent dynamics arising from diverse interactions in populations. This paper

    contains an initial theoretical examination of the dynamics of this system of activation and an assessment of its therapeutic

    potential from a theoretical standpoint using an agent-based modeling approach. It also contains a case study comparison

    with real-world results from an experimental study carried out on two prodrug candidate compounds in the literature.

    21A Weighted Principal Component Analysis and Its Application to Gene Expression Data

    In this work, we introduce in the first part new developments in Principal Component Analysis (PCA) and in the second part

    a new method to select variables (genes in our application). Our focus is on problems where the values taken by each

    variable do not all have the same importance and where the data may be contaminated with noise and contain outliers, as is

    the case with microarray data. The usual PCA is not appropriate to deal with this kind of problems. In this context, we

    propose the use of a new correlation coefficient as an alternative to Pearsons. This leads to a so-called weighted PCA

    8

  • 8/6/2019 IEEE Final Year Projects 2011-2012 ::Computational Biology:: Elysium Technologies Pvt Ltd

    9/52

    Elysium Technologies Private LimitedISO 9001:2008 A leading Research and Development DivisionMadurai | Chennai | Trichy | Coimbatore | Kollam| SingaporeWebsite: elysiumtechnologies.com, elysiumtechnologies.infoEmail: [email protected]

    IEEE Project List 2011 - 2012

    [Type text]

    Madurai

    Elysium Technologies Private Limited

    230, Church Road, Annanagar,

    Madurai , Tamilnadu 625 020.

    Contact : 91452 4390702, 4392702, 4394702.

    eMail: [email protected]

    Trichy

    Elysium Technologies Private Limited

    3rd Floor,SI Towers,

    15 ,Melapudur , Trichy,

    Tamilnadu 620 001.

    Contact : 91431 - 4002234.

    eMail: [email protected]

    Kollam

    Elysium Technologies Private Limited

    Surya Complex,Vendor junction,

    kollam,Kerala 691 010.

    Contact : 91474 2723622.

    eMail: [email protected]

    [Type text] [Type text]

    (WPCA). In order to illustrate the features of our WPCA and compare it with the usual PCA, we consider the problem of

    analyzing gene expression data sets. In the second part of this work, we propose a new PCA-based algorithm to iteratively

    select the most important genes in a microarray data set. We show that this algorithm produces better results when our

    WPCA is used instead of the usual PCA. Furthermore, by using Support Vector Machines, we show that it can compete with

    the Significance Analysis of Microarrays algorithm

    22 Accurate Construction of Consensus Genetic Maps via Integer Linear Programming

    We study the problem of merging genetic maps, when the individual genetic maps are given as directed acyclic graphs. The

    computational problem is to build a consensus map, which is a directed graph that includes and is consistent with all (or,

    the vast majority of) the markers in the input maps. However, when markers in the individual maps have ordering conflicts,

    the resulting consensus map will contain cycles. Here, we formulate the problem of resolving cycles in the context of a

    parsimonious paradigm that takes into account two types of errors that may be present in the input maps, namely, local

    reshuffles and global displacements. The resulting combinatorial optimization problem is, in turn, expressed as an integer

    linear program. A fast approximation algorithm is proposed, and an additional speedup heuristic is developed. Our

    algorithms were implemented in a software tool named MERGEMAP which is freely available for academic use. An

    extensive set of experiments shows that MERGEMAP consistently outperforms JOINMAP, which is the most popular tool

    currently available for this task, both in terms of accuracy and running time. MERGEMAP is available for download at

    http://www.cs.ucr.edu/~yonghui/mgmap.html.

    23 Accurate Reconstruction for DNA Sequencing by Hybridization Based on a Constructive Heuristic

    Sequencing by hybridization is a promising cost-effective technology for high-throughput DNA sequencing via microarray

    chips. However, due to the effects of spectrum errors rooted in experimental conditions, an accurate and fast

    reconstruction of original sequences has become a challenging problem. In the last decade, a variety of analyses and

    designs have been tried to overcome this problem, where different strategies have different trade-offs in speed and

    accuracy. Motivated by the idea that the errors could be identified by analyzing the interrelation of spectrum elements, this

    paper presents a constructive heuristic algorithm, featuring an accurate reconstruction guided by a set of well-defined

    criteria and rules. Instead of directly reconstructing the original sequence, the new algorithm first builds several accurate

    short fragments, which are then carefully assembled into a whole sequence. The experiments on benchmark instance sets

    demonstrate that the proposed method can reconstruct long DNA sequences with higher accuracy than current approaches

    in the literature.

    24 An Approximation Algorithm for the Noahs Ark Problem with Random Feature Loss

    9

  • 8/6/2019 IEEE Final Year Projects 2011-2012 ::Computational Biology:: Elysium Technologies Pvt Ltd

    10/52

    Elysium Technologies Private LimitedISO 9001:2008 A leading Research and Development DivisionMadurai | Chennai | Trichy | Coimbatore | Kollam| SingaporeWebsite: elysiumtechnologies.com, elysiumtechnologies.infoEmail: [email protected]

    IEEE Project List 2011 - 2012

    [Type text]

    Madurai

    Elysium Technologies Private Limited

    230, Church Road, Annanagar,

    Madurai , Tamilnadu 625 020.

    Contact : 91452 4390702, 4392702, 4394702.

    eMail: [email protected]

    Trichy

    Elysium Technologies Private Limited

    3rd Floor,SI Towers,

    15 ,Melapudur , Trichy,

    Tamilnadu 620 001.

    Contact : 91431 - 4002234.

    eMail: [email protected]

    Kollam

    Elysium Technologies Private Limited

    Surya Complex,Vendor junction,

    kollam,Kerala 691 010.

    Contact : 91474 2723622.

    eMail: [email protected]

    [Type text] [Type text]

    The phylogenetic diversity (PD) of a set of species is a measure of their evolutionary distinctness based on a phylogenetic

    tree. PD is increasingly being adopted as an index of biodiversity in ecological conservation projects. The Noahs Ark

    Problem (NAP) is an NP-Hard optimization problem that abstracts a fundamental conservation challenge in asking to

    maximize the expected PD of a set of taxa given a fixed budget, where each taxon is associated with a cost of conservation

    and a probability of extinction. Only simplified instances of the problem, where one or more parameters are fixed as

    constants, have as of yet been addressed in the literature. Furthermore, it has been argued that PD is not an appropriate

    metric for models that allow information to be lost along paths in the tree. We therefore generalize the NAP to incorporate a

    proposed model of feature loss according to an exponential distribution and term this problem NAP with Loss (NAPL). In

    this paper, we present a pseudopolynomial time approximation scheme for NAPL.

    25 An Improved Heuristic Algorithm for Finding Motif Signals in DNA Sequences

    The planted l; d-motif search problem is a mathematical abstraction of the DNA functional site discovery task. In this

    paper, we propose a heuristic algorithm that can find planted l; d-signals in a given set of DNA sequences. Evaluations

    on simulated data sets demonstrate that the proposed algorithm outperforms current widely used motif finding algorithms.

    We also report the results of experiments on real biological data sets..

    26 Asymmetric Comparison and Querying of Biological Networks

    Comparing and querying the protein-protein interaction (PPI) networks of different organisms is important to infer

    knowledge about conservation across species. Known methods that perform these tasks operate symmetrically, i.e., they

    do not assign a distinct role to the input PPI networks. However, in most cases, the input networks are indeed

    distinguishable on the basis of how the corresponding organism is biologically well characterized. In this paper a new idea

    is developed, that is, to exploit differences in the characterization of organisms at hand in order to devise methods for

    comparing their PPI networks. We use the PPI network (called Master) of the best characterized organism as a fingerprint to

    guide the alignment process to the second input network (called Slave), so that generated results preferably retain the

    structural characteristics of the Master network. Technically, this is obtained by generating from the Master a finite

    automaton, called alignment model, which is then fed with (a linearization of) the Slave for the purpose of extracting, via the

    Viterbi algorithm, matching subgraphs. We propose an approach able to perform global alignment and network querying,

    and we apply it on PPI networks. We tested our method showing that the results it returns are biologically relevant.

    27 Bayesian Models and Algorithms for Protein Beta-Sheet Prediction

    Prediction of the 3D structure greatly benefits from the information related to secondary structure, solvent accessibility,

    and nonlocal contacts that stabilize a proteins structure. We address the problem of Beta-sheet prediction defined as the

    10

  • 8/6/2019 IEEE Final Year Projects 2011-2012 ::Computational Biology:: Elysium Technologies Pvt Ltd

    11/52

  • 8/6/2019 IEEE Final Year Projects 2011-2012 ::Computational Biology:: Elysium Technologies Pvt Ltd

    12/52

    Elysium Technologies Private LimitedISO 9001:2008 A leading Research and Development DivisionMadurai | Chennai | Trichy | Coimbatore | Kollam| SingaporeWebsite: elysiumtechnologies.com, elysiumtechnologies.infoEmail: [email protected]

    IEEE Project List 2011 - 2012

    [Type text]

    Madurai

    Elysium Technologies Private Limited

    230, Church Road, Annanagar,

    Madurai , Tamilnadu 625 020.

    Contact : 91452 4390702, 4392702, 4394702.

    eMail: [email protected]

    Trichy

    Elysium Technologies Private Limited

    3rd Floor,SI Towers,

    15 ,Melapudur , Trichy,

    Tamilnadu 620 001.

    Contact : 91431 - 4002234.

    eMail: [email protected]

    Kollam

    Elysium Technologies Private Limited

    Surya Complex,Vendor junction,

    kollam,Kerala 691 010.

    Contact : 91474 2723622.

    eMail: [email protected]

    [Type text] [Type text]

    Genetic regulatory networks usually encompass a multitude of complex, interacting feedback loops. Being able to model

    and analyze their behavior is crucial for understanding their function. However, state space explosion is becoming a

    limiting factor in the formal analysis of genetic networks. This paper explores a modular approach for verification of

    reachability properties. A framework for component-based modeling of genetic regulatory networks, based on a modular

    discrete abstraction, is introduced. Then a compositional algorithm to efficiently analyze reachability properties of the

    model is proposed. A case study on embryonic cell differentiation involving several hundred cells shows the potential of

    this approach.

    31Computing a Smallest Multilabeled Phylogenetic Tree from Rooted Triplets

    We investigate the computational complexity of inferring a smallest possible multilabeled phylogenetic tree (MUL tree)

    which is consistent with each of the rooted triplets in a given set. This problem has not been studied previously in the

    literature. We prove that even the very restricted case of determining if there exists a MUL tree consistent with the input and

    having just one leaf duplication is an NP-hard problem. Furthermore, we show that the general minimization problem is

    difficult to approximate, although a simple polynomial-time approximation algorithm achieves an approximation ratio close

    to our derived inapproximability bound. Finally, we provide an exact algorithm for the problem running in exponential time

    and space. As a by-product, we also obtain new, strong inapproximability results for two partitioning problems on directed

    graphs called ACYCLIC PARTITION and ACYCLIC TREE-PARTITION.

    32Data Mining on DNA Sequences of Hepatitis B Virus

    Extraction of meaningful information from large experimental data sets is a key element in bioinformatics research. One of

    the challenges is to identify genomic markers in Hepatitis B Virus (HBV) that are associated with HCC (liver cancer)

    development by comparing the complete genomic sequences of HBV among patients with HCC and those without HCC. In

    this study, a data mining framework, which includes molecular evolution analysis, clustering, feature selection, classifier

    learning, and classification, is introduced. Our research group has collected HBV DNA sequences, either genotype B or C,

    from over 200 patients specifically for this project. In the molecular evolution analysis and clustering, three subgroups have

    been identified in genotype C and a clustering method has been developed to separate the subgroups. In the feature

    selection process, potential markers are selected based on Information Gain for further classifier learning. Then,

    meaningful rules are learned by our algorithm called the Rule Learning, which is based on Evolutionary Algorithm. Also, a

    new classification method by Nonlinear Integral has been developed. Good performance of this method comes from the use

    of the fuzzy measure and the relevant nonlinear integral. The nonadditivity of the fuzzy measure reflects the importance of

    the feature attributes as well as their interactions. These two classifiers give explicit information on the importance of the

    individual mutated sites and their interactions toward the classification (potential causes of liver cancer in our case). A

    thorough comparison study of these two methods with existing methods is detailed. For genotype B, genotype C

    subgroups C1, C2, and C3, important mutation markers (sites) have been found, respectively. These two classification

    methods have been applied to classify never-seen-before examples for validation. The results show that the classification

    12

  • 8/6/2019 IEEE Final Year Projects 2011-2012 ::Computational Biology:: Elysium Technologies Pvt Ltd

    13/52

    Elysium Technologies Private LimitedISO 9001:2008 A leading Research and Development DivisionMadurai | Chennai | Trichy | Coimbatore | Kollam| SingaporeWebsite: elysiumtechnologies.com, elysiumtechnologies.infoEmail: [email protected]

    IEEE Project List 2011 - 2012

    [Type text]

    Madurai

    Elysium Technologies Private Limited

    230, Church Road, Annanagar,

    Madurai , Tamilnadu 625 020.

    Contact : 91452 4390702, 4392702, 4394702.

    eMail: [email protected]

    Trichy

    Elysium Technologies Private Limited

    3rd Floor,SI Towers,

    15 ,Melapudur , Trichy,

    Tamilnadu 620 001.

    Contact : 91431 - 4002234.

    eMail: [email protected]

    Kollam

    Elysium Technologies Private Limited

    Surya Complex,Vendor junction,

    kollam,Kerala 691 010.

    Contact : 91474 2723622.

    eMail: [email protected]

    [Type text] [Type text]

    methods have more than 70 percent accuracy and 80 percent sensitivity for most data sets, which are considered high as

    an initial scanning method for liver cancer diagnosis.

    33 Determination of Glycan Structure from Tandem Mass Spectra

    Glycans are molecules made from simple sugars that form complex tree structures. Glycans constitute one of the most

    important protein modifications and identification of glycans remains a pressing problem in biology. Unfortunately, the

    structure of glycans is hard to predict from the genome sequence of an organism. In this paper, we consider the problem of

    deriving the topology of a glycan solely from tandem mass spectrometry (MS) data. We study, how to generate glycan tree

    candidates that sufficiently match the sample mass spectrum, avoiding the combinatorial explosion of glycan structures.

    Unfortunately, the resulting problem is known to be computationally hard. We present an efficient exact algorithm for thisproblem based on fixed-parameter algorithmics that can process a spectrum in a matter of seconds. We also report some

    preliminary results of our method on experimental data, combining it with a preliminary candidate evaluation scheme. We

    show that our approach is fast in applications, and that we can reach very well de novo identification results. Finally, we

    show how to count the number of glycan topologies for a fixed size or a fixed mass. We generalize this result to count the

    number of (labeled) trees with bounded out degree, improving on results obtained using Po lyas enumeration theorem.

    34 Discriminative Motif Finding for Predicting Protein Subcellular Localization

    Many methods have been described to predict the subcellular location of proteins from sequence information. However,

    most of these methods either rely on global sequence properties or use a set of known protein targeting motifs to predict

    protein localization. Here, we develop and test a novel method that identifies potential targeting motifs using a

    discriminative approach based on hidden Markov models (discriminative HMMs). These models search for motifs that are

    present in a compartment but absent in other, nearby, compartments by utilizing an hierarchical structure that mimics the

    protein sorting mechanism. We show that both discriminative motif finding and the hierarchical structure improve

    localization prediction on a benchmark data set of yeast proteins. The motifs identified can be mapped to known targeting

    motifs and they are more conserved than the average protein sequence. Using our motif-based predictions, we can identify

    potential annotation errors in public databases for the location of some of the proteins. A software implementation and the

    data set described in this paper are available from http://murphylab.web.cmu.edu/software/ 2009_TCBB_motif/.

    35 Disturbance Analysis of Nonlinear Differential Equation Models of Genetic SUM Regulatory Networks

    Noise disturbances and time delays are frequently met in cellular genetic regulatory systems. This paper is concerned with

    the disturbance analysis of a class of genetic regulatory networks described by nonlinear differential equation models. The

    mechanisms of genetic regulatory networks to amplify (attenuate) external disturbance are explored, and a simple measure

    of the amplification (attenuation) level is developed from a nonlinear robust control point of view. It should be noted that the

    conditions used to measure the disturbance level are delay-independent or delay-dependent, and are expressed within the

    framework of linear matrix inequalities, which can be characterized as convex optimization, and computed by the interior-

    13

  • 8/6/2019 IEEE Final Year Projects 2011-2012 ::Computational Biology:: Elysium Technologies Pvt Ltd

    14/52

    Elysium Technologies Private LimitedISO 9001:2008 A leading Research and Development DivisionMadurai | Chennai | Trichy | Coimbatore | Kollam| SingaporeWebsite: elysiumtechnologies.com, elysiumtechnologies.infoEmail: [email protected]

    IEEE Project List 2011 - 2012

    [Type text]

    Madurai

    Elysium Technologies Private Limited

    230, Church Road, Annanagar,

    Madurai , Tamilnadu 625 020.

    Contact : 91452 4390702, 4392702, 4394702.

    eMail: [email protected]

    Trichy

    Elysium Technologies Private Limited

    3rd Floor,SI Towers,

    15 ,Melapudur , Trichy,

    Tamilnadu 620 001.

    Contact : 91431 - 4002234.

    eMail: [email protected]

    Kollam

    Elysium Technologies Private Limited

    Surya Complex,Vendor junction,

    kollam,Kerala 691 010.

    Contact : 91474 2723622.

    eMail: [email protected]

    [Type text] [Type text]

    point algorithm easily. Finally, by the proposed method, a numerical example is provided to illustrate how to measure the

    attenuation of proteins in the presence of external disturbances.

    36 Efficient Formulations for Exact Stochastic Simulation of Chemical Systems

    One can generate trajectories to simulate a system of chemical reactions using either Gillespies direct method or Gibson

    and Brucks next reaction method. Because one usually needs many trajectories to understand the dynamics of a system,

    performance is important. In this paper, we present new formulations of these methods that improve the computational

    complexity of the algorithms. We present optimized implementations, available from http://cain.sourceforge.net/, that offer

    better performance than previous work. There is no single method that is best for all problems. Simple formulations often

    work best for systems with a small number of reactions, while some sophisticated methods offer the best performance for

    large problems and scale well asymptotically. We investigate the performance of each formulation on simple biological

    systems using a wide range of problem sizes. We also consider the numerical accuracy of the direct and the next reaction

    method. We have found that special precautions must be taken in order to ensure that randomness is not discarded during

    the course of a simulation.

    37 Encoding Molecular Motions in Voxel Maps

    This paper builds on the combination of robotic path planning algorithms and molecular modeling methods for computing

    large-amplitude molecular motions, and introduces voxel maps as a computational tool to encode and to represent such

    motions. We investigate several applications and show results that illustrate the interest of such representation.

    38 Ensemble Learning with Active Example Selection for Imbalanced Biomedical Data Classification

    In biomedical data, the imbalanced data problem occurs frequently and causes poor prediction performance for minority

    classes. It is because the trained classifiers are mostly derived from the majority class. In this paper, we describe an

    ensemble learning method combined with active example selection to resolve the imbalanced data problem. Our method

    consists of three key components: 1) an active example selection algorithm to choose informative examples for training the

    classifier, 2) an ensemble learning method to combine variations of classifiers derived by active example selection, and 3)

    an incremental learning scheme to speed up the iterative training procedure for active example selection. We evaluate the

    method on six real-world imbalanced data sets in biomedical domains, showing that the proposed method outperforms

    both the random under sampling and the ensemble with under sampling methods. Compared to other approaches to

    solving the imbalanced data problem, our method excels by 0.03-0.15 points in AUC measure.

    14

  • 8/6/2019 IEEE Final Year Projects 2011-2012 ::Computational Biology:: Elysium Technologies Pvt Ltd

    15/52

    Elysium Technologies Private LimitedISO 9001:2008 A leading Research and Development DivisionMadurai | Chennai | Trichy | Coimbatore | Kollam| SingaporeWebsite: elysiumtechnologies.com, elysiumtechnologies.infoEmail: [email protected]

    IEEE Project List 2011 - 2012

    [Type text]

    Madurai

    Elysium Technologies Private Limited

    230, Church Road, Annanagar,

    Madurai , Tamilnadu 625 020.

    Contact : 91452 4390702, 4392702, 4394702.

    eMail: [email protected]

    Trichy

    Elysium Technologies Private Limited

    3rd Floor,SI Towers,

    15 ,Melapudur , Trichy,

    Tamilnadu 620 001.

    Contact : 91431 - 4002234.

    eMail: [email protected]

    Kollam

    Elysium Technologies Private Limited

    Surya Complex,Vendor junction,

    kollam,Kerala 691 010.

    Contact : 91474 2723622.

    eMail: [email protected]

    [Type text] [Type text]

    39 Estimating Genome-Wide Gene Networks Using Nonparametric Bayesian Network Models on Massively Parallel Computers

    We present a novel algorithm to estimate genome-wide gene networks consisting of more than 20,000 genes from gene

    expression data using nonparametric Bayesian networks. Due to the difficulty of learning Bayesian network structures,

    existing algorithms cannot be applied to more than a few thousand genes. Our algorithm overcomes this limitation by

    repeatedly estimating subnetworks in parallel for genes selected by neighbor node sampling. Through numerical

    simulation, we confirmed that our algorithm outperformed a heuristic algorithm in a shorter time. We applied our algorithm

    to microarray data from human umbilical vein endothelial cells (HUVECs) treated with siRNAs, to construct a human

    genome-wide gene network, which we compared to a small gene network estimated for the genes extracted using a

    traditional bioinformatics method. The results showed that our genome-wide gene network contains many features of the

    small network, as well as others that could not be captured during the small network estimation. The results also revealed

    master-regulator genes that are not in the small network but that control many of the genes in the small network. These

    analyses were impossible to realize without our proposed algorithm.

    40 Estimating Haplotype Frequencies by Combining Data from Large DNA Pools with Database Information

    We assume that allele frequency data have been extracted from several large DNA pools, each containing genetic material

    of up to hundreds of sampled individuals. Our goal is to estimate the haplotype frequencies among the sampled individuals

    by combining the pooled allele frequency data with prior knowledge about the set of possible haplotypes. Such prior

    information can be obtained, for example, from a database such as HapMap. We present a Bayesian haplotyping method for

    pooled DNA based on a continuous approximation of the multinomial distribution. The proposed method is applicable when

    the sizes of the DNA pools and/or the number of considered loci exceed the limits of several earlier methods. In the

    example analyses, the proposed model clearly outperforms a deterministic greedy algorithm on real data from the HapMap

    database. With a small number of loci, the performance of the proposed method is similar to that of an EM-algorithm, which

    uses a multinormal approximation for the pooled allele frequencies, but which does not utilize prior information about the

    haplotypes. The method has been implemented using Matlab and the code is available upon request from the authors.

    41 EvoMD: An Algorithm for Evolutionary Molecular Design

    Traditionally, Computer-Aided Molecular Design (CAMD) uses heuristic search and mathematical programming to tackle the

    molecular design problem. But these techniques do not handle large and nonlinear search space very well. To overcome

    these drawbacks, graph-based evolutionary algorithms (EAs) have been proposed to evolve molecular design by mimicking

    chemical reactions on the exchange of chemical bonds and components between molecules. For these EAs to perform their

    tasks, known molecular components, which can serve as building blocks for the molecules to be designed, and known

    chemical rules, which govern chemical combination between different components, have to be introduced before the

    evolutionary process can take place. To automate molecular design without these constraints, this paper proposes an EA

    called Evolutionary Algorithm for Molecular Design (EvoMD). EvoMD encodes molecular designs in graphs. It uses a novel

    15

  • 8/6/2019 IEEE Final Year Projects 2011-2012 ::Computational Biology:: Elysium Technologies Pvt Ltd

    16/52

    Elysium Technologies Private LimitedISO 9001:2008 A leading Research and Development DivisionMadurai | Chennai | Trichy | Coimbatore | Kollam| SingaporeWebsite: elysiumtechnologies.com, elysiumtechnologies.infoEmail: [email protected]

    IEEE Project List 2011 - 2012

    [Type text]

    Madurai

    Elysium Technologies Private Limited

    230, Church Road, Annanagar,

    Madurai , Tamilnadu 625 020.

    Contact : 91452 4390702, 4392702, 4394702.

    eMail: [email protected]

    Trichy

    Elysium Technologies Private Limited

    3rd Floor,SI Towers,

    15 ,Melapudur , Trichy,

    Tamilnadu 620 001.

    Contact : 91431 - 4002234.

    eMail: [email protected]

    Kollam

    Elysium Technologies Private Limited

    Surya Complex,Vendor junction,

    kollam,Kerala 691 010.

    Contact : 91474 2723622.

    eMail: [email protected]

    [Type text] [Type text]

    crossover operator which does not require known chemistry rules known in advanced and it uses a set of novel mutation

    operators. EvoMD uses atomics-based and fragment-based approaches to handle different size of molecule, and the value

    of the fitness function it uses is made to depend on the property descriptors of the design encoded in a molecular graph. It

    has been tested with different data sets and has been shown to be very promising.

    42 Extensions and Improvements to the Chordal Graph Approach to the Multistate Perfect Phylogeny Problem

    The multistate perfect phylogeny problem is a classic problem in computational biology. When no perfect phylogeny exists,

    it is of interest to find a set of characters to remove in order to obtain a perfect phylogeny in the remaining data. This is

    known as the character removal problem. We show how to use chordal graphs and triangulations to solve the character

    removal problem for an arbitrary number of states, which was previously unsolved. We outline a preprocessing technique

    that speeds up the computation of the minimal separators of a graph. Minimal separators are used in our solution to the

    missing data character removal problem and to Gusfields solution of the perfect phylogeny problem with missing data.

    43 F2Dock: Fast Fourier Protein-Protein Docking

    The functions of proteins are often realized through their mutual interactions. Determining a relative transformation for a

    pair of proteins and their conformations which form a stable complex, reproducible in nature, is known as docking. It is an

    important step in drug design, structure determination, and understanding function and structure relationships. In this

    paper, we extend our non uniform fast Fourier transform-based docking algorithm to include an adaptive search phase

    (both translational and rotational) and thereby speed up its execution. We have also implemented a multithreaded version

    of the adaptive docking algorithm for even faster execution on multi-core machines. We call this protein-protein docking

    code F2Dock (F2 Fast Fourier). We have calibrated F2Dock based on an extensive experimental study on a list of

    benchmark complexes and conclude that F2Dock works very well in practice. Though all docking results reported in this

    paper use shape complementarity and Coulombic-potential-based scores only, F2Dock is structured to incorporate

    Lennard-Jones potential and re ranking docking solutions based on desolvation energy.

    44 Fast Surface-Based Travel Depth Estimation Algorithm for Macromolecule Surface Shape Description

    Travel Depth, introduced by Coleman and Sharp in 2006, is a physical interpretation of molecular depth, a term frequently

    used to describe the shape of a molecular active site or binding site. Travel Depth can be seen as the physical distance a

    solvent molecule would have to travel from a point of the surface, i.e., the Solvent-Excluded Surface (SES), to its convex

    hull. Existing algorithms providing an estimation of the Travel Depth are based on a regular sampling of the molecule

    volume and the use of the Dijkstras shortest path algorithm. Since Travel Depth is only defined on the molecular surface,

    this volume-based approach is characterized by a large computational complexity due to the processing of unnecessary

    16

  • 8/6/2019 IEEE Final Year Projects 2011-2012 ::Computational Biology:: Elysium Technologies Pvt Ltd

    17/52

    Elysium Technologies Private LimitedISO 9001:2008 A leading Research and Development DivisionMadurai | Chennai | Trichy | Coimbatore | Kollam| SingaporeWebsite: elysiumtechnologies.com, elysiumtechnologies.infoEmail: [email protected]

    IEEE Project List 2011 - 2012

    [Type text]

    Madurai

    Elysium Technologies Private Limited

    230, Church Road, Annanagar,

    Madurai , Tamilnadu 625 020.

    Contact : 91452 4390702, 4392702, 4394702.

    eMail: [email protected]

    Trichy

    Elysium Technologies Private Limited

    3rd Floor,SI Towers,

    15 ,Melapudur , Trichy,

    Tamilnadu 620 001.

    Contact : 91431 - 4002234.

    eMail: [email protected]

    Kollam

    Elysium Technologies Private Limited

    Surya Complex,Vendor junction,

    kollam,Kerala 691 010.

    Contact : 91474 2723622.

    eMail: [email protected]

    [Type text] [Type text]

    samples lying inside or outside the molecule. In this paper, we propose a surface-based approach that restricts the

    processing to data defined on the SES. This algorithm significantly reduces the complexity of Travel Depth estimation and

    makes possible the analysis of large macromolecule surface shape description with high resolution. Experimental results

    show that compared to existing methods, the proposed algorithm achieves accurate estimations with considerably reduced

    processing times.

    45 FEAST: Sensitive Local Alignment with Multiple Rates of Evolution

    We present a pairwise local aligner, FEAST, which uses two new techniques: a sensitive extension algorithm for identifying

    homologous subsequences, and a descriptive probabilistic alignment model. We also present a new procedure for trainingalignment parameters and apply it to the human and mouse genomes, producing a better parameter set for these

    sequences. Our extension algorithm identifies homologous subsequences by considering all evolutionary histories. It has

    higher maximum sensitivity than Viterbi extensions, and better balances specificity. We model alignments with several

    submodels, each with unique statistical properties, describing strongly similar and weakly similar regions of homologous

    DNA. Training parameters using two submodels produces superior alignments, even when we align with only the

    parameters from the weaker submodel. Our extension algorithm combined with our new parameter set achieves sensitivity

    0.59 on synthetic tests. In contrast, LASTZ with default settings achieves sensitivity 0.35 with the same false positive rate.

    Using the weak submodel as parameters for LASTZ increases its sensitivity to 0.59 with high error. FEAST is available at

    http://monod.uwaterloo.ca/feast/.

    46Finding Significant Matches of Position Weight Matrices in Linear Time

    Position weight matrices are an important method for modeling signals or motifs in biological sequences, both in DNA and

    protein contexts. In this paper, we present fast algorithms for the problem of finding significant matches of such matrices.

    Our algorithms are of the online type, and they generalize classical multipattern matching, filtering, and superalphabet

    techniques of combinatorial string matching to the problem of weight matrix matching. Several variants of the algorithms

    are developed, including multiple matrix extensions that perform the search for several matrices in one scan through the

    sequence database. Experimental performance evaluation is provided to compare the new techniques against each other as

    well as against some other online and indexbased algorithms proposed in the literature. Compared to the brute-force

    Omn approach, our solutions can be faster by a factor that is proportional to the matrix length m. Our multiple-matrix

    filtration algorithm had the best performance in the experiments. On a current PC, this algorithm finds significant matches

    (p 0:0001) of the 123 JASPAR matrices in the human genome in about 18 minutes.

    47 Fuzzy ARTMAP Prediction of Biological Activities for Potential HIV-1 Protease Inhibitors Using a Small Molecular Data Set

    17

  • 8/6/2019 IEEE Final Year Projects 2011-2012 ::Computational Biology:: Elysium Technologies Pvt Ltd

    18/52

    Elysium Technologies Private LimitedISO 9001:2008 A leading Research and Development DivisionMadurai | Chennai | Trichy | Coimbatore | Kollam| SingaporeWebsite: elysiumtechnologies.com, elysiumtechnologies.infoEmail: [email protected]

    IEEE Project List 2011 - 2012

    [Type text]

    Madurai

    Elysium Technologies Private Limited

    230, Church Road, Annanagar,

    Madurai , Tamilnadu 625 020.

    Contact : 91452 4390702, 4392702, 4394702.

    eMail: [email protected]

    Trichy

    Elysium Technologies Private Limited

    3rd Floor,SI Towers,

    15 ,Melapudur , Trichy,

    Tamilnadu 620 001.

    Contact : 91431 - 4002234.

    eMail: [email protected]

    Kollam

    Elysium Technologies Private Limited

    Surya Complex,Vendor junction,

    kollam,Kerala 691 010.

    Contact : 91474 2723622.

    eMail: [email protected]

    [Type text] [Type text]

    Obtaining satisfactory results with neural networks depends on the availability of large data samples. The use of small

    training sets generally reduces performance. Most classical Quantitative Structure-Activity Relationship (QSAR) studies for

    a specific enzyme system have been performed on small data sets. We focus on the neuro-fuzzy prediction of biological

    activities of HIV-1 protease inhibitory compounds when inferring from small training sets. We propose two computational

    intelligence prediction techniques which are suitable for small training sets, at the expense of some computational

    overhead. Both techniques are based on the FAMR model. The FAMR [1] is a Fuzzy ARTMAP (FAM) incremental learning

    system used for classification and probability estimation. During the learning phase, each sample pair is assigned a

    relevance factor proportional to the importance of that pair. The two proposed algorithms in this paper are: 1) The GA-

    FAMR algorithm, which is new, consists of two stages: a) During the first stage, we use a genetic algorithm (GA) to optimize

    the relevances assigned to the training data. This improves the generalization capability of the FAMR. b) In the second

    stage, we use the optimized relevances to train the FAMR. 2) The Ordered FAMR is derived from a known algorithm. Instead

    of optimizing relevances, it optimizes the order of data presentation using the algorithm of Dagher et al. [2], [3]. In our

    experiments, we compare these two algorithms with an algorithm not based on the FAM, the FS-GA-FNN introduced in [4],

    [5]. We conclude that when inferring from small training sets, both techniques are efficient, in terms of generalization

    capability and execution time. The computational overhead introduced is compensated by better accuracy. Finally, the

    proposed techniques are used to predict the biological activities of newly designed potential HIV-1 protease inhibitors.

    48 Genetic Networks and Soft Computing

    The analysis of gene regulatory networks provides enormous information on various fundamental cellular processes

    involving growth, development, hormone secretion, and cellular communication. Their extraction from available gene

    expression profiles is a challenging problem. Such reverse engineering of genetic networks offers insight into cellular

    activity toward prediction of adverse effects of new drugs or possible identification of new drug targets. Tasks such as

    classification, clustering, and feature selection enable efficient mining of knowledge about gene interactions in the form of

    networks. It is known that biological data is prone to different kinds of noise and ambiguity. Soft computing tools, such as

    fuzzy sets, evolutionary strategies, and neurocomputing, have been found to be helpful in providing low-cost, acceptable

    solutions in the presence of various types of uncertainties. In this paper, we survey the role of these soft methodologies

    and their hybridizations, for the purpose of generating genetic networks.

    49 Graph Comparison by Log-Odds Score Matrices with Application to Protein Topology Analysis

    A TOPS diagram is a simplified description of the topology of a protein using a graph where nodes are -helices and -

    strands, and edges correspond to chirality relations and parallel or antiparallel bonds between strands. We present a

    matching algorithm between two TOPS diagrams where the likelihood of a match is measured according to previously

    known matches between complete 3D structures. This totally new 3D training is recorded on transition matrices that count

    the likelihood that a given TOPS feature, or combination thereof, is replaced by another feature on homologs. The new

    18

  • 8/6/2019 IEEE Final Year Projects 2011-2012 ::Computational Biology:: Elysium Technologies Pvt Ltd

    19/52

    Elysium Technologies Private LimitedISO 9001:2008 A leading Research and Development DivisionMadurai | Chennai | Trichy | Coimbatore | Kollam| SingaporeWebsite: elysiumtechnologies.com, elysiumtechnologies.infoEmail: [email protected]

    IEEE Project List 2011 - 2012

    [Type text]

    Madurai

    Elysium Technologies Private Limited

    230, Church Road, Annanagar,

    Madurai , Tamilnadu 625 020.

    Contact : 91452 4390702, 4392702, 4394702.

    eMail: [email protected]

    Trichy

    Elysium Technologies Private Limited

    3rd Floor,SI Towers,

    15 ,Melapudur , Trichy,

    Tamilnadu 620 001.

    Contact : 91431 - 4002234.

    eMail: [email protected]

    Kollam

    Elysium Technologies Private Limited

    Surya Complex,Vendor junction,

    kollam,Kerala 691 010.

    Contact : 91474 2723622.

    eMail: [email protected]

    [Type text] [Type text]

    algorithm outperforms existing ones on a benchmark database. Some biologically significant examples are discussed as

    well. The method can be used whenever frequencies of edge relationship matches are known, as it is the case for several

    biopolymer structures.

    50 ICGA-PSO-ELM Approach for Accurate Multiclass Cancer Classification Resulting in Reduced Gene Sets in Which Genes Encoding SecretedProteins Are Highly Represented

    A combination of Integer-Coded Genetic Algorithm (ICGA) and Particle Swarm Optimization (PSO), coupled with the neural-

    network-based Extreme Learning Machine (ELM), is used for gene selection and cancer classification. ICGA is used with

    PSOELM to select an optimal set of genes, which is then used to build a classifier to develop an algorithm

    (ICGA_PSO_ELM) that can handle sparse data and sample imbalance. We evaluate the performance of ICGA-PSO-ELM and

    compare our results with existing methods in the literature. An investigation into the functions of the selected genes, using

    a systems biology approach, revealed that many of the identified genes are involved in cell signaling and proliferation. An

    analysis of these gene sets shows a larger representation of genes that encode secreted proteins than found in randomly

    selected gene sets. Secreted proteins constitute a major means by which cells interact with their surroundings. Mounting

    biological evidence has identified the tumor microenvironment as a critical factor that determines tumor survival and

    growth. Thus, the genes identified by this study that encode secreted proteins might provide important insights to the

    nature of the critical biological features in the microenvironment of each tumor type that allow these cells to thrive and

    proliferate.

    51Identifiability of Two-Tree Mixtures for Group-Based Models

    Phylogenetic data arising on two possibly different tree topologies might be mixed through several biological mechanisms,

    including incomplete lineage sorting or horizontal gene transfer in the case of different topologies, or simply different

    substitution processes on characters in the case of the same topology. Recent work on a 2-state symmetric model of

    character change showed that for 4 taxa, such a mixture model has nonidentifiable parameters, and thus, it is theoretically

    impossible to determine the two tree topologies from any amount of data under such circumstances. Here, the question of

    identifiability is investigated for two-tree mixtures of the 4-state group-based models, which are more relevant to DNA

    sequence data. Using algebraic techniques, we show that the tree parameters are identifiable for the JC and K2P models.

    We also prove that generic substitution parameters for the JC mixture models are identifiable, and for the K2P and K3P

    models obtain generic identifiability results for mixtures on the same tree. This indicates that the full phylogenetic signal

    remains in such mixtures, and the 2-state symmetric result is thus a misleading guide to the behavior of other models.

    52 Identification and Modeling of Genes with Diurnal Oscillations from Microarray Time Series Data

    19

  • 8/6/2019 IEEE Final Year Projects 2011-2012 ::Computational Biology:: Elysium Technologies Pvt Ltd

    20/52

    Elysium Technologies Private LimitedISO 9001:2008 A leading Research and Development DivisionMadurai | Chennai | Trichy | Coimbatore | Kollam| SingaporeWebsite: elysiumtechnologies.com, elysiumtechnologies.infoEmail: [email protected]

    IEEE Project List 2011 - 2012

    [Type text]

    Madurai

    Elysium Technologies Private Limited

    230, Church Road, Annanagar,

    Madurai , Tamilnadu 625 020.

    Contact : 91452 4390702, 4392702, 4394702.

    eMail: [email protected]

    Trichy

    Elysium Technologies Private Limited

    3rd Floor,SI Towers,

    15 ,Melapudur , Trichy,

    Tamilnadu 620 001.

    Contact : 91431 - 4002234.

    eMail: [email protected]

    Kollam

    Elysium Technologies Private Limited

    Surya Complex,Vendor junction,

    kollam,Kerala 691 010.

    Contact : 91474 2723622.

    eMail: [email protected]

    [Type text] [Type text]

    Behavior of living organisms is strongly modulated by the day and night cycle giving rise to a cyclic pattern of activities.

    Such a pattern helps the organisms to coordinate their activities and maintain a balance between what could be performed

    during the day and what could be relegated to the night. This cyclic pattern, called the Circadian Rhythm, is a

    biological phenomenon observed in a large number of organisms. In this paper, our goal is to analyze transcriptome data

    from Cyanothece for the purpose of discovering genes whose expressions are rhythmic. We cluster these genes into

    groups that are close in terms of their phases and show that genes from a specific metabolic functional category are tightly

    clustered, indicating perhaps a preferred time of the day/ night when the organism performs this function. The proposed

    analysis is applied to two sets of microarray experiments performed under varying incident light patterns. Subsequently,

    we propose a model with a network of three phase oscillators together with a central master clock and use it to approximate

    a set of circadian-controlled genes that can be approximated closely.

    53Identifying Relevant Data for a Biological Database: Handcrafted Rules versus Machine Learning

    With well over 1,000 specialized biological databases in use today, the task of automatically identifying novel, relevant data

    for such databases is increasingly important. In this paper, we describe practical machine learning approaches for

    identifying MEDLINE documents and Swiss-Prot/TrEMBL protein records, for incorporation into a specialized biological

    database of transport proteins named TCDB. We show that both learning approaches outperform rules created by hand by

    a human expert. As one of the first case studies involving two different approaches to updating a deployed database, both

    the methods compared and the results will be of interest to curators of many specialized databases.

    54Image-Based Surface Matching Algorithm Oriented to Structural Biology

    Emerging technologies for structure matching based on surface descriptions have demonstrated their effectiveness in

    many research fields. In particular, they can be successfully applied to in silico studies of structural biology. Protein

    activities, in fact, are related to the external characteristics of these macromolecules and the ability to match surfaces can

    be important to infer information about their possible functions and interactions. In this work, we present a surface-

    matching algorithm, based on encoding the outer morphology of proteins in images of local description, which allows us to

    establish point-to-point correlations among macromolecular surfaces using image-processing functions. Discarding

    methods relying on biological analysis of atomic structures and expensive computational approaches based on energetic

    studies, this algorithm can successfully be used for macromolecular recognition by employing local surface features.

    Results demonstrate that the proposed algorithm can be employed both to identify surface similarities in context of

    macromolecular functional analysis and to screen possible protein interactions to predict pairing capability.

    55 Improving the Computational Efficiency of Recursive Cluster Elimination for Gene Selection

    20

  • 8/6/2019 IEEE Final Year Projects 2011-2012 ::Computational Biology:: Elysium Technologies Pvt Ltd

    21/52

    Elysium Technologies Private LimitedISO 9001:2008 A leading Research and Development DivisionMadurai | Chennai | Trichy | Coimbatore | Kollam| SingaporeWebsite: elysiumtechnologies.com, elysiumtechnologies.infoEmail: [email protected]

    IEEE Project List 2011 - 2012

    [Type text]

    Madurai

    Elysium Technologies Private Limited

    230, Church Road, Annanagar,

    Madurai , Tamilnadu 625 020.

    Contact : 91452 4390702, 4392702, 4394702.

    eMail: [email protected]

    Trichy

    Elysium Technologies Private Limited

    3rd Floor,SI Towers,

    15 ,Melapudur , Trichy,

    Tamilnadu 620 001.

    Contact : 91431 - 40