Decision support based on genomics: integration of data- and knowledge-driven reasoning

Int. J. Biomedical Engineering and Technology, Vol. x, No. x, xxxx 1

Copyright © 200x Inderscience Enterprises Ltd.

Decision support based on genomics: integration of data- and knowledge-driven reasoning

S. Sfakianakis* The Foundation for Research and Technology-Hellas, Institute of Computer Science, P.O. Box 1385, GR-71110, Heraklion, Crete, Greece E-mail: [email protected] *Corresponding author

M. Blazantonakis, I. Dimou and M. Zervakis The Technical University of Crete, Department of Electronics and Computer Engineering, Chania 73100, Crete, Greece E-mail: [email protected] E-mail: [email protected] E-mail: [email protected]

M. Tsiknakis and G. Potamias The Foundation for Research and Technology-Hellas, Institute of Computer Science, P.O. Box 1385, GR-71110, Heraklion, Crete, Greece E-mail: [email protected] E-mail: [email protected]

D. Kafetzopoulos

The Foundation for Research and Technology-Hellas, Institute of Molecular Biology and Biotechnology, P.O. Box 1385, GR-71110, Heraklion, Crete, Greece E-mail: [email protected]

D. Lowe The Neural Computing Research Group, Aston University, Aston Triangle, Birmingham B4 7ET, UK E-mail: [email protected]

2 S. Sfakianakis et al.

Abstract: The breadth and depth of available clinico-genomic information, present an enormous opportunity for improving our ability to study disease mechanisms and meet the individualised medicine needs. A difficulty occurs when the results are to be transferred ‘from bench to bedside’. Diversity of methods is one of the causes, but the most critical one relates to our inability to share and jointly exploit data and tools. This paper presents a perspective on current state-of-the-art in the analysis of clinico-genomic data and its relevance to medical decision support. It is an attempt to investigate the issues related to data and knowledge integration.

Keywords: integration of data and knowledge; clinico-genomic knowledge discovery; bioinformatics; personalised medicine; decision support.

Reference to this paper should be made as follows: Sfakianakis, S., Blazantonakis, M., Dimou, I., Zervakis, M., Tsiknakis, M., Potamias, G., Kafetzopo, D. and Lowe, D. (xxxx) ‘Decision support based on genomics: integration of data- and knowledge-driven reasoning’, Int. J. Biomedical Engineering and Technology, Vol. x, No. x, pp.xxx–xxx.

Biographical notes: S. Sfakianakis received his BSc in Computer Science in 1995 and his MSc with highest distinction in Advanced Information Systems from the University of Athens in 1998. In January 2000, he joined the ICS-FORTH's Biomedical Informatics Laboratory (BMI). His interests include the semantic integration and composition of services in state-of-the-art computational environments such as the Grid and the Semantic Web and the employment of statistical and computational approaches based on machine learning and data-mining techniques for the analysis of high-throughput experiments, such as gene-expression profiling and genomic sequencing. He is actively involved in the ACGT-integrated project.

M. Blazantonakis is a PhD Candidate in the Department of Electronics and Computer Engineering, Technical University of Crete. His research interests include applications of pattern recognition in bioinformatics.

I. Dimou is a PhD Candidate in the Department of Electronics and Computer Engineering, Technical University of Crete. His research interests include applications of pattern recognition in biomedical engineering.

M. Zervakis holds a PhD from the Department of Electrical Engineering, University of Toronto, since 1990. He joined the Technical University of Crete in January 1995, where he is currently Full Professor at the Department of Electronic and Computer Engineering. He is the Director of the Digital Image and Signal Processing Laboratory (DISPLAY) and is involved in research on modern aspects of signal processing, including estimation and constrained optimisation, multi-channel and multi-band signal processing, wavelet analysis for data/image processing and compression, biomedical imaging applications, neural networks and fuzzy logic in automation applications.

Manolis Tsiknakis received a BEng in Electronic Engineering, an MSc in Microprocessor Engineering, and a PhD in Systems Engineering from the University of Bradford, UK. In 1992, he joined FORTH-ICS where he is currently a Principal Researcher in many collaborative R&D projects, Head of the Centre of eHealth Technologies, and is currently the Scientific Coordinator of the ACGT-integrated project. He is the initiator and Chair of the ERCIM

Decision support based on genomics: integration of data- and knowledge-driven 3

Biomedical Informatics Working Group. His current research interests are in the areas of biomedical informatics, component-based software engineering, information integration, ambient intelligence in eHealth and mHealth service platforms and signal processing and analysis. He is a member of IEEE and ACM.

George Potamias received a BSc in Mathematics, and a PhD in Artificial Intelligence from the University of Patras, Greece. In 1992, he joined FORTH-ICS where he is currently a Principal Researcher leading the data-mining and knowledge discovery activity of the biomedical informatics laboratory at FORTH-ICS. His R&D interests include the development and customisation of data-mining algorithms, tools and systems, and their utilisation in the biomedical domain. He is an affiliate member of IEEE.

Dimitris Kafetzopoulos studied Biology at the University of Thessaloniki and Biochemistry at the Graduate School of the University of Toronto. In 1994, he was awarded the Doctorate Degree in Applied Biology and Biotechnology from the Department of Biology of the University of Crete for research in enzyme conversion of cell wall polysaccharides. Since 1997, he is a Researcher at the FORTH-IMBB, leading the research group of Post-Genomic Applications. His research interests include drug development methodologies, molecular classification using DNA microarrays and multianalyte approaches in genotyping. He has participated and coordinated several national and European research projects, including research contracts with the pharmaceutical industry, multidisciplinary research and technology foresight projects.

D. Lowe is Professor of Information Engineering at Aston University and a leading innovator in the research and development of applications for neural networks, forecasting and pattern recognition. Whereas at the Royal Signals and Radar Establishment in Malvern, UK, Professor Lowe specialised in automatic speech recognition and generic pattern processing. In 1993, he joined Aston University as a founding Professor in the Neural Computing Research Group. Professor Lowe is a former Chairman of the IEE International Conference on Artificial Neural Networks, and remains active on various committees of international neural network, biomedical and financial conferences. He is also a member of several strategic government advisory committees on technology development.

1 Introduction

1.1 Individualised medicine in the post-genomic era

Biomedical research has entered a new phase. The completion of the Human Genome Project sparked the development of many new tools for today’s biomedical researcher to use in finding the mechanism behind disease.

Coupled with the sequencing and annotation of many model organisms, our ability to risk-stratify patients using a collection of phenotypic and genotypic information may come to fruition in the foreseeable future. Whereas the goal is clear, the path to such discoveries has been fraught with roadblocks in terms of technical, scientific, and sociological challenges. The deluge of data that large-scale sequencing, transcriptomic and proteomic studies have produced to date is a case in point. In addition to the shear


volume, data collected using a variety of laboratory technologies and techniques are often published without the background information (method of capture, sample preparation, statistical techniques applied) that is needed to reproduce results.

This data problem has pushed the biological community to partition and compartmentalise their data for easy digestion and maintenance. Whereas this approach worked in the past for simple systems containing a relatively small number of interactions, modelled by a small number of data sets, bioinformaticians are finding it difficult to model more complex systems. The simplicity and digestibility of the compartments described earlier have made it almost completely impossible to cross compartmental boundaries without consulting an expert.

To alleviate this burden, bioinformaticians are starting to apply sophisticated computational approaches in the areas of statistics, data mining, signal processing and artificial intelligence to discover relationships between such compartments. However, the community is quickly discovering glaring inconsistencies in language, methodology and computational models used to describe a particular organism, pathway, interaction, annotation, and so forth. The repercussions of the compartmental approach have produced a bottleneck in the road to discovery.

To make this more concrete, computational approaches to data analysis and discovery typically rely on formalism in terms of syntax, context, and format to perform reproducible and consistent experiments – the backbone of hypothesis-driven science. These formal definitions are severely lacking in the biological sciences. They will remain a burden to the process of biological discovery unless the biological community takes action.

Much of the genomic data of clinical relevance generated so far are in a format that is inappropriate for diagnostic testing. Very large epidemiological population samples followed prospectively (over a period of years) and characterised for their biomarker and genetic variation will be necessary to demonstrate the clinical utility of these tools.

Diagnostic medicine that includes predisposition testing, early detection, individualised therapy and therapeutic monitoring is neither systematically applied nor well taught in the current healthcare system. Its implementation will require not just clear data demonstrating its benefits, but also demand by patients and acceptance by healthcare professionals. This will not come quickly. This approach is also more likely to put particular financial pressure on different components of the healthcare system. The opportunities for clinical genetics to become a mainstream component of clinical medicine are now apparent. This move to the clinic appears to be inevitable (Bell, 2004).

The vision requires common standards of data storage at each level of investigation, new frameworks for integration and cross-referencing terms and their biological contexts (‘ontologies’) between disparate types of data, and new tools to analyse and mine data at all levels. The benefits will be numerous: Quicker routes to identifying patients’ individual characteristics that make one treatment more appropriate than another, easier integration of genomics research into clinical trials, and much readier access by basic molecular and cell biologists to the early lessons that can be drawn from even a few patients, as well as from large-scale, randomised clinical trials (Nature, 2004).


Individuals respond differently to drugs and sometimes the effects are unpredictable. Differences in DNA that alter the expression or function of proteins that are targeted by drugs can contribute significantly to variation in the responses of individuals. Many of the genes examined in early studies were linked to highly penetrant, single-gene traits, but future advances hinge on the more difficult challenge of elucidating multi-gene determinants of drug response. This intersection of genomics and medicine has the potential to yield a new set of molecular diagnostic tools that can be used to individualise and optimise drug therapy (Evans and Relling, 2004). In the same context, it is now time to use genomic data and genomics technology developed to generate them (such as the high-throughput technology) to combat major diseases (mainly their molecular mechanisms), such as cancer. To accelerate our understanding of the molecular mechanisms of these diseases, and to produce targeted therapies, further basic mutational and functional genomic information is required, with the collaboration of the academic and commercial (e.g., pharmaceutical companies) sectors (Strausberg et al., 2004).

1.2 Powerful new observational techniques that transform science

Breakthroughs in genomics, proteomics, instrumentation and related technologies have created unprecedented abilities to observe, collect and generate data. These advances are transforming the life sciences from small-scale, hypothesis-driven experimental sciences into large-scale, data- and discovery-driven knowledge factories.

Some of the (relatively) recent advances in genomics research that relate to the vision of genome-enabled individualised medicine are presented here:

• A large number of genomes are fully sequenced and public. The size of genomic databases increases exponentially containing tens of higher organisms, hundreds of model and economically important species, thousands of microbial pathogens and almost all important viral genomes. Comparative genomics allow the identification of conserved structural and regulatory elements within the genomes.

• All kinds of proteins are deduced from the various genome projects. Within them, conserved or variant regions, functional and structural elements, features and domains are identified. The continuum of life forms becomes clearer and the differences are measurable. The network of molecular interactions and complex biological processes becomes available for modelling and in silico experimentation.

• Gene-expression profiles allow clear identification, monitoring and classification of various organisms (i.e., pathogen strains), different tissues and tumours, health and disease states. Profiling highlights specific macromolecules and metabolic pathways (i.e., surface antigens) that could allow targeting of drugs or therapies.

• High-throughput screenings of hundreds of targets are generating new functional coordinates within the chemical space. Classification of chemical compounds and targets into functional groups, identification of relations between distant targets and drug effects and knowledge visualisation for chemical structures and properties become the main tools for knowledge-based drug discovery.

https://www.researchgate.net/publication/8542523_Evans_W_E_Relling_M_K_Moving_towards_individualized_medicine_with_pharmacogenomics_Nature_429_464-468?el=1_x_8&enrichId=rgreq-d8aacae8-26b7-4295-83c7-2536780e0ece&enrichSource=Y292ZXJQYWdlOzIzODM0MTIzMjtBUzoyNTQ3NTc5NjA5NDE1NjhAMTQzNzc1MDQzOTQwMw==


• A variety of biosensors that allow simultaneous monitoring of several metabolites and biological signals become widely available, portable and distributed. Additionally, molecular-imaging techniques and other functional imaging methods, such as PET and functional-MRI, are assuming new, important roles in molecular-genetic imaging cell metabolic states, for the in vivo monitoring of protein interactions and gene expression.

• Functional genomics and genetic studies elucidate the function of unknown genes, mostly by the use of holist post-genomic approaches. The genetic determinants of multigenic diseases are analysed and evaluated. Pharmacogenetics identifies the genetic basis of drug efficiency and adverse effects. Pharmacogenomic information from clinical trials is generating the basis of the future “targeted pharmacotherapy”: the right drugs in the right doses to the right patient.

• Correlations between genotypes, gene regulatory networks and biochemical pathways allow the intervention and metabolic readjustments for combating complex diseases.

• Targeting and quantisation of disease-specific proteins with metabolomic and proteomic profiling approaches is a recent and promising approach to the discovery and validation of disease biomarkers. The human plasma proteome holds the promise of a revolution in disease diagnosis and therapy. One major breakthrough should come from the detection of multiprotein disease biomarkers including isoforms. The EU project LOCCANDIA1 aims towards the integration of a full proteomics analysis chain. It includes an innovative patented lab-on-chip detection apparatus, and targets the early pancreatic cancer diagnosis (Paulus et al., 2007). It also provides an integrated clinico-proteomics information technology platform to ease the integration of clinical and mass-spec data and their intelligent analysis (Kalaitzakis et al., 2008).

The above-mentioned achievements portray the potential of genomic and individualised medicine. Genomic medicine aims to explain life and disease in terms of the presence and regulation of molecular entities. Individualised medicine applies genotypic knowledge to identify predisposition to disease and develops therapies adapted to the genotype of a patient. The former is driven towards gaining knowledge about the disease, while the latter tries to identify and clinically utilise individual genetic information.

In addition, we are now approaching a threshold where advances in knowledge of genomic processes and impact on medical healthcare will soon be incorporated into this picture. This raises additional issues such as the possibility of personalised drug and therapy design, how to combine micro and macro-levels of biomedical data for coherent decision-making, and ethical dilemmas of who should have access to such personalised information, and for what purposes.

Current trends indicate that to be able to exploit developments in technology for individualised healthcare will require commensurate developments in our abilities to process this data deluge and refine into manageable streams of information and knowledge capable of human interpretation. This need for integration is to some extent clear in the case of complex, multifactorial diseases/traits, such as obesity, diabetes,

https://www.researchgate.net/publication/232654061_An_Integrated_Clinico-Proteomics_Information_Management_and_Analysis_Platform?el=1_x_8&enrichId=rgreq-d8aacae8-26b7-4295-83c7-2536780e0ece&enrichSource=Y292ZXJQYWdlOzIzODM0MTIzMjtBUzoyNTQ3NTc5NjA5NDE1NjhAMTQzNzc1MDQzOTQwMw==

https://www.researchgate.net/publication/5844373_Chromatographic_alignment_combined_with_chemometrics_profile_reconstruction_approaches_applied_to_LC-MS_data?el=1_x_8&enrichId=rgreq-d8aacae8-26b7-4295-83c7-2536780e0ece&enrichSource=Y292ZXJQYWdlOzIzODM0MTIzMjtBUzoyNTQ3NTc5NjA5NDE1NjhAMTQzNzc1MDQzOTQwMw==


hypertension, schizophrenia (and other diseases of the nervous system, including Parkinson’s and Alzheimer’s) and cancer.

It is obvious that a complete knowledge of the underlying biological processes requires the integration and analysis of massive amounts of data as is being collected from current genomic, proteomic and metabolomic platforms (Hood et al., 2004). But, it is not just the multiplicity of the factors (and cellular levels) contributing to a particular disease framework that imposes approaching the problem in a systematic way. Even for Mendelian genetic disorders, nearly all of which have now been correlated with a specific gene or set of genes (Hoh and Ott, 2004) owing to remarkable advances in gene mapping and bioinformatics, the relationship between genotype and phenotype is not as simple as expected (or currently treated) (Lai and Klapa, 2004).

In this paper, we present a perspective on current state-of-the-art in automated genomic data analysis and their relevance to medical decision support based on population and personal data. In Section 2.1, we provide a set of taxonomies of data-processing approaches in which we have provided our own suggestions for how these techniques can be viewed as part of a common hierarchy. We also indicate failings in our methods and future challenges that need to be addressed if we are to capitalise on augmenting diverse patient information. In Section 3, we survey the existing and emerging technologies and methodologies for the integration of data and knowledge and we explore the challenges we face in such endeavours. Finally, Section 4 concludes this paper.

2 Data mining for genomics knowledge discovery

2.1 Data mining: a key-technology for knowledge discovery

The posted tasks are complex and most of the times difficult to attack. A key technology relates to data-mining systems, methods and tools. Data mining is a step in the process of generating knowledge in databases. It includes techniques for query databases, online analytical processing, and machine-learning algorithms. In the medical area, many applications have been created for decision support to address issues such as image and signal analysis and outlining clinical prognoses for patient conditions. In biology, efforts have been centred on research issues such as the prediction of protein structures and drug studies.

Both types of predictive exercises present considerable challenges for future research. Text mining is a discipline that aims to extract data, information or knowledge from texts. Finding information in biomedical databases using text mining and information-retrieval techniques is expected to leverage a substantial amount of biomedical information that has escaped analysis until now.

Some of the key biomedical tasks to be tackled with data mining are as follows.

• Genome Database Mining. Genome database mining is an emerging technology, which is based on extracting useful information from genome databases. One of the main tasks is the computational annotation of genomes that consists of two sequential processes (Bork et al., 1998; Rouze et al., 1999):

https://www.researchgate.net/publication/8453754_Alternative_pathways_of_galactose_assimilation_Could_inverse_metabolic_engineering_provide_an_alternative_to_galactosemic_patients?el=1_x_8&enrichId=rgreq-d8aacae8-26b7-4295-83c7-2536780e0ece&enrichSource=Y292ZXJQYWdlOzIzODM0MTIzMjtBUzoyNTQ3NTc5NjA5NDE1NjhAMTQzNzc1MDQzOTQwMw==

https://www.researchgate.net/publication/12976891_Genome_annotation_Which_tools_do_we_have_for_it?el=1_x_8&enrichId=rgreq-d8aacae8-26b7-4295-83c7-2536780e0ece&enrichSource=Y292ZXJQYWdlOzIzODM0MTIzMjtBUzoyNTQ3NTc5NjA5NDE1NjhAMTQzNzc1MDQzOTQwMw==

https://www.researchgate.net/publication/8218163_Hood_L_Heath_JR_Phelps_ME_Lin_BSystems_biology_and_new_technologies_enable_predictive_and_preventative_medicine_Science_306_640-643?el=1_x_8&enrichId=rgreq-d8aacae8-26b7-4295-83c7-2536780e0ece&enrichSource=Y292ZXJQYWdlOzIzODM0MTIzMjtBUzoyNTQ3NTc5NjA5NDE1NjhAMTQzNzc1MDQzOTQwMw==

https://www.researchgate.net/publication/8533479_Genetic_dissection_of_diseases_Design_and_methods?el=1_x_8&enrichId=rgreq-d8aacae8-26b7-4295-83c7-2536780e0ece&enrichSource=Y292ZXJQYWdlOzIzODM0MTIzMjtBUzoyNTQ3NTc5NjA5NDE1NjhAMTQzNzc1MDQzOTQwMw==


• Structural annotation – refers to the identification of hypothetical genes termed Open Reading Frames (ORFs) in a DNA sequence using computational gene discovery algorithms.

• Functional annotation – refers to the assignment of functions to the predicted genes using sequence-similarity searches against other genes of known function.

• Computational/Mining for Gene Discovery. Locating genes on a genome is a complex task. The regions that code for proteins (exons) are only a tiny fraction of the genome. These regions can be predicted making use of the biological properties and the particular statistical composition that characterise these regions. Computational gene discovery techniques are able to find these dispersed coding exons in a sequence and to provide the best tentative gene models. Exon recognition algorithms exhibit performance trade-offs between increasing sensitivity – ability to detect true positives, and decreasing specificity – ability to exclude false positives (http://www.nslij-genetics.org/gene/).

• Sequence-Similarity Searching. Sequence-similarity searching is an important methodology in computational molecular biology. Initial clues to understanding the structure or function of a molecular sequence arise from similarity to other molecules that have been previously studied. Sequence database searches reveal biologically significant sequence relationships and suggest future investigation strategies (http://www.ebi.ac.uk/Tools/similarity.html). Sequence similarity searches are mainly exploited for Comparative Genomics (http://www.ornl.gov/sci/techresources/ Human_Genome/faq/compgen.shtml) – the analysis and comparison of genomes from different species. The purpose is to gain a better understanding of how species have evolved and to determine the function of genes and non-coding regions of the genome. Comparative genomics involves the use of computer programmes that can line up multiple genomes and look for regions of similarity among them. Some of these sequence-similarity tools are accessible to the public over the internet. One of the most widely used is BLAST (http://www.ncbi.nlm.nih.gov/BLAST/), which is available from the National Centre for Biotechnology Information. BLAST is a set of programmes designed to perform similarity searches on all available sequence data.

• Gene-Expression Mining is defined as the use of quantitative messenger RNA (mRNA)-level measurements of gene expression to characterise biological processes and elucidate the mechanisms of gene transcription. Changes in gene expression under the influence of drug or disease perturbations can be studied. The identification of differential gene expression associated with biological processes is a central research problem in molecular genetics. High-throughput gene expression assays enable the simultaneous monitoring of thousands of genes in parallel and generate vast amounts of gene-expression data. The large-scale investigation of gene expression attaches functional activity to structural genetic maps and therefore is an essential milestone in the paradigm shift from static structural genomics to dynamic functional genomics. Gene-expression database mining is used to identify intrinsic patterns and relationships in gene-expression data. The identification of patterns in complex gene expression data sets provides two benefits:


• Generation of insight into gene transcription conditions.

• Characterisation of multiple gene-expression profiles in complex biological processes, e.g., pathological states.

Data visualisation is used to display snapshots of cluster analysis results generated from large gene-expression data sets (http://www.computational-genomics.net/ genomics_9.html).

• Proteomics and Data Mining. The study of the proteome is important because proteins represent the actual functional molecules in the cell. Proteomics covers a number of different aspects of protein function, including the following:

• Structural proteomics – the large-scale analysis of protein structures.

• Expression proteomics – the large-scale analysis of protein expression, this can help to identify the main proteins found in a particular sample and proteins differentially expressed in related samples, such as diseased vs. healthy tissue.

• Interaction proteomics – the large-scale analysis of protein interactions; the characterisation of protein–protein interactions helps to determine protein functions and can also show how proteins assemble in larger complexes (http://www.wellcome.ac.uk/en/genome/thegenome/ hg03b002.html).

• Metabolomics and Data Mining. Proteomic analysis methods such as mass spectrometry allow the abundance and distribution of many proteins to be determined simultaneously. Mass-spectral raw data are pre-processed to convert them to a form usable by data-mining algorithms. Here, the mining task is that of feature extraction and classification, i.e., peak detection and peak calibration, and then clustering – clusters of (detected) mass-spectra peaks become the extracted features. Once the mass spectral data have been pre-processed and their features extracted, one can proceed to biomarker discovery including support vector machines, neural networks, decision trees and more (Hilario et al., 2004).

• SNP Identification and Genotyping. Discovery and characterisation of Single Nucleotide Polymorphisms (SNPs), which are DNA sequence variations that occur when a single nucleotide in the genome sequence is altered, are multi-step processes, and systematic approaches were centred on the emerging sequence of the human genome (Kwok and Gu, 1999). Most current SNP analysis methods rely on Polymerase Chain Reaction (PCR) amplification of the sequence of interest, which is then tested for the presence, or absence of the polymorphism using an assay system. The volume of known genetic variations lends itself well to an informatics approach. Bioinformaticians have become very good at functional inference methods derived from functional and structural genomics (Mooney, 2005). Today, the primary database of polymorphisms is dbSNP (http://www.ncbi.nih.nlm.gov/snp/), which currently contains more than 5,000,000 validated human SNPs. Disease-associated polymorphisms are available from databases such as OMIM (Hamosh et al., 2000), Swiss-Prot (Boeckmann et al., 2003), the Human Gene Mutation Database (HGMD; Stenson et al., 2003) and HGVBase (Fredman et al., 2004). Together, these databases represent more than 40,000 non-synonymous, synonymous and non-coding polymorphisms.


• Prediction of Functional SNPs. Much effort has been invested in predicting the function of non-synonymous (ns) mutations, based on evidence that regulatory and coding SNPs are most likely to affect disease and the wide availability of functional data on proteins. Researchers have taken several approaches to predict the function of nsSNPs. Almost all methods use categories, or discrete or continuous valued features to predict a deleterious mutation. These features range from sequence-based properties, physical properties of the wild-type and mutant amino acids, protein structural properties and evolutionary properties derived from a phylogeny or sequence alignment. To classify whether a mutation will be tolerated, a training set is usually constructed of mutations known to be deleterious. Currently, the state-of-the-art classification tools are based on SVMs or decision trees and the best features for classification are based on structural and evolutionary properties. Structurally, solvent accessibility has consistently been shown to be important in determining whether a mutation will be tolerated (Chasman and Adams, 2001; Saunders and Baker, 2002; Sunyaev et al., 2001; Ramensky et al., 2002). Evolutionarily, non-tolerated mutations inferred using a PSSM matrix are generally better than using positional conservation approaches (Saunders and Baker, 2002).

• Prediction of Functional Non-Coding variations. Non-coding variation has not received the attention that non-synonymous SNPs and disease-associated mutations have. This is due to difficulties in collecting functional variation information, not to lack of importance. Understanding how variation affects gene expression has been called one of the key challenges in human genetics (Hudson, 2003). The challenge arises from the difficulty in separating regulatory variation (cis-acting factors) from the cellular environment and variation on other chromosomes (trans-acting factors) and the environment. A recent review succinctly summarises the efforts to understand this challenging problem (Knight, 2004). A key problem for bioinformatics is developing methods to predict variation that is more likely to affect expression levels. Cowles et al. (2002) addressed the problem of removing trans-acting factors by focusing their studies on the expression levels in an F1 hybrid mouse derived from two inbred mouse strains. This allowed them to remove trans-regulation from the results. Pastinen et al. (2004) examined 129 genes to identify 23 genes that had allele-dependent expression levels. Additionally, Wittkopp et al. (2004) compared differences in gene expression between closely related Drosophila species and found that most of the genes with significant expression-level differences had cis-regulatory differences. Another resource is rSNP_Guide (Ponomarenko et al., 2003; http://wwwmgs.bionet.nsc.ru/mgs/systems/rsnp/), which contains annotations of SNPs based on potential effects to regulation. When well-annotated databases begin to take form, regulatory-relevant polymorphism classification will become possible. PupaSNP Finder (Conde et al., 2004; http://pupasnp.bioinfo.cnio.es/) is a tool for identifying SNPs that could have an effect on transcription. Another resource is rSNP_Guide (Ponomarenko et al., 2003; http://wwwmgs.bionet.nsc.ru/mgs/systems/rsnp/), which contains annotations of SNPs based on potential effects to regulation.


• Pharmacogenomics. The study of how genetic differences influence the variability in patients’ responses to drugs holds great promise for the optimisation of new drug development and the individualisation of clinical therapeutics. It is predicted that the use of pharmacogenetics in pre-marketing clinical trials will enable a greater percentage of those trials to produce significant results, because patients whose genetic profile suggests that the drug will be harmful or ineffective to them will be intentionally excluded (Pfost et al., 2000). In addition, physicians will be able to use genetic testing to predict the patient’s response to a drug, which can aid in individual dosing of medications or avoidance of side effects (Roses, 2002; Pfost et al., 2000; Chakravarti, 2001).

With the advancement of molecular biology technology, new whole-genome approaches for the study of genetic variations are coming into play. Copy Number Variation (CNV), a form of structural variation in the genome that refers to differences in the number of copies of a particular region in the genome, is one of them (Sebat et al., 2004; Iafrate et al., 2004). Comparative Genomic Hybridisation arrays (aCGH) present a powerful tool for the assessment of CNVs within any given DNA sample. The technique is derived from the concept of conventional CGH, which has contributed greatly to the molecular characterisation of both somatic and constitutional genomic DNA mutations in the last decade. Array-CGH technology has proved to be very useful in the characterisation and classification of various disease genotypes (Snijders et al., 2005).

3 Integration of data and knowledge in genomics

The vast wealth of data produced by state-of-the-art high-throughput technologies like DNA microarrays (Kuo et al., 2004), mass spectrometry, and gel electrophoresis techniques present new challenges for the data analysis and knowledge discovery processes. Furthermore, an additional challenge is the maintenance, storage, and integration of this massive data, so that novel data-mining methodologies can be applied. The requirements for biological data management are very demanding because of their size and complexity, quality properties (missing values or noisy data are frequent), and the inherent heterogeneity of the domain. A variety of data types exist ranging from genomic and proteomic sequences of nucleic and amino acids, which are categorical, to gene expression and mass spectrometry data that are numeric. In this setting, the need for integration is both vertical and horizontal: vertical integration is required for bridging data sources like UniProt (Apweiler et al., 2004), GenBank (Benson et al., 2005) and Pfam (Bateman et al., 2002) that manage the same ‘type’ of data, whereas horizontal integration is required so that the multilevel modelling of an organism, from molecules to cells, tissues, organs, organ systems, etc., is amenable to analysis (Figure 1).


Figure 1 Integration of biological data, methods and tools should be achieved at all levels, from the molecular to system and to the population

From an IT perspective, the integration of heterogeneous data has been extensively studied and a number of approaches, which are also applicable to the domain of bioinformatics, have been proposed, with each one exhibiting different qualities (Hernandez and Kambhampati, 2004; Louie et al., 2007):

• In the navigational or link-based integration approach, integration is achieved through shared common references and ids in the sources or cross references that are automatically generated, e.g., by similarity tools like BLAST. The supported interaction with the final system then follows the “browsing” paradigm where the links between the sources guide the users to extend their query and find the ultimate result. Examples of such systems are SRS and Entrez (Maglott et al., 2005).

• In the data warehouse integration approach, data are imported from the different data sources and persisted in a single data store with a unified schema. In order for the diverse data to be aligned with this unified schema, a transformation pre-processing step is required. Atlas is one such data warehouse integrating in a single relational data store a multitude of the existing biological databases like GenBank, RefSeq, UniProt, Human Protein Reference Database (HPRD), Biomolecular Interaction Network Database (BIND), Molecular Interactions Database (MINT), IntAct, NCBI Taxonomy, Gene Ontology (GO), Online Mendelian Inheritance in Man (OMIM), etc. (Shah et al., 2005).

• In the mediator-based integration, no local copies of the data sources are maintained. The different data sources are accessed remotely and queries sent to the mediator are mapped and adapted to the local schema of each database. Examples of mediator-based integration systems include TAMBIS (Stevens et al., 1999) and BioMediator.


The diversity in the genomic and proteomic data sources has given rise to a few efforts for building common information and data models. Examples of such efforts are the Minimum Information about Microarray Experiments (MIAME; Brazma et al., 2001) for microarray data and the Minimum Information about a Proteomics Experiment (MIAPE; Taylor et al., 2007) for proteomics, or the Polymorphism Markup Language (PML; Sugawara et al., 2004) to support interoperability and seamless sharing of SNP information and genotyping identification. Nevertheless, and in spite of the success of the integration methodologies presented earlier, a new requirement has emerged for achieving true semantic integration in a formal, extensible, reusable, and dynamic way (Gardner, 2005). This led to the adoption of ontologies for knowledge representation, abstracting away from the differences in the syntactic data formats and trying to represent the underlying meaning, context, quality, and other properties of the biological entities.

An ontology describes “the objects, concepts, and other entities that are presumed to exist in some area of interest and the relationships that hold among them” and therefore ontologies have very broad application domain and are more expressive than vocabularies, taxonomies, and thesauri. In the biological domain, a number of ontologies have been developed in recent years (Bard and Rhee, 2004), with Gene Ontology (GO) being the most popular. GO (Ashburner et al., 2000) aims to represent our knowledge about biological processes, molecular functions, and cell components. Additional examples of biological ontologies are the Microarray Gene Expression Data (MGED) ontology (Whetzel et al., 2006) for annotating microarray experiments, the Foundational Model of Anatomy (FMA) for modelling human ‘canonical’ anatomy (Rosse and Mejino, 2003) and many other referenced in the Open Biomedical Ontology web site

(http://www.obofoundry.org). In addition to the data integration requirements, knowledge integration is an

additional dimension to cater for. There are mainly two sources of knowledge that are relevant to this discussion: the domain knowledge of clinical and biology experts, and the extracted knowledge of the wealth of information encapsulated in genomic and proteomic data that is subject to mining and other data analysis processes. The need for integrating these different knowledge sources in a complementary way is of great importance for a number of reasons. First, the mechanised elicitation of hidden information in data, and especially genomic data with small sample size and huge feature space, always entails the danger of overfitting and the identification of patterns that are considered systematic while being sample-specific. Second, it is always important to have a way to test the knowledge of the domain experts and possibly to enhance and enrich it with the findings of the data-mining processes.

The incorporation of expert prior knowledge aims to guide the knowledge discovery process or increase its performance and accuracy. An aspect of such exploitation of the existing knowledge is the definition of similarity metrics based on the contents of an ontology. In Azuaje and Bodenreider (2004), such semantic similarities suggested by GO were used to deduce correlation of gene-expression data. Bolshakova et al. (2006) suggest using the GO as the domain knowledge to validate clustering results and to determine the number of clusters in gene-expression analysis. Khatri and Drăghici (2005) present a number of tools for GO-based analysis of gene-expression data.

An additional dimension in this knowledge integration direction is the fusion of knowledge produced by a variety of sources. Literature search is one of these additional sources where unanticipated relationships and patterns can be revealed by mining the text content of publications (Cohen and Hersh, 2005). Ontologies present an even further


knowledge source in this case as well. As an example, Tiffin et al. (2005) use the eVOC anatomical ontology to integrate text-mining of biomedical literature and data mining of available human gene-expression data to link the expression phenotype and the genomic sequences.

3.1 Supporting clinico-genomic knowledge discovery: a framework for integrating data mining and knowledge discovery methodologies

The integration of multi-layer biological data gives rise to new opportunities for mining and identifying latent casual links and relationships. In this section, we briefly overview a Clinico-Genomic Knowledge Discovery (CGKD) scenario for the linkage of patients’ gene-expression (microarray) and clinical data. The whole approach composes a ‘screening’ scenario for the careful identification of those patient cases and genes, which are more suitable to feed a gene-selection process. Such approaches can be based on the smooth integration of three distinct data-mining methodologies (Figure 2):

• Clustering: With this approach, the clusters of genes that best describes the available patient cases are selected, i.e., clusters that cover an adequate number of genes and for which an adequate number of samples shows significant ‘low’ or, ‘high’ gene-expression values; we call them strong-clusters.

• Association Rules Mining: It is aimed for the discovery of ‘causal’ relations (rules with high confidence) between genes and patients’ attributes, and operate on the genes and patient cases being covered by strong-clusters.

• Feature Selection: This step aims to select the most discriminant genes, i.e., the genes that are able to distinguish with high accuracy between patients’ pre-specified classes – disease state, survival category, etc.

Figure 2 A Clinico-Genomic Knowledge Discovery (CGKD) scenario enabled by the smooth integration of different data-mining methods

The clustering phase is important for identifying co-regulated groups or, clusters of genes, and as a method to reduce the dimensionality and complexity of gene-expression data (Eisen et al., 1998; Xu and Wunsch, 2005). A sample with a strong gene-expression profile for a specific cluster of genes is one that exhibits, in an adequate percentage of the current clusters’ genes, ‘high’ or, ‘low’ gene expression levels. A pre-processing step is required before clustering to convert the continuous expression values into nominal by discretising them according to Lopez et al. (2000). Clustering of genes can then be performed using the discretised k-means approach (Kanterakis and Potamias, 2006; Potamias et al., 2004) or similar approaches (Gupta et al., 1999; San et al., 2004). We are interested in ‘strong’ clusters because we want to identify potential subsets of samples


that tend to exhibit mainly ‘high’ or ‘low’ expression levels for the respective cluster’s genes. The genes of a cluster, accompanied by the ‘strong-samples’, may be interpreted as a combined ‘clinico-genomic attribute’ linking patient cases and their genomic (gene-expression) profiles.

The Self-Organising Map (SOM; Kohonen, 2001) based clustering approach presents a promising alternative that fits our needs to identify ‘strong’ gene clusters. SOM is a clustering algorithm that is used to map a multi-dimensional data set onto a (typically) two-dimensional surface. This surface (a map) is an ordered interpretation of the probability distribution of the available genes/samples of the input data set. SOMs have been used extensively in many domains, including the exploratory data analysis of gene-expression patterns. SOMs work somewhat like k-means, but are a little richer. With k-means, one chooses the number of clusters to fit the data into. For an SOM, the shape and size of a network of clusters is chosen. Like k-means, SOM initially populates its nodes or clusters by randomly sampling the data, and then refines the nodes in a systematic fashion. Unlike k-means, however, an SOM will not force there to be exactly as many clusters as there are nodes, because it is possible for a node to end up without any associated cluster items when the map is complete. A further difference with k-means clustering is that the SOM automatically provides some information on the similarity between nodes – i.e., how strongly the certain nodes resemble each other. Moreover, as visualisation has typically been a difficult matter for high-dimensional data, SOMs can be used to explore the groupings and relations within high-dimensional data by projecting the data on to a two-dimensional image that clearly indicates regions of similarity. Even if visualisation is not the goal of applying SOM to a data set, the clustering ability of the SOM is very useful.

The quest now is about causal relationships between genomic and clinical profiles. To this end, Association Rule Mining aims to induce casual relationships between the genes and the clinical attributes (Creighton and Hanash, 2003; Rodriguez et al., 2005; Agrawal et al., 1993). The identification of most ‘interesting’ associations is usually performed based on the ARM results and ‘experts’ advice. Each rule contains a combination of clinico-genomic attributes that uncovers not only significant but also causal relations between genomic and clinical patient profiles. Then, we may proceed by taking each association rule as a medium to focus on the genes and patient cases covered by it. The expert (molecular biologist or, physician) may inspect the discovered association rules and focus on the ones seems interesting for his/her research. Then, a gene-selection process may operate just on the sets of genes and patient cases being covered by the focused association rules, to identify genes that distinguish between patient case classes. For example, in the case of the above rule, these classes may be “survival greater than five years” vs. “survival less than five years”.

With the subsequent filtering approach clusters of genes that exhibit, in an adequate number of samples, strong gene-expression profiles are selected. The gene-selection methodology proposed in Potamias et al. composed by four main modules:

• gene ranking

• grouping of genes

• consecutive feature elimination

• class prediction.


In the gene ranking and grouping stages, the genes are ranked with respect to their power to distinguish between the different disease states or other classes, and a greedy gene-groups elimination process is consecutively applied on the ordered list of grouped genes to select the most discriminated ones. Finally, a greedy feature-elimination (or, addition) is performed accompanied with a classification metric to predict patient class categories (the final outcome).

4 Discussion and conclusions

The breadth and depth of information, already available in both medical and genomic research communities, present an enormous opportunity for improving our ability to study disease-mechanisms, reduce mortality, improve therapies and meet the demanding individualisation of care needs.

Up to now, the lack of a common infrastructure has prevented clinical research institutions from being able to mine and analyse disparate data sources. This inability to share both data and technologies developed by MI and BI research communities, and by different vendors and scientific groups, can therefore severely hamper the research process. Similarly, the lack of a unifying architecture can prove to be a major roadblock to a researcher’s ability to mine huge, distributed and heterogeneous information and data sources. Most critically, however, even within a single laboratory, researchers have difficulty integrating data from different technologies because of a lack of common standards and other technological and medico-legal and ethical issues.

Furthermore, biomedical data present new challenges for machine learning and knowledge discovery. Whereas in traditional applications of pattern recognition and data mining there are a large number of samples with a small feature space, in today’s bioinformatics the mass of data produced exhibit the exact reverse characteristics: small sample size with a large number of features (e.g., the estimated number of genes is above 20,000). In this environment, standard statistical and machine-learning methods are more likely to over-fit the structures in the data, and the presence of ‘noise’ puts statistical analysis and inference under fire (Ioannidis, 2005).

Resolving these issues requires discipline, multiple validations by independent studies, and integration of any available prior or other independently generated knowledge. Ontologies seem to be able to succeed as integration enabler infrastructure but we are still in the initial stage of such developments. The integration and exploitation of the data and information generated at all levels by the disciplines of bioinformatics, medical informatics, medical imaging and clinical epidemiology requires a new synergetic approach that enables a bi-directional dialogue between these scientific disciplines and integration in terms of data, methods, technologies, tools and applications.

The major building elements actually exist today, even with some rough edges: biomedical ontologies, knowledge-based integration through these ontologies, powerful computational frameworks capitalising on modern technologies such as the Grid for the fast and efficient processing of biomedical data. Nevertheless, the main challenge remains the issue of incorporating these methods, tools, and techniques into the clinical decision process in an optimal way (Ioannidis, 2007) and the development of methods that have been genuinely devised to support medical decision-making.


References Agrawal, R., Imielinski, T. and Arun, S.N. (1993) ‘Mining association rules between sets of items

in large databases’, Proc. of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, DC, USA, pp.207–216.

Apweiler, R., Bairoch, A., Wu, C., Barker, W., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M., Martin, M., Natale, D., O’Donovan, C., Redaschi, N. and Yeh, L. (2004) ‘UniProt: the universal protein knowledgebase’, Nucleic Acids Research, (32 Database), pp.115–119.

Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwal, M., Rubin, G.M. and Sherlock G. (2000) ‘Gene ontology: tool for the unification of biology’, The Gene Ontology Consortium. Nat. Genet., Vol. 25, No. 1, pp.25–29.

Azuaje, F. and Bodenreider, O. (2004) ‘Incorporating ontology-driven similarity knowledge into functional genomics: an exploratory study’, Bioinformatics and Bioengineering, BIBE 2004. Proceedings. Fourth IEEE Symposium on, Taichung, Taiwan, pp.317–324.

Bard, J.B.L. and Rhee, S.Y. (2004) ‘Ontologies in biology: design, applications and future challenges’, Nature Reviews Genetics, Vol. 5, No. 3, pp.213–222.

Bateman, A., Birney, E., Cerruti, L., Durbin, R., Etwiller, L., Eddy, S.R., Griffiths-Jones, S., Howe, K.L., Marshall, M. and Sonnhammer, E.L.L. (2002) ‘The pfam protein families database’, Nucleic Acids Research, Vol. 30, No. 1, pp.276–280.

Bell, J. (2004) ‘Predicting disease using genomics’, Nature, Vol. 429, pp.453–456. Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J. and Wheeler, D.L. (2005) ‘GenBank:

update’, Nucleic Acids Research, Vol. 32, No. 90001, pp.23–26. Boeckmann, B., Bairoch, A., Apweiler, R. et al. (2003) ‘The SWISS-PROT protein knowledgebase

and its supplement TrEMBL in 2003’, Nucleic Acids Res., Vol. 31, No. 1, pp.365–370. AUTHOR PLEASE SUPPLY REMAINING AUTHOR NAMES.

Bolshakova, N., Azuaje, F. and Cunningham, P. (2006) ‘Incorporating biological domain knowledge into cluster validity assessment’, EvoWorkshops, pp.13–22.

Bork, P., Dandekar, T., Diaz-Lazcoz, Y., Eisenhaber, F., Huynen, M. and Yuan, Y. (1998). ‘Predicting function: from genes to genomes and back’, J. Mol. Biol., Vol. 283, No. 4, pp.707–725.

Brazma, A., Hingamp, P., Quackenbush, J., Sherlock, G., Spellman, P., Stoeckert, C., Aach, J., Ansorge, W., Ball, C.A., Causton, H.C., Gaasterland, T., Glenisson, P., Holstege, F.C., Kim, I.F., Markowitz, V., Matese, J.C., Parkinson, H., Robinson, A., Sarkans, U., Schulze-Kremer, S., Stewart, J., Taylor, R., Vilo, J. and Vingron, M. (2001) ‘Minimum information about a microarray experiment (MIAME)-toward standards for microarray data’, Nat. Genet., Vol. 29, No. 4, pp.365–371.

Chakravarti, A. (2001)’ Single nucleotide polymorphisms… to a future of genetic medicine’, Nature, Vol. 409, No. 6822, pp.822–823.

Chasman, D. and Adams, R.M. (2001) ‘Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: structure-based assessment of amino acid variation’, J. Mol. Biol., Vol. 307, No. 2, pp.683–706.

Cohen, A.M. and Hersh, W.R. (2005) ‘A survey of current work in biomedical text mining’, Brief Bioinform., Vol. 6, No. 1, pp.57–71.

Conde, L., Vaquerizas, J.M., Santoyo, J., Al-Shahrour, F., Ruiz-Llorente, S., Robledo, M. and Dopazo, J. (2004) ‘PupaSNP Finder: a web tool for finding SNPs with putative effect at transcriptional level’, Nucleic Acids Res., Vol. 32, web server issue, pp.W242–W248.

Cowles, C.R., Joel, N.H., Altshuler, D. and Lander, E.S. (2002) ‘Detection of regulatory variation in mouse genes’, Nat. Genet., Vol. 32, No. 3, pp.432–437.


Creighton, C. and Hanash, S. (2003) ‘Mining gene expression databases for association rules’, Bioinformatics, Vol. 19, No. 1, pp.79–86.

Eisen, M., Spellman, P.T., Botstein, D. and Brown, P.O. (1998) ‘Cluster analysis and display of genome-wide expression patterns’, Proc. Natl. Acad. Sci., USA, Vol. 96, pp.14863–14867.

Evans, W.E. and Relling, M.V. (2004) ‘Moving towards individualized medicine with pharmacogenomics’, Nature, Vol. 429, pp.464–468.

Fredman, D., Munns, D., Rios, F., Sjoholm, F., Siegfried, M., Lenhard, B., Lehvaslaiho, H. and Brookes, A.J. (2004) ‘HGVbase: a curated resource describing human DNA variation and phenotype relationships’, Nucleic Acids Res., Vol. 32, Database issue, pp.D516–D519.

Gardner, S.P. (2005) ‘Ontologies and semantic data integration’, Drug Discovery Today, Vol. 10, No. 14, pp.1001–1007.

Gupta, S.K., Rao, S. and Bhatnagar, V (1999) ‘K-means clustering algorithm for categorical attributes’, LNCS, Vol. 1676, pp.203–208.

Hamosh, A., Scott, A.F., Amberger, J., Bocchini, C.A. and McKusick, V.A. (2000) ‘Online Mendelian Inheritance in Man (OMIM)’, Human Mutat., Vol. 15, No. 1, pp.57–61.

Hernandez, T. and Kambhampati, S. (2004) ‘Integration of biological sources: current systems and challenges ahead’, ACM SIGMOD Record, Vol. 33, No. 3, pp.51–60.

Hilario, M., Kalousis A., Prados, J. and Binz, P.A. (2004) ‘Data mining for mass-spectra based diagnosis and biomarker discovery’, Biosilico Journal, Vol. 2, No. 5, pp.171–222.

Hoh, J. and Ott, J. (2004) ‘Genetic dissection of diseases: design and methods’, Curr. Opin. Gen. Dev., Vol. 14, pp.229–232.

Hood, L., Heath, J.R., Phelps, M.E. and Lin, B. (2004) ‘Systems biology and new technologies enable predictive and preventative medicine’, Science, Vol. 306, pp.640–643.

Hudson, T.J. (2003) ‘Wanted: regulatory SNPs’, Nat. Genet., Vol. 33, No. 4, pp.439–440. Iafrate, A., Feuk, L., Rivera, M.N., Listewnik, M.L., Donahoe, P.K., Qi, Y., Scherer, S.W. and

Lee, C. (2004) ‘Detection of large-scale variation in the human genome’, Nature Genetics, Vol. 36, pp.949–951.

Ideker, T., Thorsson, V., Ranish, J.A., Christmas, R., Buhler, J., Eng, J.K., Bumgarner, R., Goodlett, D.R., Aebersold, R. and Hood, L. (2001) ‘Integrated genomic and proteomic analyses of a systematically perturbed metabolic network’, Science, Vol. 292, pp.929–934.

Ioannidis, J. (2005) ‘Microarrays and molecular research: noise discovery?’, The Lancet, Vol. 365, No. 9458, pp.454–455.

Ioannidis, J. (2007) ‘Is molecular profiling ready for use in clinical decision making’, The Oncologist, Vol. 12, No. 3, p.301.

Kalaitzakis, M., Kritsotakis, V., Kondylakis, H., Potamias, G., Tsiknakis, M. and Kafetzopoulos, D. (2008) ‘An integrated clinico-proteomics information management and analysis platform’, 21st IEEE International Symposium on Computer-Based Medical Systems (CBMS 2008), University of Jyväskylä, Finland, pp.218–220.

Kanterakis, A. and Potamias, G. (2006) ‘Supporting clinico-genomic knowledge discovery: a multi-strategy data mining process’, Procs. of the 4th Hellenic Conference on AI (SETN 2004), LNAI, Vol. 3955, pp.520–524.

Khatri, P. and Draghici, S. (2005) ‘Ontological analysis of gene expression data: current tools, limitations, and open problems’, Bioinformatics, Vol. 21, No. 18, pp.3587–3595.

Knight, J.C. (2004) ‘Allele-specific gene expression uncovered’, Trends Genet., Vol. 20, No. 3, pp.113–116.

Kohonen, T. (2001) Self-Organizing Maps, Springer Series in Information Sciences, Vol. 30, Springer, Berlin, Heidelberg, New York, 1995, 1997, 2001. ISBN 3-540-67921-9, ISSN 0720-678X

Kuo, W.P., Kim, E.Y., Trimarchi, J., Jenssen, T.K., Vinterbo, S.A. and Ohno-Machado, L. (2004) ‘A primer on gene expression and microarrays for machine learning researchers’, Journal of Biomedical Bioinformatics, Vol. 37, pp.293–303.


Kwok, P-Y and Gu, Z. (1999) ‘Single nucleotide polymorphism libraries: why and how are we building them?’, Mol Med Today, Vol. 5, pp.538–543.

Lai, K. and Klapa, M. (2004) ‘Alternative pathways of galactose assimilation: could inverse metabolic engineering provide an alternative to galactosemic patients?’, Metab. Eng., Vol. 6, pp.239–244.

Lopez, L.M., Ruiz, I.F., Bueno, R.M. and Ruiz, G.T. (2000) ‘Dynamic discretisation of continuous values from time series’, in Mantaras, R.L. and Plaza, E. (Eds.): Proc. 11th European Conference on Machine Learning (ECML 2000), LNAI 1810, pp.290–291.

Louie, B., Mork, P., Martin-Sanchez, F., Halevy, A. and Tarczy-Hornoch, P. (2007) ‘Data integration and genomic medicine’, Journal of Biomedical Informatics, Vol. 40, No. 1, pp.5–16.

Maglott, D., Ostell, J., Pruitt, K.D. and Tatusova, T. (2005) ‘Entrez gene: gene-centered information at NCBI’, Nucleic Acids Research, Vol. 33, Database issue, pp.D54–D58.

Mooney, S. (2005) ‘Bioinformatics approaches and resources for single nucleotide polymorphism functional analysis’, Brief Bioinform., Vol. 6, No. 1, pp.44–56.

Nature (2004) ‘Making data dreams come true (editorial)’, Nature, Vol. 428, No. 6980, p.239. Pastinen, T., Sladek, R., Gurd, S., Sammak, A., Ge, B., Lepage, P., Lavergne, K., Villeneuve, A.,

Gaudin, T., Brandstrom, H., Beck, A., Verner, A., Kingsley, J., Harmsen, E., Labuda, D., Morgan, K., Vohl, M-C., Naumova, A.K., Sinnett, D. and Hudson, T.J. (2004) ‘A survey of genetic and epigenetic variation affecting human gene expression’, Physiol. Genomics, Vol. 16, No. 2, pp.184–193.

Paulus, C., Bonnet, S., Gerfault, L., Mery, E., Strubel, G., Ricoul, F. and Grangeat, P. (2007) ‘Chromatographic alignment combined with chemometrics profile reconstruction approaches applied to LC-MS data’, 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Lyon, France, pp.5984–5987.

Pfost, D.R., Boyce-Jacino, M.T. and Grant, D.M. (2000) ‘A SNPshot: pharmacogenetics and the future of drug therapy’, Trends Biotechnol., Vol. 18, No. 8, pp.334–338.

Pomeroy, S.L. et al. (2002) ‘Prediction of central nervous system embryonal tumour outcome based on gene expression’, Nature, Vol. 415, No. 6870, pp.436–442. AUTHOR PLEASE SUPPLY REMAINING AUTHORS.

Ponomarenko, J.V., Merkulova, T.I., Orlova, G.V., Fokin, O.N., Gorshkova, E.V., Frolov, A.S., Valuev, V.P. and Ponomarenko, M.P. (2003) ‘rSNP_Guide, a database system for analysis of transcription factor binding to DNA with variations: application to genome annotation’, Nucleic Acids Res., Vol. 31, No. 1, pp.118–121.

Potamias, G., Koumakis, L. and Moustakis, V. (2004) ‘Gene selection via discretized gene-expression profiles and greedy feature-elimination’, Lecture notes in Artificial Intelligence-LNAI, Vol. 3025, pp.256–266.

Ramensky, V., Bork, P. and Sunyaev, S. (2002) ‘Human non-synonymous SNPs: server and survey’, Nucleic Acids Res., Vol. 30, No. 17, pp.3894–3900.

Rodriguez, A., Carazo, J.M. and Trelles, O. (2005) ‘Mining association rules from biological databases’, Journal of the American Society for Information Science and Technology, Vol. 56, No. 5, pp.493–504.

Roses, A.D. (2002) ‘Pharmacogenetics place in modern medical science and practice’, Life Sci., Vol. 15, pp.1471–1480.

Rosse, C. and Mejino, J. (2003) ‘A reference ontology for biomedical informatics: the foundational model of anatomy’, Biomedical Informatics, Vol. 36, pp.478–500.

Rouze, P., Pavy, N. and Rombauts, S. (1999) ‘Genome annotation: which tools do we have for it?’, Curr. Opin. Plant Biol., Vol. 2, No. 2, pp.90–95.

San, O.M., Huynh, V-N. and Nakamori, Y. (2004) ‘An alternative extension of the k-means algorithm for clustering categorical data’, Int. J. Appl. Math. Comput. Sci., Vol. 14, No. 2, pp.241–247.


Saunders, C.T. and Baker, D. (2002) ‘Evaluation of structural and evolutionary contributions to deleterious mutation prediction’, J. Mol. Biol., Vol. 322, No. 4, pp.891–901.

Sebat, J., Lakshmi, B., Troge, J., Alexander, J., Young, J., Lundin, P., Mεnιr, S., Massa, H., Walker, M., Chi, M., Navin, N., Lucito, R., Healy, J., Hicks, J., Ye, K., Reiner, A., Gilliam, T.C., Trask, B., Patterson, N., Zetterberg, A. and Wigler M. (2004) ‘Large-scale copy number polymorphism in the human genome’, Science, Vol. 305, pp.525–528.

Shah, S.P., Huang, Y., Xu, T., Yuen, M.M., Ling, J. and Ouellette, B.F. (2005) ‘Atlas-a data warehouse for integrative bioinformatics’, BMC Bioinformatics, Vol. 6, p.34.

Snijders, A.M., Schmidt, B.L., Fridlyand, J., Dekker, N., Pinkel, D., Jordan, R.C. and Albertson, D.G. (2005) ‘Rare amplicons implicate frequent deregulation of cell fate specification pathways in oral squamous cell carcinoma’, Oncogene, 16 June, Vol. 24, No. 26, pp.4232–4242.

Stenson, P.D., Ball, E.V., Mort, M., Phillips, A.D., Shiel, J.A., Thomas, N.S.T., Abeysinghe, S., Krawczak, M. and Cooper, D.N. (2003) ‘Human Gene Mutation Database (HGMD): 2003 update’, Human Mutat., Vol. 21, No. 6, pp.577–581.

Stevens, R., Goble, C.A., Paton, N.W., Bechhofer, S., Ng, G., Baker, P. and Brass, A. (1999) ‘Complex query formulation over diverse information sources using an ontology’, in Bornberg-Bauer, E., De Beuckelaer, A., Kummer, U. and Rost, U. (Eds.): Workshop on Computation of Biochemical Pathways and Genetic Networks, European Media Lab (EML), Berlin, Germany, August, pp.83–88.

Strausberg, R.L., Andrew, J.G., Simpson, A.J.G., Old, L.J., Gregory, J. and Riggins, G.J. (2004). ‘Oncogenomics and the development of new cancer therapies’, Nature, Vol. 429, pp.469–474.

Sugawara, H., Mizushima, H., Kano, T., Shigemoto, Y., Hashimoto, Y., Tomabechi, I., Ikawa, M., Sakagami, N., Katagiri, T. and Oroguchi, T. (2004) ‘Polymorphism Markup Language (PML) for the interoperability of data on SNPs and other sequence variations’, 15th International Conference on Genome Informatics, Yokohama Pacifico, Japan, pp.16–18.

Sunyaev, S., Ramensky, V., Koch, I., Lathe III, W., Kondrashov, A.S., and Bork, P. (2001) ‘Prediction of deleterious human alleles’, Human Mol. Genet., Vol. 10, No. 6, pp.591–597.

Taylor, C.F., Paton, N.W., Lilley, K.S., Binz, P.A., Julian Jr., R.K., Jones, A.R., Zhu, W., Apweiler, R., Aebersold, R., Deutsch, E.W., Dunn, M.J., Heck, A.J., Leitner, A., Macht, M., Mann, M., Martens, L., Neubert, T.A., Patterson, S.D., Ping, P., Seymour, S.L., Souda, P., Tsugita, A., Vandekerckhove, J., Vondriska, T.M., Whitelegge, J.P., Wilkins, M.R., Xenarios, I., Yates 3rd, J.R. and Hermjakob, H. (2007) ‘The minimum information about a proteomics experiment (MIAPE)’, Nat. Biotechnol., Vol. 25, No. 8, pp.887–893.

Tiffin, N. et al. (2005) ‘Integration of text- and data-mining using ontologies successfully selects disease gene candidates’, Nucleic Acids Research, Vol. 33, No. 5, pp.1544–1552. AUTHOR PLEASE SUPPLY REMAINING AUTHORS.

Whetzel, P., Parkinson, H., Causton, H., Fan, L., Fostel, J., Fragoso, G., Game, L., Heiskanen, M., Morrison, N., Rocca-Serra, P., Sansone, S., Taylor, S., White, J. and Stoeckert, C. (2006) ‘The MGED ontology; a resource for semantics-based description of microarray experiments’, Bioinformatics, Vol. 22, pp.866–873.

Wittkopp, P.J., Haerum, B.K. and Clark, A.G. (2004) ‘Evolutionary changes in cis and trans gene regulation’, Nature, Vol. 430, No. 6995, pp.85–88.

Xu, R. and Wunsch, D. (2005) ‘Survey of clustering algorithms’, IEEE Transactions on Neural Networks, Vol. 3, pp.645–678.

Note 1http://www.loccandia.eu/

Decision support based on genomics: integration of data- and knowledge-driven reasoning

Documents