Top Banner
GENE MINING 1 1. DEFINITION:- Gene mining is the process of exploiting deoxyribonucleic acid (DNA) sequence of one genotype to isolate useful gene from related genotypes. HISTORICAL BACKGROUND EVENTS YEAR Sequencing of first plant genome (Arabidopsis thaliana) 2000 COMPLETION OF HUMAN GENOME ROJECT 2003 FIRST WORK ON GENE MINING 2003 SOME ORGANISMS WHOSE GENOME HAVE BEEN SEQUENCED COMPLETELY EUKARYOTES (mb) PROKARYOTES (mb) Arabidopsis thaliana (114.5) Bacillus subtilis (4.20) Saccharomycese cerevisiae (12) Haemophilus influenzae (1.83) Oryza sativa (466) Escherichia coli (4.6) Homo sapiens Vibrio cholerae (4.0) Drosophilla melanogaster (120) Mycobacterium tuberculosis (4.40)
44
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Gene Mining

GENE MINING

1

1. DEFINITION:- Gene mining is the process of exploiting

deoxyribonucleic acid (DNA) sequence of one genotype to isolate useful gene

from related genotypes.

HISTORICAL BACKGROUND

EVENTS YEAR

Sequencing of first plant genome (Arabidopsis thaliana) 2000

COMPLETION OF HUMAN GENOME ROJECT 2003

FIRST WORK ON GENE MINING 2003

SOME ORGANISMS WHOSE GENOME HAVE BEEN SEQUENCED

COMPLETELY

EUKARYOTES (mb) PROKARYOTES (mb)

Arabidopsis thaliana (114.5) Bacillus subtilis (4.20)

Saccharomycese cerevisiae (12) Haemophilus influenzae (1.83)

Oryza sativa (466) Escherichia coli (4.6)

Homo sapiens Vibrio cholerae (4.0)

Drosophilla melanogaster (120) Mycobacterium tuberculosis (4.40)

Page 2: Gene Mining

GENE MINING

2

Plasmodium falciparum (23) Treponema pallidum (1.14)

PROGRRAMMES /SOFTWARES / DATABASES USED FOR

GENE MINIG

BLAST

Page 3: Gene Mining

GENE MINING

3

3. INTRODUCTION:-

CENTRAL DOGMA:-

Page 4: Gene Mining

GENE MINING

4

WHERE IS GENE LOCATED?

INTRODUCTION TO GENE MINING:-

The principle reason for gene mining is to identify and isolate genes that are

characterised for conferring essential traits. The widespread use and availability of

molecular biological techniques have allowed for the rapid development and

identification of nucleic acid derived sequences.With the availability of integration

of laboratory equipment with advanced computer software, researchers are able to

conduct advanced quantitative analyses, database comparisons and computational

algorithms to seek and identify gene sequences . Genetic databases for organisms

such as Escherichia coli, Haemophilus influenzae, Mycoplasma genitalium , and

Mycoplasma pneumoniae , to name a few, are available for public.These

biological databases store information that is searchable and from which biological

information may be retrieved.This work illustrates exploitation of publicly

available sequence databases on the Internet for identification of useful genes. The

Page 5: Gene Mining

GENE MINING

5

Internet is readily accessible to scientists worldwide. The resources used mainly

are Genbank at the National Center for Biotechnology Information

(http://www.ncbi.nlm.nih.gov/), the A. thaliana TC database at TIGR

(http://www.tigr.org/tdb/tgi/agi/) This work was carried out using an Internet

connection and a DOS-based sequence analysis software package, and the facilities

of a basic molecular biology laboratory for PCR verification. It is recommended to

carry out this type of sequence analysis in a Windows or Apple Macintosh

environment, since these are the most readily available platforms and software

packages are now available in the public domain for sequence analysis.

3. VARIOUS METHODS ADOPTED FOR GENE MINING:

1. DNA Extraction and PCR BASED Gene mining

2. Data Mining

3. Using Genetic Algorithm

4. Peptide mass fingerprinting

5. From Biomedical Literature: weighing protien-pprotien

6. interactions and connectivity

7. With the help of GENOWATCH

8. ORIEL

9. DNA chip analysis (Microarray)

Page 6: Gene Mining

GENE MINING

6

1. DNA Extraction and PCR BASED Gene mining

Plant Material

The germplasm used in this study of allele mining is leaf materials of the

concerned genotypes are collected from the Genetic Resources Centers.

DNA Extraction

Total genomic DNA is isolated from fresh green leaves (approx 5 g) according to

the methodology of Dellaporta et al. with minor modifications. The quality and

quantity of the extracted DNA is confirmed to be consistent both

spectrophotometrically and by running the extracted DNA on 1.0% agarose gels

stained with ethidium bromide.

PCR Analysis

PCR amplification of genomic DNA was carried out using gene-specific primers.

The PCR amplification consisted of a total of 40 cycles of melting (94°C for 1

min), annealing (55°C for 1 min 15 s), and elongation (72°C for 3 min 5 s). The

PCRamplified products were electrophoresed in 1.4% agarose gel in 1X Tris-

acetate-ethylenediamine– tetraacetic acid (TAE) buffer. The gels were

photographed under an ultraviolet transilluminator.

Page 7: Gene Mining

GENE MINING

7

5' 3' AND NC PRIMERS USED FOR ALLELE MINING

Gene-specific primers amplify the DNA of each accession, and the amplified

product represents either the entire allele or some functional component of the

allele, such as the promoter or the coding sequences.

Page 8: Gene Mining

GENE MINING

8

PROBLEMS

Amplification of more than one gene (lack of specificity)

Failure to amplify alleles in distantly related genera.

Page 9: Gene Mining

GENE MINING

9

2. DATA Mining

Definitions of DATA Mining

Data mining mainly is about somehow extracting the information and knowledge

from text;

2 Definitions:

Any operation related to gathering and analyzing text from external sources for

business intelligence purposes;

Discovery of knowledge previously unknown to the user in DATA/ TEXT;

Data mining is the process of compiling, organizing, and analyzing large document

collections to support the delivery of targeted types of information to analysts and

decision makers and to discover relationships between related facts that span wide

domains of inquiry.

Data Mining PROBLEMS

Data mining systems induce knowledge from datasets which are huge, noisy

(incorrect), incomplete, inconsistent, imprecise (fuzzy), and uncertain.

The problem is that existing systems use a limiting attribute value language for

representing the training examples and induced knowledge.

Furthermore, some important patterns are ignored because they are statistically

insignificant.

Page 10: Gene Mining

GENE MINING

10

3. Using Genetic Algorithm

Rapid growth of available data in digital format increase need for methods to

analyze them . So research on some topics such as text classification, information

retrieval and automatic text summarization became an important field

Researchers in Knowledge Discovery in Databases (KDD) have provided new

tools for analyzing and accessing data in databases . Some of them is based on

term frequency and are used in text processing. Goal of an automatic text

summarization system is to generate a summary of the original text that allows the

users to obtain the main pieces of information available in that text, but with a

much shorter reading time . In addition, an important data preprocessing task for

effective classification is the attribute selection task, which consists of selecting the

most relevant attributes for classification purposes .

4. PEPTIDE MASS FINGERPRINTING

Public protein sequence database such as SWISS-PROT is used practically for the

protein identification from the result of Matrix-Assisted Laser Desorption

Ionization-Time Of Flight (MALDI-TOF)data, which is one of popular proteomic

studies. However, for the less of protein information for the specific plant species

in these databases it is needed to construct the private protein database containing

sufficient protein information for interpreting massive PMF results about each

specific plant species. Thus we tried to make the protein database by translating

enormous coding region sequences obtained from EST analysis and the PMF

software working on these databases. Therefore, in this study, we tried to make the

individual systems about EST based data analysis, regulatory motif information

from chromosomal mapping of ESTs, microarray data and PMF information from

bench works at first and finally integrate these individual

Page 11: Gene Mining

GENE MINING

11

5. From Biomedical Literature: weighing protien-pprotien interactions and

connectivity

An initial set of genes and proteins is obtained from gene-disease relationships

extracted from PubMed abstracts using natural language processing. Interactions

involving the corresponding proteins are similarly extracted and integrated with

interactions from curated databases (such as BIND and DIP), assigning a

confidence measure to each interaction depending on its source. The augmented

list of genes and gene products is then ranked combining two scores: one that

reflects the strength of the relationship with the initial set of genes and incorporates

user-defined weights and another that reflects the importance of the gene in

maintaining the connectivity of the network. We can apply the method to

atherosclerosis to assess its effectiveness.

The method can be summarized as follows:

1. Obtain a list of genes or gene products known to be involved with the target

disease from the CBioC[5] database.

2. Apply heuristics to unify variants of extracted names, and use HUGO [8] to

normalize both the set obtained in the previous step and the names stored in

CBioC. This will be referred to as the initial set.

3. Apply nearest-neighbor expansion to the initial set to build a protein interaction

network using data from the CBioC database and curated databases. Analyze the

connectivity of the network. The genes and proteins in this network (derived from

the interactions) form the extended set.

4. Apply a heuristic scoring formula to the extended set to predict the proteins most

likely related to the disease.

Page 12: Gene Mining

GENE MINING

12

6. GenoWatch: a disease gene mining browser for association study.

A human gene association study often involves several genomic markers such as

single nucleotide polymorphisms (SNPs) or short tandem repeat polymorphisms,

and many statistically significant markers may be identified during the study.

GenoWatch can efficiently extract up-to-date information about multiple markers

and their associated genes in batch mode from many relevant biological databases

in real-time. The comprehensive gene information retrieved includes gene

ontology, function, pathway, disease, related articles in PubMed and so on.

Subsequent SNP functional impact analysis and primer design of a target gene for

re-sequencing can also be done in a few clicks. The presentation of results has been

carefully designed to be as intuitive as possible to all users. The GenoWatch is

available at the website http://genepipe.ngc.sinica.edu.tw/genowatch.

7. ORIEL

Introduction

The ORIEL Project (Online Research Information Environment for the Life

Sciences) This European Project will develop tools and procedures to promote

access to and integration of a wide range of information resources in the life

sciences.

The tools developed through ORIEL will enable effective linking of different types

of biological information (literature, factual and multimedia databases) make

navigation easy, thereby encouraging the creative exploration of the information

landscape facilitate communication by making data presentation and information

visualisation user-friendly.

Page 13: Gene Mining

GENE MINING

13

Project Description

Aims

The ORIEL Project (IST-2001-32688), funded by the EU and coordinated by the

European Molecular Biology Organisation (EMBO) aims to provide research

communities with tools to manage large, complex, multimedia datasets and to

navigate through an increasingly intricate and potentially confusing information

landscape.

Methodologies

ORIEL's methodologies will be tested and applied in a critical user

environment represented by the EU-funded E-BioSci platform (QLRI-CT-2001-

30266).

Developments

Methods leading to the creation of new concepts of the scientific literature, based

on machine-understandable documents.Technologies permitting effective linking of

a wide range of biological digital information sources, including molecular,

genomic and multi-dimensional image databases, promoting ease of cross-

database navigation, leading to creative exploration of the information landscape.

Protocols facilitating effective data representation and information

visualisation through the construction of adaptive interfaces that meet the needs of

individual users.

Background

The emerging fields of genomics and bio-informatics are having far-reaching

effects on all aspects of the Life Sciences. Additionally, biotechnology and

biomedicine, that will benefit enormously from new genome-based technologies,

Page 14: Gene Mining

GENE MINING

14

have become important growth areas in the European life sciences industry.

Genomics research is characterized by the production of vast amounts of raw and

derived data. The integration of the exponentially growing amounts of these and

associated biological information in digital form (publications, sequence and

sequence-related information, digital image data) is presenting one of the most

demanding current challenges to information technology. There is an urgent need

to better exploit the potential of the Internet and other communication networks to

develop novel technology and intelligent middleware for the integration of large,

complex and disparate information resources.

Objectives

The ORIEL project will explore and further develop methods, technologies and

protocols aimed at the integration, dissemination and exploitation of large,

complex and disparate digital information resources. With a view to making such

technologies widely available, it will focus on the Life Sciences as a data-intensive

and highly demanding testbed that will: - permit effective linking of different types

of biological information displaying complex inter-relationships (literature, factual

and multi-media image databases)

- promote ease of navigation leading to creative exploration of the information

landscape and facilitate user-friendly data presentation and information

visualisation.

Milestones

The development of new concepts that will enhance the efficiency of integration of

different types of biological data currently maintained in a wide spectrum of digital

collections and resources across Europe.

Page 15: Gene Mining

GENE MINING

15

The development and optimization of interactive and adaptive user interfaces to

promote intelligent access to, retrieval and analysis of data stored in digital form.

8. DNA chip analysis (Microarray)

This is a recently developed technique for the analysis of gene expression & has

following features.

• The expression of many genes can be investigated at the same time (i.e. in one

experiment)

• This requires the availability of many cloned genes

• Allows the elucidation of complex responses

• Based on two RNA samples, a control and a sample of interest (e.g. heat stressed/

mutant)

Limitations:-

• High tech.

• Expensive: requires fancy equipment and expensive reagents

• Analysis not straight forward and still under development

• Available at Purdue at the Genome Center in the basement of Whistler.

Page 16: Gene Mining

GENE MINING

16

INSTRUMENT USED OUTPUT OF SCANNER

Page 17: Gene Mining

GENE MINING

17

Page 18: Gene Mining

GENE MINING

18

Page 19: Gene Mining

GENE MINING

19

Page 20: Gene Mining

GENE MINING

20

4. APPLICATIONS OF GENE MINING

1. Allele Mining for Stress Tolerance Genes in Oryza Species and Related

Germplasm.

2. Allele mining and sequence diversity at the wheat powdery mildew resistance

locus Pm3.

3. Gene-mining the Arabidopsis thaliana genome: applications for biotechnology in

Africa

4. Isolation of Nucleic Acid Molecules Related to Integrin

5. Gene mining strategies of drug discovery

6. Isolation of a Known Gene to Validate System

7. Mining colon tumor-relevant genes

8. Mining molecular signatures for leukemia subtypes

9. Gene mining: classification of biological types

10. Metagenomics ( Uncultivable Microbes & Novel genes)

11. Genomics Industry - Gene Mining Companies

12. Gene mining in African rice germplasm to improve drought resistance in rainfed

production systems for resource-poor farmers of Africa

13. Mining the Epigenome for Methylated Genes in Lung Cancer

Page 21: Gene Mining

GENE MINING

21

1. Allele Mining for Stress Tolerance Genes in Oryza Species and

Related Germplasm.

The international project to sequence the genome of Oryza sativa L cv. Nipponbare

has made allele mining possible for all genes of rice. Scientists used a rice

calmodulin gene, a rice gene encoding a late embryogenesis-associated protein,

and salt-inducible rice gene to optimize the polymerase chain reaction (PCR) for

allele mining of stress tolerance genes on identified accessions of rice and related

germplasm. Two sets of PCR primers were designed for each gene. Primers based

on the 5' and 3' untranslated region of genes were found to be sufficiently

conserved so as to be effective over the entire range of germplasm in rice for which

the concept of allelism is applicable.

However, the primers based on the adjacent amino (N) and carboxy (C)

termini amplify additional loci. Two sets of PCR primers were designed for each

gene. Field-based phenotyping of germplasm identifies tolerant accessions,

biochemical and physiological analysis groups. the existing and emerging tools of

genomics and proteomics help to identify key genes or key members of a gene

family involved in each mechanisms. The technique of choice for allele mining is

PCR. Gene-specific primers amplify the DNA of each accession, and the amplified

product represents either the entire allele or some functional component of the

allele, such as the promoter or the coding sequences.

Page 22: Gene Mining

GENE MINING

22

HOW TO FIND NOVEL GENE

2. Allele mining and sequence diversity at the wheat powdery mildew

resistance locus Pm3.

The production of wheat is threatened by a constantly changing population of

pathogen races. Considering the capability of many pathogens to overcome genetic

resistance, the identification and implementation of new sources of resistance is

essential. Landraces and wild relatives of wheat have played an important role as

genetic resources for the improvement of disease resistance. Here, we discuss the

allele mining approach to characterize and utilize the naturally occurring resistance

diversity in wheat. This study is a large scale systematic allele mining, including

1320 hexaploid wheat landraces selected on the basis of ecogeographical

Page 23: Gene Mining

GENE MINING

23

parameters favouring growth of powdery mildew. The landraces were infected

with a set of differential powdery mildew isolates, which allowed the selection of

resistant lines. The molecular tools derived from Pm3 haplotype studies were

applied to study the genetic diversity at this locus. From the known Pm3 R alleles,

Pm3b was the only one frequently identified. In the same set, we also found a high

frequency of landraces carrying a susceptible haplotype. This analysis allowed the

identification of candidate resistant lines that were further tested for the presence

of new potentially functional alleles. Based on transient expression assays as well

as Virus Induced Gene Silencing (VIGS), we conclude that we have identified at

least two new functional Pm3 alleles. The new interesting and functional alleles

can be transferred to susceptible but economically important wheat varieties as

single genes or R-gene cassettes to achieve efficient control of mildew. This study

contributes to targeted use of genetic diversity resources for research and breeding.

3. Gene-mining the Arabidopsis thaliana genome: applications for

biotechnology in Africa

Plant science research has reached the post-genome era with the completion

of the genome sequences of both a dicotyledonous (Arabidopsis thaliana) (The

Arabidopsis Genome Initiative 2000) and a monocotyledonous (rice: Oryza sativa)

species (Yu et al. 2002). These genome sequences obtained through publicly

funded research have been made available through the Internet with new sequence

information appearing each day. In addition, large collections of cDNA libraries

derived from different plant tissues or growth conditions have been subjected to

single pass sequencing, often from the 3’ends, to derive express sequence tag

(EST) databases (Bennetzen 1999, Quackenbush et al. 2000). All these data

present an opportunity for researchers to enhance studies on non-model crop plants

by identifying homologues in the more tractable model species. This can lead to

design of experiments, such as the study of mutants in the model plant, which can

provide rapid answers to gene function in the crop plant. The plant cell wall, often

containing a matrix of pectic components, is the first line of defence against fungal

pathogens (Esquerre-Tugaye et al. 2000). In addition, pectic fragments broken

down from plant cell walls are elicitors of the plant defense response (Boudart et

al. 1998). Several lines of evidence indicate that the polygalacturonase inhibiting

protein (PGIP), which is associated with the cell walls of many plants has a role to

Page 24: Gene Mining

GENE MINING

24

play in plant resistance to fungal pathogens (De Lorenzo and Cervone 1997). This

work was initiated to identify a homologue of the gene for PGIP in A. thaliana

since it has relevance as a model system for protein-protein interactions, as well as

practical application in engineering fungal resistance in crop plants (Powell et al.

2000). PGIPs have been identified in a variety of plant species such as bean, pear

and apple (Toubart et al. 1992, Stotz et al. 1993, Arendse et al. 1999) . PGIPs are

characterised by their ability to bind to fungal polygalacturonases (PGs) and this

has led to the hypothesis that PGIP plays a role in the plant defence response by

modulating the activity of endo- PGs produced by invading fungi (Cervone et al.

1989). In addition, PGIPs are interesting for protein-protein interaction studies,

since they are made up of leucine rich repeats (LRRs) (De Lorenzo and Cervone

1997). The main model system for studying this process has been the interaction

between the Fusarium monilforme PG and the bean PGIP (Desiderio et al. 1997).

These studies have given in vitro evidence for this protein-protein interaction and

enabled identification of specific PGIP amino acids in this interaction (Leckie et al.

1999). However, this is a heterologous system employing use of a tobacco

expression system for production of variants of the bean PGIP. Testing of the

hypothesis in vivo has been hampered by lack of a model plant system, which can

be readily transformed and manipulated. This provided the rationale for searching

for a pgip homologue in the model dicotyledon A. thaliana.

4. Isolation of Nucleic Acid Molecules Related to Integrin

The integrin family of cell adhesion receptors plays a fundamental role in

the processes involved in cell division, differentiation and movement. The specific

function identified was that the target be an integral membrane protein involved in

cytoskeletal formation. The localization selected was that the protein be expressed

in the midgut of an organism.These structural-functional parameters were then

used to target potential genes based on the function identified from the PubMed

database on all organisms. The primer design software was the MacVector

software, and following an initial round of sequence determination, the primer

design was improved.

Page 25: Gene Mining

GENE MINING

25

5. Gene mining strategies of drug discovery

The strategies for identifying the limited number of genes that will be

relevant to any given disease (i.e., "gene mining") have been evolving at a rapid

pace. Until recently, technological developments restricted the genomics

specialist's sense of accomplishment to the mere compiling of "possibly relevant"

genes. The most common approach was to define the "mutant" (or "diseased") and

"wild-type" (or "normal") sets of genes in terms of the ensemble of mRNAs

produced by cells under a given circumstance (e.g., drug treatment). Because of a

certain focus on technology, a tidal wave of descriptive information (e.g., a long

list of mRNAs whose levels differed by twofold or more between experimentally

fixed conditions) threatened to obscure the identification of truly relevant genes. A

more systematic approach to pharmacogenomics, and in particular to the

pharmacogenomics of CF, can now benefit from hypothesis-driven bioinformatic

tools to identify disease- and drug-specific patterns of gene expression. In an

iterative scheme, a hypothesis is developed and used to design investigations of

cells or tissues. Microarrays are used to analyze the samples, and the resulting data

are installed into a database. Once a CF database is generated, specific algorithms

are used as bioinformatic tools, extracting meaning out of the data. Some of these

tools, such as the hierarchical clustering algorithm (see below), are available within

the public domain of the Internet. We have developed two additional tools, called

GRASP (for Gene R atio Analysis Paradigm) and GENESAVER (Gene Space

Vector). Both are hypothesis-driven techniques, which we shall describe in detail

as they have been applied to CF. The analyzed data must then be integrated into

the larger scope of bioinformation available through the Internet. In this way, a

new, refined hypothesis can be developed for the next cycle of investigation . In

our experience, several tactical cycles through this strategic approach have been

necessary to develop insight into a given problem.

6. Isolation of a Known Gene to Validate System

In order to validate the system, it was used to isolate a known gene; in this case

the Mudunca sexta Aminopeptidase gene. Aminopeptidase is involved in the

modulation of various cellular responses, especially in cell-cell adhesion and signal

transduction. Aminopeptidase is directly involved in resistance by insects to

insecticidal toxins of Bacillus thuringiensis. The M. sexta aminopeptidase gene

Page 26: Gene Mining

GENE MINING

26

was mined based on nucleotide and amino acid sequence alignment with the

existing aminopeptidase related sequences

7. Mining molecular signatures for leukemia subtypes

Here, the target phenotypes are two distinct leukemia subtypes, AML and ALL.

Thus, an ensemble decision analysis is conducted to identify the significant

molecular signatures (subtype-relevant genes) that underpin the complex molecular

mechanisms for distinction between the two subtypes. These data contain

measurements corresponding to ALL and AML samples from bone marrow and

peripheral blood.

Leukemia: Acute Lymphoblastic (ALL) vsAcute Myeloid (AML)

AMLALL

Visually similar, but genetically very different

Page 27: Gene Mining

GENE MINING

27

8. Gene mining: classification of biological types

Working on the same data allows us to show the differences between the two

targets. As a result of the ensemble gene subset selection, three best subsets

determined by their classification performance on the holdout samples using

equation 3 are obtained, all with a 2 value of 9.118 (P = 0.003). Best subset 1

(Best tree 1) contains four genes: M26383 (human monocyte-derived neutrophil-

activating protein mRNA, MONAP), T51849 (tyrosine-protein kinase receptor

ELK precursor, R.norvegicus), Z24727 ] (Homo sapiens tropomyosin isoform

mRNA) and H55758 (H.sapiens -enolase). Best subset 2 also contains four genes:

M26383 , T94993 (H.sapiens fibroblast growth factor receptor 2 precursor),

T58861 (60S ribosomal protein L30E, Kluyveromyces lactis) and R39465

(eukaryotic initiation factor 4A, Oryctolagus cuniculus). Best subset 3 contains

five genes: M63391 (H.sapiens desmin gene), D14812 (H.sapiens KIAA0026

mRNA, complete cds), H44011 (myosin heavy chain, non-muscle type A,

H.sapiens), T58861 and H55933 (H.sapiens mRNA homolog of yeast ribosomal

protein L41). To unravel the relationships between the two targets, classify the

biological types and mine disease-relevant genes, we construct a classification rule

using all 20 colon tumor-relevant genes. To allow for (lessen) selection bias due to

either the same approach being used for feature gene selection and prediction or

the induced rule being tested on tissue samples that had been used in the first

instance to select the feature genes, we perform a de novo validation procedure

called external cross-validation, with newly permutated data sets and with separate

classifiers from that used for feature gene subset selection. The classifiers

considered are a SVM with five different kernel functions, FLD, LNR, KNN and

MD, reflecting the diversity of discriminant methods putatively useful for

microarray data analysis.

Page 28: Gene Mining

GENE MINING

28

Although there are some variations between the different classifiers, on average all

four subsets (three best trees and the feature set with the 20 relevant genes)

identified by our ensemble approach perform comparably with or better than the

feature set with all 2000 genes. Best tree 3, although it does not include the top

gene (M26383 ), achieves the highest performance across the multiple external

classifiers and even performs better than the feature set of the top 20 colon tumor-

relevant genes, with the highest performance (92.1%) attained using a SVM with a

polynomial 2-D kernel, which is the highest attainable so far. The second best

feature set is the top 20 relevant genes, reflecting the fact that the relevant genes

are extracted from trees, which are in turn built with a target of high classification

performance given a data structure. Nevertheless, this feature set is neither

necessarily the most economical (minimal) nor the most efficient set for

classification or prediction because there are ‘redundant’ features among the top 20

genes (e.g. the two replicates of R39465 ). Indeed, mining these ‘redundant’ genes

is one of major goals for ensemble decision analysis of microarrays.

9. Mining colon tumor-relevant genes

A 5-fold cross-validation (CV) resampling approach is used to construct the

training and test sets. First, colon tumor and normal samples are randomly divided

into five non-overlapping subsets of roughly equal size, i.e. tumor subsets Di (i =

1, 2, ..., 5) and normal subsets Ni (i = 1, 2, ..., 5). Repeat the resampling 20 times

and obtain 500 pairs of training and test sets. The proposed gene extraction

approach is then applied to each pair. In order to obtain a statistical measure of

significance for each gene, a null distribution FV0 is constructed, as described

previously. An empirical threshold of 0.035 at the significance level of 0.01 is

chosen, denoted as FV0ß = 0.035 (ß = 0.01). The extracted colon tumor-relevant

genes of high significance (P < 0.01), obtained by analyzing 500 pairs.

Page 29: Gene Mining

GENE MINING

29

10. Metagenomics ( Uncultivable Microbes & Novel genes)

Modern biotechnology has a steadily increasing demand for novel genes for

application in various industrial processes and development of genetically

modified organisms. Identification, isolation and cloning for novel genes at a

reasonable pace is the main driving force behind the development of

unprecedented experimental approaches. Metagenomics is one such novel

approach for engendering novel genes. Metagenomics of complex microbial

communities (both cultivable and uncultivable) is a rich source of novel genes for

biotechnological purposes.

11. Genomics Industry & Gene Mining Companies

Now that the sequence data is available and placed in the public

domain.Companies have been created to "mine" the data, that is, to analyze the

genomic sequences to identify genes, their function, and their relationships to

health and disease processes. Companies pioneering in this area included Sequana

and Millennium Pharmaceuticals . Worldwide efforts are ongoing in optimizing

medical treatment by searching for the right medicine at the right dose for the

individual. Metabolism is regulated by polymorphisms, which may be tested by

relatively simple SNP analysis, however requiring DNA from the test individuals.

Target genes for the efficiency of a given medicine or predisposition of a given

disease are also subject to population studies, e.g., in Iceland, Estonia, Sweden, etc.

For hypothesis testing and generation, several bio-banks with samples from

patients and healthy persons within the pharmaceutical industry have been

established during the past 10 years. Thus, more than 100,000 samples are stored

in the freezers of either the pharmaceutical companies or their contractual partners

at universities and test institutions.

Ethical issues related to data protection of the individuals providing samples to

bio-banks are several: nature and extent of information prior to consent, coverage

of the consent given by the study person, labeling and storage of the sample and

data (coded or anonymized). In general, genetic test data, once obtained, are

permanent and cannot be changed. The test data may imply information that is not

beneficial to the patient and his/her family (e.g., employment opportunities,

Page 30: Gene Mining

GENE MINING

30

insurance, etc.). Furthermore, there may be a long latency between the analysis of

the genetic test and the clinical expression of the disease and wide differences in

the disease patterns. Consequently, information about some genetic test data may

stigmatize patients leading to poor quality of life. This has raised the issue of

‘genetic exceptionalism’ justifying specific regulation of use of genetic

information.

Discussions on how to handle sampling and data are ongoing within the industry

and the regulatory sphere, the European Agency for the Evaluation of Medicinal

Products (EMEA) having issued a position paper, the Council for International

Organizations of Medical Sciences (CIOMS) having a working group on this issue,

and the European Society of Human Genetics preparing background paper on

‘Polymorphic sequence variants in medicine: Technical, social, legal and ethical

issues. Pharmacogenetics as an example’. Within the European project Privacy in

Research Ethics and Law (PRIVIREAL), recommendations for common European

guidelines for membership in research ethical committees have been discussed,

balancing the interests and assuring independence and legal competence. Good

decision making, assuring legality of protocols and assessment of data protection is

suggested to be part of any evaluation of protocols.

12. Gene mining in African rice germplasm to improve drought resistance in rainfed production systems for resource-poor farmers of Africa

Rice has been cultivated in western and central Africa for centuries and is now a

staple food in the region. But drought is a major problem as it severely depresses

yield in upland and rainfed lowlands, where most of the producers are resource-

poor farmers. Drought resistance in plants, however, is a complex trait, controlled

by the interaction of many genes, as it involves several physiological, phonological

and morphological mechanisms. Consequently, conventional breeding for drought

resistance in Africa has had but limited success. DNA markers and genetic

mapping are expected to provide impetus not only in gaining a better

Page 31: Gene Mining

GENE MINING

31

understanding of the traits associated with drought but also by contributing to

enhanced selection efficiency.

The project seeks to 1) characterize drought in different environments and

identify the most important traits associated with drought resistance, 2) select

and characterize sources of drought resistance for genetic mapping and

quantitative trait locus (QTL) analysis, and 3) develop advanced lines combining

drought resistance with heavy yield and agronomic and quality characters

acceptable to farmers and consumers. To achieve these objectives the project will

exploit a core germplasm pool (Oryza glaberrima and O. sativa) of 1) drought-

resistant O. glaberrima accessions, collected and screened in Mali by the Institut

d’Economie Rurale (IER), 2) drought-tolerant interspecific breeding lines

developed by the Africa Rice Centre (WARDA) from crosses between O.

glaberrima and O. sativa, and 3) a range of traditional O. glaberrima and O. sativa

accessions from WARDA’s gene bank. Confirmed sources of resistance among this

core germplasm will be crossed with elite but drought-susceptible O. sativa lines

to develop interspecific and intraspecific populations segregating for drought

resistance. These populations will be phenotyped in replicated field trials in

different environments in Mali and Nigeria. QTL analysis will be performed to

identify, across environments, drought-improving alleles for future breeding. In

other populations, selection will be conducted to generate agronomically superior

drought-resistant lines.

Page 32: Gene Mining

GENE MINING

32

13. Mining the Epigenome for Methylated Genes in Lung Cancer

Lung cancer has become a global public health burden, further substantiating the

need for early diagnosis and more effective targeted therapies. The key to

accomplishing both these goals is a better understanding of the genes and

pathways disrupted during the initiation and progression of this disease. Gene

promoter hypermethylation is an epigenetic modification of DNA at promoter

CpG islands that together with changes in histone structure culminates in loss of

transcription. The fact that gene promoter hypermethylation is a major

mechanism for silencing genes in lung cancer has stimulated the development of

screening approaches to identify additional genes and pathways that are

disrupted within the epigenome. Some of these approaches include restriction

landmark scanning, methylation CpG island amplification coupled with

representational difference analysis, and transcriptome-wide screening. Genes

identified by these approaches, their function, and prevalence in lung cancer are

described. Recently, we used global screening approaches to interrogate 43 genes

in and around the candidate lung cancer susceptibility locus, 6q23–25. Five genes,

TCF21, SYNE1, AKAP12, IL20RA, and ACAT2, were methylated at 14 to 81%

prevalence, but methylation was not associated with age at diagnosis or stage of

lung cancer. These candidate tumor suppressor genes likely play key roles in

contributing to sporadic lung cancer. The realization that methylation is a

dominant mechanism in lung cancer etiology and its reversibility by

pharmacologic agents has led to the initiation of translational studies to develop

biomarkers in sputum for early detection and the testing of demethylating and

histone deacetylation inhibitors for treatment of lung cancer. Lung cancer has

Page 33: Gene Mining

GENE MINING

33

become a global public health burden, with 1.5 million deaths expected by 2010.

The high mortality from this disease stems from the lack of an effective screening

approach for early diagnosis and the refractiveness of advanced cancers to

conventional therapies, substantiating the need to develop more effective

targeted therapies and chemoprevention. Although smoking cessation does

reduce risk for lung cancer, approximately half of lung cancers diagnosed are in

former smokers. Adenocarcinoma is the major histologic type of cancer diagnosed

in smokers in the United States and now Europe. An incidence rate of 40% and up

to 80% has been reported for this histologic type of cancer in smokers and never

smokers, respectively, diagnosed with lung cancer. Non–small cell lung cancer

(NSCLC, comprising mainly adeno, squamous cell, and large cell carcinoma) is

diagnosed in approximately 80% of patients, while the remaining 20% of tumors

appear to be small cell lung cancer (SCLC). The detection of numerous cytogenetic

changes provided the first link to the molecular pathogenesis of lung cancer.

Mapping of chromosomal sites for rearrangement, breakpoints, and losses

revealed both common and distinct changes in SCLC and NSCLC. The commonality

for specific regions in the genome for allelic loss suggested the presence of tumor

suppressor genes (TSGs) within these loci. The retinoblastoma gene was the first

TSG linked to lung cancer . Loss of function of this gene through either deletion or

point mutation occurs in 90% of SCLC, while less than 15% of NSCLCs harbor

changes in this TSG . The second major TSG inactivated in lung cancer is p53.

Although p53 inactivation is common across many malignancies, the mutation

spectrum within this gene tracks with specific tumor types.

In lung cancer, the most common mutation seen is the G:C to T:A transversion, an

alteration potentially stemming from the inability to repair DNA damage caused

Page 34: Gene Mining

GENE MINING

34

by polyaromatic hydrocarbons such as benzo[a]pyrene, which is present in

tobacco . Consistent with this hypothesis, the prevalence for transversion

mutations increased in tumors with increasing cumulative exposure to cigarette

smoke. Mutations in p53 are found in 70% of SCLC, 65% of squamous cell cancer,

and 33% of adenocarcinoma. In lung cancer, the search for TSGs inactivated

through the two-hit mechanism of loss of one allele and mutation of the

remaining allele have not identified any genes whose prevalence for inactivation

approaches that seen for the retinoblastoma and p53 genes. The exception to this

is LKB gene that is mutated exclusively in approximately one-third of

adenocarcinomas . The most commonly mutated oncogene in lung cancer is K-ras

with approximately 30 to 40% of adenocarcinomas harboring an activating

mutation, while mutations in squamous cell and SCLC are rarely observed .

Mutations are localized to codons 12, 13, and 61 with the majority (> 85%)

occurring within codon 12. Nearly 70% of the mutations seen are G to T

transversions within codon 12 that change a glycine codon (GGT) to valine (GTT)

or cysteine (TGT) that may reflect DNA adducts formed by metabolism of

polyaromatic hydrocarbons in tobacco. Recently, a whole genomic approach was

taken to address how many mutations are seen in cancer .

These studies were focused on breast and colon cancer, but most likely reflect the

paradigm seen in lung cancer studies that have evaluated candidate genes

discovered through various screening modalities. In this whole genome

sequencing study, approximately 80 gene mutations were identified that alter

amino acids. What was surprising was that the prevalence of the majority of these

mutations in primary tumors was less than 5%. The authors concluded that these

minor mutations would each be associated with a "small fitness advantage" that

Page 35: Gene Mining

GENE MINING

35

would drive tumor progression, and thus, it is not the most common genetic

changes but these rare changes that dominate the cancer genome landscape .

While this is an interesting hypothesis, the emergence of epigenetic modifications

of critical regulatory genes indicates that the epigenome may play an equal, if not

greater role in driving cancer initiation and progression than genetic mutations.

The most common epigenetic change in cancer is methylation of DNA at the fifth

position of the cytosine ring. Cytosine located 5' to guanine (CpG) is the prime

target of methylation in the mammalian genome and this dinucleotide is

concentrated in a much higher frequency than a random genome-wide

distribution in regions called CpG islands. About 50% of human promoters contain

CpG islands that often extend into exon 1 of many critical regulatory genes. When

DNA hypermethylation occurs within a CpG island located in the promoter region

of a gene, it is also accompanied by histone modifications (such as acetylation,

methylation, or phosporylation of histone tails) within the island. Together, these

two epigenetic changes create a closed chromatin configuration around the

promoter region denying access to RNA polymerase and regulatory proteins

needed for transcription.

The end result of this process is loss of gene transcription and hence "silencing of

gene function." With the development of the methylation-specific PCR assay that

can screen for gene methylation in specific promoters, there has been

tremendous growth over the past decade in the identification of genes that are

silenced in lung cancer through promoter hypermethylation. Transcriptional

silencing by CpG island hypermethylation now rivals genetic changes that affect

coding sequence as a critical trigger for neoplastic development and progression.

Genes responsible for all types of normal cellular function are targeted for

Page 36: Gene Mining

GENE MINING

36

inactivation by methylation at prevalences of 15 to 80% in lung tumors. These

include genes involved in cell cycle regulation (e.g., p16), apoptosis (e.g., death

associated protein kinase), DNA repair (e.g., O6-methylguanine-DNA

methyltransferase), cell adhesion (e.g., H-cadherin), signal transduction (e.g., ras

effector homolog 1 [RASSF1A]), and cell differentiation (e.g., RAR-β). Importantly,

many of these genes appear to be inactivated at the earliest histologic stage of

lung cancer and in cytologically normal-appearing bronchial epithelial cells from

smokers. Understanding which pathways are inactivated in the tumor cell and

bronchial epithelium of smokers will be essential for developing targeted therapy

for lung cancer and cancer prevention. The realization of gene promoter

methylation as a major alteration in the cancer cell has stimulated the

development of screening approaches to identify additional genes and pathways

that are disrupted within the epigenome. The sections below describe some of

the high-throughput genome screening approaches used to identify methylated

genes in cancer and a recent study by our group evaluating promoter methylation

of genes in and around the candidate lung cancer susceptibility locus 6q23–25

using a combination of screening approaches.

Page 37: Gene Mining

GENE MINING

37

5. FUTURE SCENARIO:-

Advancement in gene mining companies

Metagenomics for mining new genetic resources of microbial communities

Gene mining with the hierarchical clustering algorithm

"gene-mining" strategies of drug discovery

Mining the mouse genome

Global gene mining and the pharmaceutical industry

Synthetic life and gene mining:

Synthetic life and gene mining:

In our DNA there might be mistakes that make it more likely to get a disease like

breast cancer, diabetes or a mental illness. Medical science become pretty good

at it but until now it has just been scratching the surface. The field of genetics in

medicine is about to explode. It is really felt that medical science is standing at the

very start of an utter explosion in genetics and medicine, and that's not just in

cancer, it's in many different fields, but particularly in cancer. It gives us the

opportunity of targeting our highly expensive and they are very expensive drugs,

and also toxic drugs but targeting them at the people who need them most and

are most likely to respond. The genes are being searched which are associated not

just with cancer but a whole range of diseases, from obesity to heart disease. Less

than a decade ago it cost billions of dollars for the first human genome to be

sequenced. Today it might cost several hundred thousand dollars, and within a

few years, that could be as little as $1,000. Once that happens, we'll all have

access to an amazing amount of information about our future health. Within the

Page 38: Gene Mining

GENE MINING

38

last two years over 100 new genes have been identified that are associated with

risk of developing different types of diseases. Not just cancer, but other diseases

that often we regard as lifestyle diseases: the risk of developing diabetes, your

propensity towards obesity, high blood pressure, neurological diseases and so on.

And those new genes offer tremendous opportunities for the prevention and

even better therapy.There are also big ethical questions still to be answered

about tinkering with these basic building-blocks of life.

Global gene mining and the pharmaceutical industry

Worldwide efforts are ongoing in optimizing medical treatment by searching for

the right medicine at the right dose for the individual. Metabolism is regulated by

polymorphisms, which may be tested by relatively simple SNP analysis, however

requiring DNA from the test individuals. Target genes for the efficiency of a given

medicine or predisposition of a given disease are also subject to population

studies, e.g., in Iceland, Estonia, Sweden, etc. For hypothesis testing and

generation, several bio-banks with samples from patients and healthy persons

within the pharmaceutical industry have been established during the past 10

years. Thus, more than 100,000 samples are stored in the freezers of either the

pharmaceutical companies or their contractual partners at universities and test

institutions.

Ethical issues related to data protection of the individuals providing samples to

bio-banks are several: nature and extent of information prior to consent,

coverage of the consent given by the study person, labeling and storage of the

sample and data (coded or anonymized). In general, genetic test data, once

obtained, are permanent and cannot be changed. The test data may imply

information that is not beneficial to the patient and his/her family (e.g.,

Page 39: Gene Mining

GENE MINING

39

employment opportunities, insurance, etc.). Furthermore, there may be a long

latency between the analysis of the genetic test and the clinical expression of the

disease and wide differences in the disease patterns. Consequently, information

about some genetic test data may stigmatize patients leading to poor quality of

life. This has raised the issue of ‘genetic exceptionalism’ justifying specific

regulation of use of genetic information.

Discussions on how to handle sampling and data are ongoing within the industry

and the regulatory sphere, the European Agency for the Evaluation of Medicinal

Products (EMEA) having issued a position paper, the Council for International

Organizations of Medical Sciences (CIOMS) having a working group on this issue,

and the European Society of Human Genetics preparing background paper on

‘Polymorphic sequence variants in medicine: Technical, social, legal and ethical

issues. Pharmacogenetics as an example’. Within the European project Privacy in

Research Ethics and Law (PRIVIREAL), recommendations for common European

guidelines for membership in research ethical committees have been discussed,

balancing the interests and assuring independence and legal competence. Good

decision making, assuring legality of protocols and assessment of data protection

is suggested to be part of any evaluation of protocols.

Mining the mouse genome

The mouse genome sequence, published , has already made a huge impact on the

research community. Although only a draft, it is clear that the sequence is a very

high-quality product, with excellent coverage and reliability over large genomic

expanses. It is a huge asset to researchers, and its significance matches that of the

human genome. In the past six months, for example, the Ensembl genome

Page 40: Gene Mining

GENE MINING

40

browser of the Sanger/European Bioinformatics Institute dealt with 2.6 million

requests for detailed information about the mouse genome, and 3.2 million

queries about the human sequence.

But there is one important difference between these two resources — the mouse

genome encodes an experimentally tractable organism. This means that it is now

truly possible to determine the function of each and every component gene by

experimental manipulation and evaluation, in the context of the whole organism.

6. CONCLUSION :-

Work is being done in exponential scale worldwide. Indian scientist are also toiling

hard. Due to some bottlenecks they are not able to keep the pace. Gene mining is

not only boon for plant biotechnology but equally good for animal sciences. Gene

mining provided molecular biologists with a powerful and useable tool for

extracting disease-relevant genes, a major theme in the post-genomic era. This

technique leaves a question mark for the target driven gene functioning.

BLAST TYPES:-

BLASTp- Compares an Amino acid query sequence against a protein sequence

database.

BLASTn- Compares a Nucleotide query sequence against a protein sequence

database.

BLASTx- Compares six frame conceptual translation products of a Nucleotide

query sequence against a protein sequence database.

Page 41: Gene Mining

GENE MINING

41

tBLASTn- Compares a protein query sequence against a Nucleotide sequence

database dynamically translated in all six reading frames.

tBLASTx - Compares six reading frames translation of a nucleotide query

sequence against the six frame translation of the nucleotide sequence database.

7. RESEARCH PAPERS PUBLISHED:

1. R. Latha, L. Rubia, J. Bennett and M. S. Swaminathan. 2004Allele Mining for

Stress Tolerance Genes in Oryza Species and Related Germplasm.

Molecular Biotechnology.Volume 27. 101-108.

2. Kaur N, Street K , Mackay M , Yahiaoui N, Keller B. Allele mining and

sequence diversity at the wheat powdery mildew resistance locus Pm3.

Plant molecular biology. (65). 93-106.

3. DK Berger. 2004. Gene-mining the Arabidopsis thaliana genome: applications

for biotechnology in Africa. South African Journal of Botany, 70(1): 173–

180.

4. Graciel A, Gonzalez¥, Juan C. Uribe, Luis Tari, Colleen Brophy & Chitta Baral.

2007. mining gene-disease relationships from biomedical literature:

weighting proteinprotein interactions and connectivity measures. Pacific

Symposium on Biocomputing 12. 28-39.

5. Seokkyung Chung, Jongeun Jun, Dennis McLeod. 2004. Mining Gene

Expression Datasets using Density-based Clustering. CIKM (04). 8–13.

6. Gerard R. Lazo1, Debbie Laudencia-Chingcuanco1, Yong Q. Gu1, Olin D.

Anderson1.2004. Gene Mining for Conserved cis Elements in Model

Genomes Using Gene Expression Patterns. In: The NCBI Handbook. 106-

109.

7. S. M. Khalessizadeh, R. Zaefarian, S.H. Nasseri, and E. Ardil. 2006. Genetic

Mining: Using Genetic Algorithm for Topic based on Concept Distribution.

World Academy of Science, Engineering and Technology 13. 144-147.

Page 42: Gene Mining

GENE MINING

42

8. Patents(program )

Peptide Mass Fingerprinting Database Management program Using

AMWISE and fBIND technique ,

Registration Number: 2004-01-12-835

Peptide Mass Fingerprinting program Using AMWISE and fBIND technique,

Registration Number : 2004-01-12-836

cDNA Microarray data Classification tool

Registration Number : 2004-01-22-839

cDNA Microarray data Clustering tool

Registration Number : 2004-01-22-840

Page 43: Gene Mining

GENE MINING

43

9. REFERENCES:

1. R. Latha, L. Rubia, J. Bennett and M. S. Swaminathan. 2004Allele Mining for

Stress Tolerance Genes in Oryza Species and Related Germplasm.

Molecular Biotechnology.Volume 27. 101-108.

2. Kaur N, Street K , Mackay M , Yahiaoui N, Keller B. Allele mining and

sequence diversity at the wheat powdery mildew resistance locus Pm3.

Plant molecular biology. (65). 93-106.

3. DK Berger. 2004. Gene-mining the Arabidopsis thaliana genome: applications

for biotechnology in Africa. South African Journal of Botany, 70(1):

173–180.

4. GracielA Gonzalez¥, Juan C. Uribe, Luis Tari, Colleen Brophy, Chitta Baral.

2007. mining gene disease relationships from biomedical literature:

weighting proteinprotein interactions and connectivity measures. Pacific

Symposium on Biocomputing 12. 28-39.

5. Seokkyung Chung, Jongeun Jun, Dennis McLeod. 2004. Mining Gene

Expression Datasets using Density-based Clustering. CIKM (04). 8–13.

6. Gerard R. Lazo1, Debbie Laudencia-Chingcuanco1, Yong Q. Gu1, Olin D.

Anderson1.2004. Gene Mining for Conserved cis Elements in Model

Genomes Using Gene Expression Patterns. In: The NCBI Handbook.

106-109.

7. S. M. Khalessizadeh, R. Zaefarian, S.H. Nasseri, and E. Ardil. 2006. Genetic

Mining: Using Genetic Algorithm for Topic based on Concept

Distribution. World Academy of Science, Engineering and Technology

13. 144-147.

Page 44: Gene Mining

GENE MINING

44

On Line references:-

a) http://www.pubgene.org.

b) http://www.gene.ucl.ac.uk

c) http://microarray.princeton.com

d) http://www.ncbi.nlm.nih.gov/

e) http://www.tigr.org/tdb/tgi/agi/

f) http://genepipe.ngc.sinica.edu.tw/genowatch