The Physical and Genetic Mapping of the Mucin Genes Located on Chromosomes 7 and 11 by Alexander Stuart Hill A thesis submitted for the degree of Doctor of Philosophy University of London MRC Human Biochemical Genetics Unit Department of Biology University College London March, 1997
259
Embed
The Physical and Genetic Mapping of the Mucin Genes ...
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
The Physical and Genetic Mapping of the Mucin Genes
Located on Chromosomes 7 and 11
by
Alexander Stuart Hill
A thesis submitted for the degree of Doctor of Philosophy
University of London
MRC Human Biochemical Genetics Unit
Department of Biology
University College London
March, 1997
ProQuest Number: 10046191
All rights reserved
INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted.
In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed,
a note will indicate the deletion.
uest.
ProQuest 10046191
Published by ProQuest LLC(2016). Copyright of the Dissertation is held by the Author.
All rights reserved.This work is protected against unauthorized copying under Title 17, United States Code.
4. 28. Composite sequence of V ECl, VEC3, VEC4 and SIB 172. 173
4. 29 Diagrammatic representation of the speculative model of MUC3. 182
12
4 .30 . Sequence alignments of the sequences SIB 172, SIB219, SIB223, 186
SIB221, SIB217, SIB236, SIB227, SIB209, SIB235 and the
vectorette sequence.
13
List of tables
Table Page
2. 1. Table showing the sequence, locus, melting temperature and 76
application of the primers used during the course of the research
described in this thesis.
4. 1. Table showing the pairwise lod scores at maximum likelihood 126
recombination fractions 0 in males (M) and female (F) for MUC3
with a selection of chromosome 7 markers which have been
localised to regions of chromosome 7 using physical methods.
4 .2 . Table showing the sizes of the MUC3 alleles detected with SIB 124 139
on genomic DNA digested with PvuII from seven individuals.
4. 3, Table showing the sizes of fragments detected using the cDNA 143
probes SIB 124, clone 20 and SIB172U on genomic DNA digested
with PstI, PvuII and Hindlll from a single individual.
4. 4. Table showing the size of fragments detected using the cDNA probe 147
on PFGE blots of genomic DNA digested with Notl, BssHII, Nael,
Smal, Sfil, SacII, Nrul and Mlu I from the cell line K562.
14
Abbreviations
ACHE Acetylcholinesterase gene
ALE Automated laser fluorescence
BAG Bacterial artificial chromosome
BrDU 5-bromo deoxyuridine
BSA Bovine serum albumin
CCD Charged couple device
CEPH Centre d'Etude du Polymorphisme Humain
CHEF Clamped homogeneous electric field
C0L1A2 Collagen, type I, alpha 2 gene
C0L2A1 Collagen, type II, alpha 1 gene
CRI-MAP Multipoint analysis computer package
DAPI 4, 6-diamino-2-phenyl-indol
DGGE Denaturing gradient gel electrophoresis
EPO Erythropoietin gene
ERV3 Endogenous retroviral sequence 3
EUROGEM European genome mapping initiative
PCS Foetal calf serum
FIGE Field inversion gel electrophoresis
FIM Frog integumentary mucin
FISH Fluorescent in situ hybridisation
FITC Fluorescein isothiocyanate
Gal Galactose
GalNAc N-acetylgalactosamine
GDB Genome database
GlcNAc N-acetylglucosamine
15
HBB Haemoglobin, beta gene
HBGU Human biochemical genetics unit
HGMP Human genome mapping project
HMFG Human milk fat globule
HRAS Harvey rat sarcoma viral oncogene homolog
ICRF Imperial cancer research fund
IL6 Interleukin 6 gene
INS Insulin gene
LMP Low melting point
lod Log of the odds
MET Met proto-oncogene (hepatocyte growth factor receptor)
MRC Medical research council
M U C I-7 The human mucin genes
MVR Minisatellite variant repeats
NIH National institute for health
OFAGE Orthogonal field-alternating gel electrophoresis
PAIl Plasminogen activator inhibitor, type I gene
PEM Polymorphic epithelial mucin
PFGE Pulsed field gel electrophoresis
PGM Phosphoglucomutase
PI Propidium iodide
PMSF Phenylmethylsulfonylfluoride
PUM Peanut lectin binding urinary protein
RFLP Restriction fragment length polymorphism
RT Reverse transcriptase
SDS Sodium dodecyl sulphate
SSCA Single stranded conformation analysis
TCRB T-cell receptor, beta cluster
TCRG T-cell receptor, gamma cluster
16
TEMED NNN'N'-Tetramethylethylenediamine
TH Tyrosine hydroxylase gene
TRITC Tetramethylrhodamine isothiocyanate
UVP Universal vectorette primer
VNTR Variable number of tandem repeats
vWF von Willebrand factor
YAC Yeast artificial chromosome
17
Acknowledgements
I would like to thank my supervisor. Dr. Dallas Swallow, for her advice,
support and encouragement during my time in the MRC Human Biochemical
Genetics Unit.
I would also like to express my appreciation of the friendship and help I
received from all my colleagues in the Galton laboratory. In particular I thank the
colleagues whose collaborative work I have included: Wendy Pratt for help with the
Southern blots and the sequencing; Lynne Vinall for sizing of the major MUC3
alleles; Yangxi Wang for help with the RT-PCR; Margaret Fox for introducing me to,
and assistance with, fluorescent in situ hybridisation; and John Attwood for his
invaluable help in the construction of the genetic maps of chromosomes 7 and 11.
I also wish to acknowledge the collaboration of: Dr. Jim Gum who provided
many of the clones used and sequence data, without which much of this work would
not have been possible; Dr. Jean-Pierre Aubert and Nicole Porchet for providing
clones for MUC5, used in the work described in chapter 3, and physical mapping data
of the region 1 Ip 15; Dr. T heda Lesuffleur for providing the clone L31 used in the
work described in chapter 3; Dr. Stephen Scherer and Dr. Eric Green who isolated the
YAC clones from chromosome 7, the characterisation of which is described in
chapter 4 and Dr. Soreq and Dr. Getman who supplied the cosmid clones containing
ACHE used in the work also described in chapter 4.
I acknowledge CEPH and EUROGEM for the family DNAs and the MRC
Human Genome Mapping Project for providing the studentship as well as other
support.
18
1. Introduction
This thesis is concerned with the physical and genetic mapping of human
mucin genes. The introduction is divided into two sections; the first part deals with
genetic polymorphism and the various techniques used for mapping and in the second
part mucin glycoproteins and the genes which correspond to specific mucins in
humans and other organisms are considered.
1.1. Genetic variation in humans
The classical definition of a polymorphism is a variable characteristic for
which the frequency of the variant allele in the population is greater than that
produced by random mutations. This is commonly accepted to be when a variant
allele is detected at a frequency of at least 1 in 50 for a population of unrelated
individuals. Prior to gene cloning it was already clear that polymorphisms could be
detected in many proteins and that distinct allele products could be separated by
electrophoresis on the basis of their surface charge differences as in the case of
phosphoglucomutase (PGM) for example (Spencer et a i 1964). The basis of most of
these polymorphisms is variation in the coding region of the gene, which lead to
amino acid substitutions which may or may not have functional consequences. There
are an even larger number of polymorphisms in non coding DNA most of which have
no functional significance.
The recent advances in techniques for analysing DNA has led to a rapid
increase in the number of polymorphisms which can be used as markers and for other
genetical purposes such as the construction of maps. The first type o f DNA
polymorphism detected was the restriction fragment length polymorphism (RFLP).
These polymorphisms are usually caused by small scale changes in the DNA such as
base substitutions and deletions which cause changes in the recognition sequences of
19
restriction enzymes resulting in restriction fragments of altered size (Jeffreys 1979;
Cooper et al. 1984). The nature of this type of polymorphism means that there are
usually only two alleles which in turn means that the maximum heterozygosit/is only
50% and is often less. The likelihood of detecting an RFLP at a given locus is quite
low: it has been estimated that the mean heterozygosity of human DNA is about
0.001 per nucleotide and many mutations will not result in the alteration of a
restriction site (Jeffreys 1979). Furthermore it is often impractical to screen with an
exhaustive selection of enzymes.
More recently a number of other techniques have been developed to detect
point mutations. One such method, single stranded conformation analysis (SSCA),
relies on the fact that single stranded DNA will take up various conformations
dependent on its sequence (Orita et al. 1989). These different conformations may
have different electrophoretic mobilities. A second method, denaturing gradient gel
electrophoresis (DGGE), is also dependent on differences in electrophoretic mobility
of DNA of the same size but slightly different sequence (Myers et al. 1985). In this
case the DNA is left double stranded and is run on an acrylamide gel which contains a
gradient of a denaturing chemical such as formamide. As the DNA moves through
the gel it will start to melt at a particular point in the gradient and there will be a sharp
reduction in electrophoretic mobility. This melting is determined by the sequence of
the regions with the lowest melting points. These techniques have been particularly
useful for identifying disease causing mutations and have occasionally been very
useful for revealing additional heterozygosity (Johnson et al. 1992; Harvey et al.
1995).
The discovery of hypervariable regions, often referred to as 'minisatellites’, in
human DNA gave a significant boost to the genetic analysis in humans (Jeffreys et al.
1985). A number of these loci were found close to genes such as HRAS* and
COL2Al^ (Capon et al. 1983; Stoker et al. 1985). These hypervariable regions are
composed of tandem repeats of short sequences and the different alleles are the result
^Heterozygosity = 1 - ^(population frequencies of the alleles)^. Allele frequency = IN/2N, where Nj is the number of i alleles and N is the total number of alleles.
* Harvey rat sarcoma viral oncogene homolog, ^Collagen, type U, alpha 1.20
of variation of the number of these tandem repeats (VNTR). The number of alleles
detected for minisatellites ranges from 6 to 80 (Sykes et al. 1985; Balazs et al. 1986).
The repeat units of one set of minisatellites contain core sequences which are
conserved over a number of loci scattered throughout the genome and can be detected
with a probe for this sequence at low stringency (Jeffreys et al. 1985). This approach
can be used in order to produce a pattern of bands which is specific for an individual
i.e. a 'genetic fingerprint' but this is not useful for linkage. However the use of probes
which are specific for a particular minisatellite locus where the allelic relationships
can be determined are useful for linkage studies (Nakamura et al. 1987; Wong et al.
1987).
There is also a further source of variation within these loci, namely small
sequence differences between the repeats (Jeffreys et al. 1990). These minisatellite
variant repeats (MVRs) are nucleotide substitutions (and other changes) distributed
along the minisatellite. Sequence analysis of the locus 5' to insulin has shown up to 9
MVRs per VNTR allele. The variable distribution of these MVRs and their ability to
be analysed using PCR based techniques means that the informativeness of a locus is
greatly enhanced (Jeffreys et al. 1991). Analysis of the allelic variation of these loci
has revealed a remarkably high mutation rate of up to 15% per gamete (Vergnaud et
al. 1991).
The precise mechanism involved in producing these mutations is not yet fully
understood but the lack of recombination in closely linked flanking markers suggests
that it is unlikely to be due to unequal crossing over between homologous
chromosomes during meiosis (Wolff et al. 1988; Wolff et al. 1989; Vergnaud, Mariat
et al. 1991). Detailed analysis of the structure of mutant minisatellite alleles and the
non reciprocal nature of the exchange of repeats indicates that processes such as
slippage during replication, gene conversion and unequal sister chromatid exchanges
are involved (Jeffreys, Neumann et al. 1990; Armour et al. 1993; Berg et al. 1993;
Desmarais et al. 1993; Buard et al. 1994; Jeffreys et al. 1994). Further analysis of the
distribution of MVRs showed that in some loci there was a certain polarity i.e. the
21
MVRs of each different type are clustered together at one end, although this is not
true in all cases and may suggest different processes are involved in generating and
maintaining these polymorphisms (Neil et a l 1993; Armour et al. 1996).
Since the discovery of minisatellites other types of VNTR have been
described such as di, tri and tetra nucleotide repeats (Weber et al. 1989; Edwards et
al. 1991). These have proved very useful as they can often be typed using PCR based
techniques, which enables large numbers of samples to be screened relatively easily
and quickly. These sites have proved very useful as sequence tagged sites (STS) for
the human genome mapping project.
The advances in both the different types of DNA polymorphisms and the
techniques for analysing them have had a dramatic effect on the mapping of the
human genome especially with respect to linkage analysis.
1.2. Human gene mapping
Some of the first genes mapped were X linked because of the ease of
interpreting the segregation in families. This includes the Xg blood group (Mann et
al. 1962). The first genes were mapped to the autosomes using somatic cell hybrids
together with linkage analysis and analysis of cytogenetic abnormalities. Once the
first cDNAs had been cloned regional assignm ents were made by in situ
hybridisation. Further refinements to these techniques and the recent advances in
molecular genetics has led to the generation of information ranging from maps of
whole chromosomes to the structure and sequence of individual genes. These and
associated techniques will be discussed in this section.
1.2.1. Linkage analysis
Linkage analysis is used to measure the extent of non independent segregation
of loci in families. In order to detect linkage the loci must show a detectable variation
which is inherited and at least one of the parents must be doubly heterozygous for
22
each pair of loci to be tested. If we consider the simplest case of two loci on the same
chromosome that are close to each other and suppose that one locus has the alleles A
and a and the other locus has the alleles B and b, then if the parent has the alleles AB
on one chromosome and ab on the homologous chromosome the offspring will inherit
either AB or ab from that parent. If the loci were on different chromosomes (not
linked) then there would be equal numbers of AB, ab, Ab, aB offspring.
However in practice even when loci are linked some offspring with the
genotype Ab or aB may be detected due to exchange of genetic material by
recombination at meiosis. Meiotic recombination happens at a relatively low rate,
somewhere between 0 and 3 recombinations per chromosome. This means that many
of the chromosomes inherited by the offspring will be a mixture of each of the
parent’s pair of homologous chromosomes. The position on the homologous
chromosomes where recombination occurs is variable, though it appears not to be
entirely random. So although most individuals will have recombinations between
different loci, in the population as a whole there does seem to be localised clustering
or 'hot spots' of recombination. However for linkage analysis it is assumed that
recombination is a random process.
Although the phenomenon of recombination means that two loci on the same
chromosome separated by a large distance will almost inevitably be separated by a
recombination and thus not appear to be linked, information about the distance
between linked loci can be obtained. For example, if one again considers the simplest
case of two loci on the same chromosome, then the further apart the loci are the
greater the chance that a recombination will take place between them. Therefore in a
population the number of recombinants compared to non recombinants for two
particular loci is related to their physical separation. The term recombination fraction
describes the proportion of the total number of offspring that are recombinants and is
a measure of genetic distance.
The detection of linkage and the measurement of recombination fractions
between loci is easier in organisms such as mice as opposed to humans because the
23
mating can be controlled so that the family is fully informative for the loci being
tested. Also they tend to have large families and short generation times which
enables statistically significant results to be obtained from a few families in a short
space of time.
The inability to carry out crosses between humans means that the population
must be searched in order to find families which are informative for the loci being
tested. In addition the small size of human families means that the results from a
number of families need to be pooled in order to get a statistically significant measure
of the recombination fraction between loci. These problems are further complicated
by the long generation time which means that it is unlikely that an investigator would
be to be able to observe more than three generations.
The data obtained from three generation families is extremely useful as the
phase of loci can be deduced i.e. which alleles for the loci co-segregate from
grandparent to parent. This means that the amount of recombination in the children
can be directly determined. However three generation families which are informative
for the loci being tested are often either not available or are too few to give
statistically significant data. Indeed often only two generations are available for
study and because the phase of these families is not known the amount of
recombination in the children cannot be directly measured. However various
statistical methods have been devised in order to determine the recombination in
families indirectly.
The most commonly used statistical method is that of lod scores which
enables the data from two and three generation families to be combined (Morton
1955). This method not only detects linkage but also gives a measure of the
recombination fraction. The lod score is a measure of the likelihood that you would
obtain the offspring observed if the loci are linked compared with the loci not being
linked at a given recombination fraction and can be calculated using the equation:
Z(0)=loglO [L(0)/L(l/2)] where Z=the lod score, 0= the recombination fraction and
24
L=the likelihood. Usually the lod scores for a range of values of 0 are calculated and
the highest value of Z (Zmax) taken to be the lod score for the particular loci at the
corresponding recombination fraction. The main advantage of lod scores is that the
data from different families can be combined by simple addition of the Z values.
Traditionally the Z values were obtained using tables devised by Maynard-Smith,
Penrose and Smith, although these only dealt with families with up to 7 children
(M aynard-Smith et al. 1961). The development of computer programs such as
HANDLINK (written in this laboratory by J. Attwood) has enabled families of any
size to be analysed. A lod score of 3 is often accepted as the minimum value at which
two loci are considered to be linked i.e. there is a 1 in 1000 chance that they are not
linked. Although under certain circumstances, such as if there is physical data to link
the loci, then a lod score of less than 3 will be acceptable. Conversely if a genome
wide search for linkage is undertaken then there is a chance that false linkages with
lod scores of 3 will be detected and some researchers suggest that a minimum lod
score of 4 is more appropriate in this instance.
So far I have only considered the case of two loci. However the information
obtained from linkage analysis can be used to predict the order of multiple loci on a
chromosome i.e. multipoint analysis. Again because of the ability to perform
controlled mating in mice for example, multi locus maps can be constructed relatively
easily by examining the recombination patterns of the offspring. The situation in
humans is complicated because, often when considering a number of loci, not all will
be informative in the family. Indeed the more loci considered the greater the
likelihood of uninformative loci in any particular family. This means that to obtain a
reliable order for the loci under test, data from a large number of families needs to be
combined, and even then the deduced order will only be the one with the highest
probability based on that particular data set within a given set of parameters. The
complexity of the calculations involved meant that the manual construction of large
scale maps of chromosomes was not practical and it was only the advent of computers
which made this a realistic possibility.
25
The process of linkage and multipoint analysis has been greatly enhanced by
the availability of resources such as DNA from the 60 large families collected by the
Centre d'Etude du Polymorphisme Humain (CEPH) and the development of powerful
computer programs such as CRI-MAP (Donis-Keller et al. 1987). CRI-MAP is in
fact a collection of programs for the manipulation and analysis of family data, which
can be selected from the various options presented in the main menu. The main
purpose of CRI-MAP is the construction of multi locus genetic maps using the
multipoint analysis program 'build'. The program first orders the loci being tested in
order of their informativeness i.e. the most informative meioses. The two most
informative loci are then used as the basis for the map. It then tries to insert the next
most informative locus by creating three new orders with the locus in one of the three
possible positions in each order. The maximum log likelihood is then calculated for
each order by varying the distances between all the loci. If one order with the locus
has a log likelihood of greater than a predetermined threshold, usually three,
compared to the other orders then this order is chosen and used for the next locus. If
none of the orders has a log likelihood of greater than three compared to the others
then the locus is left out and the program moves onto the next locus. This process is
repeated until all the chosen loci have been tested. Then using the option 'flips' the
local support of groups of markers in the order from the build output can be checked.
This is done by comparing all the different permutations of groups of up to 5 markers
to see if an alternative to the original order of this group is more likely i.e. increases
the overall likelihood of the whole order. The option 'twopoint' allows you to
calculate LOD scores for pairs of loci. The option 'chrompic' is able to create
diagrams of the chromosomes which show the parental origin of the allele and thus
the meiotic breakpoints. The ability to identify specific meiotic breakpoints in
individuals is very useful as it enables the rapid positioning of loci within a pre
existing order without having to rebuild the whole map again.
26
Genetic maps have now been generated for all the 22 human autosomes and
the X chromosome. Much of this work has been carried out by dedicated centres
such as GENETHON, EUROGEM and the CEPH consortium.
1.2.2. Somatic cell hybrids
Somatic cell hybrids have proved to be a useful tool for the localisation of loci
to specific chromosomes and even to particular regions on the chromosome.
These hybrid cells are produced by fusion of human cells with permanent
rodent cell lines. This mapping technique exploits the fact that there is loss of whole
human chromosomes or fragments of chromosomes from the fused human/rodent
hybrid cell lines (Ruddle 1973). In the early studies the presence or absence of a
specific human gene product was correlated with the presence of absence of a
chromosome. More recently however Southern hybridisation and PGR techniques
have been used in order to determine the presence or absence of genes by testing the
DNA directly.
Hybrid cells are produced by mixing the human cells with the rodent cells in
the presence of polyethylene glycol or Sendai virus to enhance the fusion process.
Various selection techniques are used in which only the hybrid cell line can survive,
one of the most popular being the HAT selection system (Littlefield 1964). When the
fused cells divide human chromosomes are lost. After several rounds of division the
cells stabilise and stop losing human chromosomes and clones can be isolated. Each
clone used to establish a cell line contains a different selection of human
chromosomes. In order to assign a locus to a single chromosome a panel of cell lines
is usually studied and the presence or absence of a particular locus in the various cell
lines can then be correlated with the presence or absence of a chromosome
throughout the same panel. However a number of single human chromosome hybrids
are also available which often avoids the testing of an extended panel of hybrids.
Hybrids which contain translocated chromosomes and X-ray induced
chromosome fragments are useful in increasing the resolution of the localisation of
27
loci (Burgerhout et al. 1973). The presence or absence of a particular locus in
hybrids which contain a fragment of a chromosome characterised by defined
breakpoints can be used to provide a regional assignment for that locus, though the
interpretation of results from these hybrids can sometimes be difficult because the
rearrangements which take place are quite complex. The results obtained from
hybrids containing X-ray irradiated chromosomes can be used to give a measure of
the distance between syntenic loci (loci that are on the same chromosome but are not
linked) because the frequency with which loci are separated is proportional to the
distance between them. The data can then be used in much the same way as
recombination fractions to determine an order of loci along the chromosome. Indeed
a recent map containing 6000 genes was constructed using radiation hybrids (Schuler
et al. 1996).
Somatic cell hybrids have to some extent been superseded by the development
of In situ hybridisation which will be discussed in the next section. However this
technique still provides a relatively cheap and in conjunction with PCR rapid method
of mapping loci, and is sometimes preferable for mapping cDNAs.
1.23. In situ hybridisation
The major application of In situ hybridisation for mapping purposes is the use
of DNA probes to localise homologous sequences with respect to the banding patterns
produced by the chromosome staining procedures. The classic chromosome stain
used was Giemsa. A reproducible pattern of light and dark bands along metaphase
chromosomes can be seen when viewed with a high power visible light microscope
and is commonly referred to as G banding (Seabright 1971). The combination of the
number and thickness of the bands produced is specific for each of the chromosomes.
This can be used to distinguish each pair of homologous chromosomes from the
others and divides the chromosome into defined regions. More recently however the
DAPl (4, 6-diamino-2-phenyl-indol) stain has been used which is fluorescent and is
visualised using a UV light source. The banding patterns are also useful in
28
identifying translocations and the presence of extra chromosomes. In addition to
hybridisation to metaphase chromosomes, interphase nuclei and stretched chromatin
are also used for particular applications.
The first probes used were radioactively labelled but these days they are
usually fluorescently labelled. One detection method uses avidin conjugated with a
fluorescent dye, usually FITC (fluorescein isothiocyanate), to detect the biotinylated
probe, although some workers use degoxygenin and others use probes directly
labelled with fluorescent dyes. The use of fluorescent dyes has meant that by using a
different colour such as TRITC (tetramethylrhodamine isothiocyanate) two or more
probes can be used simultaneously. Multiple probes labelled with different dyes can
be used for measuring the physical distance between two loci and determining the
order of loci directly on the chromatin (Trask et a i 1989). The chromosomes and
probes can then be visualised using a UV illuminated microscope, although more
recently confocal laser microscopes and CCD (charged couple devices) cameras have
enabled the data to be fed directly into computers for image analysis. One of the
most useful aspects is the ability to depict the chromosomes in one colour and the
signal from the probe or probes in other distinct colours. Until recently it was only
possible to visualise different probes with two different colours. However with the
developm ent o f cooled CCD cameras, which are more sensitive, and more
sophisticated image analysis programs different probes can be distinguished on the
basis of the proportions of the two colours with which each probe has been labelled.
The computer will then display the signal from each probe as a different 'false' colour.
In situ hybridisation has not only been extremely useful for mapping
applications in terms of chromosomal localisation of loci, it is also a useful tool for
checking the integrity of clones, especially YACs, which seem to suffer from
relatively high levels of chimerism.
1.2.4. Cloning
29
The size of the human genome has been estimated to be around 3x10^ base
pairs (bp) and the genes are thought to occupy approximately 5% (Fields et al. 1994).
It also been estimated that there are between 50 000 and 100 000 genes, so in order to
study and manipulate genes and other regions of interest it is extremely useful to be
able to isolate specific sequences from the rest of the genome. The usual approach is
cloning where fragments of DNA are inserted into a vector which enables the DNA to
be taken up by a host organism, usually bacteria. The bacteria will then replicate the
recombinant vector as it divides and large amounts of the desired DNA fragment can
be recovered from a culture of the transformed bacteria. Clones are usually isolated
by screening libraries comprised of a large number of different clones that as a whole
represent the entire sequence of the DNA used in its construction.
1.2.4.1. cDNA clones
Complementary DNAs (cDNAs) corresponding to the exon sequences of
genes are frequently isolated from expression libraries using antiserum raised against
the gene of interest. Alternatively they can be screened by colony or plaque
hybridisation with a radioactive DNA/RNA probes or antibodies. When choosing a
library the expression pattern of the gene should be considered because screening a
library of a tissue with a high level of expression will increase the chances of
isolating a clone containing the desired sequence. The positions of intron/exon
boundaries can be determined by comparison of cDNA sequences and genomic
sequences.
1.2.4.2. Genomic clones
Genomic clones are important in the study of the genetic structure of a gene as
they contain not only the coding regions but the noncoding regions such as introns
and promoter regions. The classical method of obtaining genomic clones is the
30
screening of libraries with cDNA clones. Libraries of genomic clones are also useful
for the positional cloning of genes by chromosome walking.
The two most commonly used vectors for construction of genomic libraries
during the time frame of this project were cosmids and YACs (yeast artificial
chromosomes), which are useful because of their relatively large insert size of
approximately 50kb and up to 1Mb respectively. However in the last few years the
reliability of Y AC clones has come into question. This is due to the discovery of a
relatively large proportion of clones in the libraries being representative of
recombinant events between quite unrelated sequences; indeed some Y AC libraries
have been estimated to be as much as 40 to 60% chimera's. The use of FISH to
identify chimeric clones has alleviated this problem to a certain extent but is not
suitable for identifying deletions or duplications of sequences. These limitations have
led to the development of new vectors such as PI and BACs (bacteria artificial
chromosomes) which have a capacity of about lOOkb.
A useful genomic DNA library would probably contain a range of random
overlapping fragments with a size of greater than about 20kb, Such libraries may be
constructed from total genomic DNA or from selected regions such as single human
chromosomes sorted using FACS (fluorescent automated chromosome sorting)
machines. The larger the insert size, the fewer the number of clones which need to be
screened in order to obtain the desired sequence coverage. The cloning of large
amounts of flanking sequence is especially useful for identifying control regions such
as promoters and for chromosome walking applications. The standard method of
producing these fragments for cosmid libraries is to do a partial digest using Sau3A.
In the case of YACs the process is similar except that the restriction enzymes used cut
less frequently i.e. Notl or Smal. The large size of inserts has made it possible to
create gridded arrays of libraries in microtitre well plates where each well contains a
single clone. This can then be screened by either making gridded filters which can be
hybridised with radioactively labelled probes or PCR of pools of clones organised in
such a way that the well position of the positive clones can be identified.
31
1.2.4.3. Other vectors used in the manipulation and sequencing of cloned
DNA
Once clones which contain the sequences of interest have been obtained they
are often subcloned into plasmid based vectors which can be more easily cultured and
the sequence of interest recovered. Plasmid based vectors such as the pUC series are
often used for applications such as making probes for hybridisation and detailed
restriction mapping. These are double stranded, have a multiple cloning site and can
easily be propagated in E.coli. There is also class of vectors called phagemids which
have two origins of replication; one derived from Col E l and the other from the fl
phage. Normally the Col E l origin is used for plasmid replication, however in the
presence of phage fl infection the other origin is used and the plasmid is replicated as
single stranded DNA. One of the most popular phagemid clones is the pBluescript
series which also has T7 and T3 phage promoters either side of the multiple cloning
site which allow expression in either orientation. The M l Bmp series of vectors based
on M13 filamentous coliphage have routinely been used for sequencing because the
DNA is single stranded which means there is no interference from the complementary
strand. These vectors are still popular when large amounts of DNA are to be
sequenced but recent advances in sequencing has enabled high quality sequence to be
obtained from double stranded DNA with relative ease.
1.2.5. The polymerase chain reaction (PCR)
Recently the polymerase chain reaction (PCR) has allowed the development
of techniques which enable specific sequences to be amplified and has increased the
repertoire of approaches to gene characterisation and sequencing. The main
advantages of PCR are the speed and ease with which specific fragments of DNA can
be amplified. However there are limitations, of which the most important is the
requirement of fairly detailed sequence information.
32
The process can be split into three stages i.e. dénaturation of the template,
annealing of the sequence specific primers and finally extension. This is achieved by
cycling the reaction through different temperatures, for example 95“C (dénaturation),
50-60°C (annealing/extension) and 72°C (extension). This is repeated a number of
times, usually 30 to 40 resulting in the production of very many copies: theoretically
the number of copies is doubled during each cycle e.g. after 30 cycles there would be
nxlO^ copies. This enormous amplification means that PCR is extremely sensitive,
indeed it is possible to amplify a single target molecule. This extreme sensitivity
however means that contamination can be a significant problem.
A number of applications based on PCR have been developed in recent years,
this includes vectorette PCR which has been used during the course of this research.
Vectorette PCR is a technique that enables specific fragments of unknown sequence
to be amplified from a complex source such as a large clone or even genomic DNA
(Fig. 1.1). This is made possible by the use of a specific primer to a known piece of
sequence and the so called vectorette unit. The vectorette unit is comprised of two
synthetic oligonucleotides annealed to each other which have complementary
sequence at each end separated by a stretch of mismatched sequence. These
vectorette units come with a variety of sticky or blunt ends which can be ligated to
DNA digested with different enzymes. Like normal PCR two primers are required,
the sequence specific primer and the universal vectorette primer (UVP). The UVP
has the same sequence as one side of the mismatch portion of the vectorette unit
which means that it cannot prime until the complementary strand is synthesised. The
complementary strand can only be synthesised if the fragment of DNA ligated to the
vectorette contains sequence identical to the specific prim er. Once the
complementary strand has been synthesised the UVP can prime in the next round of
amplification and the PCR reaction can proceed as normal.
33
STEfr 1. ligi^ion o f v « c io r € tt« ijnit to Ol^A (qenom io Of o4or*i(j) d tg est& j w th a so ita t le r e r tr ic tc nenzyfftt.
ST AOE 2 . Dirir^g ?he first round of omplficjticfl wily Ihc spcolfw primer (r^d jrro v ) ts jbit to oriic^l, Thtf s /rrfh « is< -; tho coffifrfemantary s lra M 1o e re (id» of ihe ta rg e t Du A and tfie b ja ted s tra n d of the v e o to ra tta iwii^dluj the finsntalc^i region
ST aOE S In th? M cond rovng of arnpAfioation the v e c to re tte prin ter wt«ch has dent»cal sev jen ce to tt-e m sm atch fé? ;io n can m neat to the corrp lw nentary s tra n d pruned by ‘Jte specific prim er and the re v e r s e strand is synthesised
f
Figure 1.1.
ST AGE 4 O irin j fub fagu en t ari^ibiication c/cAas a specafic p ro d jc t is amplir’teiJ n the o s w l v a y .
Diagrammatic representation of the vectorette PCR process. The complementary
strands are depicted in red and black. The mismatch strand of the vectorette unit is
depicted in green. The specific primer is represented as a red arrow and the universal
vectorette primer is represented as a black arrow.
34
1.2.6. Restriction enzyme analysis of DNA
Restriction enzymes are used to cut DNA at specific sites into different sized
fragments which are separated on agarose or acrylamide gels. The fragments
produced by different restriction enzymes cover a range of sizes from a few hundred
bases to a few megabases. The length and sequence specificity of the restriction site
determine the frequency with which the DNA will be cleaved. For example BamHI
which has the recognition sequence GGATCC would on average cut a random piece
o f DNA every 4000 nucleotides. However in practice the size of fragments is
extremely variable: in the case of the phage X genome, of 48.5kb in size, there are
only 5 sites which is a reflection of the GC content being somewhat less than 50%.
The less frequently cutting enzymes tend to have longer recognition sequences and
some such as Notl only have G and C in the recognition sequence and produce
fragments of around 500bp to 1Mb when human genomic DNA is digested.
Restriction enzymes which produce fragments of 50kb or more are often referred to
as rare cutters. Restriction enzymes such as Notl are useful for identifying potential
CpG islands because the recognition sequence is composed of G s and C's. CpG
islands are regions in the genome that are relatively undermethylated and have a high
GC content with a higher proportion of CpG than the rest of the genome. These
regions are often associated with the 5' ends of genes (Craig et al. 1994)).
Some restriction enzymes are sensitive to méthylation of the DNA which
results in either increased or decreased efficiency of cleavage. The méthylation of
DNA can result in partial digestion of the DNA which can be useful for map
construction and can show the relative positions of two or more restriction sites for
the same enzyme. However if the partial digestion is a problem then the use of
different cell lines which will have different méthylation patterns can help. K562 as
chosen in this project as it is considered in general to be relatively undermethylated
(Guyonnet-Duperat 1993). Isoschizomers are useful pairs of enzymes which
35
recognise the same restriction site but in which the action of one is not affected by
méthylation.
Because the size of fragments produced by the various restriction enzymes
ranges from tens of bases to a number of megabases, different electrophoresis
conditions are used i.e. standard agarose or acrylamide electrophoresis and pulsed
field gel electrophoresis (PFGE).
Standard electrophoresis uses a gel comprised of buffer and agarose or
acrylamide to act as a molecular sieve through which molecules, in this case DNA
fragments, migrate at different rates depending on size, when an electric field is
applied. In general the larger the fragment the slower it will move through the gel.
The concentration of the gel is also important as the DNA can move more easily
through lower percentage gels, with low percentage gels more suitable for the
resolution of larger fragments. For example the most commonly used concentrations
of agarose gels ranges from approximately 0.8% to 3% and the range of sizes which
can be realistically resolved using standard agarose gel electrophoresis is
approximately 200bp to 40 OOObp. Below 200bp acrylamide gels are used as they can
separate fragments that differ in size by a single nucleotide.
Under standard electrophoresis conditions the agarose gel matrix seems
unable to resolve fragments above approximately 50kb with the result that these large
fragments appear to co-migrate through the gel. However separation of DNA up to a
number of megabases can be obtained using pulsed field gel electrophoresis (PFGE).
Essentially this technique relies on alternating the direction of the electric field across
the gel. The current theory is that when the direction of the field is changed the DNA
must reorientate in order to move in a new direction through the gel and the larger the
fragment the longer the reorientation time. However there is no definitive model
which describes accurately the processes involved and the interactions that occur
between the DNA and gel matrix during PFGE. Indeed the precise nature of the gel
itself is not yet fully understood. However a number of systems have been developed
to exploit this phenomenon. The simplest is field inversion gel electrophoresis
36
(FIGE) in which the standard two electrode configuration is used (Carle et al. 1986).
The direction of the electric field is periodically inverted with the time in the desired
direction for migration of the DNA being longer. However the resolution range of
this technique is still fairly limited with an upper limit of about 700kb. If the
alternating electric fields are at an angle to the net direction of migration a larger
range of sizes could be resolved (up to many megabases). One of the most popular
systems is contour clamped homogeneous electric fields (CHEF) (Chu et al. 1986).
The electrodes are arranged hexagonally to create an electric field which is very even
across the gel at an angle of 120°. The direction is then switched from one side to the
other for equal lengths of time so that although the DNA zig zags down the gel, the
net result is a fairly straight run in contrast with other systems such as orthogonal
field-alternating gel electrophoresis (OFAGE) (Carle etal. 1984).
In order to identify individual restriction fragments in genomic DNA so called
Southern blots of these gels can be hybridised with specific probes ranging in size
from a few tens of bases to hundreds of kilobases. The DNA is immobilised onto a
solid support of either nitrocellulose, or more commonly now, robust nylon
membrane. The DNA can be transferred onto the membrane by a number of methods
i.e. capillary blotting, electroblotting or vacuum blotting. Once the DNA has been
fixed to the support it can be hybridised with a radioactively labelled probe which
will detect all the fragments with homologous sequence. The size of the fragments
can be determined by comparison with a molecular size standard.
The construction of all restriction maps is in principle the same. The DNA is
cut with a number of different enzymes and the sizes of the fragments detected
determined. This will show the distance between pairs of the same restriction sites.
Then double digests are done where the DNA is digested with two different enzymes
to show were there is a restriction site for one enzyme within a fragment produced by
another enzyme. The relative order of the fragments can then be determined by
constructing a model which fits all the data, the model can then be added to or
changed as new data becomes available. The position within the map of specific
37
sequences of interest and their orientation can be determined by Southern analysis.
The construction of detailed restriction maps for cloned DNA is more straightforward
than for genomic DNA because all the fragments produced by a particular enzyme
will be seen, not just those with homologous sequence to a probe used on a Southern
blot. The scale of a particular restriction map is dependent on the enzymes used and
the source of DNA. Fairly detailed maps of single genes tend to be constructed by
the digestion of clones with four and six cutters. However by using restriction
enzymes, such as Notl, and PFGE the approximate physical distances between loci
and their order over a region of a few megabases can be determined.
1.2.7. Sequencing
There are a number of techniques which have been developed in order to
determine the precise nucleotide sequence of DNA but the most commonly used
techniques are based on the chain termination method developed by Sanger (Sanger et
al. 1977). The basis of this technique is the use of 2', 3'-dideoxyribonucleoside
triphosphates (ddNTPs). When these ddNTPs are incorporated into the strand being
synthesised they are unable to form phosphodiester bonds which results in
termination of synthesis of that particular strand. By adding a small amount of a
specific ddNTP to a reaction containing all four deoxyribonucleoside triphosphates
(dNTPs) the ddNTP will be incorporated in a random manner. This will create a
range of different sized fragments all of which have the particular ddNTP at their 3'
end. The use of a specific primer ensures that synthesis will start at the same place
each time.
In order to sequence a piece of DNA four reactions must be done where each
reaction contains one of the four ddNTPs. The products are detected by either
incorporating a radioactively labelled dNTP, using fluorescently labelled ddNTPs or
dNTPs or by labelling the sequencing primer. The products are then run in four
adjacent lanes on an acrylamide gel to separate the fragments. Each lane on the gel
will show the relative positions of the dNTPs in the template, which correspond to the
38
specific ddNTP used in that particular reaction. Comparison of the four lanes enables
the sequence to be determined by noting in which of the four lanes the next largest
fragment is found.
Two of the most significant developments in sequencing over the last few
years is the development of fluorescent labels which has enabled the automation of
sequencing and cycle sequencing. Several systems for fluorescent automated
sequencing have been developed. One system utilises four differently coloured labels
in which each colour corresponds to one of the four bases, meaning that all four
reactions can be run in the same, lane which counteracts differences between the
speed of migration which can vary across the gel.
Cycle sequencing is based on PCR although the enzyme used is not Taq but a
thermostable version of Sequenase v 2. The main advantage is that comparatively
small amounts of DNA are required for the reaction e.g. traditional sequencing
usually required 1 to 2 |ig of template where as cycle sequencing needs as little as
0.1 |ig of template.
39
1.3. Mucins
Mucins are a major component of the visco-elastic mucus gels coating the
epithelium of a variety of tissues. They are high molecular weight glycoproteins of
which 50% to 80% is composed of carbohydrate side chains. The mucus secreted by
particular tissues is usually comprised of a number of different mucins. The physical
and chemical properties of the mucus gel are probably determined by the mucin
composition. The function of these mucus gels are thought to include lubrication,
protection from proteolysis, maintenance of tissue hydration and to act as a barrier to
potentially harmful chemicals and organisms (Allen 1984; Rose 1992).
The analysis of mucin glycoproteins using classical biochemical techniques
has proved rather difficult due to; the large size of the molecules, the relatively high
level of glycosylation and the heterogeneity of mucins. Much of the information
about the primary structure of mucins has come from peptide sequences inferred from
the sequence of cDNA clones corresponding to a number of mucin genes.
These glycoproteins are thought to be comprised o f highly glycosylated
regions, that are resistant to proteolysis, and relatively unglycosylated regions which
alternate along the molecule (Sheehan et al. 1991). Analysis of the amino acid
content of these molecules showed a high proportion of threonine, serine, proline,
alanine and glycine. These regions have a high proportion of hydroxyl amino acids
such as threonine and serine which are able to form 0-glycosidic linkages and may
correspond to the highly glycosylated regions of the mature protein (Van Klinken et
al. 1995). Secondly there are the so-called cysteine rich domains and it has been
suggested that some of these cysteine rich domains are involved in the polymerisation
of mucin molecules.
One area where the traditional biochemical techniques have provided a
significant amount of information is the investigation of the structure o f the
carbohydrate side chains. These polysaccharides can be considered in terms of three
domains i.e. 'peripheral', 'backbone' and 'core' regions (Hounsell,j et al. 1982)
(Fig. 1. 2).
40
Polypeptide
Core region
Backbone region
Peripheral region
Figure 1, 2.
Diagrammatic representation of the structure of the mucin carbohydrate side chains,
taken from [Hounsell, 1982 ].
41
The core regions are characterised by the a ttachm ent o f N-
acetylgalactosamine (GalNAc) to the oxygen of serine and threonine to form the O-
glycosidic linkages. Further elongation can occur with the addition of galactose (Gal)
and/or N-acetylglucosamine (GlcNAc) which result in four possible types of core
structure. The backbone consists of alternating Gal and GlcNAc residues. This can
be extended by the addition of Gal-GlcNAc units. These units can be divided into
two groups on the basis of the linkage between the Gal and GlcNAc i.e. type 1,
Galpl-3GlcNAc and type 2, Galpl-4GlcNAc.
The peripheral regions which have antigen activities analogous to the blood
group antigen H, A, B, Lewis a and Lewis b are the best characterised. The blood
group H antigen is formed by the addition of a fucose by a specific a l -
2fucosyltransferase to the terminal Gal of type 1 or 2 backbone structures or to the
Gal of the core residues. The blood group A and B antigens are formed by the
addition of GalNAc or a Gal to the H antigen. The expression of the H, A and B
antigens is regulated by the secretor gene which encodes one of two a 1-
2fucosyltransferases. Approximately 75% of the population have a functional
secretor gene which means that glycoproteins in the epithelia and secretions of these
individuals will express the H, A and B antigens found on their erythrocytes; these
people are termed secretors. Those who do not possess a functional secretor gene and
thus have a low level of al-2fucosyltransferase in epithelial cells do not express the
blood group antigens on their secreted glycoproteins. This is because the H antigen
which is required by the A and B glycosyltransferases cannot be made in these cells.
The Lewis^ antigen is formed by the attachment of a fucose to the penultimate
GlcNAc residue of a type 1 backbone structure by the Lewis enzyme. The Lewis*’
antigen is formed by the addition of two fucose residues to a type 2 backbone by the
H and Lewis enzymes. Other terminal modifications include the addition of sialic
acid residues.
42
The reason for the high level of glycosylation is not clearly understood but the
addition of these polysaccharide side chains results in extension of the molecule and
this may be important in the formation of the mucus gel matrix. Also the
glycosylation makes the molecule very hydrophilic which would obviously be vital as
mucus gels contain a large proportion of water. The diversity of these side chains
indicates that there is a possibility of interactions between micro-organisms and the
mucus gel which may play a role in colonisation of mucosae. Indeed there is some
evidence for this, for example a number of micro-organisms which include H. pylori
appear able to bind the Lewis^ structure (Essery et al. 1994).
1.4. The human mucin genes
These genes were defined by partial cDNAs isolated using polyclonal and
monoclonal antibodies raised against deglycosylated mucins to screen libraries
produced from various tissues. A number of separate gene loci which encode mucin
glycoproteins have been distinguished on the basis of their chromosomal location and
pattern of tissue expression. The mucin genes have been assigned the symbol MUC
followed by a number which relates to the order in which they were cloned. These
genes are expressed at different levels in different tissues. In most cases sequencing
of these cDNAs has revealed the presence of tandem repeats of sequence. Usually
the tandem repeats correspond to a fixed number of codons which leads to repetition
of the peptide sequence. Southern blot analysis of DNA, digested with a variety of
enzymes, using mucin cDNA probes detects a high level of polymorphism. Evidence
suggests that this polymorphism is mainly due to the occurrence of variable numbers
of tandem repeats similar to those found in the non coding "minisatellite" regions of
human DNA (Jeffreys, Wilson et al. 1985). VNTR polymorphisms have so far been
described in M UCl (Swallow et a l 1987), MUC2 (Toribara et al. 1991) and
proposed for MUC3 (Fox, Lahbib et al. 1992), MUC4 (Porchet et al. 1991), MUC6
(Toribara et al. 1993) and MUC7 (Bobek et al. 1996). In the following section which
43
describes the various mucin genes and their products, the genes which map to
chromosome 11 are considered together, because of their close proximity and
probable relationships.
1.4.1. Chromosome lq21 M UCl
M UCl is expressed in the mammary glands and many other tissues. Full
length cDNA clones have heen obtained and the gene structure is known (Gendler et a l 1990;
Lan et a l 1990). Historically this protein has had a number of different names e.g. PUM,
peanut lectin binding urinary protein, (Karlsson et a l 1983), PEM, polymorphic
epithelial mucin, (Gendler et a l 1988), episialin, formerly MAM6, (Ligtenberg et a l
1990). This mucin carries a number of antigenic determinants recognised by
monoclonal antibodies raised against tumour associated antigens e.g. G al to 3,
HMFG (human milk fat globule) 1 and 2 (Swallow et a l 1986), NCRC-11 (Price et
a l 1987). Like many of the other mucin genes subsequently identified M UCl shows
a high level of variation and even before the gene had been cloned polymorphism of
the M U C l glycoprotein had been detected using SDS polyacrylam ide gel
electrophoresis and radio-iodinated lectins (Karlssonj et al. 1983) or with a
number of antibodies which included Cal (Swallow; et al. 1986).
The cloning of M UCl and the isolation of a partial cDNA containing tandem
repeat sequence enabled the genetic basis of the polymorphism detected with C al to
be determined and was shown to be due to variation in the number of tandem repeats
(Swallow' et al. 1987).I
The M UCl polypeptide, deduced from the cDNA, is composed o f three
regions; an amino terminus consisting of a putative signal peptide and degenerate
tandem repeats, a tandem repeat region composed of 60bp repeat units encoding a 20
amino acid repetitive peptide rich in proline, serine and threonine with the consensus
sequence GSTAPPAHGVTSAPDTRPAP, the carboxyl terminus consisting of
44
degenerate tandem repeats and a unique sequence containing a transmembrane anchor
(Ligtenberg et al. 1992).
M UCl also has a genetic polymorphism due to a G/A substitution in exon 2I
which results in different splice variants (Ligtenbergj et al. 1990). The proteins
encoded by these variants have differences in the signal sequences and in the extreme
amino terminal regions of the "mature" proteins. Another polymorphism of M UCl
has been identified in the non repetitive region 3' to the tandem repeats and is the
result o f variable numbers of CA repeats in intron 6 (Pratt et al. 1996). It is
interesting to note that the common alleles of all these polymorphisms are associated
which suggests that the M UCl VNTR polymorphism is not due to the unequal
crossing over between homologous chromosomes.
The M UCl gene has been mapped to chromosome lq21 (Swallow et al. 1987;
Middleton-Price et al. 1988). Although the presence of a transmembrane anchor on
the M UCl glycoprotein and its wide pattern of tissue expression distinguishes it from
mucins as originally defined, the glycoprotein is present in secretions, this probably
results from proteolytic cleavage (Hilkens et al. 1988).
Although epitopes of MUCl glycoproteins can act as 'tumour markers' there is
now abundant evidence to show that the MUCl gene is widely expressed in healthy
tissues. Indeed it was first detected as a normally occurring urinary component in the
early studies from this laboratory (Karlsson, Swallow et al. 1983). Nevertheless the
over-expression of M UCl epitopes in cancer has considerable diagnostic applications
(Balague et al. 1995; Weiss et al. 1996). Indeed other changes in M UCl expression
have been noted such as the alternative splicing of M UCl mRNA which leads to the
loss of the tandem repeats in breast cancer, although the functional significance if any
is unknown (Wreschner et al. 1994). These observations have led to a search for a
role for M UCl and attempts to understand its significance in health and disease. To
this end a M UCl transgenic mouse strain has been developed (Peat et al. 1992). The
preliminary results from a M u d knockout mouse are rather curious as the mice
appear to be as healthy as those with a normally functioning M u d gene (Gendler et
45
al. 1994). This would seem to imply that M u d is not vital in the development of the
mouse. However if, as has been suggested, mucins are involved in defence then it
would be interesting to determine if the lack of the M u d glycoprotein makes them
more susceptible to damage from external agents.
1.4.2. Chromosome llp l5 .5 : MUC2, MUC5 and MUC6
So far four mucin genes have been localised to chromosome l i p 15. The
MUC2 gene was the first to be localised to this region using somatic cell hybrids,
linkage analysis and m situ hybridisation using the cDNA SMUC41 isolated from an
intestinal library (Griffiths 6/ at. 1990). Soon after, three clones; JER58, JER47 and
JER57 were isolated from a tracheobroncial library and were mapped to the same
chromosome band, although JER47 also hybridised to a minor BamHI fragment
which maps to chromosome 13 (Nguyen et al. 1990). Since these clones might have
come from the same gene they were provisionally given a single symbol MUC5 by
the human gene mapping nomenclature committee although more recently they have
been shown to correspond to two genes (see below). MUC6 was the mucin gene to
be most recently localised to chromosome l ip 15 by in situ hybridisation using a
cDNA isolated from a stomach library (Toribara, Roberton et al. 1993).
1.4.2.1. MUC2
MUC2 is expressed in the intestine (Gum et al. 1989) but has also been
reported to be expressed in the bronchus (Jany et al. 1991). The full length cDNA
sequence of MUC2 has been determined (Toribara, Gum et al. 1991; Gum et al.
1994). The gene contains two sets of tandem repeats. The largest region towards the
3' end is comprised of perfect 69bp repeats which vary in number between different
alleles. The tandem repeats span a region of approximately 7kb in the common
alleles and these repeat units encode a 23 amino acid repetitive motif rich in serine,
th r e o n in e an d p r o l in e w ith th e c o n s e n s u s s e q u e n c e
PTTTPITTTTTVTPTPTPTGTQT. The smaller 5' region consists o f 48bp repeats
46
interrupted by 21-24bp segments and there is no evidence of variation in the number
of repeats. This sequence of "imperfect" repeats is separated from the larger tandem
repeat array by region of unique sequence.
The MUC2 mucin contains regions at the amino and carboxyl termini which
are cysteine rich. These cysteine rich regions are composed of repetitive elements
which show some similarity to the cysteine rich D domains in von Willebrand factorI
which have been implicated in protein/protein interactions (Curb et al. 1994).
A cysteine rich region at the carboxyl terminus of both MUC2 and von Willebrand
factor has been also been identified which is similar to the cystine knot region found
in the Norrie protein (Meitinger gr al. 1993). It is thought that this region is able to
dimerise through the formation of intermolecular disulphide bridges (Sheehan,
et al. 1991; Gum et al. 1994). However, some of the cysteine rich
domains found in mucins such as the frog integumentary mucins (FIM) have some
similarity to other protein motifs such as the P domains found in FIM-A. 1 and FIM-
C .l which are similar to the trefoil motif (Hauser et al. 1992). These domains are not
thought to be involved in the formation of intermolecular disulphide bridges but
instead it has been suggested that they may be involved in noncovalent protein,
protein interactions. Whether these structures are present in mature mucins is not
known but if they are it may be that they are involved in interactions with cell surface
receptors although what functional significance this would have is unknown.
MUC2 appears to be the predominant mucin secreted in the intestine (Tytgat
et al. 1994). The mucin seems to be synthesised as a precursor which is subsequently
glycosylated and secreted as a glycoprotein. Two dimensional electrophoresis and
pulse chase experiments indicate that the MUC2 peptide does indeed form a
disulphide bond stabilised dimer (Asker et al. 1995). However it has not been
determined whether the dimérisation is head to tail like vWF or the extent to which
further polymerisation occurs.
DNA Polymorphisms of MUC2 have been detected with Hinfl, Sau3A, TaqI
and Hae III (Gum et al. 1989; Griffiths, et al. 1990). A simple but
47
variable pattern of bands was detected with Hinfl and is due to VNTR polymorphism
(Griffiths, Mathews et al. 1990). The more complex pattern of bands detected with
TaqI is due to sequence polymorphisms resulting in the presence or absence of certain
TaqI sites within the large 3' tandem repeat region (Toribara, Gum et al. 1991). The
complex patterns observed with Sau3A and Hae III are reminiscent of the TaqI
polymorphism and are probably also due to polymorphic restriction sites within the
repeats.
Prior to the start of this project The Hinfl polymorphism had been analysed
for linkage and MUC2 has been shown to be linked with INS'TH^ HR AS and HBB^
which are also located on chromosome 1 Ip 15 (Griffiths, Mathews et al. 1990)
1.4.2.2. MUC5
The MUC5 locus is a region which codes for two or more tracheobronchial
mucins. The series of cDNA clones initially identified from a tracheobronchial
lambda gt 11 library which mapped to 1 Ip 15.5 and were tentatively divided into three
clone families on the basis of partial sequence information. These families were
provisionally called MUC5A, B, and C. MUC5B cDNA clones are composed of
degenerate 87bp tandem repeats (Dufosse et al. 1993). This encodes a peptide rich in
serine, threonine and proline without tandem repeats which has alternating
hydrophilic and hydrophobic domains. The repetitive structure has been destroyed by
numerous insertions and deletions at the DNA level. At the outset of this project we
were supplied with the MUC5B clone, JER57 and also clones for MUC5A (JER47)
and MUC5C (JER58). We were surprised to observe that the MUC5A and C clones
recognised the same bands on Southern blots. When clones from MUC5A and C
were hybridised to DNA digested with restriction enzymes which cut within CpG
islands they detected the same set o f fragments (Guyonnet-Duperat et al. 1995).
However when MUC5B was tested a different set of fragments was detected with
these enzymes. CpG islands are often associated with the 5' promoter regions of
genes (Craig and Bickmore 1994). These results indicated that MUC5B was under the
dCTP and 7.5pM dTTP), 0.5pl a dATP (Amersham), 1.75pl enzyme dilution
82
buffer (lOmM Tris-HCl pH7.5, 5mM DTT and 0.5mg/ml) and 0.25|ii Sequenase
version 2.0 enzyme (13 units/pl) per sequencing reaction.
The termination reactions are carried out in four separate tubes each of which
contains 2.5|il of one of the four termination mixtures (SOpM dATP, SOpM dCTP,
80|iM dGTP, 80|iM dTTP 50mM NaCl and 8 jiM of one of the four ddNTFs). 3)11 of
the sequencing reaction was then added to the termination mixtures and incubated for
5 mins at 37°C.
For the DNA bound to the beads the reaction was stopped by separating the
termination mix from the beads and then resuspending the beads in 4pl of stop
solution (95% formamide, 20mM EDTA, 0.05% bromophenol blue and 0.05% xylene
cyanol FF). The solution was then heated to 85°C for 2 mins and the supernatant
separated from the beads and stored in fresh tubes. For DNA in solution 4 |il of stop
solution is added to the termination reaction.
2.7.2. Cycle sequencing
This method is based on the protocol supplied by the manufacturers of the
Thermo Sequenase cycle sequencing kit (Amersham). The template was prepared
using Reagent Pack for use with Sequenase PCR product sequencing (Amersham)
under the conditions specified by the manufacturer.
The sequencing reactions was carried out in two steps, the labelling step and
then the chain termination reactions.
The design of the primer is important in the labelling step because only three
of the four dNTPs is used so that extension will only proceed for a few nucleotides. It
is therefore important to design the primer in such a way that at least 2 or 3 a ^^P
labelled dATPs (Amersham) will be incorporated before the extension is terminated.
Each sequencing reaction comprised Ipl of sequencing primer (0.5pmol/pl), Ip l of
83
PCR product prepared in the manner described above, 2|il reaction buffer (260mM
Tris-HCl pH9.5 and 65mM MgClz), 0.25pl a ^3? dATP (10 pCi/|il), Ipl of each of
two of the remaining three 3pM dNTPs and 9.25pl of water to a final volume of
17.5pl. The reaction was overlaid with 15pl of paraffin and then cycled 50 times
between 95 °C for 15 secs and 60°C for 30 secs.
The termination reaction is carried out in four separate tubes each of which
contains 4pl of one of the four termination mixes which are comprised of; 150pM
dATP, 150pM dCTP, 150pM 7-deaza-dGTP, 150pM dTTP and 1.5pM of either
ddATP, ddCTP, ddGTP or ddTTP. 3.5pl of the labelling mix was added to the
termination mixes and the reaction overlaid with lOpl of paraffin and cycled 50 times
between 95°C for 30 secs and 60 to 72°C for 60 secs. The reaction is stopped by the
addition of 4pl of stop solution (95% formamide, 20mM EDTA, 0.05% bromophenol
blue and 0.05% xylene cyanol FF).
2.73. Sequencing Gel
The products from both sequencing methods described above were run on
acrylamide gels using apparatus supplied by BIORAD with 50cm wedge spacers
(0.4mm to 1.2mm). Gels were prepared using a 6 % acrylamide solution (19:1 Bis
acrylamide, 7M URFA and 1 x TBF) supplied by Severn Biotech, and the acrylamide
was polymerised by the addition of '/500 volume of IM ammonium peroxodisulphate
(APS) and '/500 volume of NNN'N'-Tetramethylethylenediamine (TFMFD). The
samples were denatured at 85“C for 2 mins and 2 to 3pi then loaded into wells formed
by placing a sharks tooth comb (BIORAD) with the tips of the teeth in contact with
the top of the set gel. Electrophoresis was carried out for 4 to 8 hours at 2500V
whilst the current was varied in order to maintain the gel at 50 to 55“C. The gel was
84
then transferred to 3MM paper covered with cling film and dried using a model 583
gel drier (BIORAD). When the gel was completely dry the cling film was removed
and placed in a light tight cassette (Fuji) with a piece of photographic film (KODAK,
Biomax BMR) next to the gel. Autoradiography was carried out for between 1 day
and 1 week at room temperature. The film was developed using the Compact X2
automatic developer (X-OGRAPH).
2,8. Fluorescent in situ hybridisation (FISH)
The initial characterisation of Y AC and cosmid clones isolated during the
course o f this project was carried out using FISH to metaphase chromosomes
conducted as described previously (Pinkel et al. 1986; Gharib et al. 1993).
85
2.8.1. Stock solutions
Iscoves:
Proteinase K buffer (lOx):
SSPE (20x):
Antifade:
Iscoves medium (Sigma), Ix glutamine (Gibco-
Life Technologies), Ix penicillin (Gibco-Life
Technologies)
200mM Tris-HCl pH7.4, 20mM CaCl.
3M NaCl, 200mM NaH^PO^.bHzO, 20mM
EDTA adjusted to pH7.4 with NaOH.
1ml Vectorsheild antifade (Vector Labs), l |i l of
lOmg/ml propidium iodide (PI, Sigma), 10pi of
0.2mg/pl 4,6-diamidino-2-phenlindole (DAPI,
Sigma)
2.8.2. Préparation of cells from blood
A culture was set up which comprised of 16mls Iscoves, 2mls fetal calf serum
(PCS), 0.3ml phytohaemagglutinin (Gibco-Life Technologies), 1ml whole blood and
incubated for 72 hours at 37°C in 5% CO? in a moist atmosphere. Then 200pl of
30mg/ml thymidine was added and the culture incubated for a further 17 hours. The
thymidine 'block' was removed by pelleting the cells at 179 x g for 5 mins and
removing all but 0.5ml of the supernatant. The cells were then resuspended in the
0.5mls remaining and 5mls of Iscoves, 10% PCS. The cells were again pelleted and
the supernatant discarded. The cells were resuspended in 5mis of Iscoves, 10% PCS
and then 50pl of 1 mg/ml 5-bromo deoxyuridine (BrDU) was added this was then
incubated for 4 hours and 35 mins. 25 mins before harvesting of the cells 50pl of
lOpg/ml colcemid (Gibco-Life Technologies) was added. The cells were then
pelleted and the supernatant discarded and the cells resuspended in 8 mis of
prewarmed 75mM KCl and incubated for 8 mins. The cells were again pelleted and
86
7.5mls of supernatant removed. The cells were resuspended in the remaining 0.5mls
then a fix solution (3:1 methanol:acetic acid) was added the solution left at 4°C for 30
mins. The fix solution was changed until there was no brown tinge to the cell
suspension and then left overnight at 4“C.
2.83. Slide preparation
The slides were prepared by cleaning with methanol to which a few drops of
concentrated HCl had been added. The cells were pelleted once again and the
supernatant discarded. Enough fix solution was then added to produce a 'cloudy' cell
suspension. The slide was then removed from the methanol wash and wiped with a
lint free cloth so that it was still damp. Then using a Pasteur pipette a single drop of
the cell suspension was allowed to fall from a height of a 30 to 50 cm onto the slide
held at an angle of approximately 30 degrees. The slide was then dried using a fan
and when dry flooded with 1ml of 70% acetic acid and left for a few seconds after
which the acetic acid was poured of and the slide left to dry. The slides were then
dehydrated in an ethanol series consisting of 70%, 90% and then 100% for 3 mins
each. The slides could then be stored at -20“C until required.
2.8.4. Prehybridisation
The cells on the slides were treated with 200|il of a solution which comprised
lOOjig/ml RNAse (Sigma) in 2 x SSC pH7.0 under a cover slip and incubated in a
moist atmosphere at 37“C for 1 hour. The cover slips were discarded and the slides
washed four times in 2 x SSC in a coplin jar followed by dehydration in a an ethanol
series consisting of 70% ethanol for 3 mins, 90% for 3mins, 100% for 5mins and then
left to air dry. A coplin ja r containing 50mls of Ix proteinase K buffer was
prewarmed to 37°C and the slides were incubated in this solution for 10 mins. The
slides were transferred to the proteinase K solution comprised of 0.035|ig/ml of
87
proteinase K (Boeringer Mannheim) in Ix proteinase K buffer and incubated for 7
mins at 37°C. The slides were washed for 5 mins in PBS and given a postfix
treatment of 0.05M MgCl2.6 H2 0 , 1% formaldehyde in PBS for 10 mins. Another
wash in PBS for 5 mins was followed by dehydration in an ethanol series. The slides
were then denatured with lOOpl of 70% formamide in 2 x SSC under a coverslip at
75°C for 5 mins. The coverslip was removed and the slides placed in ice cold 70%
ethanol for 3mins and then passed through 90% and 100% ethanol for 3 mins each
and finally left to air dry
2.8.5. Probe preparation using competition with COT-l-DNA and
hybridisation
Ipg of whole clone was biotinylated by nick translation using the kit supplied
by BRL. The probe was purified using a G-25 medium grade Sephadex column
(Pharmacia) and eluted in a volume of 1ml. For each hybridisation 200ng of probe
was combined with lOpg of Cot-l-D N A (Im g/m l), 50pg herring sperm DNA
(lOmg/ml), '/|o volume of 3M NH4 acetate and two volumes of 100% ethanol this was
then incubated at -70“C for 30 mins. The DNA was pelleted by spinning in a
microcentrifuge at 13 000 rpm for 5 mins and the pellet freeze dried for 10 to 15
mins. The pellet was then resuspended in lOjil of 50% formamide, 10% dextran
sulphate in 2 x SSPE pH7.0. The probe was denatured at 75°C for 5 mins followed
by incubation at 37“C for 30 mins to preanneal repetitive components in the probe,
the preannealing was stopped by placing the probe on ice. This hybridisation mix
was then placed on the slide and covered with a circular coverslip and the edges
sealed with cow gum and incubated overnight at 37“C in a sealed moist environment.
2.8.6. Post hybridisation washes
88
The cover slip was discarded and the slides washed three times in a solution of
50% formamide in 2 x SSC at 42°C for 5 mins each wash. The slides were then
washed five times in 2 x SSC at 42°C for 2 mins each wash. If the probe was a
cosmid then a more stringent wash was used which comprises three washes in 50%
formamide in 2 x SSC at 45°C for 5 mins each wash followed by two washes at 45°C
in 2 X SSC for 2.5 mins each wash and finally two washes at 60°C in 0.1 x SSC for
2.5 mins each wash.
2.8.7. Signal detection
The slides were washed in 0.05% Tween 20 (SIGMA) in 4 x SSC for 5 mins.
Preincubation of the slides was carried out in 5% milk powder (Marvel) in 4 x SSC
for 20 mins. The slides were then incubated with lOOpl of 5|ig/m l avidin-FITC
(Vector labs), 5% Marvel in 4 x SSC under a coverslip for 20 mins the slides were
protected from the light for all further steps. The coverslip was discarded and the
slides washed three times in 0.05% Tween 20 in 4 x SSC for 5mins each wash. The
slides were then incubated with 100|il of 5|ig biotinylated anti-avidin (Vector labs),
5% Marvel in 4 x SSC for 20 mins. The slides were again washed with Tween 20 in
4 X SSC. The slides were then incubated with the avidin-FITC mix once again for a
further 20 mins which was followed by two washes with PBS for 5 mins each wash.
The slides were finally dehydrated with an ethanol series and 15|il o f antifade
solution under a coverslip placed over the chromosomes.
The propidium iodide(PI) and diaminophenolindole (DAPI) in the antifade
solution counterstained the chromosomes to produce R-banding and the images were
collected by confocal laser microscopy (BIORAD MRC 600).
89
2.9. Computer resources
Linkage analysis was carried out on marker data generated in this lab and
marker data from a copy of the CEPH database stored on the hard disc of a DEC
Station 5000/25 (Digital) in this laboratory. The analysis was conducted using the
CRI-MAP (Donis-Keller, Green et al. 1987) package running on a SPARC station 10
(SUN Microsystems) in this laboratory. Nucleic acid and protein sequences were
analysed using various computer packages, such as the GCG suite, at the UK HGMP
resource centre (Cambridge) available via the internet.
90
3. The mucin g e n e family on c h ro m o s o m e
11p15.5: results and discussion
The main emphasis of this work was the genetic mapping of the family of
mucin genes on chromosome l lp l5 . This work was done in collaboration with
Wendy Pratt who probed many of the southern blot filters. Whilst this work was in
progress our collaborators in Lille undertook the task of producing a physical map
using PFGE.
3.1, Families analysed
All the families used in this study were from the CEPH series. Southern blots
of the CEPH family DNA samples digested with various enzymes were prepared and
provided by the EU funded EUROGEM consortium. The CEPH families are
comprised of a number of distinct populations located in Utah and France and one
family from Venezuela and an Amish family. The initial MUC2 data used in this
study was provided by D. Matthews (MRC HGBU) and had previously been
submitted to CEPH (and was thus on version 7.0 of the database). Some of the
families used to obtain the original MUC2 the data were not included in the
EUROGEM panel of CEPH families and not all of the EUROGEM families were
tested at the time.
22. Search for and analysis of polymorphisms of the mucin
genes on chromosome 11p15.5
Each of the genes was analysed using probes corresponding to the main
tandem repeat region of that gene.
Polymorphisms of MUC2 and MUC6 had previously been described (Gum,
Byrd et al. 1989; Griffiths, Mathews et al. 1990; Toribara, Roberton et al. 1993).
Both genes show evidence of VNTR polymorphism described in sections 1.4.2.1 and
91
1.4.2.3. The original MUC2 data was obtained by probing southern blots of DNA
from 37 CEPH families digested with Hinfl probed with SMUC41 (MUC2). Hinfl
was the enzyme of choice for the analysis of MUC2 because most other enzymes
showed more complicated patterns which are harder to interpret, as described in
section 1.4.2.1. While this work was in progress the EUROGEM filters of the CEPH
family DNA samples digested with Hinfl became available and were probed with
SMUC41 (Fig. 3. 1). This was done to fill in gaps in the data and improve the
informativeness since the resolution of the fragments on the EUROGEM filters is
better than on the original Southern blots.
The original paper on MUC6 described a polymorphism with Taql. However
in this study Southern blots of CEPH family DNA digested with PvuII became
available first and thus these were tested with the MUC6 probe. This revealed a
variable allele length polymorphism presumably due to the reported VNTR
polymorphism (Fig. 3. 2). This interpretation of the polymorphism was subsequently
supported by comparison of the pattern of relative mobilities detected with PvuII and
Taql (Fig 3. 2). Hinfl was not suitable as a restriction site for this analysis as a cut
site for this enzyme is present within each of the tandem repeats.
92
VOU)
FF MF F Cl C2 C3 C4 C5 C6 C l C8 C9 CIO C il M FM MM
Figure 3. 1.
Autoradiograph of a Southern blot of DNA from CEPH family 884 digested with Hinf I and probed with
SMUC41 (MUC2). The sizes of the two alleles shown are 6.5 and 6.95 kilobases. Key: FF=father of the father,
MF=mother of the father, F=father, C l, C2, C3, e.t.c.=children, M=mother, FM=father of the mother and
MM=mother of the mother.
FF MF F C l 02 03 S 04 05 07 06 08 09 OlO M FM MM
Pvu II
T aq l
kb18.515.0
9.0
,18.5115.0
9.0
7.4
Figure 3. 2.
Autoradiographs of two Southern blots of DNA from OEPH family 1416 digested
with Pvu II and Taq I probed with MU06. Key: S=size marker lane and the sizes in
kilobases are shown on the right hand side, FF=father of the father, MF=mother of
the father, F=father, 01, 02, 03, e.t.c.=children, M=mother, FM=father of the mother
and MM=mother of the mother. A 2 kb fragment is also detected with Taql (data not
shown) and in some individuals a 2.2 kb fragment is seen.
94
The search for polymorphism in the genes MUC5B and MUC5AC was split
between our collaborators in Lille and this lab respectively. A number o f enzymes
were tested in an attempt to identify polymorphism of M UC5AC, which included
Seal EcoRI, Taql, M spI and PvuH. Large invariant bans of 20 to 30kb were detected
with EcoRI and Seal (Fig. 3. 11). Taql, M spI and PvuII produced complex patterns
with considerable person to person variation and are reminiscent of the M UC2 Taql
polymorphism previously described in section 1.4.2.1 and are probably not due to
straightforward VNTR (Fig. 3 .3 ) . Although Hinfl and PstI showed simpler patterns
comprising of one or two bands in each individual (Fig. 3. 4), PvuII was used to type
the Œ P H families because it was more informative. The pattern of fragments
detected with PvuH consists of two sets variable fragments, which do not appear to be
obviously associated, together with a number of constant smaller fragments (Fig 3 .3 ).
The variable fragments were treated as separate polymorphisms for ease of analysis.
Two-point linkage analysis using the 'two-point' option of CRI-MAP showed that the
two polymorphic zones are tightly linked with a LOD score of 49.07 at 0= 0 .
95
kb
7.4
5.6
PvuIIFF MF F C l C2 C3
» # # # #
Msp IFF MF F C l C 2 C 3
Taq IFF MF F C 1 C 2 C 3
2.9
2.3
1.4
1.3
Figure 3. 3.
Autoradiographs of three Southern blots of DNA from CEPH family 1424 digested
with Pvu II, Msp I and Taq I probed with JER58 (MUC5AC). The variable alleles
detected with Pvu II range in size from 6.5 to 7.5 kilobases (upper set) and 2.3 to 2.5
kilobases (lower set), were as with Msp I they range from 2.7 to 3.3 kilobases (upper
set) and 1.3 to 1.4 kilobases (lower set). Key: FF=father of the father, MF=mother of
the father, F=father, Cl, C2, C3, e.t.c.=children, M=mother, FM=father of the mother
and MM=mother of the mother.
96
FF MF F Cl C2 C3 C4 €5 06 C l C8 M FM MM kb
PstI
H infl7.5
6.9
Figure 3. 4.
Autoradiographs of two Southern blots of DNA from CEPH family 1424 digested
with Pst I and Hinf I probed with JER58. The sies of the two variant alleles detected
with Hinf I are 6.9 and 7.5 kilobases. Key: FF=father of the father, MF=mother of
the father, F=father, C l, C2, C3, e.t.c.=children, M=mother, FM=father of the mother
and MM=mother of the mother.
97
Our collaborators in Lille have identified a number of polymorphisms for
MUC5B. These include PstI, Taql and B glll (P. Pigny et al personal
communication), but in each case the fragment sizes are rather small, and the
heterozygosities rather low. For the purpose of this study the PstI polymorphism was
selected to test because it was the most suitable with respect to the fragment sizes and
availability of filters.
The frequency of the different length alleles of both MUC2 and MUC6 were
determined for the EUROGEM series of grandparents, and parents when
grandparents were unavailable (Fig. 3. 5). The distribution observed for MUC6
appears to be unimodal (Fig. 3. 5 A) and possibly bimodal for MUC2 although the
peak for the smaller sizes is considerably smaller than the main peak (Fig. 3. 5 B).
98
« 80
B
60 1
^ 40
MUC2 data others 0MUC2 data France
MUC2 data Utah
Allele size/kb
Z 20
MUC6 data others □ MUC6 data France ■ MUC6 data Utah
85 9 9.5 10 10.5 11 11.5 12 12.5 13 13.5
Allele size/kb
Figure 3. 5.Two histograms showing the allele size distributions of MUC2 (A) and MUC6 (B).
The y axis shows the number of alleles which fall into the arbitrary size range of the
categories which span 0.5 kilobases on the x axis.
99
Heterozygosities were calculated for both MUC2, MUC6 and MUC5AC by
dividing the number of observed heterozygous individuals by the total number of
individuals. MUC2 has a heterozygosity of 0.64, MUC6 of 0.70, a value of 0.60 was
obtained for the larger set of polymorphic fragments of MUC5 AC and 0.36 for the
smaller set. It is interesting to note that no new mutations were detected in the 40
CEPH EUROGEM families for MUC6 or MUC5AC, whereas 3 had been previously
observed in families 1333, 1331 and 1413 probed with SMUC41 (MUC2) (D.
Matthews unpublished). Two of these MUC2 mutations were clearly evident in the
new analysis (Fig. 3. 6 ). In the case of family 1333 a large mutant allele can be seen
in the mother which is not present in either of the grandparents. The mutant band is
approximately twice the size of the grandparents i.e. l l . lk b compared with 6.7kb.
This may indicate a duplication of the tandem repeat region in this family member.
The mutation in child C l 1 of family 1331 appears to be the lack of a paternal allele.
The most likely explanation is that the paternal allele has lost tandem repeats and is
either the same or nearly the same size as the allele inherited from the mother. The
third mutation originally detected, consisting of a faint extra band, was not seen in
this analysis. One possibility is that there was a population of cells which contained
the mutant allele which was not represented in the DNA sample used for on the
EUROGEM Southern blot, or that there was contamination of the original sample.
100
Family 1333FF MF F C l C2 C3 C4 C5 C6 C l C8 C9 M FM MM S kb
#H8.5
H 5.0
»9.0
>7.4
»5.6
Family 1331FF MF F Cl C2 C3 C4 C5 C6 C l C8 C9 CIO C il M FM MM
Figure 3. 6 .
Autoradiographs of two Southern blots of DNA from CEPH families 1331 and 1333
digested with Hinf I and probed with SMUC41 (MUC2). The large mutant allele can
clearly be seen in the mother (M) of family 1333 and has been inherited by children
C l, C2, C5, C l and C9. The mutation in child Cl 1 of family 1331 can be seen as the
apparent lack of a paternal allele. Key: S=size marker lane and sizes are in kilobases,
FF=father of the father, MF=mother of the father, F=father, C l, C2, C3,
e.t.c.=children, M=mother, FM=father of the mother and MM=mother of the mother.
101
3.3. Linkage analysis
Initially MUC5AC and MUC6 were analysed together with other markers on
chromosome 11 from the CEPH database version 7.0, which included the MUC2 data
generated previously in the MRC HBGU. Two point lod scores for each of the MUG
genes with all the other markers on this version of the database were obtained using
the 'twopoint' option of the CRI-MAP computer program (Appendix I). All the
markers which had a lod score of greater than 3 with MUC5AC, MUC6 or MUC2
were then used to generate a genetic map of chromosome 11 using the 'build' option
of CRI-MAP. This map contained both MUC6 and MUC2 and was supported at odds
of greater than 1000 to 1 when adjacent groups of 5 markers in the map were
permuted using the 'flips 5' option of CRI-MAP. Using the 'chrompic' option of CRI-
MAP all the putative recombinant chromosomes could be identified. A single
recombination between MUC6 and MUC2 in individual 1424-03 was identified
which had enabled CRI-MAP to orientate these genes with respect to the other
markers in the map i.e. MUC6 goes with HRAS (towards the telomere) and MUC2
goes with D1 IS 1000 and the other more centromeric markers. MUC5AC, however,
was not informative for this family and could not be unambiguously inserted into this
map but was shown to be in the same region as MUC6 and MUC2. In an attempt to
make MUC5AC informative for this family a number of other enzymes were tested,
TaqI, MspI, Hinfl, Hae III, PstI and EcoRI. Although these enzymes all detected
polymorphism with the JER58 tandem repeat probe the critical parent was
homozygous in each case and was therefore not informative. Two enzymes, PstI and
Hinfl appeared to detect the same polymorphism suggesting that both enzymes were
detecting the same VNTR variation (Fig. 3.4).
The 'chrompic option' of CRI-MAP was used to identify all the chromosomes
with apparent recombinations in this region. A total of 24 recombinations in 20
families were identified (Appendix II). Because of the existence of errors within the
CEPH database and the incomplete nature of some of the data the families which
102
showed these recombinations were tested further in an attempt to provide support and
increase precision for this region of the map. To this end the EUROGEM filters were
reprobed with pEJ6 .6 (HRAS) (Fig. 3. 7). The families were also tested for some
additional markers. D11S150, detected with probe 2.1 (Brookes et al. 1989), was
selected because it had been localised to this region by PFGE and D11S2071 was
used because it is the most telomeric marker reported (Redeker et al. 1994) (Fig. 3. 8
and 3. 9).
There were clearly problems with the original HRAS data in two families
(1413 and 23) and the new results did not support recombinations in these families.
Of the remaining recombinant chromosomes 11 were well supported with at least two
informative markers on either side of the breakpoint. The results for D11S2071
agreed with HRAS in every case where both loci were informative and supported the
recombination between HRAS and MUC6 in individual 1413-03. D11S150, where
informative, segregates with MUC2 (Fig. 3. 10).
103
M F F C l C2 C3 C4 C5 C 6 Cl C8 C9 CIO C i l C 1 2 C 1 3 C 1 4 C15 M M M
m m
Figure 3.7.
Autoradiograph of a Southern blot of DNA from CEPH family 1413 digested with
Msp I and probed with pEJ6.6 (HRAS). The variant alleles range in size from 1.15 to
2.6 kilobases. Key: FF=father of the father, MF=mother of the father, F=father, C l,
C2, C3, e.t.c.=children, M=mother, FM=father of the mother and MM=mother of the
mother.
104
MF F C6 Cl C8 C9C10C11 C12 C13 C14 C15 M MM
Figure 3.8.
Autoradiograph of a Southern blot of DNA from CEPH family 1413 digested with Pst
I and probed with probe 2.1 (D11S150). The variant alleles range in size from 1.8 to
7.4 kilobases. Key: FF=father of the father, MF=mother of the father, F=father, C l,
C2, C3, e.t.c.=children, M=mother, FM=father of the mother and MM=mother of the
mother.
105
(2 .4 )
(4 .6 )
C2( 1. 2 )
( 1. 2 )
C4 (4 .6 )
(4 .6 )
C6(1 .4 )
(4 .6 )
C8 (4. 6)
MF(5 .6 )
PM(1 .4 )
MM( 1. 2 )
Figure 3. 9.
An example of the results obtained with the ALP system showing the electrophoretic
analysis of the D11S2071 microsatellite using DNA samples from members of CEPH
family 1424. An arbitary phenotype for the members of the family can be deduced by
comparison of the relative positions of the major peaks. The deduced phenotypes are
shown in brackets below the family member symbol. Key; FF=father of the father,
MF=mother of the father, F=father, C l, C2, C3, e.t.c.=children, M=mother,
FM=father of the mother and MM=mother of the mother.
106
Figure 3. 10.
A diagrammatic representation of the eleven most informative meiotic breakpoints in
the region of chromosome 1 lp l5 . Each recombination is supported by at least two
informative markers either side of the breakpoint. The genes are shown in order and the
parental and grandparental origin of each chromosomal region indicated as is the CEPH
family number and individual number. The possible positions of MUC5AC and
D1 IS 150 are shown and the individual results for these markers are given below the
main diagram.
s
MUC5ACD11S150
KEY.
P = Paternal chromosome
M = Maternai chromosome
Grand maternai chromosome
Grand paternal cchromosome
= Uninformative family or missing data
D11S2071
HRAS
Z.IcmJMUC6
O.ScmJMUC2
2.1 cmJD l l 8 1 0 0 0
1.1 cmJINS / TH
I.ScmJD 11S1318
2 .3 cmJ
D11S868
3.1 cmJ
D11S454
HBB
MUC5AC
probe 2.1
133211
141303
142403
141809
134906
134907
133212
134103
" - i
M
133111
1
10209
1329106
■■ ■ ' j■4 ^
% fA
These results were combined with data from the CEPH data base version 7.1.
Haplotypes were constructed by inspection of the individuals with the recombinant
chromosomes using the order generated by CRI-MAP for the markers i.e. from
telomere to centromere HRAS, MUC6 , MUC2, D1 IS 1000, INS, TH, D11S1318,
D11S868 and D11S454 (Fig. 3. 10). D11S2071 and HBB were added to this order to
provide support for the most telomeric and centromeric breakpoints respectively (Fig.
3. 10). It should be noted that each breakpoint is supported by at least two
informative markers on either side. The results for MUC5AC and probe2.1 are
shown underneath the recombinant chromosomes of the critical individuals (Fig. 3.
10). It was not possible to insert them unambiguously into the map and their most
likely position is indicated by the vertical bars on the left hand side of the diagram
(Fig. 3. 10). An attempt was also made to insert MUC5B into the map. However
none of the recombinant families were informative for this gene. The revised data
supports the order originally generated by CRI-MAP and the additional MUC2 data
reveals another recombination between MUC6 and MUC2 in individual 1418-09.
3.4. Characterisation of a putative 0 terminal MUC5AC clone.
This work was done in collaboration with Theda Lesuffleur who isolated and
sequenced the partial cDNA L31 (Lesuffleur et al. 1995). The clone was isolated
from a HT29 MTX (mucus secreting cell line) expression library using polyclonal
serum raised against normal gastric mucus. The DNA sequence of this clone showed
a high level of similarity (98.6%) to the NP3a clone which had previously been
reported as the 3' end of ‘MUC5’ (Meerzaman et al. 1994). Interestingly less
similarity was observed between the predicted peptides due to changes in reading
frame. The clone L31 was also localised to chromosome l lp lS using FISH by
Margaret Fox in this lab. The clone was thus located to the same region as the cluster
of mucin genes containing MUC5AC, MUC5B, MUC2 and MUC6 described in
section 1.4.2. The expression pattern of L31 was similar to MUC5AC when
compared with MUC2, MUC5B and MUC6 on northern blots of a variety of tissues
108
(Lesuffleur, Roche et al. 1995). These results suggested that the L31 clone maybe
part of the non tandem repeat sequence of MUC5AC.
Thus I attempted to use Southern blot analysis of human genomic DNA to
pursue this further. Southern blots of DNA from 4 individuals digested with Seal and
EcoRI probed with both L31 and JER58 (MUC5AC). These enzymes were chosen
because they did not cut within the L31 sequence and because JER58 detects large
single fragments with these enzymes. L31 detected a single Seal fragment of 9.5kb
and when the same filter was probed with JER58 a single fragment of greater than
18kb was detected (Fig. 3. 11). However on DNA digested with EcoRI a single
fragment of approximately 20kb was detected with both L31 and JER58 (Fig. 3. 11)
The Lille group had detected evidence of polymorphism with JER58 using
DNA digested with H indlll and Xba I (Pigny et al. 1995). Both enzymes detect two
large alleles in some individuals which were not detected in the samples we tested,
run under the standard electrophoretic conditions used initially in the laboratory or by
EUROGEM. The Lille group supplied us with two individuals who were
heterozygous when probed with JER58 for both Hindlll and Xba I to use as a control.
A DNA sample from one of these individuals digested with H indlll was run using the
phosphate buffer system recommended by the Lille group together with DNA from
individuals of unknown genotype and Southern blots of these gels were probed with
L31 and then JER58. Unfortunately the separation was not as good as that achieved
by the group in Lille and L31 is a poor probe, although it did seem that L 31 detected
the same two alleles as JER58 (results not shown). These results were not conclusive
and further experiments are needed.
The results from Southern blots probed with JER58 and L31 indicates that
they are physically very close (18-30kb). These results together with the evidence of
the expression studies suggest that L31 is part of the MUC5AC gene and may
correspond to the 3' end as indicated by the presence of a poly A tail in the cDNA
clone.
109
kb
48.5
18.515.0
9.0
7.4
L31 JER58
7^'
L31 JER58
5
EcoR I Sea I
Figure 3.11.
Autoradiograph of a southern blot of four individual human genomic DNA samples
digested with EcoR I and four with Sea I probed with L31 and JER58 (MUC5AC)
cDNAs. Marker tracks are labelled S and sizes are in kilobases.
110
3.5. Discussion
The linkage analysis described in this section resulted in the construction of a
genetic map for chromosome 1 Ip 15 and the identification of a panel of recombinant
individuals (Fig. 3. 10). This recombinant panel will be useful in the mapping of
other markers in this region by testing the most informative families, as was done in
the case of D11S150 which appears to map with MUC2 (Fig. 3. 10).
While this work was in progress collaborators in Lille produced a physical
map using PFGE (Fig. 3. 12). They showed that all four mucin genes and D1 IS 150
lie in a region of approximately 400kb. MUC5AC and MUC5B have been localised
to a 220kb Swa I fragment and D l l 8150 appears to lie between MUC6 and MUC2
on the same ISOkb Swa I fragment. Very recent sequence data indicates that
D1 IS 150 is located in one of the introns of MUC2 (Pratt unpublished). The genetic
map presented here is in agreement with the order of the genes deduced from the
PFGE data i.e. MUC6 is at one end of the cluster followed by D11S150, MUC2,
MUC5AC and then MUC5B, although the MUC5 genes could not be placed
unambiguously in the linkage map. Evidence for the orientation of the gene cluster
with respect to other genes on chromosome 11 and thus the telomere and centromere
came from a few large PFGE fragments i.e. HRAS localised to the same 750kb Clal
fragment as M UC6 and MUC2. This agrees with the orientation of M UC6 and
MUC2 in the linkage map, in which MUC6 goes with HRAS (towards the telomere)
and MUC2 goes with genes which lie towards the centromere.
I l l
Figure 3. 12.
A diagrammatic representation of the map of the mucin genes in the region of
chromosome 1 lpl5.5 as determined by PFGE (adapted from [Pigny, 1996 #210]).
HRAS IGF2
500K B 400kb 1200kb- I
Mlu 1 Sac II
BssH II
M lu I
Pac I N ot I Sw a I Sac II
N ot I Sac II
BssH II
to
Sac II B ssH II C la l Sw a I
Sac II Sac II Sac II BssH II B ssH II C la I
BssH II N ot I N ot I Pac I
Sac II B ssH II
N ot ISac II Sw a I
M lu I
60 kb
180 kb 2 2 0 kbM U C 6 tandem repeats
D 1 1 S 150 tandem repeats
M U C 2 tandem repeats
M U C 5A C tandem repeats
M U C 5B com plete gene
Neither of the families which had recombinations between MUC6 and
D11S150/MUC2 were informative for MUC5AC. Indeed no recombinations were
identified between MUC5AC and MUC6 or D11S150/MUC2. It is perhaps
somewhat surprising that two recombinations were identified between MUC6 and
D11S150/MUC2 considering the close proximity of these two genes, indeed the
PFGE data suggests that the distance between D11S150 and MUC6 may only be
60kb. This suggests that the observed level of recombination between MUC6 and
MUC2 is quite high. Indeed it appears that this region is relatively recombination
rich although as can be seen in Figure 3 .10 there does not seem to be a particular hot
spot but rather the recombinations are scattered along this region of chromosome 1 1 .
It is interesting to note that of the 11 well supported recombinations in this region
only 1 is on the maternal side which contrasts with the observation that there is more
recombination in the female genome compared to the male (Haldane 1922). This
relative increase in the amount of recombination in the telomeric regions of males
compared with females was observed when sex specific chiasmata density maps of
mouse chromosome 2 were compared (Povey et al. 1992). Another example comes
from a chiasmata density and interference map of human chromosome 9 in males
which again shows obvious clustering of meiotic breakpoints at the terminal regions
(Povey, Smith et al. 1992). Unfortunately chiasmata data for human female are not
available so direct comparisons are not possible. Male bias at the telomeres has also
been reported for a number of other chromosomes such as chromosome 21 (Blouin et
al. 1995). The increase in the availability of genetic mapping data and the publication
of sex specific maps will enable more detailed studies of this phenomenon.
The mucin clones used for the analysis of the polymorphisms and linkage
analysis were amongst the first isolated and are mostly partial cDNA clones
comprised of tandem repeats. A considerable amount of effort has been devoted to the
cloning of the complete cDNA sequences of the chromosome 11 mucin genes and
more recently clones such as L31 containing ‘unique’ sequences have been isolated
(Lesuffleur, Roche et al. 1995). Prior to the cloning of L31 a very similar clone NP3a
113
had been identified and characterised by D. Meerzaman et al which they claimed
represents the 3' end of ‘MUC5’ (Meerzaman, Charles et al. 1994). The NP3a clone
was isolated using degenerate primers based on the peptide sequence of that reported
by Rose et al (Rose et al. 1989). The C terminal of this peptide contains a sequence of
14 amino acids which is also present in a 22 amino acid sequence deduced from the
unique’ sequences shared by some of the MUC5AC clones (Aubert et al. 1991).
NP3a also contains a stop codon that is followed by a putative poly adénylation signal
16 nucleotides upstream of the poly (A) tail of 18 nucleotides which suggests that it
corresponds to the 3' end of the gene.
Analysis of the peptide sequence encoded by NP3a shows some similarity to
MUC2, vWF, bovine and porcine submaxillary mucin and rat mucin like protein
especially in the conservation of the number and position of the cysteines and it
mapped to chromosome 11. The clone L31 shows a high level of identity to NP3a at
the nucleotide level i.e. 98.6% (Appendix III). However there is less identity at the
level of the peptide due to shifts in the reading frame caused by the small number of
nucleotide differences (Fig. 3. 13). However the peptide sequence of L31 shows
some similarity to the carboxyl terminal of MUC2, especially in the conservation of
the number and position of the cysteine residues. It is interesting to note that there is
also some similarity to the cystine knot found in the Norrie disease protein
(Meitinger, Meindl et al. 1993). When the number and positions of the cysteines in
the peptide sequences of MUC2, NP3a and L31 are compared there appears to be
better agreement between L31 and MUC2 (Fig. 3. 13). In particular one of the
cysteines in the cystine knot like region is present in both L31 and MUC2 but not in
NP3a (fig 3. 13).
114
Figure 3. 13.
Sequence alignments of the predicted peptide sequences of carboxyl terminal of MUC2
and the cDNA clones L 31 and NP3a.. The conserved cysteine residues have been
underlined. The sequence in italics is were NP3a goes out of reading frame with
respect to L31.
L 3 1 N P 3 a M U C 2
. . . . H E K T T H S Q P V T S D S IH P L £A W T K W F D V D F P S P G P H G G D K E T Y N N I I R P I T T T T T V T P T P T P T G T O T P T T T P I T T T T T V T P T P T P T G T O T P T T T P I T T
5 1 1 0 0L 3 1 NQ DQ Q
N P 3 a S G E K I ^ R R P E E I T R L Q £ R A E S H P E V N I E H l , G Q W Q ^ S R E E G L V £ R N Q D Q QM U C 2 T T T V T P T P T P T G T Q T G P P T H T S T A P I A E L T T S N P P P E S S T P Q T S R S T S S P
1 0 1 1 5 0L 3 1 G P F K M C L N Y E V R V L C C E T P R G ^ P V T S V T P Y G T S P T N A L Y . . P S L S T S M V S
N P 3 a G P F K M E L .N Y E V R V L C C E T P R G C P V T S V T P Y G T S P T N A L Y . . P S L S T S M V SM U C 2 L T E S T T L L S T L P P A I E M T S T A P P S T P T A P T T T S G G H T L S P P P S T T T S P P G
1 5 1 2 0 0L 3 1 A S V A S T S V A S S S V A S S S V A Y S T V T Ç ................................................................................................
N P 3 a A S V A S T S V A S S S V A S S S V A Y S T Q T C ................................................................................................M U C 2 T P T R G T T T G S S S A P T P S T V y T T T T S A W T P T P T P L S T P S I I R T T G L R P Y P S
201L 3 1 . . . . F C N V A D R L Y P A G S T I Y R H R D L A G H C Y Y A L C S Q P C Q V
N P 3 a . . . . F C N V A D R L Y P A G S T I Y R H R D L A G H C Y Y A L C S Q D C Q VM U C 2 S V L I C C V L N D T Y Y A P G E E V Y NC^TYGDTCY F V N C S L S C T L
2 5 0V . . . R G V D S D V . . . R G V D S D E F Y N W S C P S T
2 5 1L 3 1 C R S T T L P P A P A T S P S I S T S E P .................... V T E
N P 3 a C P S T T L P P A P A T S P S I S T S E P .................... V T EM U C 2 P S P T P T P S K S T P T P S K P .S .S T P .S K P T P G T K P
3 0 1L 3 1 C S E A T Ç E G N N V I S L S P R T C P R V E K P T C A N G
N P 3 a C S E A T Ç E G N N V I S L R P P T C P R V E K P T C A N AM U C 2 C F M A T C K Y N N T V E I V K V E C E P P P M P T C -S N G
L G C P N A V P P Rl g c p n a v p p r
P E C P D F D P P R
Y P A V K V A D Q DY P A V K V A D Q DL Q P V R V E D P D
3 0 0K K G E T W A T P NK K G E T W A T P NQ EN ET W W L C D
3 5 0GC£HHYQCQCC C C IT T S A S VG CCW H W ECDC
3 5 1L 3 1 VC-SG W G D PH Y I T F D C T Y Y T F L D N C T Y V L V Q Q I V P V Y G H F R
N P 3 a C A A A G V TPT T S P S T A P T T P S WTTARTL.GAA DQARVWPhPRM U C 2 Y C T G W G D P H Y V T F D G L Y Y S Y O G N C T Y V L V E E I S P S V D N F G
4 0 0V L V D N Y F C G AARRQLLLRCGV Y I D N Y H C D P
4 0 1L 3 1 E D G L S C P R S I I L E Y H O D R W L T R K P V H G ..................... V M T N E I I
N P 3 a G F A L L.PEVHHPGVP PGPRGADPQA SPRGVDKRDHM U C 2 N D K V S C P R T L IV R H E T Q E V L IK T V H M M P .........................MQVQVQ
4 5 0F N N K W S P A FEQQQGGQPREV N R Q A V A L P Y
4 5 1 5 0 0L 3 1 R K N G X W S R I G V K M Y A T IP E L C V y V M F S G L I F S V E V P F S K F A N N T E G O C G
N P 3 a P K N G I W S R I G V K M Y A T IP E L C V Q V M F S G L I F S V E V P F S K F A N N T E G O C GM U C 2 K K Y G L E V Y Q S G I N Y W D I P E L G V L V S Y N G L S F S V R L P Y H R F G N N T K G Q C G
5 0 1 5 5 0L 3 1 T C T N D R K D E C R T P R G T W A S C S E M S G L W N V S I P D Q P A C H R P H P T P T T V G P
N P 3 a T C T N D R K D E C R T P R G T W A .S C S E M S G L W N V .S I P D Q P A C H R P H P T P T T V G PM U C 2 T C T N T T S D D C I L P S G E I V S N C E A A A D Q W L V N D P S K P H C P H .................................
5 5 1L 3 1 T T V G S T T V G P T T V G S T T V G P T T P P A P C L P S
N P 3 a T T V G S T T V G P T T V G S T T V G P T T P P A P C L P SM U C 2 . . . S S S T T K R P A V T V P G G G K T T P H K D C T P S
6 0 1L 3 1 L L F Y E G C V F D R C H M T D L D W C S S L E L Y A A L
N P 3 a L L F Y E G C V F D R C H M T D L D W C S S L E L Y A R LM U C 2 Q H Y Y D A C V F D S C E M P C S S L E C A S L Q A Y A A L
P I C H L I L S K VP I C H L I L S K VP L C Q L I K D S L
c a s h D i e I D W C A S H D I C I D W C A Q Q N IC D D W
6 0 0F E P C H T V I P PF E P C H T V I P Pf a q c h a l v p p
6 5 0r g r t g h m c p f
R G R T . RTQAHr n h t h g a c l v
6 5 1 7 0 0L 3 1 T C P A D K V Y Q P C G P S N P S Y C Y C N D S A S L G A L P E A G P I T E G C F C P E G M T L F S
N P 3 a HEPSRQGVPA L F P S N P S Y C Y C N D S A S L G A L R E A G P I T E G C F C P E G M T L F SM U C 2 E C P S H R E Y Q A C G P A E E P T C K S S S S Q Q N N T V L V E G C F C P E G T M N Y A
7 0 1 7 5 0L 3 1 T S A Q V C V P T G C P R C D G P H G E P V K V G H T V G M Q C Q E C T C E A A T W T L T C R P K L
N P 3 a T S A Q V C V P T G C P R C L G P H G E P V K V G H T V G M D C Q E C T C E A A T W T L T C R P K LM U C 2 P G F D V C V K T . C G .C V G P D N V P R E F G E H F E F D C K N C V C L E G G S G I I C Q P K R
7 5 1 3 0 0L 3 1 C P D P P A . . C P L P G F V P V P A A P O A G O C C P O Y S C A C N T S R C P A . P V G C P E G A
N P 3 a C P L P P A . .C P L P G F V P V P A A PO AG O CC PO Y S C A C N T S R C P A . P V R C P E G AM U C 2 C S Q K P V T H C V E D G T Y L A T E V N P A D T C C N I T V C K C N T S L C K E K P S V C P D G F
8 0 1 8 5 0L 3 1 R A I P T Y Q E G A C C P V Q N C ^ S W T V C S I N G T L Y Q P G A W S S S L C E T C R C E L P G
N P 3 a R R I P T Y Q E G A C C P V O N C . SW T V C S I N C iT L Y Q P G A W .S .S .S L C E T C R C E L P GM U C 2 E V K S K M V P G R C C P F Y W C E .SK G V C V H G M A E Y Q P G S P V Y S S K C Q D C V C T D K V
8 5 1 9 0 0L 3 1 G P P S D A F W S C E T Q I C N T H C P V G F E Y Q E Q S G O C C G T C V O V A Ç V T N T S K S P
N P 3 a G P P S D A F W S C E T Q I C N T H C P V R F E Y Q E Q F R SAVAPVQ RS P V SPT PA R A PM U C 2 D N N T L L N V I A C T H V P C N T .S C S P G F E L M E A P C E C C K K C E O T H C H K R P D N Q
9 0 1 9 5 0L 3 1 A H L F Y P G E T W S D A G N H C V T H Q C E K H Q D G L V W T T K K A C P P - . . L S C S L D E
N P 3 a P T S S T L A S W S D A G N H C V T H Q C E K H Q D G L V W T T K K A C P P . . . L S C S L D EM U C 2 H V I L K P G D F K S D P K N N C T F F S C V K I H N Q L I S S V S N I T C P N F D A S I C I P G S
9 5 1 1 0 0 0L 3 1 A R M SK D C C C R F C P D P P P P Y Q N Q S T C A V Y H R S L I I Q Q Q C C S S .S E P V R L A Y C
N P 3 a A R M SK D C C C R F C P D P P P P Y Q N Q S T C A V Y H R S L I I Q Q Q C S S S S E P V R L A Y CM U C 2 IT F M P N G C C K T C T P R N E T R V . . . P C S T V P V T T E V S Y A G C - . T K T V L M N H C
1 0 0 1 1 0 5 0L 3 1 R G N C G D S S S M Y .S L E G N T V E H RCOC C O E L R T S L R N V T L H C T D C S S R A F S Y T
N P 3 a R G N C G D S S S M Y .S L E G N T V E H R C O C C O E L R T S L R N V T L H C T D C S S R A F S Y TM U C 2 S G .S C G .T F V M Y S A K A Q A L D H S C S C C K E E K T S Q R E W L .S C P N G G S L T H T Y T
1 0 5 1 1 1 0 0L 3 1 E V E E C G C M G R R C P A P G D T ..............................................Q H S E E A E P E P S Q E A E S G S W E R
N P 3 a E V E E C G C M G R R C P A P A T P S T RRRRN PSPAR RQRVGAGREA SSV PH AL.TSTM U C 2 H I E S C Q C Q D T V C C L P T G T S R R A R R S P R H L C S C ...........................
1 1 0 1 1 1 5 0L 3 1 G V Q C P P C T D Q H C R P P D L Q G E P P I C P L S S A S K A .S C T C A P V Q A A A A L N T L S T
N P 3 a AAELTSKENE P Y V E ..............................................................................................................................................M U C 2 ....................................................................................................................................................................................................
1 1 5 1 1 1 8 9L 3 1 PAFLW RV W AM G H L L P G G G A L T H P A C -S H L S G P A P G L A E L L W P C I Q P A V L G T
N P 3 a .......................................................................................................................................................M U C 2 .......................................................................................................................................................
The relationship between clones NP3a and L31 is not clear. There are a
number of possible explanations for the differences between these clones i.e. genetic
polymorphism, the existence of more than one very similar gene or repeated exons.
The number of differences and the changes in reading frame makes it seem unlikely
that the differences between these two sequences are due to polymorphism. The high
level of similarity between L31 and NP3a indicates that there would be cross
hybridisation so that if there were two genes, probes made from either clone
would detect both genes. However a single EcoRI fragment of 20+kb is detected
with both L 31 and JER58 which suggests that the 3' ends and all the tandem repeat
sequences of both genes would have to be located within a region of about 2 0 kb
which seems unlikely (Fig. 3.11) although it is conceivable that two fragments might
comigrate. If L31 does detect the same Hindlll polymorphism as JER58 this makes
that possibility less likely. L31 also detects an 8 kb Seal fragment (Fig. 3. 11)
suggesting that both the L 31 and NP3a sequences would have to be located within
less than 8 kb of each other. It is interesting to note that the position and number of
the cysteine residues are better conserved between the predicted peptide sequence of
L31 and MUC2 than NP3a and MUC2 (Fig. 3. 13). When the peptide sequences of
other mucin genes such as MUC2 were compared with their homologues in other
species the conservation of the non repetitive sequences especially with respect to the
cysteine residues was extremely high (Gum, Hicks et al. 1994). This indicates that
the number and position of the cysteine residues is important for the function of the
glycoprotein. All these observations together suggest another possible explanation
for the differences between L31 and NP3a, namely that mistakes were made during
the sequencing of the NP3a clone.
The restriction fragment length polymorphisms detected with tandem repeat
probes for MUC2, MUC6 and MUC5AC and a wide variety of enzymes appear to be
mainly due to variation in the number of tandem repeats.
Polymorphism of MUC2 detected with TaqI has been well characterised by
our collaborators in San Francisco and is discussed in section 1.4.2.1. The
116
polymorphism detected with TaqI is interesting as it shows not only is there VNTR
variation but also polymorphic restriction sites located within the tandem repeats
themselves which produces complex patterns. This may have some relevance to the
interpretation of complex polymorphisms detected with TaqI, MspI and PvuII for
MUC5AC, although the structure of the gene is somewhat different in that there are
regions of tandem repeats separated by so called unique sequences which also appear
to be repeated as described in section 1.4.2.2. The patterns detected with TaqI, MspI
and PvuII suggest that there may in fact be two major polymorphic regions within the
MUC5AC gene and comparison of the relative mobilities is suggestive of VNTR
variation. Figure 3. 3 shows the correspondence of the relative mobilities of the
smaller set of fragments. The relationship between the patterns of the larger set of
fragments detected with the different enzymes is more complex and may indicate
further polymorphism arising from polymorphic restriction sites. Evidence for
VNTR variation of the larger fragments is provided by the simple patterns detected
with the enzymes PstI and Hinfl which both show similar relative of mobilities (Fig.
3. 4). Further work carried out in this laboratory indicates that Hinfl is cutting
outside the large set of tandem repeats while PstI cuts outside all the tandem repeat
regions.
The similarity between the relative mobilities of the polymorphic fragments
detected with the MUC6 probe on DNA digested with TaqI and PvuII suggest that the
polymorphism is due to VNTR variation of a single region of tandem repeats with
the restriction sites located in flanking regions.
A feature of the three mucin genes is their hypervariability which is illustrated
by the large number of alleles. However due to the limits of resolution of the
Southern blots it is not possible to determine the exact number of distinct alleles,
although in the case of MUC2 and MUC6 it is possible to place the alleles in a
number of size categories. Thus the number of distinct alleles of MUC2 is probably
more than 9 and for MUC6 more than 11. It would seem likely that the
hypervariability is a direct consequence of a high mutation rate. Thus it is interesting
117
that two MUC2 mutations were detected in the EUROGEM series of CEPH families
which corresponds to approximately 400 offspring (Fig. 3. 6 ). Given the small
sample size it is not significant that no mutations were observed with MUC6 or
MUC5AC. Indeed the two mutations observed with MUC2 is comparable to the
number observed with some minisatellites (Jeffreys et al. 1988).
The heterozygosities calculated for MUC2 (0.64), MUC6 (0.70) and
MUC5AC (0.60 for the upper set of bands and 0.36 for the lower set) are fairly high
although not as high as those observed for most minisatellites (Vergnaud, Mariat et
al. 1991).
The 40 CEPH EUROGEM families are comprised of geographically distinct
populations mostly from France and Utah and the allele distributions shown in figure
3. 5 may obscure differences in the distribution of alleles between different
populations (Fig. 3 .5 ). When the distributions for the largest sub population (from
Utah) were compared to the overall patterns, for both MUC2 and MUC6 , they where
broadly in agreement. This indicates that there is probably no significant difference
in the allele distributions between the various populations comprising the EUROGEM
series.
Interestingly the allele length distribution of MUC2 appears to show a
bimodal distribution whereas the distribution of MUC6 allele sizes appears to be
approximately unimodal (Fig. 3. 5) although a trimodal distribution has been reported
in a Portuguese population (F. Carvalho and L. David personal communication). The
bimodal and trimodal distributions may indicate large scale mutations such as
duplications of portions of DNA including tandem repeat sequences. Possible
mechanisms for the duplication of large regions of DNA are through unequal crossing
over during meiosis or sister chromatid exchange. These mechanisms may be
responsible for the germ line mutation observed in the mother of family 1333 and
inherited by a number of her offspring (Fig. 3. 6 ). The mutant allele appears to be
approximately twice the size of the alleles present in both the maternal grandparents.
If this mutant allele is due to unequal crossing over during meiosis then one would
118
expect markers flanking MUC2 to be recombined. Unfortunately without data for the
great-grandparents or the mothers siblings it is not possible to determine whether the
flanking markers are recombined. Interestingly the cluster of smaller alleles of
MUC2, shown in figure 3. 5 A, are approximately half the size of the main peak.
Possibly the larger alleles arose after duplication of one of the smaller alleles.
The relative lack of recombination between markers flanking minisatellites
and analysis of MVRs indicate that processes such as slippage and sister chromatid
exchange are responsible for the generation and maintenance of VNTR
polymorphisms, as discussed in section 1.1. Indeed the specific example of the
analysis of two polymorphisms in MUCl flanking the major tandem repeat region
also suggests that unequal crossing over during meiosis may not be a major cause of
VNTR variation in mucin genes, discussed in section 1.4.1. The large size
differences between the mucin gene alleles suggests that unequal gene conversion or
sister chromatid exchange is perhaps more likely than slippage. Interestingly the
mutation in child C l 1 of family 1331 also supports the notion that unequal
recombination is not involved as no recombination is detected between the flanking
markers (Appendix 11).
Thus it would seem that although there is a relatively high level of meiotic
recombination in this region of chromosome 11 other processes such as unequal
exchange between sister chromatids or unequal gene conversion maybe responsible
for the maintenance of the VNTR polymorphisms in the mucin genes and possibly for
the duplication of genes.
The conservation of repetitive sequences between different mucin genes is
very poor even though the unique sequences appear well conserved (Pemberton,
TaylorPapadimitriou et al. 1992). However one fairly common feature of the peptide
sequences predicted from the repetitive DNA is the presence of hydroxyl amino acids
which could be potential O-glycosylation sites. This indicates that the role of the
repeat sequences is to provide a backbone to which the large numbers of carbohydrate
side chains associated with mucins, are attached.
119
It is clear that particular regions are conserved between different mucin genes
in the same organism and homologous mucin genes in different organisms, especially
the cysteine rich regions which suggests that they are important in the function of
these molecules. This is perhaps not surprising as it has been speculated that the
cysteine rich regions of mucins may be involved in crosslinking of mucin
glycoproteins to produce the mucus gel and/or as receptor binding motifs which
recognise peptides expressed on the cell surface (Meitinger, Meindl et al. 1993).
120
4. Genetic and physical mapping of MUC3 located
on chromosome 7q22: results and discussion.
The mucin gene MUC3 was localised to chromosome 7q22 by in situ
hybridisation using the tandem repeat probe SIB 124 (Fox, Lahbib et al. 1992) prior to
the outset of this project. Polymorphism had also been detected by Southern blot
analysis, (Fig. 4. 1) as previously reported (Fox, Lahbib et al. 1992). The first two
sections of these results, 4.1.1 and 4.1.2, deal with the analysis of MUC3
polymorphisms and the genetic analysis of chromosome 7 with particular interest in
the chromosomal region around MUC3. The remaining sections, 4.1.3 and 4.1.4, are
concerned with the physical characterisation of the chromosomal region which
contains MUC3 and the attempts made to investigate the physical structure and
sequence of MUC3 itself.
121
4.1. Results
4.1.1. Analysis of the MUC3 polymorphisms and two-point linkage
analysis
When Southern blots of DNA digested with PvuII and PstI were probed with
the cDNA clone SIB 124, which is comprised of tandem repeat sequence, a restriction
fragment length polymorphism was detected which produces a distinctive pattern of
bands (Fig. 4. 2), The patterns in each case comprise two sets of bands with one set
of fragments larger than 18.5kb and the other set of fragments smaller than 15kb. All
40 CEPH families in the EUROGEM series were tested with PvuII and all the parents
together with 6 complete families were tested with Pstl. Each set of PvuII bands
shows independent allelic variation and there was no apparent association between
the two polymorphic regions in either case i.e. the variation seen in one set of
fragments is not dependent on that seen in the other set. A similar situation was
observed with Pstl. This suggests that the polymorphic regions of DNA are
physically separated in the genome and do not arise from common restriction sites.
The high level of variation together with the broad similarity of the patterns
observed with both PvuII and Pstl initially indicated that the polymorphism is simply
due to variation in the number of tandem repeats (VNTR) in the two zones (Fig. 4. 2).
However the analysis of unrelated individuals shows differences between the relative
mobilities of the bands detected with PvuII and Pstl for the smaller set of fragments.
This can clearly be seen in family 1341 in figure 4. 2 were the smaller set of bands
detected with PvuII appear to be homozygous in the children; C l, C3, C5, C6 and the
mother, MM, but are heterozygous with Pstl while the pattern in the father, F, appears
to be heterozygous with both PvuII and Pstl.
122
FM MF FF Cl C2 C3 C4 C5 C6 C7 C8 M F
Pvu II
kb
18.515.0
9.0
FM MF FF Cl C2 C3 C4 C5 C6 C7 C8 M F
Pst I
48.5
18.5
15.0
9.0
Figure 4. 1.
Autoradiographs of two Southern blots of DNA from a CEPH family digested with
Pvu II and Pst I probed with SIB 124 (MUC3) (taken from [Fox, 1992 #21]). Key:
FF=father of the father, MF=mother of the father, F=father, C l, C2, C3,
e.t.c.=children, M=mother, FM=father of the mother and MM=mother of the mother.
123
a1000 lo
1 1ON
lo looo 00 mN #—4 H
-T T TON
tu
00Ur -U
U
VIU
S
mU(Nu
u
UhUh f3
PL, £
Figure 4. 2.
Autoradiographs of two Southern blots of DNA from the CEPH family 1341
digested with Pvu II and Pst I probed with SIB 124 (MUC3). parents are shown. Key:
FF=father of the father, MF=mother of the father, F=father, C l, C2, C3,
e.t.c.=children, M=mother, FM=father of the mother and MM=mother of the mother.
124
The simplest interpretation of these observations is that there is some VNTR
variation with additional polymorphism of a Pstl site in the region to one side of the
smaller tandem repeat region which causes the a major change in size. This site could
be located within the tandem repeats themselves but no reciprocal small fragment was
detected with the tandem repeat probe (SIB 124) which indicates the polymorphic Pstl
site in located in the ‘unique’ sequence.
No recombination was observed between the two polymorphic zones detected
with PvuII in the 40 CEPH EUROGEM families tested. Two-point linkage analysis
using the 'two-point' option of CRI-MAP showed that these two zones are tightly
linked with a LCD score of 31 at 0 =0. Two-point analysis using the 'twopoint'
option of CRI-MAP was carried out with all the chromosome 7 markers in the CEPH
data base version 6 and MUC3, a selection of results is show in Table 4. 1. These
results confirmed that MUC3 is situated in linkage group which includes C0L1A2^
Since MUC3 had been assigned to 7q22 by insitu hybridisation this provided another
physically assigned marker in this linkage group.
‘Collagen type I alpha 2.
125
Gene locus Location 0 F 0 M lod score
D7S64 7q21-q22 0 .1 2 0.05 2 1 .8
COL1A2 7q21.3-q22 0.11 0 .0 0 2 1 .6
D7S82 7pter-q22 0.04 0.03 27.6
D7S78 7q21.3-q31 0.18 0.03 11.3
D7S13 7q22.3-q31.2 0.08 0.06 16.5
D7S8 7q31 0.21 0 .1 0 10.3
Table 4. 1.
Table showing the pairwise led scores at maximum likelihood recombination
fractions 0 in males (M) and female (F) for MUC3 with a selection of chromosome 7
markers which have been localised to regions of chromosome 7 using physical
methods.
126
4.1.2. Genetic mapping of chromosome 7
When this work was started the most recent genetic maps available of
chromosome 7 were from GENETHON and NIH/CEPH collaborative map (1992;
Weissenbach et al. 1992). The GENETHON map did not contain any genes and
although the NIH/CEPH map had a number of genes the markers were of the less
informative RFLP type. Also neither of these maps shared any markers or contained
MUC3. Thus it was of some interest to try and integrate genes and other markers
from these maps together with MUC3 and possibly identify loci close enough for
long range PFGE mapping.
A genetic map of chromosome 7 was built using the multipoint analysis
options of CRI-MAP. The 'build' option of this program was used to generate the
map. The 'automatic build' used all the markers in the CEPH database version 6 in
order of their informativeness. Unfortunately the program crashed before all the
markers had been tested because the map exceeded the capacity of the computer that
was available at this time. However the partial map constructed provided an excellent
starting point. The map presented here was constructed using a combination of the
preliminary output from the automatically constructed map together with a selection
of markers which were shown to have a high probability of being located near MUC3
i.e. a two point lod score of at least 3 and a small 0 value (Appendix IV). The final
combined map was checked using an option called 'flips 5' which showed that the
map was supported at odds of a 1000 to 1 when groups of 5 markers were permuted
(F%s4.3y
This map included the genes MUC3, IL6 TCRG^lERVS^lTCRB'^and a large
number of markers from both the previously published GENETHON and NIH/CEPH
collaborative maps (1992; Weissenbach, Gyapay et al. 1992). This data was
published in abstract form (Hill et al. 1994) and the map included in the report of The
First International Workshop on Human Chromosome 7 Mapping 1993 (Grzeschik et
D 7S 531D 7S 21D 7 S 5 1 7D 7S 481D 7 S 5 1 3D 7S 75D 7 S 5 0 3D 7 S 4 9 3IL 6D 7 S 5 2 9D 7 S 6 2D 7 S 5 2 6D 7 S 4 9 7T C R GD 7S 485D 7S 521D 7 S 5 1 9D 7 S 5 0 6D 7 S 4 9 9D 7S.502E R V 3D 7S 398D 7 S 5 2 4D 7S 15D 7 S 5 2 7D 7S 491D 7 S 5 5 4M U C 3D 7 S 4 9 6D 7 S 5 2 3D 7 S 4 8 6D 7 S 4 8 0D 7 S 97D 7 S 487D 7 S 5 1 2D 7 S 5 0 0D 7 S 5 0 9TC R BD 7 S 4 9 8D 7 S 4 5 0D 7 S 5 0 5D 7 S 4 8 3D 7S 68D 7 S 22D 7S 468D 7 S 1 0 4
Figure 4. 3.
A diagrammatic representation of the framework map of chromosome 7 based on the
order predicted by the computer program CRI-MAP supported at odds of greater than
1000:1. An ideogram of chromosome 7 is shown alongside the gene order and the
physical localisation of a selection of genes is indicated as known in October 1993.
128
This map was used as the basis for constructing a more detailed map of
chromosome 7q using data from a more recent version of the CEPH data base,
version 7 (Fig 4 .4 ). The section of the total chromosome 7 map from ERV3 (which
maps to the centromere) to TCRB located near the q arm telomere was used as the
initial order for the 7q map. The program CRI-MAP Then attempted to insert all the
new markers from the CEPH database version 7 which had a lod score of greater than
3 with MUC3. This second map was subsequently used at the second International
Chromosome 7 Workshop together with maps of chromosome 7 from other workers
to obtain a consensus order for 77 markers along the entire length of the chromosome
(Tsui, Donis-Keller et al. 1995).
The map of the q arm of chromosome 7 indicated that the two closest genes
genetically are COL1A2 and MET' however the genetic distances between COL1A2
and MUC3 (6 .6 cM) and between MET and MUC3 (21.7 cM) which are quite large
and indicates that these two genes are not suitable for physical mapping. By the
second International Workshop it was evident from the physical data that there were
problems with the physical mapping of this region of chromosome 7 (Tsui, Donis-
Keller et al. 1995). The region appears to be fairly gene rich and contains the genes
EPO^PAIl and ACHE but is not covered by any contig maps and no YAC clones had
been isolated (Watkins et a l 1986; Klinger et al. 1987; Getman et a l 1992). It was
therefore of some interest to try an establish the genetic and physical locations of
Total Map Length 115.3 cM Average 154.0 cM female 80.5 cM Maie
Figure 4. 4.
A diagrammatic representation of a higher resolution genetic map of the q arm of
chromosome 7 supported at odds of greater than 1000:1. An ideogram of the q arm is
shown along side the gene order and the physical localisation of a selection of genes
is indicated.
130
4.1.2.1. Mapping of the gene PAIl using a panel of chromosomes with
defined meiotic breakpoints
PA Il was selected for genetic mapping since a dinucleotide repeat
polymorphism had already been identified with the primers HGMP ID No. 6031 and
6032 (GDB accession ID; GDB:512834). In order to analyse this polymorphism
using an automated sequencing system the primer 6031 was fluorescently labelled
using the kit supplied by Vistra and PGR was carried out on DNA samples from the
members of the desired families (Fig. 4. 5).
A panel of chromosomes with defined meiotic breakpoints was constructed
using the program ‘CROSSFIND’ (Attwood et al. 1996). The program utilises the
'chrompic' output of CRI-MAP produced using information from the CEPH data base
V.7. The order of the loci used was based on the consensus order published in the
report of the second International Chromosome 7 Workshop (Tsui, Donis-Keller et al.
1995). The conditions use to construct the initial diagram were, fam_like_tol=0.3,
fem ale and male m ap_tol=20, m in_density=3.0, m in_score=250, and
puk_score_factor=0.5 (Fig. 4. 6 ). The fam_like_tol is the minimum likelihood for the
predicted phase relationships between loci in any particular phase unknown family
that the program will accept and is calculated by CRI-MAP when the 'chrompic file is
created. The female and male_map_tol values are the minimum allowable distance in
cM between double recombinants. The min_density is the minimum number of
informative loci per lOcM. The min_score is the minimum allowable value of a
calculation which measures the support for a particular breakpoint. This value is
calculated by assigning an overall value to the 10 adjacent markers either side. For
example if the marker next to the breakpoint is informative then it scores highly.
However the further from the breakpoint the informative loci are the lower they score.
The maximum 'score' is 500 with the values heavily weighted in favour of the three
loci closest to the breakpoint. The puk_score_factor is the number by which the
program multiplies the individual 'score' values assigned to loci for which the phase is
not known.
131
(2 ,4 )
( 1. 2 )
( 1. 2 )
C4(1 .4 )
C5(3 .4 )
C7(3 .4 )
C8(3 .4 )
C9( 1. 2 )
FF(2 .3 )
MF(1 .4 )
FM(1 .3 )
Figure 4. 5.
An example of the results obtained with the ALP system showing the electrophoretic
analysis of the PAIl microsatellite using DNA samples from members of CEPH
family 1347. An arbitrary phenotype for the members of the family can be deduced
by comparison of the relative positions of the major peaks. The deduced phenotype is
shown in brackets below the family member symbol. Key: FF=father of the father,
MF=mother of the father, F=father, C l, C2, C3, e.t.c.=children, M=mother,
FM=father of the mother and MM=mother of the mother.
132
Figure 4. 6.
Output from the program cross finder using the consensus order of markers on
chromosome 7, taken from the report of the Second International Chromosome 7
Workshop [Tsui, 1995 ], showing a selection chromosomes with defined meiotic
breakpoints. The conditions for this diagram were;female_map_tol=20,
male_map_tol=20, min_density=3.0, min_score=250, andpuk_score_factor=0.5. The
most likely position of PAIl indicated by a vertical bar on the left of the gene order.
The parental origin of each chromosome is indicated by the suffix M (maternal) or P
(paternal) to the CEPH family and individual number. The grandparental origin of each
chromosomal region is indicated by black (grandparental) and white (grandmaternal)
squares, if the grandparental origin cannot be determined the square is grey.
% DBeBBBBBBBBBBBBHODBBBDBBBBBBBSBBCaBBBBBBBDDBBCBBBeDeBBBBBBBBBBBBBBBBBBBBSSSSBG S aBBB BSG aaSBBG SBSaSBSBaSBG G SaasaSSBBaBSBSBO ElBSSBBBaBESBBBBSBBSSaG aG ËBBBBBB
O O O O E O O O O O O O C a O & O C l O O O C l O D Q O Q Q ^ Q Q Q Q Q Q Q D Q D O ^ Q O g Q Q Q Q Q D Q D Û Û Q C l O U O O E O O O O O O "
Figure 4. 6 shows the output from the program ‘CROSSFIND’ which
identified breakpoints in the chromosomes of the children from the two CEPH
families 1347 and 1331. These two families provided a fairly even spread of
breakpoints along chromosome 7.
Comparison of the results for PAIl with the breakpoints defined in Figure 4. 6
indicate that PAIl maps between D7S630 and D7S525 confirming its position within
the same genetic region as C0L1A2 and MUC3.
The fine mapping of PA Il was done by creating a diagram under less
stringent conditions i.e. min_density=2.0. This identified a panel of recombinants
across the region from D7S630 to D7S525 (Fig. 4. 7). The informative recombinant
individuals identified in CEPH families; 884, 1332, 1362, 1413 and 1416 together
with the parents and grandparents were tested for PA Il. Comparison of the PAIl
results in these families with the panel of recombinants place the gene with D7S554
and MUC3 (Fig. 4. 7).
134
Figure 4. 7.
Output from the program cross finder using the consensus order of markers on
chromosome 7,taken from the report of the Second International Chromosome 7
Workshop [Tsui, 1995 ], showing a selection chromosomes with defined meiotic
breakpoints in the region 7q22. The most likely position of PAIl within the consensus
order is indicated together with the results from the ALP analysis. The conditions for
this diagram were; fam_like_tol=0.3, female_map_tol=20, male_map_tol=20,
min_density=2.0, min_score=250, and puk_score_factor=0.5. PAIl has been placed in
its most likely position (between D7S554 and MUC3) and the individual results for this
marker are shown. The parental origin of each chromosome is indicated by the suffix
M (maternal) or P (paternal) to the CEPH family number and individual number. The
grandparental origin of each chromosomal region is indicated by black (grandparental)
and white (grandmaternal) squares, if the grandparental origin cannot be determined the
square is grey.
\pdglac
D7S596 D7S531
D7S21 D7S462 D7S517
1lCom/Com2 D7S481 D7S641 D7S513 D7S664 D7S103 Ë D7S503 Ë D7S507 E D7S488 E D7S493 E
IL6/INFB2 E D7S529 E D7S516 E D7S435 E D7S526 E D7S497 E D7S528 E
TCRG [ D7S485 E D7S510 E D7S521 E D7S891 E
GCK E D7S674 E D7S506 E EGFRe E D7S499 E D7S494 E D7S502 E] D7S639 E] E] D7S645 □ □ D7S653 E] E] D7S669 E] E] D7S440 E] E) D7S524 E] E) D7S630 El E] D7S492 D n
C0L1A2-1 H B D7S554 n B
. • . • / / / / / / M m V///
E]E]□E]EE§
Ig
PAH i
MUC3 B D7S515 H
12Com/Com2 f l D7S496 B D7S525 n D7S471 n
MET-4 n D7S480 n D7S650 n D7S490 n D7S466 n D7S514 n D7S530 n D7S512 B D7S500 n D7S509 D7S72
Autoradiograph of a Southern blot of a pulsed field gel electrophoresis of K562 DNA
digested with Sma I, Sfi I, BssH II, Nae I, Not I, Sac II, Nru I, and Mlu I restriction
enzymes probed with SIB 124. The size markers are shown on the right hand side and
are in kilobases.
146
Enzyme Size of fragments/kb
Notl 370
BssHII 355
Nael 50
Smal 80
45
Sfil 160
85
SacII 330
250
115
70
40
Nrul 885
640
320
270
M lu l 310
115
310
Table 4. 4.
Table showing the size of fragments detected using the cDNA probe on PFGE
blots of genomic DNA digested with Notl, BssHII, Nael, Smal, Sfil, SacII, Nrul and
Mlu I from the cell line K562.
147
Multiple bands are observed with the three remaining enzymes SacII (330kb,
250kb, 115kb, 70kb and 40kb), Nrul (885kb, 640kb, 320kb and 270kb) and Mlul
(310kb, 165kb and 115kb). The most likely explanation for these patterns of bands is
partial digestion although some may be due to restriction sites for these enzymes
being present between the tandem repeat regions.
Southern blots of genomic DNA digested with Notl and Swal were probed
with SIB 124 and SIB172U (Fig. 4. 12). All three probes detected a single 350 to
400kb Notl and 200kb Swal fragment.
In order to investigate the possibility of large scale differences in the region
around MUC3 a Southern blot of DNA from 5 individuals digested with Swal was
probed with SIB 124 (Fig. 4. 13). The result clearly shows that the fragment detected
in each individual is indistinguishable. Differences of up to 20kb, such as those due
to VNTR variation, would however not be resolved under these conditions.
These results indicate that the whole MUC3 genetic region is located on a
200kb Swal fragment which shows no large scale variation in size between
individuals. The Notl and BssHII result indicates that there are potential CpG island
sequences surrounding the MUC3 gene which may be associated with the 5' end of
the MUC3 gene.
148
SIB124 SIB172U
290245
I
Figure 4. 12.
Autoradiograph of a Southern blot of pulsed field gel electrophoresis of K562 DNA
digested with Not I and Swa I probed with SIB 124 and SIB172U. The size markers
are shown on the left hand side and are in kilobases.
149
kb G1 G2 G3 G4 G5
Ié
300250200150100
50
Figure 4. 13.
Autoradiograph of a Southern blot of pulsed field gel electrophoresis of DNA from 5
individuals digested with Swa I probed with SIB 124. Size markers are in kilo bases.
150
4.1.3.5. Cloning MUC3
The isolation of large genomic clones such as YACs and cosmids would
greatly aid the elucidation of the genomic structure of MUC3. To this end a number
of Y AC and cosmid libraries were screened by ourselves and our collaborators in an
attempt to isolate large genomic clones containing MUC3.
4.1.3.6. Isolation and analysis of genomic clones
GM3 was isolated in California by Dr Jim Gum by screening a human
genomic library in XFIXII with SIB 124. This clone has an insert of approximately
20kb and contains the sequences in both clones 20 and 23 at one end of the insert.
Comparison of the cDNA and genomic clone sequences indicates that there are at
least two introns located at the 3' end of MUC3 (Fig. 4. 14). Clone 20 contains
478bp of sequence 5' to the first 'small' intron, the whole 184bp of the next exon and
1243bp of sequence 3' to the second intron. This is comprised of 139bp of translated
sequence and 1104bp of untranslated sequence. The first 'small' intron has been
completely sequenced and is 106bp long but only a small amount of the second intron
at the 5' and 3' ends has been sequenced.
151
Figure 4. 14.
A diagrammatic representation of the restriction map of the genomic clone GM3. The
approximate position of various primers is indicated together with the position of Pst I
and Pvu II sites identified by sequencing.
20 kb
LAK)
Sph IS m a l B g l l l S m a l B g l l l S m a l S s i l EcoR I Sst IEcoR 1 EcoR I Sst I
Sma I
Clone 20Clone 23
104 bp1529 bp 2 - 2 . 5 kh1021 bp
M U C 3 23S‘ ■MUC3 IS M U C 3 INS
Pvu II PstM U C 3 INAM U Q 3F 2AM U C 3 2 3A .M UC3 lA139 bp106 bp 184 bp
20 Tandem Repeats
3' non repetitive coding region
I non coding region
Intron
Primers were designed for RT-PCR which were either side of the ‘small’
intron, MUC31S and MUC3F2A, and spanning both the ‘small’ and ‘large’ introns,
MUC31S and MUC31A (Fig. 4. 14). RT-PCR was carried out on cDNA samples
from tissues and cell lines, which included small intestine, colon and HT29-MTX
cells (cell line HT29 treated with methotrexate causing secretion of mucus). A
product was detected in colon, small intestine, the Caco 2 cell line and HT29-MTX
cells, although the expression is quite low with the highest level in small intestine
(Fig. 4. 15). The product obtained with both sets of primers was consistent with that
predicted from the cDNA sequence. The larger product of approximately 380bp
detected in colon, the cell line MCF-7 and Caco-2 cells with the primers MUC31S
and MUC3F2A (Fig. 4. 15) is probably due to contamination by genomic DNA. This
380bp fragment corresponds to that predicted from the genomic sequence from the clone GM3. Indeed primers designed for the human lactase gene had amplified a product from the reverse transcriptase free suggesting the presence of genomic DNA in trace amounts.
When we received the GM3 clone the size of the second intron was unknown
and the precise position of the final 139bp of translated and 1104bp of untranslated
sequence within in the restriction map of GM3 was unknown. The size of the intron
was therefore estimated using 'medium/long hot start' PCR described in section
2.6.3.4. The primers used were MUC323S and MUC31A designed from the 'unique'
sequence contained in cDNA clones 20 and 23 (Fig. 4. 14). The primers span both
the introns and a product of approximately 4.4kb was detected on an agarose gel
stained with ethidium bromide (Fig. 4. 16). This confirmed that the clone 23 and 20
sequences are contiguous and when all the known sequence is subtracted gives an
intron of 2.4kb (Fig. 4. 14).
153
bp
CO SI M6 MZ SK MC CA HT
506,517396344298220
Figure 4. 15.
Reverse transcriptase (RT) PCR of cDNA samples prepared from colon (CO), small
intestine (SI), M614 normal stomach (M6 ), MZPC-4 pancreas cancer cell line (MZ), )
SKPC-3 pancreas cancer cell line (SK), MCF-7 breast cancer cell line (MC), Caco 2 ^
colon cancer cell line (CA) and HT29-MTX colon cancer cell line treated with
methatrexate (HT). With primers MUC31S and MUC3F2A. Size markers (S) are in
basepairs.
154
bp G M
7126 i 610Si 5090, 4072'3054
2036 1636 i
Figure 4. 16.
Medium length hot start PCR of genomic (G) and genomic clone GM3 (M) DNA
with MUC323S and MUC31A primers. The size markers (S) are in base pairs.
155
4.1.3.7. Isolation of Y AC clones
Two pairs of PCR primers for MUC3 called MUC323A and S and MUC3INA
and S were designed from sequence data supplied by J Gum and their approximate
position is shown in (Fig. 4. 14). The conditions of the PCR were adjusted to detect
the MUC3 gene in approximately 1 ng of genomic DNA.
These primers were used to amplify human genomic DNA, a chromosome 7
only somatic cell mouse/human hybrid, C121, and a sample of mouse genomic DNA.
A product of the correct size was detected with both human genomic and C121 DNA
but not with the mouse genomic DNA (data not shown). This indicated that the
primers were specific for chromosome 7. These primers were used to screen the ICI
Y AC library (Anand et al. 1989) supplied by Dr B. Carritt. Unfortunately all the first
level Y AC pools were negative.
Both sets of primers and the SIB 124 probe were supplied to two other groups
who used them to screen their Y AC libraries for MUC3 sequences. One YAC clone
(ICRF900A07107) from the ICRF reference library was isolated by Dr. Stephen
Scherer from Toronto.
Four YAC clones were also isolated in Eric Greens lab by screening three
libraries with the MUC3 PCR primers. Clones YW SS2050 and YW SS3480 were
obtained from a chromosome 7 only hybrid cell line library and clones YWSS2717
and YWSS2782 were isolated from the CEPH mega YAC library.
4.1.3.8. Initial characterisation of the YAC clones
Initial characterisation of the YAC clone ICRF900A07107 by Southern blot
analysis was carried by Dr Stephen Scherer. Southern blots of the YAC clone
digested w ith EcoRI probed w ith SIB 124 detected two fragm ents, although
previously Southern blots of genomic DNA digested with EcoRI probed with SIB 124
had revealed up to 11 bands (data not shown). Also when a Southern blot of
undigested DNA from the YAC clone was probed with vector sequence to check the
insert size two fragments where detected, one of about 500kb and the other of about
156
360kb. When the same Southern blot was probed with SIB 124 only the 500kb
fragment was detected. This indicates that there may be a mixture of two clones or
the MUC3 tandem repeat containing clone is unstable and recombines down from
500kb to 360kb. Also the EcoRI results indicated that the YAC clone only contains a
fragment of the tandem repeat region.
Samples of DNA prepared from the YAC clone ICRF900A07107 were used
for fluorescent in situ hybridisation (FISH) experiments. The strongest signal was
detected on chromosome Ip, a faint signal on chromosome 7q22 and a number of
faint signals on a variety of other chromosomes (Fig. 4. 17). The results of these
experiments indicate that the YAC is highly chimaeric.
Before any further work was done with the clones YWSS2050, YWSS3480,
YWSS2717 and YWSS2782 were tested using FISH. The clone YWSS3840 give a
strong signal on chromosome 7q22 (Fig. 4. 18), whereas YWSS2050 mapped to
7q22-31, YWSS2717 mapped to 4q32-33 and YWSS2782 mapped to 3p l4 and
13q22-31 (Fig. 4. 19). Thus YWSS3840 was the only promising clone for further
investigation of MUC3.
4.1.3.9. Further characterisation of YAC YWSS384Q
Three pairs of primers were tested on a sample of DNA from YWSS3840 and
genomic DNA. The primer pairs used were: MUC3FP1A and S designed from
SIB 172 sequence, MUC323A and S designed from clone 23 sequence and
MUC3INA and S designed from and GM3 large intron sequence. The primer
sequences are shown in Table 2. 1. These primers all amplified the YAC DNA and
produced the same size product with both the YAC and genomic DNA (Fig. 4. 20).
These results indicate that MUC3 sequences 5' and 3' to the tandem repeats are
present in YWSS3840.
157
Figure 4. 17.
A metaphase spread showing fluorescent in situ hybridisation of the YAC clone
ICRF900A07107. The chromosomes are counter stained red with PI. The probe
localisation can be seen as green spots, the strongest signal is coincident with
chromosome Ip with a number of weaker signals detected on other chromosomes
including chromosome 7q22.
158
-
- K ..
Figure 4. 18.
A metaphase spread showing fluorescent in situ hybridisation of the YAC clone
YWSS3840. The chromosomes are counter stained red with PI. The probe
localisation can be seen as green spots coincident with chromosome 7q22. The
chromosome 7s are also shown enlarged in the lower left hand corner.
159
Figure 4. 19.
Three metaphase spreads A, B, and C showing fluorescent in situ hybridisation of the
YAC clones YWSS2050 (spread A), YWSS2717 (spread B) and YWSS2782 (spread
C). The chromosomes are counter stained red with PI
A. The probe YWSS2050 localisation can be seen as green spots coincident with
the border of bands 7q22 and q31.
B. The probe YWSS2717 localisation can be seen as green spots coincident with
the bands 4q32-33.
C. The probe YWSS2782 localisation can be seen as green spots coincident with
the bands 3pl4 and 13q22-31.
dSOl
A w V
; r j
f'
B
W y z :
hÎ5
% ■H ». IUJ> Z'
160
G3 Y3
Figure 4. 20.
Standard hot start PCR of genomic (G) and YWSS3840 (Y) DNA with primers for
MUC3; 1. MUC323A and MUC323S, 2. MUC3INA and MUC3INS and 3.
MUC3FP1A and MUC3FP1S. The size markers are in base pairs.
161
In order to further characterise this clone agarose blocks containing DNA
from YWSS3840 and the cell line K562 were made for use in PFGE experiments.
Southern blots of K562 and YWSS3840 DNA digested with PvuII, Notl, Smal and
Swa I were probed with SIB 124 (Fig. 4. 21). The genomic DNA fragments detected
were entirely consistent with those observed previously (Table 4. 4). The fragments
detected with the YAC however were consistently smaller than the genomic
fragments (Fig. 4. 21). Indeed it appears that the fragments from the YAC produced
by PvuII were so small that they have run off the end of the gel. The exact size of the
undigested YAC was not easily determined because there appeared to be two weak
bands of approximately lOOkb and 200kb.
The YWSS3840 clone was also tested with primers designed from genomic
sequences of the genes EPO (HGMP ID No. 6029 and 6030), PA Il (HGMP ID No.
6031 and 6032) and ACHE (HGMP ID No. 6033 and 6034), These genes were
selected because they were shown to map to the same region as MUC3 at the Second
International Chromosome 7 Workshop, although no information was available
concerning their positional relationships to each other. A product of the correct size
was detected with primers for ACHE (Fig. 4. 22) indicating that ACHE is in close
proximity to MUC3. However no product was detected with primers for EPO and
PA Il on DNA from YWSS3840.
162
kb
Pvu II N ot I S m a I S w îJ UndigestedI-------------- 1 I----------- 1 I---------- 1 I------------1 I
Y G Y G Y G Y G Y
400
200
Figure 4. 21.
Autoradiograph of a PFGE Southern blot of K562 (G) and YWSS3840 (Y) DNA
digested with Pvu II, Not I, Sma I and Swa I probed with SIB 124 (MUC3). The size
markers are shown on the left hand side and are in kilobases.
163
G1 Y1 G2 Y2 G3 Y3
Figure 4. 22. y
Standard hot start PCR of genomic (G) and YAC YWSS3840 (Y) DNA samples with
primers for; 1. ACHE (HGMP ID No. 6033 and 6034), 2. PAH (HGMP ID No. 6031
and 6032) and 3. EPO (HGMP ID No. 6029 and 6030). The size markers are shown
on the left hand side and are in base pairs.
164
The analysis of this clone by PCR indicates that the 5' and 3' 'unique'
sequences of MUC3 are intact and that the gene ACHE may be within lOOkb to
200kb of MUC3. The differences in size of the restriction fragments detected with
genomic DNA and the YAC together with the two bands detected with the undigested
YAC indicates that there may be a certain level of rearrangement or deletion of
sequences in the YAC, or two different YACs which are present in the same yeast cells.
The presence of sequence 5’ to one set of tandem repeats and the fact that there are no
detectable PvuII fragments, indicates that the small size of the fragments detected in the
clone, is not due to a portion of MUC3 being located at the end of the insert. The most
likely cause is instability in the tandem repeat sequences resulting in the loss of repeats,
although rearrangements in the 'unique' sequences cannot be ruled out. As there is
evidence of rearrangement in YWSS3840 the close proximity of ACHE to MUC3
suggested by the presence of ACHE sequences in the YAC needs to be confirmed.
This could be done by probing PFGE Southern blots of genomic DNA digested with a
variety of 'rare' cutting restriction enzymes.
4.1.3.10. Cosmid clones
The two cosmid libraries screened were; a library of total genomic DNA
(Cachon-Gonzalez 1991) and an ICRF chromosome 7 only library (library no. 113
(L4/FS7)).
The total genomic library was screened using the SIB 124 repeat cDNA clone.
A total of 500 000 colonies were tested in the primary screen and 6 signals were
detected. Individual clones were then isolated at the secondary screening stage (Fig.
4. 23).
165
\»
##
Figure 4. 23.
An example of an autoradiograph off a colony blot probed with SIB 124 from the total
genomic cosmid library (Cachon-Gonzalez 1991) at the secondary screening stage.
166
The two clones which gave the strongest signals were picked and cultured.
Southern blot analysis indicated that the clones did not contain intact copies of the
MUC3 gene because none of the fragments detected with various enzymes
corresponded to those detected with genomic DNA (data not shown). Fluorescent in
situ hybridisation gave an unexpected result i.e. DNA samples from the two clones
did not hybridise to chromosome 7 but a signal was detected on chromosome Sqter
(Fig. 4. 24). This may indicate the presence of a related gene in this region of
chromosome 8 . However this did not seem particularly likely because the tandem
repeat sequences of mucin genes appear to be the least well conserved regions of the
genes. Further more no signal was detected in this region of chromosome 8 with any
of the other MUC3 clones used for in situ hybridisation experiments. These clones
were thus most probably chimaeric and were therefore not pursued further.
The gridded chromosome 7 library provided by the ICRF was also screened
using the SIB 124 clone however no positive clones were detected. The filters were
reprobed when the 5' cDNA clone SIB 172 became available but again no positive
clones were detected.
The positive result obtained with primers for ACHE with YWSS3840
suggested that MUC3 and ACHE might be in close enough proximity for cosmid
clones containing ACHE to also contain MUC3 sequences.
Two cosmid clones A- and p l 8D l-l provided by Dr. Soreq and Dr. Getman
were tested with MUC3FP1A and S and MUC323A and S which correspond to
sequences 5' and 3' to the tandem repeat regions. However no amplification was
observed for either set of primers with cosmid clones A- or p i 8D 1-1. These results
indicate that neither of the cosmids contain any MUC3 sequences (Fig. 4. 25).
The genomic clone GM3 was also tested for the presence of ACHE as it
contains approximately 15kb of sequence flanking the 3' end of MUC3, however no
amplification was observed (Fig. 4. 25).
167
B
Figure 4. 24.
Three metaphase spreads A and B showing fluorescent in situ hybridisation of the
cosmid clones MUC3C2 (spread A) and MUC3C6 (spread B). The chromosomes are
counter stained red with PL
A. The probe MUC3C2 localisation can be seen as green spots coincident with
the band Sqter.
B. The probe MUC3C6 localisation can be seen as green spots coincident with
the band Sqter.
16S
201154
G1 Ml Al G2 M2 A2 P2 G3 M3 A3
Figure 4. 25.
Standard hot start PCR of genomic (G), genomic MUC3 clone GM3 (M), ACHE
cosmids A- (A) and p l8D-l (P) with primers for; 1. ACHE (HGMP ID No. 6033 and
6034), 2. MUC3 (MUC323A and MUC323S) and 3. MUC3 (MUC3FP1A and
MUC3FP1S). The size markers are shown on the left hand side and are in basepairs.
169
4.1.4. Sequencing
In order to complement the sequence information obtained from the cDNA
clones vectorette PCR was used to obtain genomic sequence. Five vectorette
'libraries' (section 2.6 .3.6 ) were constructed from genomic DNA digested with
BamHI, Clal, Alul, EcoRI and Hindlll ligated to the appropriate vectorette ends.
Genomic DNA was used because of the probable problem with rearrangements in
even the most hopeful Y AC clones. Primers were designed from the sequence of
SIB 172 which contains ‘unique’ sequence 5' to a region of tandem repeats (Appendix
V).
A product (VECl) of approximately 600bp was amplified using the specific
primer MUC3FP3A and universal vectorette primer (UVP) (Fig. 4. 26) from the
H indlll vectorette library. There was a certain amount of non specific product
associated with the distinct band (not shown) which was not present when this was
reamplified (Fig. 4. 27) using the specific primer MUC3FP2A which is 5' (nested) to
MUC3FP3A (Fig. 4. 26). In order to sequence VECl using the biotinylated
sequencing method described in section 2.7.1. the biotinylated primer B-MUC3FP2A
was used to produce a biotinylated PCR product. VECl was then sequenced from
both ends using the specific MUC3FP2A primer and the universal vectorette
sequencing prim er (UVseqP) and together with internal sequencing primers
(MUC3FP5S, MUC3FP660S, MUC3FP11A, MUC3FP12A, MUC3FP15A and
MUC3FP15) (Fig. 4. 26). This produced a contiguous sequence of 556bp (Fig. 4.
28X
170
I I
%V E C 4
V E C 3
V E C l
S IB 172
3»-U V P
IA nti sen se
10 uu 1330
Vectorette sequence
t 9 I I 1 ? 1 ?
U V P
Pnmers
1. M U C 3F P 7A2. M U C 3F P 6A3. M U C 3F P 10S4. M U C 3F P5S5. M U C 3F P 5A6. M U C 3F P 4A7. M U C 3F P 660SK. M U C 3F P 11Ay. M U C 3F P 12A10. M U C 3F P 1S11. M U C 3F P 1S A12. M U C 3F P 2S13. M U C 3F P 2A14. M U C 3F P 3A15. M U C 3F P 1AU V P. U niversal V ectorette Prinier
Figure 4. 26.
Diagrammatic representation of the composite vectorette and SIB 172 sequence
showing the direction and position of primers used for vectorette PCR and
sequencing.
171
VI V3 V4
298220
Figure 4. 27.
Vectorette PCR products VECl (VI), VEC3 (V3) and VEC4 (V4) run on a 2% agarose
gel. The size markers are shown on the left hand side and are in base pairs.
172
1 C T T C A C T T C T T C A A C C A G T C T A C T C C A C A G C C A G C A C A C T A C A C A A C T G C C A T C A C C T C A 6 0
C TL e u H i s P h e P h e A s n G l n S e r T h r P r o G l n P r o A l a H i s T y r T h r T h r A l a l l e T h r S e r
6 1 G T T C C C A C T A C G T T G G G T A C C A T G G T G A C T T C T A C A T C C A G G A T C T C A T C T A G T O T G A G T 1 2 0
C T C C C CV a l P r o T h r T h r L e u G l y T h r M e t V a l T h r S e r T h r S e r A r s r I l e S o r S e r S e i V a l S e r
M e t P r o T h r L o u
1 2 1 A C A G O T A T C C C T A C C T ^ C j ^ C C ^ 1 8 0
A ~~ ’ T CT h r G l y l l e P r o T h r S e r G l n P r o T h r T h r l l e T h r P r o S e r S e r V a l G l y l l e S o r G l y
A s p T tir
1 8 1 T C A T T A C C T A T G A T C A C A G A C C T C A C C T C A G T T G T A C A C A G T C T C 2 4 0
S e r L e u P r o M e t M e t T h r A s p L e u T h x S e r V a l T y r T h r V a l S e r S e r M e t S e r A l a A r g
2 4 1 C C ^ C f ^ G T ? G T C A T T C C lT ^ A T C T C C C A C r o T C C A G A A T A C A 3 0 0
P r o T h r S e r V a l l l e P r o S e r S e r P r o T h r V a l G l n A s n T h r G l u T h r S e r l l e P h e V a l
3 0 1 A G C A T G A T C T C T G C T A C C A C T C C C A G T C G A G G A T C ^ C T T T C A C ^ G T ^ 3 6 0
S e r M e t M e t S e r A l a T h r T h r P r o S e r G l y G l y S e r T h r P h e T h r S e r T h r G l u A s n T h r
3 6 1 C C ^ C f ^ G G T C C C T C C T G A C ^ G C T T T C C A G T ^ C A C A l T ^ 4 2 0
P r o T h r A r g S e r L e u L e u T h r S e r P h e P r o V a l T h r H i s S e r P h e S e r S e r S e r M e t S e r
4 2 1 G C C A G C A G T C T A G G G A C C A C T C A C A C C C A G A G T A T C T C C T C A C C C C C A G C C A T C A C C A G T 4 8 0
A l a S e r S e r V a l G l y T h r T h r H i s T h r G l n S e r l l e S e r S e r P r o P r o A l a l l e T h r S e r
4 8 1 A C A C T C C A C A C A A C A g C T G A A T C C A C C C C A T C A (:C T A C A A C C A C C A T G T C A T T C A C A A C A 5 4 0
T h r L e u H i s T h r T h r A l a G l u S e r T h r P r o S e r P r o T h r T h r T h r M e t S e r P h e T h r T h r
5 4 1 T T T A C A A A G A T G G A A A C A C C T T C A T C C A C T G T A G C A A C T A C A G G C A C A G G T C A G A C T A C A 6 0 0
P h e T h r L y s M e t G l u T h r P r o S e r S e r T h r V a l A l a T h r T h r G l y T h r G l y G l n T h r T h r
6 0 1 T T C A C C A G T T C A A C A G G C A C A T C .C C C T A A G A C C A C C A C A C T G A C T C C T A C C T C T G A C A T T 6 6 0
P h e T h r S e r S e r l l i r A l a T h r S e r P r o L y s T h r T h r T h r L e u T h r P r o T h r S e r A s p I l e
6 6 1 T C C A C A G G A T C T T T C A A A A C A G C C G T G A G T T C T A C T C C C C C C A T C A C T T C T T C A A T C A C C 7 2 0
S e r T h r G l y S e r P h e L y s T h r A l a V a l S e r S e r T h r P r o P r o I l e T h r S e r S e r l l e T h r
7 2 1 T ^ Q hQ hT h T h Q Q Q T Q h Q T T C G A T G A C A A C T A C C A C C C C T C T A G G G C C C A C A G C C A C T A A T 7 8 0
S e r T h r T y r T h r V a l T h r S e r M e t T h r T h r T h r T h r P r o L e u G l y P r o T h r A l a T h r A s n
7 8 1 A ^ T T A Q C A T (^ T T T A K A S T A G C G T T T C A T C T T C T A C G C C T G T C C C A A G T A C A G A A G C G 8 4 0
T h r L e u P r o S e r P h e T h x S e r S e r V a l S e r S e r S e r T h r P r o V a l P r o S e r T h r G l i i A l a
8 4 1 A T C A C G A G T C G T A C C A C A A A C A C C a C œ C T C T A Œ r T A C A r ro G T tS A C C A C A T ir r r C C A A T 9 0 0
I l e T h r S e r G l y T h r T h r A s n T h r T h r P r o L e u S e r T h r L e u V a l T h r T h r P h e S e r A s n
9 0 1 TC C G A C A C C A G TT C TA C A C C TA C ^TC TG A C iA C C A C C TA C C C yrA C yrT C TC TT A C TA a'T rx-^ 9 6 0
S e r A s p T h r S e r S e r T h r P r o T h r S e r G l u T h r T h r T y r P r o T h r S e r L e u T h r S e r A l a
9 6 1 C TC A C A G A TT C C A C G A C C A G A A C N A C C T A TTC C A 9 9 4
L e u T h r A s p S e r T h r T h r A r g T h r T h r T y r S e r
Two more primers were designed from the 5' end of the VECl sequence i.e.
MUC3FP4A and the nested primer MUC3FP5A (Fig. 4. 26). Using these primers a
product (VEC3) of approximately 350bp was amplified from the H indlll vectorette
library (Fig. 4. 27). This was reamplified using MUC3FP5A together with the
biotinylated universal vectorette primer (B-UVP). This product was sequenced from
both ends using the specific primer MUC3FP5A and UVseqP together with internal
sequencing primers (MUC3FP10S and MUC3FP6A) (Fig. 4. 26). This produced a
contiguous sequence of 358bp which had a 21bp overlap with the VECl sequence
(Fig. 4. 28).
The product VEC4 (approximately 150bp) was amplified from the Alul
vectorette library using primers designed from the 5' end of VEC3 i.e. MUC3FP6A
and the nested primer MUC3FP7A (Figs. 4. 25 and 4. 24). The product was
sequenced from both ends using the cycle sequencing method described in section
2.7.2. and the primers MUC3FP7A and UVseqP. A sequence of 207bp was obtained
with an overlap of 106bp the VEC3 sequence (Fig. 4. 28).
The sequences of VECl, 3 and 4 form a contiguous sequence of 994bp which
extends 739bp 5' to the SIB 172 sequence and has now been sequenced completely on
both strands (Fig. 4. 28). The single open reading frame codes for a 331 amino acid
polypeptide which is rich in threonine (29.6%), serine (21.8%) and proline (9.3%)
which together account for 60.7% of amino acids in the peptide (Fig. 4. 28). It is
interesting to note that in the VEC4 nucleotide sequence shown in Figure 4. 26 there
are a number of positions were it was not possible to distinguish between two
alternative nucleotides. In a number of cases the alternative nucleotides also leads to
an alternative amino acid but do not create any stop codons (Fig. 4. 28). This
suggests that two distinct species are present in the PCR product both of which are
very similar at the nucleotide level. The nucleotide sequence and both versions of the
peptide sequence have been used to search the sequence databases at the HGMP
resource centre, which include GenBank and SwissProt. However no significant
homologies were detected with any of the DNA or protein sequences in these
174
databases. Also both the peptide sequences and nucleotide sequence were analysed
using the program ‘Repeat’ but no significant repetitive structure was found.
175
4.2. Discussion
The MUC3 polymorphisms detected with SIB 124 on DNA digested with PstI
and PvuII indicate that there are two separate regions of tandem repeats separated by
‘unique’ sequence, in which PstI and PvuII sites are located, and that at least part of
the variation is due to VNTR. However it seems that the PstI polymorphism is not
simply the result of VNTR variation but may also result from a polymorphic flanking
PstI site. The most likely interpretation is that there is an additional polymorphic PstI
site closer to the tandem repeats. This is reminiscent of the situation with MUC2 were
there are polymorphic restriction sites as well as the VNTR polymorphism, although
in the case of MUC2 they are within the tandem repeats (Toribara, Gum et al. 1991).
Like many of the mucin genes, MUC3 shows a high level of variation which is
illustrated by the large number of alleles observed, although the resolution of the
Southern blots used did not allow the precise number to be determined. It seems
likely that the generation and maintenance of the VNTR polymorphism would be due
to mutational processes such as unequal sister chromatid exchange as discussed in
section 1.1. It is worth noting that no new mutations were observed in the 40 CEPH
EUROGEM families tested.
The high variability of MUC3 meant that it was an excellent marker for
genetic mapping. Although MUC3 had been physically assigned to chromosome
7q22 very little was known about the relative order and distances of the markers in
this region especially with respect to MUC3 at the outset of this project. The initial
intention for generating the map of chromosome 7 shown in Figure 4. 3 was to try
and create a map with as many markers as possible and to try and integrate markers
from the GENETHON and NIH/CEPH maps (1992; Weissenbach, Gyapay et al.
1992). It was also hoped that the markers included in the region containing MUC3
may prove useful for long range physical mapping. When the relative order of
markers which were included in the map I had constructed and each of the published
maps were compared they were found to be in agreement (Appendix VI). Also the
order of the markers which had been mapped using physical methods to particular
176
regions on chromosome 7 agreed with that predicted from my map (Fig. 4. 3).
However the genetic distances of the markers flanking MUC3 suggested that they
were probably separated by large physical distances and would not be useful for
PFGE.
Using the order from the chromosome 7 map shown in Figure 4. 3 a map of
the q arm was constructed when version 7.1 of the CEPH database became available
which had improved data for a number of the markers on version 6 and many new
markers (Fig. 4. 4). This map included a number of new markers which map closer to
MUC3 such as D7S1493, D7S515, UT7164 and the genes C0L1A2 and MET (Fig. 4.
4). Unfortunately the two genes (C0L1A2 and MET) which would be the most
interesting for long range physical mapping are still a considerable genetic distance
from MUC3 i.e. COL1A2 is 6 .6 cM and MET is 21.7 cM from MUC3 and are thus
not suitable. However this map together with maps from other laboratories was used
to construct a consensus map of chromosome 7 at the second International Workshop
(Tsui, Donis-Keller et al. 1995).
It is interesting to note that the maps of chromosome 7 calculated for each sex
are of different lengths i.e.. Male 224.2 cM and female 342.7 cM (Fig. 4. 3), although
the total map length is likely to have been increased by the inclusion of errors. The
longer female map reflects the fact that on average there is more recombination in
females than males throughout the genome. This is in accordance with Haldanes
observation that crossing over is more frequent in the homogametic sex e.g. XX than
in the heterogametic sex e.g. XY (Haldane 1922). There are however regional
differences in the relative recombination rate and it is possible that this will also
prove to be the case for chromosome 7 (Keith et al. 1990). The distance between loci
towards the telomeres of the male map appear to increase which indicates more
recombinations, were as they seem more evenly spread along the chromosome in the
female map. This may be an indication of a similar phenomenon to that observed
with chromosome 1 Ip 15, chromosome 9, mouse chromosome 2 and chromosome 21
as discussed in section 3.5.
177
The physical mapping of the region of chromosome 7 to which MUC3 has
been localised has proved quite difficult and is reflected in the lack of large genomic
clones such as YACs and cosmids for this region. A number of genes were located in
the same region of chromosome 7 in the second International Workshop report for
which the relative order was not known (1992; Weissenbach, Gyapay et al. 1992).
This group included MUC3, PAIl, ACHE and EPO. Because of the lack of physical
mapping resources the most straightforward method was to use genetic information
for mapping loci by comparison with panels of defined meiotic breakpoints as
described in section 3.5. One difference however was the use of the program
‘CROSSFIND’ written by John Attwood in this laboratory (Attwood and Povey
1996) which has been designed to create breakpoint diagrams using the entire map of
a chromosome. The map used was based on the consensus order of the whole of
chromosome 7 which includes MUC3 but not PAIl EPO and PA Il. The most
suitable candidate for mapping in this way was PAIl because an extremely variable
tetranucleotide repeat polymorphism had already been described for this gene (GOB
accession ID; GDB:512834). PAIl had already been included in a map of
chromosome 7 by the Donnis-Keller group under its old name PLANHl (Tsui,
Donis-Keller et al. 1995). However these workers had not been able to insert MUC3
into the same map but had shown that it probably mapped somewhere in the same
region. Comparison of the results obtained from analysis of the PAIl polymorphism
with the defined meiotic breakpoints show that the most likely position for PAIl is
between MUC3 and D7S554.
This demonstrates a rapid and relatively straightforward method of genetic
mapping. Although this method does not give an indication of the genetic distances it
may be possible to use the order determined by this method together with a program
such as CRI-MAP. However because the families used in this analysis were not
selected randomly but on the basis that they contained a recombination, map
distances based on these data would be artificially high. The close proximity of PAIl
to MUC3 indicates that this might possibly be a useful marker for PFGE analysis.
178
These panels of chromosomes will also be useful for the future mapping of other loci
on chromosome 7 and in particular the precise mapping of markers within the same
region as MUC3 and PAIl.
The genetic analysis of MUC3 shows that the two sets of polymorphic bands
detected with PstI and PvuII are tightly linked, indicating that the two sets of tandem
repeats are in close proximity to each other. This suggests two possible scenarios;
one is that there is a single MUC3 gene with two regions of tandem repeats separated
by unique sequence, the other is that there are two genes. Attempts were made to
investigate the physical structure of the MUC3 gene locus using a variety of
techniques.
The physical mapping of MUC3 has been complicated by the lack of genomic
clones. However a certain amount of information has come from the use the cDNA
clones (SIB 124, clone 20 and SIB172U) for Southern blot analysis of genomic DNA
separated using PFGE and standard gel electrophoresis.
In situ hybridisation using probes for the tandem repeat cDNA clone SIB 124,
the ‘unique’ sequence cDNA clone 23 and the genomic clone GM3 indicate that
MUC3 maps to a single locus on chromosome 7q22. Furthermore all the MUC3
sequences appear to be located on a single 400kb Notl fragment and a single 200kb
Swa I fragment (Fig. 4. 12). However the Swa I fragment is the smallest single
fragment that has been detected and may indicate that the MUC3 genetic region
covers a large region of DNA. There does not appear to be any very large scale
interindividual variation in this region and the variation due too VNTR is presumably
too small to be detected given the resolution of PFGE (Fig. 4. 13). The Notl sites
may indicate the presence of CpG islands as a similar size fragment is also detected
with BssHII (Table 4. 4). CpG island are often associated with the 5’ regions of
genes (Craig and Bickmore 1994). If MUC3 has a CpG island associated with its 5’
end then the fact that only a single fragment is detected with Notl and BssHII might
hint that there is either a single MUC3 gene in which the duplicated sequences are
tandemly arrayed or that the two genes are inverted with respect to each other.
179
PFGE also shows that two fragments are detected with SIB 124 on genomic
DNA digested with other enzymes such as Smal (80kb and 45kb) and Sfil (160kb and
85kb) (Table 4. 4). These fragments may indicate the presence of cut sites for these
enzymes in the sequences flanking and between the tandem repeat regions. The
multiple fragments detected with the enzymes SacII, Nrul and Mlu I also indicate
multiple cut sites within MUC3. However the intensity of the bands is variable and
overall the hybridisation does not appear to be as good as with the other enzymes
(Fig. 4. 11). Since the DNA blocks were all made at the same time and should be
virtually identical in their DNA content this may indicate that the digestion of the
agarose embedded DNA was not complete, possibly due to technical reasons.
However it should be noted that some of these enzymes are méthylation sensitive and
méthylation of the DNA may have resulted in partial digestion to produce the
multiple fragments detected.
When Southern Blots of genomic DNA digested with PvuII, PstI and Hindlll,
run under standard conditions, were probed with clone 20 and SIB172U a number of
bands were detected as well as the larger polymorphic bands detected with SIB 124
(Fig. 4. 10) and when PvuII digests are used, clone 20 and SIB172U detect both sets
of polymorphic bands detected by SIB 124, which would suggest that these ‘unique’
sequences are also repeated. The common 1.8kb PvuII fragment detected by both
Clone 20 and SIB172U would seem to indicate that these or sequences very similar to
these clones are physically close.
It should also be noted that there is some variation of the size of these
additional bands with Pstl and this may be related to the proposed polymorphic PstI
site as no such variation is observed with PvuII.
The 5’ end of clone 20 shares identical overlapping sequence with clone 23
which contains a number of tandem repeats at its 5’ end, whereas the SIBI72U
‘unique’ sequence is located 5’ to one of the regions of tandem repeat. This may
indicate that the copies of clone 20-like sequence are associated with the 3’ ends of
both tandem repeat regions and that SIB 172 like sequences are associated with the 5’
180
ends. Also the close proximity of clone 20-like and SIB 172-like sequences indicated
by the l.Skb PvuII fragment suggests that the MUC3 duplicated sequences are
tandemly arrayed. If these are two tandemly arrayed genes, the two genes must be
extremely close, no more than 0.5 to l.Okb apart. It is thus perhaps more probable
that there is a single gene with tandemly arrayed internal duplications. There is
precedent for this in the case of MUC5AC which appears to show two or more major
regions of tandem repeat and multiple copies of certain cysteine rich regions
(Guyonnet-Duperat, Audie et al. 1995).
It is interesting to note that clone 20 probably corresponds to the 3’ end of a
MUC3 gene due to the presence of a stop codon and a long untranslated region. Also
more recently a polyadenylation signal has been identified in sequence from the
genomic clone GM3 which contains identical sequence corresponding to the whole of
the clone 20 cDNA (Jim Gum personal communication). Indeed when GM3 was
tested with primers designed from SIB172U sequence no amplification was detected
(Fig. 4. 25). This indicates that clone 20 and GM3 correspond to the 3’ end of the
entire MUC3 genetic region.
A speculative model of MUC3 based on this and other data presented in this
thesis is shown in (Fig. 4. 29). The whole MUC3 genetic region is contained within a
200kb Swa I fragment. It is proposed that the region between the tandem repeats
contains sequences similar to those in clone 23, clone 20 and SIB 172. A PvuII and a
Hindlll restriction site are located upstream of the SIB 172 sequence and were
identified by vectorette sequencing described earlier. The structure of the 3' end is
based on that described earlier with the two PstI and PvuII sites present in the
sequence shown in their approximate locations. All the other restriction sites shown
in the diagram are hypothetical and their relative positions are not based on actual
physical distances. The most 3' Hindlll site has been placed outside of the known
sequence as clone 2 0 only detects the same two fragments of 2 0 +kb and 12+kb
detected with SIB 124. The polymorphic PstI site has been placed between the two
regions of tandem repeats although it may be present in the flanking DNA.
181
Figure 4. 29
Diagrammatic representation of the speculative model of MUC3.
ocK)
SIB 172U
SIB 124
Sw a I Psi I
1 ■ ■ ■ ■ ^■ m
1 ' 1 ' 1 '
Tandem Repeals | i 1 Tandem Repeals1. . . .............................I L ........................................................... ... T ..................................................]
C lon c20 C lon e23
Pvu n Ilind III
Pvu II Ilind III Pvu n P s l l Pvu IIP stI ilind 111
r'II __
■ ■ I
Tandem Repeals
'Unique' cDNA Sequence
Hypolheiical Sequence Comprising Coding and Non Coding
Region Covered by DNA Probe
Possible Region Covered by DNA Probe
In iron
Pst I Pvu II Pst I
Ilind 111 Sw a I
Obtaining genomic sequence information and the determination of the genetic
structure of MUC3 has been hampered by the lack of genomic clones. Until recently
the only genomic clone of MUC3 available was the 3’ clone GM3.
A significant effort has been made in this laboratory and by other groups to
screen a number of cosmid and Y AC libraries to obtain large MUC3 genomic clones.
This has met with limited success with the recent isolation of the Y AC clone
YWSS3840. The other Y AC clones analysed all turned out to be rearranged or
mapped to different chromosomes or chromosomal regions (Figs. 4. 15 and 4. 17).
The instability of Y AC clones is a widely recognised problem. Indeed the proportion
of chimaeric clones in some Y AC libraries have been estimated to be as much as
60%. However it was disappointing that the cosmid clones MUC3C2 and MUC3C6
which initially appeared promising localised to chromosome 8 pter. Suggesting that
these clones were also rearranged and that the instability may be a feature of this
genomic region, possibly the MUC3 gene itself.
The clone YWSS3840 was localised to chromosome 7q22 using FISH (Fig. 4.
18). Analysis of this clone using three pairs of primers showed that it contained
sequences corresponding to Clone 20, 23 and SIB 172. However when DNA samples
of the clone digested with a variety of enzymes were compared with genomic DNA
digested with the same enzymes on PFGE blots it was obvious that the MUC3 gene
or genes were not intact (Fig. 4. 20). The fragments from the clone detected with the
repeat probe SIB 124 were consistently smaller than those of genomic DNA.
Furthermore it seems that both the PvuII fragments appear to have run off the end of
the gel suggesting that the most likely explanation may be that the tandem repeat
sequences are unstable leading to loss of these sequences. If this is the case then the
‘unique’ MUC3 sequences may conceivably be intact, together with the flanking
genomic regions.
As has been mentioned earlier a number of genes have been located in the
same region as MUC3 but little was know about their relative positions. This Y AC
clone was tested using primers for a selection of these genes i.e. ACHE, EPO and
183
PAIl (which has been mapped genetically as described earlier). A product was
amplified with the pair of primers corresponding to the ACHE gene. The undigested
YWSS3840 is 100 to 200kb in size which indicates that the ACHE gene is located
within 100 to 200kb of MUC3 and may in fact be physically closer than PAIl.
Two cosmid clones containing the ACHE gene have been isolated by two
groups. Samples of DNA from these two clones, cosmid A- (Gnatt et al. 1991) and
p l 8 D -l (Getman, Eubanks et al. 1992), were tested with primers from the SIB 172
and the 3’ end of MUC3 but no amplification was observed (Fig. 4. 25). This
indicates that MUC3 sequences are not present within these clones, indeed GM3 was
also tested with primers for ACHE and again no amplification was observed (Fig. 4.
25).
The lack of genomic clones has meant that most of the MUC3 sequence has
come from cDNA clones such as SIB 172, 124 clone 20 and 23 and some genomic
sequence from GM3. In order to determine the intron/exon structure of MUC3
genomic sequences are required to compare with the cDNA sequence. The GM3
sequence has been used to determine the intron/exon structure of the 3’ end of
MUC3. However no genomic sequence had been obtained from the 5’ side of any of
the tandem repeats. The traditional method would have been to subclone the Y AC
into a suitable vector and screen this with the various probes. However this presented
certain problems, not least of which was the small size of the MUC3 gene in the
YAC. Also it appears that this region of the genome is difficult to clone and this may
well effect the subcloning.
Vectorette PCR offered a method of obtaining unknown genomic sequence in
a directed way from specific sequences. It avoids the problem of rearrangement or
deletion of sequences in the Y AC as total genomic DNA was used as the template.
This proved to be a relatively successful approach and a contiguous sequence of
994bp was generated which extended the SIB 172 sequence by 739bp in the 5 ’
direction (Fig. 4. 28). This sequence has a single open reading frame which codes for
a 331 amino acid peptide which indicates that there are no introns in this sequence.
184
The peptide is rich in threonine (29.6%), serine (21.8%) and proline (9.3%) which is
characteristic of mucin glycoproteins. The results of database searches also indicate
that this is indeed novel sequence. Also there does not appear to be any repeat
structure or motifs in either the nucleotide or peptide sequences.
It is interesting to note that in the region covered by the VEC4 product there
are a number of nucleotide positions were it was not possible to distinguish between
two different nucleotides (Fig. 4. 28). In some instances the alternative nucleotides
result in alternative possible amino acids but not a stop codon. This may indicate that
there are two distinct species in the vectorette PCR product which share a high level
of similarity but are from different, coding, parts of the gene. This would seem to fit
with the other results which indicate the presence of more than one copy of the
‘unique’ sequences.
Indeed a number of cDNA clones has recently been isolated and sequenced by
Jim Gum which appear to have varying degrees of similarity to the vectorette
sequence (Fig. 4. 30). These clones are very similar in their sequences but can be
divided into three groups on the basis of the differences between them. The clones
SIB 172, SIB219, SIB223, SIB221 and SIB211 show almost 100% similarity to the
vectorette sequence and are probably clones from the same region of the gene. The
clone SIB217 can probably be included in this group due to the 100% identity of 165
nucleotides at the 3’ end even though the sequence of the remaining 108 nucleotides
at the 5’ end are not identical. This may indicate that the clone is chimaeric, as in the
case of SIB219 which was found to contain a portion of a mitochondrial sequence at
its 5’ end. However when the databases were searched with SIB217 no significant
similarities were found. The sequences of the clones in the second group, SIB236
and SIB227, overlap by 124 nucleotides which show 100% identity. The clones
SIB209 and SIB235 are probably the same clone as each other due to their identical
sequence and length and they comprise the third group.
185
Figure 4. 30.
Sequence alignments of the sequences SIB 172, SIB219, SIB223, SIB221, SIB217,
SIB236, SIB227, SIB209, SIB235 and the VEC.COMP sequence which is comprised
The sequences of these 3 groups of clones share a high level of similarity but
are distinguishable from each other by a number of substitutions, insertions and
deletions (Fig. 4. 30). It seems unlikely that these differences are due to
polymorphism given the number of differences or errors in sequencing as in each case
there are at least two overlapping clones or the same clone has been sequenced twice.
It also seems unlikely that the differences are cloning artefacts as they are spread
evenly along the sequences. The high level of similarity between these sequences
means that they would all hybridise the SIB172U probe.
This evidence together with the vectorette sequence and the Southern analysis
strongly supports the notion that the so called ‘unique’ sequences are in fact repeated
and possibly more than twice. The most likely explanation is that this region of DNA
has undergone at least one large duplication event of an ancestral MUC3 gene with
possibly other small scale duplications as well.
The precise role of the mucin encoded by the MUC3 gene is unknown.
MUC3 appears to be expressed in the small intestine (Fig. 4. 15), in both goblet cells
and villus columnar cells, although the expression appears to be higher in the
columnar cells (Lesuffleur, Zweibaum et al. 1994). It does not appear to be highly
expressed in the colon (Fig. 4. 15) (Lesuffleur, Zweibaum et al. 1994) and probably
doesn’t form a significant component of colonic mucus. Indeed MUC2 appears to be
the mucin gene predominantly expressed in the colon.
The heterogeneity of the mucus preparations together with the high level of
glycosylation has meant that direct estimates of the size of the mucins peptide
backbones has not been possible. Indeed it is only recently becoming possible to
determine the which specific mucins are present in mucus preparations. In the case of
M U C l, MUC2 and MUC7 for which complete cDNA sequences have been
published, it is possible to deduce the size of the peptide, i.e. M UCl encodes a
peptide of between 874 and 2954 amino acids, the most common allele of MUC2
codes for a protein containing some 5100 residues and MUC7 a protein of about 780
amino acids (Gendler, Lancaster et al. 1990; Gum, Hicks et al. 1994; Bobek, Liu et al.
187
1996). However only partial cDNA clones have been isolated for the other mucin
genes including MUC3. Accurate estimates of the size of the MUC3 mRNA has
proved difficult to obtain due to the ‘polydisperse’ transcripts detected on Northern
blots. These smeared signals appear to be a common feature of mucin genes although
the cause is unknown and may merely be due to degradation, however mechanisms
such as alternative splicing can not be ruled out.
A major transcript of approximately 13kb has been detected with MUC3 on
Northern blots of RNA from the cell line HT29 (Lesuffleur et al. 1993), which would
correspond to a protein of about 4330 residues (approximate Mr of 400 000 to 500
000). It is interesting to note that if the VNTR regions were transcribed in their
entirety the difference in size between the various alleles are larger than the 13kb
transcript detected in HT29 cells. Thus it seems likely that the tandem repeats are
interrupted by an intron or introns.
A possible model for this is the FIM-B.l and FIM-C.l mucins which have
repetitive elements encoded by separate exons (Hoffmann and Hauser 1993).
Although in the case of MUC3 it seems more likely that the repeats are in clusters
separated by introns. This model raises the possibility of a higher order of repeats in
which the repeat unit is comprised of an exon containing a number of tandem repeats
together with an intron. Thus not only would it be possible for there to be variation in
the number of 51 bp repeats in the exon but also variation in the number of exon-
intron repeats.
188
5. General Discussion
As described in section 1.3 of the Introduction of this thesis the biochemical
analysis of mucins has proved difficult due to their large size and enormous
heterogeneity. The isolation of cDNA clones corresponding to different mucin genes
led to a certain amount of optimism that DNA cloning would increase the rate of
progress in the understanding of these glycoproteins. To a certain extent this has
happened. So far at least seven human mucin genes have been cloned and expression
studies show that the mucus gel secreted by many tissues are comprised of more than
one mucin such as in the small intestine where both MUC2 and MUC3 appear to be
expressed (Lesuffleur, Zweibaum et al. 1994). This together with the high level of
genetic polymorphism found in many of the mucin genes and their glycosylation
presumably accounts for a significant proportion of the heterogeneity of mucus gels.
Also the determination of partial cDNA sequences for these genes and complete
sequences in the case of MUCl, MUC2 and MUC7 has enabled the sequence of the
peptide backbones to be deduced. This has led to the production of highly specific
monoclonal antibodies using synthetic peptides as antigens, as in the case of MUC2
and MUC5AC (Durrant et al. 1994; Hovenberg et al. 1996), which can be used for
identifying the components of the mucus itself.
It seemed that the isolation of cDNA clones would rapidly lead to the isolation
of genomic clones and thus provide tools for the analysis of the genomic structure of
the mucin genes. However the analysis of the genomic structure of these genes has
by no means been straightforward. The isolation of large scale clones such as
cosmids and YACs has been especially troublesome. Indeed it is worth noting that
there are still no cosmid or Y AC contigs covering the chromosomal regions which
contain MUC3 and the cluster on l lp l5 . Specifically in the case of MUC3 Y AC
libraries were extensively screened and although a few clones were isolated these
were shown to be unstable including the clone containing genuine MUC3 sequences.
It may be that that a greater effort could have been made in obtaining cosmid clones
189
but the nature of the construction of the libraries using the Sau3A partial digestion
and the presence of Sau3 A sites in each of the tandem repeats of MUC3 did not bode
well. Also the very nature of the mucin genes and in particular the tandem repeat
regions may be responsible for instability of these sequences when cloned. Indeed
the instability of repetitive sequences in clones even when recombination suppressed
cell lines are used has been widely recognised. Unfortunately at the time this work
was carried out libraries constructed using other vectors such as PI and BACs were
not available which may have proved to be more suitable.
This meant that techniques such as linkage and PFGE have proved extremely
useful in the analysis of the chromosomal regions 7q22 and 1 Ip 15. The analysis of
the genes themselves has required the use of both traditional techniques such as
Southern blot analysis and newer techniques like vectorette PCR. It seems likely that
the elucidation of the structure of the mucin genes will require the use of a wide range
of techniques as demonstrated in this thesis.
So far mucin genes have been localised to chromosomes 1 ,3 , 4, 7 and 11.
The determination of the genomic structure of these genes and their physical
relationships to one another will be useful for investigating the evolutionary basis of
the mucin genes and whether there is any functional significance in their position and
structure.
The mucin gene family on chromosome l lp l5 are particularly interesting in
this respect now that the order of the genes and the orientation of the cluster has been
determined using physical and genetic mapping techniques. It has been speculated
that the order of the genes may be related to the pattern of expression (Pigny,
Guyonnet-Duperat et al. 1996). It was noted that there seems to be a correspondence
between the order of the genes and the preferential expression of particular genes in
specific tissues i.e. the genes towards the centromere are preferentially expressed in
the epithelia of anterior tissues such as bronchus, while the more telomeric genes
showed preferential expression in the epithelia of posterior tissues such as colon.
190
Also the similarities between the ‘unique’ sequences of these genes indicate that they
arose from a common ancestor.
It might therefore be tempting to think that the all the mucin genes on the
different chromosomes arose from a single ancestral gene. However, although the
mucin genes share some characteristics such as the tandem repeats there are
significant differences. MUCl for instance is widely expressed in a large number of
tissues and has a transmembrane region which has not been found in any other
mucins(Gendler, Lancaster et al. 1990). There does not appear to be a very high level
of similarity between the unique sequences of mucins on different chromosomes. For
example the cystine knot like sequence found in the deduced carboxyl terminal
peptide sequence of MUC2 and MUC5AC (Meitinger, Meindl et al. 1993; Lesuffleur,
Roche et al. 1995) is not present in M UCl, MUC3 or MUC7. Also the cysteine
residues present in the MUC2 peptide which are able to form disulphide bridges
implicated in gel formation (Gum et al. 1992) are not present in MUCl or MUC7. It
is not known whether they are present in the ‘unique’ sequences of MUC4 and MUC3
which have not yet been cloned. MUC4 expression like MUC3 is not limited to the
goblet cells in the tissues in which it is expressed, unlike MUC2 for instance
(Lesuffleur, Zweibaum et al. 1994). It is curious to note that a signal was detected at
the tip of the q arm of chromosome 3 when using the ‘unique’ MUC3 clone 23 in in
situ hybridisation experiments at lower than normal stringency (M. Fox
unpublished). It is tempting to speculate that there may be an evolutionary and or
functional relationship between MUC3 and MUC4.
It would seem that although the mucin peptides share some characteristics
such as tandem repeats rich in threonine and serine it may be that this is coincidental
and that this apparent similarity arose from ‘convergent’ evolution. Indeed the
differences in expression and the inability of some mucins to form gels indicates that
the different mucins fulfil different functions but all require a high level of
glycosylation. It may be that the mucin gene family seem today arose by both
‘divergent’ evolution, such as the cluster on chromosome l lp l5 , and by ‘convergent
191
evolution’ accounting for the genes on other chromosomes. Thus the determination
of the gene structure as well as the sequence of these gene will be invaluable in
unravelling the complex evolutionary relationships of the mucin gene family.
192
Appendix I
All the lod scores greater than 3 for the the genes MUC6, MUC2 and MUC5AC with all the other chromosome 11 markers in the CEPH database version 7.1, calculated using the ‘twopoint’ option of CRI-MAP.
Pedigrees of the 15 CEPH families which show recombinations in the region chromosome llp lS . The phenotypes for each locus are shown below the individual.
KEY:
Locus Alias Probe PolymorphismD11S2071 D11S2071 j)194b (CA)nHRAS HRAS (correct) pEJ6.6 Msp IHRAS HRAS pTBB-2 Tap IHRAS HRASl pTBB-2 Tag IMUC6 MUC6 MUC6 Pvu IIMUC5AC MUC5A JER58 Pvu II (‘upper’ set of bands)MUC5AC MUC5B JER58 Pvu II (‘lower’ set of bands)D11S150 D11S150 probe2 .1 Pst IMUC2 MUC2 SMUC41 H inflMUC2 MUC2new SMUC41 Hinf I (ERO G EM CEPH filters)
DllSlOOO CEB41 CEB41 Pvu IIDllSlOOO COS32A8 CEB41 Hae IIIINS INSa pINS-310 Pvu IIINS INSb pINS-310 Pvu IITH THa J4.7 Tag ITH THb J4.7 Tag IDIIS1318 D1IS1318 AFM218xel (CA)nD11S868 c o s ll la CEB18 Hae IIID11S868 c o s ll lb CEB18 Hae IIIHBB HBB EC per
I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I M M I I I I I I I I I I I I I I I I I I I i l l l lNP3a TGTGCCTCAACTACGAGGTGCGCGTGCTCTGCTGCGAGACCCœAGAGGCTGCCCGGTGA
I I I I I M I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I INP3 a CCTCTGTGACCCCATATGGGACTTCTCCTACCAATGCTCTGTATCCTTCCCTGTCTACTT
M I I I I I I I I I I I I I M I I I I I I M M I I I I M I I I I M M I I I M I I I I I I I I I I I I MNP3a CCATGGTATCCGCCTCCGTGGCATCCACCTCTGTGGCATCCAGCTCTGTGGCATCCAGCT
I I I M I I I M I I I I I I M M M I I I I M I M M I I I I M I M I I I I I I I I I I I I I I I I I I -NP3 a GATCCACCATATACCGCCACAGAGACCTCGCTGGCCATTGCTATTATGCCCTGTGTAGCC
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l lNP3a ATGCGGTTCCCCœAGAAAGAAAGGTGAGACCTGGGCCACACCCAACTGCTCCGAGGCCA
i i i i i i i i i i i i i i i i i i i i i i i i i i i i i mil l i i i i i i i i i i i i i i i i i i i i i iNP3a CCTGTGAGGGCAACAACGTCATCTCCCTGÇGCCCGCÇSACGTGCCCGAGGGTGGAGAAGC
I M I I I I I I I I I I I I I I I I I I I I M I I I I I I I I I I I I I I I I I I I I I I I I I I I I ININPSa CCACTTGTGCCAACGCGTACCCGGCTGTGAAGGTGGCTGACCAAGATGGCTGCTG-CATC
I I I M I I M I I I I I I I I I I I I I M I I I I I M I I I I I II l l l l l l l l l l l l l l l l l l lNPSa GCACCTACTACACCTTCCTGGACAACTGCACGTACGCTG— GGTGCAGCAGATTGTGCCC
l l l l l l l l l l l l l l l l l l l l l l l l l l l m i l l i i i i i i i i i i i i i iNPSa GTGTATGGCCACTTCCGCGTGCTCGTCGACAACTACTTCTGCGGTGCGGAGGACGGGCTC
MI MMI MMMI l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l INPSa CCAGTCCACGGGGTGTAGACAAACGAGATCATCTTCAACAACAAGGTGGTCAGCCCCGGC
n i l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l lNPSa TTCC -GAAAAACGGCATCGTGGTCTCGCGCATCGGCGTCAAGATGTACGCGACCATCCCG
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l lNPSa AAGTTTGCCAACAACACCGAGGGCCAGTGCGGCACTTGCACCAACGACAGGAAGGATGAG
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l lNPSa TGCCGCACGCCTAGGGGGACGGTGGTCGCTTCCTGCTCCGAGATGTCCGGCCTCTGGAAC
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l lNPSa GTGAGO TCCCTGACCAGCCAGCCTGCCACCGGCCTCACCCGACGCCCACCACGGTCGGG
I I I I I I I I I I M I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I INP3a CCCACCACAGTTGGGTCTACCACGGTCGGGCCCACCACAGTTGGGTCTACCACCGTCGGG
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l lNP3a GTCTTTGAGCCGTGCCACACTGTGATCCCCCCACTGCTGTTCTATGAGGGCTGCGTCTTT
I I I I I I I I I I M I I I I I I I I I I I I I I I I I I M I I I I I I I I I I I I I I I I I I I I I I I I I INP3a GAœGGTGCCACATGACGGACCTGGATGTGGTGTGCTCCAGCCTGGAGCTGTACGCGÇGA
Mi l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l I l l l l l l l l l l lNP3a CTCTGTGCGTCCCACGACATCTGCATCGATTGGAGAGGCCGGACCCGdGACATGTGCCCA
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l lNP3a -TCACCTGCCCAGCCGACAAGGTGTACCAGCCCTGC-GCCCGAGCAACCCCTCCTACTGC
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l I l l l l l l l l l l l l l l l l l l l l l l l l lNP3a TACGGGAATGACAGCGCCAGCCTCGGGGCTCT£CGGGAGGCCGGCCCCATCACCGAAGGC
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l lNP3a TGCTTCTGTCCGGAGGGCATGACCCTCTTCAGCACCAGTGCCCAAGTCTGCGTGCCCACG
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l lNP3a GGCTGCCCCAGGTGTCTGGGGCCCCACGGAGAGCCGGTGAAGGTGGGCCACACCGTCGGC
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l lNP3a CTCTGCCCGCTGCCCCCTQCCTGCCCCCTGCCCGGCTTCGTGCCTGTGCCTGCAGCCCCA
11II11 I 11II1111II1111 III I I I I I I I I I I I I I M I I I I I I I I I I I I I I I I INP3a CCCGTGCGGTGTCCTGAGGGCGCCCGCCGGATCCCGACCTACCAGGAGGGQGCCTGCTGC
I I I I I I I I I I I I I I I I I M I I I I I I I I I I I I I I I I I I I I M I I I I I I I I I I I I I I I I I I INP3a CCAGTCCAAAACTGCAGCTGGACAGTGTGCAGCATCAACGGGACCCTGTACCAGCCCGGC
Mi l l I N i l II Ml II I N i l II N i l 11 Ni l II M i l l I M i l l I N i l 11 N i l IINP3a GCCGTGGTCTCCTCGAGCCTGTGCGAAACCTGCAGGTGTGAGCTGCCGGGTGGCCœCCA
U l l l l 11 Ni l I I I U l l l l 11 Mi l l I Mi l l 11 Ml II11 U l l l l 11 Ml I I I INP3a TCGGACGCGTTTGTGGTCAGCTGTGAGACCCAGATCTGCAACACACACTGCCCTGTGÇGG
l l l l l l l l l l l l i l l l l l l l I II11 Mi l l 11 Mi l l 11 Mi l l 11 N i l I I I M i l lNP3a TTCGAGTACCAGGAGCAGA8-GCGCAGTGCTGT0GCACCTC?rGTGCAGGTCGCCTGTGTC
11 U l l l l 11 Mi l l 11 Mi l l 11 Mi l l ! I U l l l 11 N i l I I I I Mi l l I M i l l 11NP3a ACCAACACCAGCAAGAGCCCCGCCCACCTCTTCTACCCTGGCGAGÇACCTGGTCAGACGC
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l lNP3a AGGGAACCACTGTGTGACCCACCAGTGTGAGAAGCACCAGGATGGGCTCGTGGTGGTCAC
l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l lNP3a CACGAAGAAGGCGTGCCCCCCGCTCAGCTGTTCTCTGGACGAGGCCCGCATGAGCAAGGA
III111 U l l l 111 U l l l 11 U l l l 11 U l l l I Ni l 1111 U l l l 11 U l l l l I U l lNP3a CGGCTGCTGCCGCITCTGCCCX3CTGCCX:CCGCCCCX:GTACCAGAACCAGTCGACCTGTQC
I I I I I I II I I I I II I I I II I I I II I I I I I I I I I II I I I l l l l l l l l l l l l l l l l l l l lNP3a TGTGTACCATAGGAGCCTGATCATCCAGCAGCAGGGCTeGAGCTCCTCGGAGCCCGTGCG
11II11 Ml I I I I U l l I I I U l l l 11 U l l l I U l l l Ml III11 U l l l 11 U l l l 11NP3a CACGGTGGAGCACAGGTGCCAGTGCTGCCAGGAGCTGCGGACCTCGCTGAGGAATGTGAC
11II11 Ml I II I U l l 111 H i l l 11 U l l l I U l l l I U l l l 11 U l l l 11 U l l l 11NP3a CCTGCACTGCACCGACGGCTCCAGCCGGGCCTTCAGCTACACCGAGGTGGAAGAGTGCGG
U l l l l 11 U l l l l I U l l l 11 U l l l I U l l l I Ul l I II U l l l I U l l l l l I U l l lNP3a CTCGGCTTCCAAGGCCAGTGGAACTTGTGCCCCTGTCCAGGCGGCTGCAGCTTTGAACAC
U l l l l 11 U l l l 11 U l l l 11 U l l l I U l l l l U l l l 11 U l l l I U l l l l 11 U l l lNP3a ACTGTCCACGCCCGCTTTCTTGTGGAGGGTGTGGGCTATGGGTCACCTGCTGCCTGGAGG
I I I I U l l l l U l l l l I U l l l I U l l l l U l l l I U l l l I I I I U l l l l I U l lNP3a AGGGGCCCTTACCCACCCCGCCTGCAGCCACCTCTCAGGA GCCCGGGGCTGGCCGA
U l l l l 11 U l l l 11 U l l l 11 U l l l I U l l l l I U l l l U l l l l I U l l l l 11 U l l lNP3a CAGGAATGACGCTTGGACATGGTGATCAGCTGCCTGGTGGCTGCAGGAGGAAGAACCTCA
Il I U l l l l 11 U l l l 11 U l l l I U l l l l I U l l l l I U l l l I U l l l l I U l l l l I uNP3a CTCCTACCTCAGCCCTCAGCCTGCGCTCCCCTCCTCAGTACACGGCCAATCTGTTGCATA
3470 3480 3490 3500 3510 3520
3270 3280 3289L31 AATACACTTGAGCATTTTGCAAAAAA
l l l l l l l l l l l l l l l l l l l l l l l l l lNP3a AATACACTTGAGCATTTTGCAAAAAAAAAAAAAAAAAA
3530 3540 3550 3560
217
Appendix IV
All the lod scores greater than 3 for MUC3 with loci on chromosome 7 from the CEPH database version 6, calculated using the ‘twopoint’ option of CRI-MAP.
Sequence from the cDNA clone SIB 172 showing the positions of primers used in standard and vectorette PCR applications. The sequence and position of the primer is indicated by a line either above the sense strand (sense primer) or below the antisense strand (antisense primer) and where two primers overlap is indicated by a double line.
Allen, A. (1984). Structure and Function of Gastrointestinal Mucus. Physiology of the Gastrointestinal Tract. Ed. J. R. Leonard. Pub. New York, Raven Press. 617-639.
Anand, R., Villasante, A. and TylerSmith, C. (1989). Construction of yeast artificial chromosome libraries with large inserts using fractionation by pulsed-field gel electrophoresis. Nucleic Acids Res. 17, 3425-3433.
Armour, J., Crosier, M. and Jeffreys, A. (1996). Distribution of tandem repeat polymorphism with minisatellite MS621 (D5S110). Ann. Hum. Genet. 60, 11-20.
Armour, J. A., Harris, P. C. and Jeffreys, A. J. (1993). Allelic diversity at minisatellite MS205 (D16S309): evidence for polarized variability. Hum. Mol. Genet .2 , 1137-45.
Asker, N., Baeckstrom, D., Axelsson, M. A. B., Carlstedt, I. and Hansson, G. C.(1995). The human MUC2 mucin apoprotein appears to dimerise before O- glycosylation and shares epitopes with the 'insoluble' mucin of rat small intestine. Biochem. J. 308, 873-880.
Attwood, J. and Povey, S. (1996). CROSSFIND; Software for detecting and displaying well-characterised meiotic breakpoints in human family data. Ann. Hum. Genet. In press,
Aubert, J. P., Porchet, N., Crepin, M., Duterque-Coquillaud, M., Vergnes, G., Mazzuca, M., Debuire, B., Petitprez, D. and Degand, P. (1991). Evidence for different human tracheobronchial mucin peptides deduced from nucleotide cDNA sequences. Am. J. Respir. Cell Mol. Biol. 5, 178-185.
Balague, C., Audie, J. P., Porchet, N. and Real, F. X. (1995). In situ hybridization shows distinct patterns of mucin gene expression in normal, benign, and malignant pancreas tissues. Gastroenterology. 109, 953-964.
Balazs, I., Baird, M., Wexler, K. and Wyman, A. (1986). Characterisation of the polymorphic DNA fragments detected with a new probe derived from the D14S1 locus. Am. J. Hum. Genet. 39, A229.
Berg, E. S. and Olaisen, B. (1993). Characterization of the C0L2A1 VNTR polymorphism. Genomics. 16, 350-4.
Bhargava, A. K., Woitach, J. T., Davidson, E. A. and Bhavanandan, V. P. (1990). Cloning and cDNA sequence of a bovine submaxillary gland mucin-like protein containing two distinct domains. Proc. Natl. Acad. Sci. USA. 87, 6798-6802.
Blouin, J. L., Christie, D. H., Gos, A., Lynn, A., Morris, M. A., Ledbetter, D. H., Chakravarti, A. and Antonarakis, S. E. (1995). A new dinucleotide repeat polymorphism at the telomere of chromosome 2 1 q reveals a significant difference between male and female rates of recombination. A m . J. Hum. Genet. 57, 388-94.
Bobek, L., Liu, J., Sait, S., Shows, T., Bobek, Y. and Levine, M. (1996). Structure and chromosomal localization of the human salivary mucin gene, MUC7. Genomics. 31, 277-282.
Bobek, L. A., Tsai, H., Biesbrock, A. R. and Levine, M. J. (1993). Molecular cloning, sequence, and specificity of expression of the gene encoding the low molecular weight human salivary mucin (MUC7). J . B io l. Chem . 268, 20563-9.
225
different human tracheobronchial mucin peptides deduced from nucleotide cDNA
sequences. Am. J. Respir. Cell Mol. Biol. 5, 178-185.
Balague, C., Audie, J. P., Porchet, N. and Real, F. X. (1995). In situ hybridization
shows distinct patterns of mucin gene expression in normal, benign, and malignant
pancreas tissues. Gastroenterology. 109, 953-964.
Balazs, I., Baird, M., Wexler, K. and Wyman, A. (1986). Characterisation of the
polymorphic DNA fragments detected with a new probe derived from the D14S1
locus. Am. J. Hum. Genet. 39, A229.
Berg, E. S. and Olaisen, B. (1993). Characterization of the C0L2A1 VNTR
polymorphism. Genomics. 16, 350-4.
Bhargava, A. K., Woitach, J. T., Davidson, E. A. and Bhavanandan, V. P. (1990).
Cloning and cDNA sequence of a bovine submaxillary gland mucin-like protein
containing two distinct domains. Proc. Natl. Acad. Sci. USA. 87, 6798-6802.
Blouin, J. L., Christie, D. H., Gos, A., Lynn, A., Morris, M. A., Ledbetter, D. H.,
Chakravarti, A. and Antonarakis, S. E. (1995). A new dinucleotide repeat
polymorphism at the telomere of chromosome 2 1 q reveals a significant difference
between male and female rates of recombination. Am . J. Hum. Genet. 57, 388-94.
Bobek, L., Liu, J., Sait, S., Shows, T., Bobek, Y. and Levine, M. (1996). Structure
and chromosomal localization of the human salivary mucin gene, MUC7. Genomics.
31, 277-282.
226
Bobek, L. A., Tsai, H,, Biesbrock, A. R. and Levine, M. J. (1993). Molecular cloning,
sequence, and specificity of expression of the gene encoding the low molecular
weight human salivary mucin (MUC7). J . B io l. Chem . 268, 20563-9.
Braga, V. M., Pemberton, L. F., Duhig, T. and Gendler, S. J. (1992). Spatial and
temporal expression of an epithelial mucin, M UCl, during mouse development.
Development. 115,427-37.
Brookes, A. J., Hedge, P. H. and Solomon, E. (1989). A highly polymorphic locus on
chromosome 11 which has homology to a collagen triple-helix coding sequence.
Nucleic Acids Res . 17, 1792.
Buard, J. and Vergnaud, G. (1994). Complex recombination events at the