GENOMICS, MUTATIONS AND THE INTERNET:

The Naming and Use of Parts

Charles R. Scriver Piotr M. Nowacki

Departments of Human Genetics
Biology and Pediatrics, McGill University
Montreal, Quebec, Canada
Address: CR Scriver
deBelle Laboratory
McGill University-Montreal Childrenís Hospital Research Institute
2300 Tupper Street
Montreal, QC H3H 1P3, Canada
tel. 514-934-4417
fax. 514-934-4329
e-mail mc77@musica.mcgill.ca

Summary

Mutations are the source of genetic variation and diversity; by their effect, some are neutral, others are pathogenic. In contemporary genetics, mutations appear at the interface between genomics (structural and functional) and genetics (heredity), where they serve gene discovery and mapping (genomics) and generate challenges to modify their phenotypic effects (medical genetics). Assuming the human genome harbours 80,000 transcribed genes each possessing at least 100 different (germ-line) alleles in a typical population, how then to record and recover data on at least 8 million human alleles? Bioinformatics is the essential resource to create the corresponding accessible digital libraries (genomic and locus-specific mutation databases) for this purpose, a goal to which The HUGO Mutation Database Initiative (Science 279:10-11, 1998) aspires. Guidelines now exist for naming alleles (Hum Mut 11:1-3, 1998). The principles behind the practice are illustrated by PAHdb, a prototype locus-specific mutation database (NAR 26:220-225, 1998), and by prototype genomic mutation databases (HGMD (NAR 26:285-287, 1998),; the EBI mutation database; and OMIM).

 

Science is an assault on ignorance (Ridley, 1991). Its legacies are arrays of concepts, databases and technologies. Ignorance is a powerful entity and understanding is its counterpart. Coleridge in his philosophical mode (adapted apparently in this case from Schelling) observed that until he could understand another writerís ignorance, he remained ignorant of the authorís understanding (Coleridge, 1817). His point of view serves science equally well.

 

I. THE GENOME PROJECT

It was fashionable during the early stages of The Human Genome Project to criticize it as mindless technology. The critics had missed the point because, like most people, they actually knew rather little about genomes, human or otherwise. Accordingly, the Genome Project was truly an assault on ignorance, and it was science; it was feasible because it was being served by a new generation of technology. If he had been here to observe, Coleridge might have said: "I understand our ignorance of genomes and until the project is complete, I remain ignorant of their understanding".

 

The Human Genome Project (how homocentric the name for a project that embraces genomes both human and non human) is now entering the final stages of its structural phase. Venter and colleagues (Venter et al. 1998) propose that "shotgun sequencing" of the human genome will produce an accurate ordered nucleotide sequence covering more than 99.9% of the human genome by the year 2001, by which time, all human genes will probably have been mapped to precise positions on chromosomes. A major legacy of this assault on our ignorance will be a comprehensive human genome database generated by analysis of multiple individual human genomes (Venter et al. 1998) displaying the obligatory allelic variation that characterizes biological species.

 

Mapping and sequencing will bring to a close "structural genomics" in the Genome Project. Meanwhile, this phase will have described the genomes of many other self-replicating organisms each interesting in their own way, with corresponding insights on the evolutionary genetics behind different life forms. The first product of the whole genome shotgun assault on self replicating organisms was the genome of Haemophilus influenzae (Fleischmann et al. 1995), other genomes quickly succumbed (Fraser et al. 1995; Bult et al. 1996; Tomb et al. 1997; Klenk et al. 1997; Fraser et al. 1997; Smith et al. 1997; Deckert et al. 1998), and most recently it was the genome of Treponema pallidum (Fraser et al. 1998) which, when expressed in its villainous spirochete, was capable of reducing Lord Randolph Churchill and Freidrich Neitzsche to mumbling incoherence and Franz Schubert to agonized silence.

 

Function as well as structure.

In the post-structural era of genomics, there is the prospect of "functional genomics" (Lander, 1996; Heiter and Boguski, 1997; Fields, 1997); the term may be new but it describes the old domain we know as "physiology" (and "pathology"). The relevance of functional genomics was highlighted when the initial structural phase of the Yeast (Saccharomyces cerevisiae) Genome Project came to completion (Goffeau et al. 1996; Oliver, 1996; Miklos and Rubin, 1996). Yeast is an ideal "model" organism enabling the analysis of gene function along with genome cross- referencing for the analysis of genes mutated in human disease (Bassett, Jr. et al. 1997). In this regard, "expression genomics" becomes particularly relevant; it includes systematic analysis and documentation of gene expression in other organisms (transgenics), and it reveals the patterns of host gene expression in organs, tissues and cell types during development, and at maturity of the organism (Strachan et al. 1997).

 

Among the most admired of "model organisms" is Homo sapiens. Francois Jacob (Jacob, 1982) reminds us that the human genome, like all others, is the product of evolutionary tinkering. Ernst Mayr (Mayr, 1982) knows that Homo sapiens, like all other living organisms, has emergent properties where the functioning whole is more than the sum of its genomic parts. These thoughtful gentlemen have pointed us toward the domains of physiology and homeostasis. Accordingly we need many resources to understand functions and their evolutionary origins; those resources include, for example: model organisms (Goodfellow, 1997), the above-mentioned cross referenced genomic database for yeast and man (Bassett, Jr. et al. 1997), "The Oxford grid" describing syntenic regions of homology in mouse and human structural genomes (Searle et al. 1989; Searle et al. 1994); and the bioinformatic tools to identify clusters of orthologous groups (COGs) consisting of individual orthologous sets of proteins (Tatusov et al. 1997) and orthologous sets of paralogs (Henikoff et al. 1997). .

 

It is no accident that Victor McKusick displayed the Oxford Grid in recent editions of the Catalogs of Mendelian Inheritance in Man (McKusick, 1994); nor that he is a co-author of early papers on comparative genomics in yeast and man (Tugendreich et al. 1994); nor that he has maintained a human genomic database both in print and on line (OMIM). The ultimate focus of the McKusick-style genomic database is a "neoVesalian human anatomy" (Scriver, 1976), both normal and morbid, the latter highlighted by Mendelian variation (McKusick, 1986; McKusick, 1987). Mutation has so often been the means to reveal a particular (human) locus.

 

II. GENOMICS, GENETICS, MUTATIONS, AND DATABASES

Genetics is the study of inheritance, genomics is the study of genomes (Goodfellow, 1997), mutations are studied at the interface between genetics and genomics, (Cotton et al. 1998; Scriver et al. 1998) and the tools for mutation detection, in both scanning and diagnostic modes, are improving steadily (Cotton, 1993; Grompe, 1993; Cotton, 1997). The result is that the rate of mutation discovery, in the human genome, for example, exceeds the ability of the print literature to keep pace. Online digital databases are tools for biological taxonomy and they will serve mutation repositories equally well (Cotton et al. 1998; Scriver et al. 1998). "In silico genetics" is the latest step in the journey of modern human genetics where cytogenetics, somatic cells genetics, molecular genetics, and transgenics are earlier technological milestones along the way (VA McKusick, personal communication).

 

The term "mutation" has many meanings; here it means "allelic variant" where allele is a unique change in a nucleotide sequence in the DNA molecule. It might be a "pathogenic" allele (disease-causing or phenotype modifying) or it might be "neutral" without apparent effect on phenotype. A pathogenic allele is likely to be disadaptive and to occur at low frequency in the population; a neutral allele is likely to be polymorphic and by definition to occur at > .01 frequency.

 

If there are 80,000 human genes (a reasonable estimate) (Fields et al. 1994)), and if each gene harbours at least 100 alleles (of any type, again a reasonable number based on current experience) then the human genome will contain at least 8 million different germline alleles; and more if somatic mutations are included. How does one capture, record and distribute information of this magnitude, in just one genome (for Homo sapiens), let alone for all other genomes of equivalent interest? There can be no other approach but the one available in informatics. But first there is the problem of nomenclature. All taxonomy requires the naming of parts and chromosomes, genes and alleles now have their conventions.

 

Nomenclature

(The beginning of wisdom is calling things by their right name - Chinese proverb; cited in White (White et al. 1997)).

 

Chromosomes. The chromosomal constitution of an individual is named and described according to the International System of Cytogenetic Nomenclature (Mitelman, 1995). It describes the diploid number, the sex chromosome constitution, and the variation in number and structure of chromosomes. The location of an individual gene is assigned to a banded region on an arm of a particular chromosome.

 

Genes. How to name genes has been the topic of two recent workshops. A search for standards in naming homologous genes in different organisms occupied the authors of one report(Blake et al. 1997); it contains a useful list of URLs for primary databases about various organisms and for various purposes. The particular problem of naming human genes is addressed in the other report (White et al. 1997); it describes: i) general rules for naming genes, including requirements for a gene symbol, style of symbol, name for a known gene, and naming of arbitrary genes and loci; ii) guidelines for symbol construction by taking into account hierarchical symbols, gene families and series, homologies with other species, along with preferred abbreviations for different species (e.g. for Homo sapiens, use HSA), genes identified only by their sequence information, genes with known protein or enzyme products (for EC numbers, see Nomenclature Committee of the International Union of Biochemistry and Molecular Biology), and genes for clinical disorders which can be taken into consideration when constructing the symbol. Investigators, authors, reviewers, and editors can do mutual service if we all become familiar with these guidelines and use them.

 

Alleles. Nomenclature for alleles (mutations) is the subject of recently published guidelines (Antonarakis and and the Nomenclature Working Group, 1998). EBI proposes a controlled vocabulary for databases to describe an allele and its components, for example, a vocabulary for DNA sequence, codon change and amino acid substitution for a missense allele.

The use of nomenclature guidelines for naming alleles is strongly recommended.

 

The systematic approach to naming alleles is centered on the change in nucleotide sequence. In this system, the number comes first and the letter follows. The number indicates the nucleotide by its position in the DNA sequence; the nucleotide sequence is numbered off its reference sequence which can be retrieved from the corresponding database. The letters in the name represent the wildtype and substituted nucleotide respectively. The systematic name for a major PKU-causing allele is c.1222C->T where c. indicates the nucleotide sequence is the cDNA, available under accession number U49897 in the GenBank database

 

When the reference nucleotide sequence is not accessible (or its use for naming the allele is not preferred), the mutation can be described by an alternative name. The "trivial name" is used as a convenience despite the ambiguities it will entertain (Antonarakis and and the Nomenclature Working Group, 1998), but in some cases its use has become a firm convention (e.g. the D F508 CFTR allele). The trivial name "R408W" describes a mutation in the human PAH gene corresponding to the systematic name given above (c.1222C->T); here the first letter ( R ), preceding the number, indicates the wildtype amino acid (arginine), the number is the codon, and the second letter ( W ) is the amino acid substitution (tryptophan). Table 1 summarizes how the most prevalent allele in the human gene for hepatic phenylalanine hydroxylase would be coded according to existing guidelines for context (the species and corresponding genome), chromosome, locus, gene, allele, gene product, and associated disease (with its OMIM number).

 

The example in Table 1 is a simple one for the naming of parts. More complex mutations raise problems that are still under consideration (Blake et al. 1997; Antonarakis and and the Nomenclature Working Group, 1998): see also - http://www.2.ebi.ac.uk/mutations/recommendations/naming.html http://www.ncbi.nlm.nih.gov/collab/FT/index.html

 

Mutation Databases

Upon completion of the structural phase of genomics, and when the functional phase is well underway, alleles that either serve as markers for loci or modify the function of genes will remain important entities in biological taxonomy. Documentation of alleles in genomes will then be a parallel activity equal in importance to that of structural and functional genomics (our prediction). It will be a responsibility of the HUGO Mutation Database Initiative for which RGH Cotton has been the driving force (Cotton et al. 1998; Scriver et al. 1998).

 

The intended outcome of the HUGO initiative will be an omnifarious public database centered on the genome of a particular species containing a record of all known alleles (germline and somatic) and their biological significance. Progress with the initiative is being recorded at Cottonís dedicated website.

 

The ultimate genomic mutation database will be an annotated genomic nucleotide sequence in which the biology intrinsic to the sequence, and its significance in any other aspect, will be documented. This "modest proposal" has elsewhere been offered as a role for HUGO, in the case of the human genome (Little, 1998), anticipated by the announcement of HUGO MDI (Cotton et al. 1998), and evident in several initiatives to create human genomic mutation databases already underway:

  • - The Cardiff Human Gene Mutation Database (HGMD)

    - The EBI Mutation Database

    - The OMIM disease/gene Database

    -The SWISS-PROT database provides annotated mutant sequences

  •  

    As examples of annotated genomic mutation databases, each of the above offers a different approach. OMIM limits itself to documenting the first 25 alleles (mainly pathogenic, and with a few exceptions as to the number), but it provides pointers to the corresponding locus-specific mutation databases, when they exist; OMIM is a respected source of information about Mendelian phenotypes and the corresponding source literature. HGMD (Cooper et al. 1998) documents mutations with their published or reported sources along with additional information about phenotypes. The EBI mutation webpages offer recommendations for database design and provide links to many resources. None of these databases is sufficient by itself for needs in medical genetics, each is a complementary resource and, at the present time, all benefit from the presence of annotated locus-specific mutation databases. The latter serve genomic needs when they are linked to genomic databases or when there is a search engine that can parse data from the locus-specific database and deposit them in the genomic counterpart. Meantime, locus-specific mutation databases support particular needs of corresponding user groups, which in several cases include both investigators and patients. Note that the relevance of nomenclature is here made apparent: without systematic nomenclature, there can be no universal merging of data from different databases to create an integrated and annotated view of all mutations in a particular genome.

     

    Locus-specific mutation databases.

    The present need for locus-specific mutation databases is illustrated by the following statistic given to the authors by David Cooper in Feb. 1998. At the time, HGMD contained data on over 12,500 different alleles in 692 different human genes; the database was increasing by over 2500 new entries annually; 93% of the loci documented in HGMD described fewer than 25 alleles per locus and none of these was supported by a locus-specific mutation database. Hence the relevance of HGMD. On the other hand, 45% of known alleles in the human genome were being documented and annotated in locus-specific databases, the majority of which were linked by pointers with HGMD (and also to EBI and OMIM). Some locus-specific mutation databases contain hundreds of alleles each annotated with auxiliary data. The locus-specific databases contain a vast array of annotations and are valuable in their own right. A directory of mutation databases will appear in the forthcoming 8th ed. of MMBID and is presently available as it develops on line.

     

    Design and Content. Databases are created; they develop, evolve, decay and redevelop. The process can be complicated, a document is a formal record of it and it becomes a separate component of the database to provide a mechanism for continuity and longevity. The curator is responsible for the document which, at a minimum, provides a textual printout of tables and fields if the database is relational in design, and a listing of objects and descriptors for other database types (example document). The curator (i.e. editor) is also responsible for accuracy of content.

     

    Content of a mutation database comprises entities and attributes. An entity is a real- world concept, such as "mutation". An attribute, such as the "name" of the mutation, describes the entity. A database contains as many entities as are required with the corresponding descriptors to meet the needs of the user group. There is need for a core group of entities and a minimum but essential degree of standardization; as mentioned above compatibility between databases (genomic and locus-specific) requires standardized nomenclature for alleles.

     

    As for the essential core of information, the DNA sequence has a context, a core of data which includes species, name of the gene, and reference nucleotide sequence. The core of entities is "mutation" and "source of the information"; mutation and source of data can be linked when there is a unique identifier for each mutation in the database. Whereas an objective of any system of mutation description is to reconstruct alignment of the variant nucleotide sequence on the reference sequence, its annotation expands logically when the core data include not only nucleotide change in DNA but the transcript change in mature messenger RNA and the change in the polypeptide (Lehvaslaiho et al. 1998).

     

    Design of a mutation database begins by asking questions about who and what the database will serve: What are the needs of its curator and user group? How can it be made compatible with other mutations databases? Design continues first by listing (and describing) the entities to be recorded in the database; then by listing the descriptors to be attached to the entities. If the database is to be relational, a sketch of the relationships between the entities may help; only entities, not attributes, share relationships; Figure 1 depicts an entity relationship diagram.

     

    EBI provides detailed recommendations for design of mutation databases. Further discussion can be found in emerging HUGO MDI guidelines. Development and deployment of digital mutation databases requires enabling software and a database management system. One approach to design and deployment of a relatively large locus-specific mutation database (PAHdb) is described elsewhere by the present authors (Nowacki et al. 1998); see also PAHdb and its Newsletter available from the authors. PAHdb is linked to a complementary annotated mutation database at SWISS-PROT.

     

     

     

  • III. UNIFIED ACCESS TO MUTATION DATABASES
  • Genomic and locus-specific mutation databases both exist at the present time, and a unified approach is being organized under the Mutation Database Initiative (Cotton et al. 1998). Others (Lehvaslaiho et al. 1998) have proposed a model, which in its ideal form would be a single public human mutation database curated at one or more institution, serving data on allelic variation in all human genes and their homologues. The curatorial task would be shared between the genomic and locus specific databases, and the result would be structured data when there is agreement on nomenclature, content and basic design. As a first step, the EBI group (Lehvaslaiho et al. 1998) has parsed over 30 publicly available human databases for somatic and germline mutations, analyzed them for common data types and made them available through a common user interface. The interface for this approach is provided by the Sequence Retrieval System (SRS), a tool to index, view and link independent databases (Etzold et al. 1996). Data presented under the SRS system reflect the exact contents in source databases at the time of acquisition with the addition of the relevant context information.

     

    Contents of source databases can be broken up into searchable fields. The relevance of creating fields common to different databases becomes apparent when one is attempting to create unified access to mutation databases. Nine categories of information essential (or at least useful) for unified access have been identified (Lehvaslaiho et al. 1998)

    • identifiers for mutations and entries
    • category of the variant
    • descriptors of mutations
    • context of mutation
    • methodology by which the mutation was described
    • data (expression analysis, etc) documenting its significance
    • information about mode of inheritance
    • description of clinical phenotype
    • references(s)

    Proof of Pathogenicity. Alleles are either pathogenic or neutral. What is the evidence that a mutation affects phenotype (constitutionally or under particular conditions) and how does one provide reliable annotation? First, it is necessary to deal with artefacts and ambiguities; then to settle on criteria that the mutation is a likely cause of phenotype variation (Cotton and Scriver, 1998). Questions arise most often with missense alleles.

     

    Artefacts include errors introduced by PCR; every "new" mutation should be confirmed on a second PCR product. A variant allele may be found but it may not be the allele responsible for the variant phenotype; the whole functional gene should be analyzed; extent of the DNA sequence analyzed should be stated and efficiency of the detection method known. Ambiguity can also be resolved by in vitro expression analysis; see (Waters et al. 1998).

     

    Criteria for pathogenicity of an allele depend on the information at hand; the results of expression analysis and the relative importance of the mutation type. Information at hand includes mutation type; for example, those producing "functional hemizygosity" (Guldberg et al. 1995) are likely to modify phenotype when combined with homologous alleles in autosomal recessive traits. Segregation analysis should reveal consistent association with the variant phenotype. Missense mutations that affect conserved amino acids in the polypeptide product are assumed to have greater functional significance and are more likely to be pathogenic. Frequency of the allele on a panel of 100 normal chromosomes should be stated. A mutation that is polymorphic is less likely to have a significant effect on phenotype, nonetheless it may be a modifier of gene expression.

     

    Information about mutability of the gene in question is also useful. Alastair Brown provides a web-based program.

    The present authors used a program available from Michael Krawczak (Cooper and Krawczak, 1993) to analyze predicted mutability in the exonic nucleotide sequence of PAH (Byck et al. 1997).

     

    Comment

    We have introduced the reader to several URLs and references about bioinformatic resources for mutation database. Let it be known that one of the authors (CRS) not so long ago had to learn what "URL" means and still canít operate a computer, while the other author (PMN) enjoys the rapid steady acquisition of essential expertise. Our advice for the novice (and at some time and in some particular way, each of us is a novice) is to: i) work with a colleague with expertise to complement our own; ii) take generic guidance from articles (such as (Harper, 1995)) and journals (such as Trends Guide to the Internet/Elsevier). Meantime, on-line mutation databases have put their intellectual property in the public domain; what to do to protect that property is an issue that has not escaped notice (Scriver et al. 1998; Gardner and Rosenbaum, 1998).

    Acknowledgements

    We thank Lynne Prevost, our curatorial colleague, who first created PAHdb in her Wordprocessor long ago. Dick Cotton, Heikki Lehvaslaiho and Victor McKusick among many have been stimulating colleagues. This work has been supported in part by the Medical Research Council (Canada), the Networks of Centers of Excellence (Canadian Genetic Diseases Network), Les Fonds de la Recherches en Santé du Québec (Réseau de Médecine Génétique Appliquée) and the Interuniversity Institute for Population Research (IREP).

    Bibliography

    Antonarakis SE, and the Nomenclature Working Group (1998) Recommendations for a nomenclature system for human gene mutations. Human Mutation 11:1-3
     
    Bassett DE, Jr., Boguski MS, Spencer F, Reeves R, Kim S, Weaver T, Hieter P (1997) Genome cross-referencing and XREFdb: Implications for the identification and analysis of genes mutated in human disease. Nature Genetics 15:339-344
     
    Blake JA, Davisson MT, Eppig JT, Maltais LJ, Povey S, White JA, Womack JE (1997) A report on the International nomenclature workshop held May 1997 at the Jackson Laboratory, Bar Harbour, Maine, USA. Genomics 45:464-468
     
    Bult CJ, White O, Olsen GJ, et al. (1996) Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii.. Science 273:1058-1073
     
    Byck S, Tyfield L, Carter K, Scriver CR (1997) Prediction of multiple hypermutable codons in the human PAH gene: Codon 280 contains recurrent mutations in Quebec and other populations. Hum Mut 9:316-321
     
    Coleridge, S.T. Biographia Literaria., London:1817.
     
    Cooper DN, Ball EV, Krawczak M (1998) The human gene mutation database. Nucleic Acids Res 26:285-287
     
    Cooper, D.N. and Krawczak, M. Human Gene Mutation. Oxford:bIOS Scientific Publishers, 1993. pp. 141-144.
     
    Cotton RGH (1993) Current methods of mutation detection. Mutation Research 285:125-144
     
    Cotton, R.G.H. Mutation Detection, New York:Oxford University Press, 1997.
     
    Cotton RGH, McKusick VA, Scriver CR (1998) The HUGO Mutation Database Initiative. Science 279:10-11
     
    Cotton RGH, Scriver CR (1998) Proof of "Disease-Causing" Mutation. Human Mutation 12:1-3
    Deckert G, Warren PV, Gaasterland T, Young WG, Lenox AL, Graham DE, Overbeek R, Snead MA, Keller M, Aujay M, Huber R, Feldman FA, Short JM, Olsen GJ, Swanson RV (1998) The complete genome of the hyperthermophilic bacterium Quifex aeolicus. Nature 392:353-358
     
    Etzold T, Ulyanov A, Argos P (1996) SRS: Information retrieval system for molecular biology data banks. Methods Enzymol 266:114-128
     
    Fields C, Adams MD, White O, Venter JC (1994) How many genes in the human genome? Nature Genetics 7:345-346
     
    Fields S (1997) The future is function. Nature Genetics 15:325-327
     
    Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ, Tomb JF, Dougherty BA, Merrick JM (1995) Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269:496-512
     
    Fraser CM, Cogayne JD, White O, Adams MD, Clayton RA, Fleischmann RD, Bult CJ, Kerlavage AR, Sutton G, Kelley JM, et al. (1995) The minimal gene complement of Mycoplasma genitalium. Science 270:397-403
     
    Fraser CM, Casjens S, Huang WM, et al. (1997) Genomic sequence of a Lyme disease spirochaete, Borrelia burgdorferi. Nature 390:580-586
     
    Fraser CM, Norris SJ, Weinstock GM, et al. (1998) Complete genome sequence of Treponema pallidum, the syphilis spirochete. Science 281:375-388
     
    Gardner W, Rosenbaum J (1998) Database protection and access to information. Science 281:786-787
     
    Goffeau A, Barrell BG, Bussey H, et al. (1996) Life with 6000 Genes. Science 274:546-567
     
    Goodfellow P (1997) A celebration and a farewell. Nature Genetics 16:209-210
     
    Grompe M (1993) The rapid detection of unknown mutations in nucleic acids. Nature Genetics 5:111-117
     
    Guldberg P, Mikkelsen I, Henriksen KF, Lou HC, Guttler F (1995) In vivo assessment of mutations in the phenylalalnine hydroxylase gene by phenylalanine loading. Characterization of seven common mutations. Eur J Pediatr 154:551-556
     
    Harper R (1995) World Wide Web resources for the biologist. Trends in Genet 11:223-228
     
    Heiter P, Boguski M (1997) Function Genomics: It's All How You Read It. Science 278:601-602
     
    Henikoff S, Greene EA, Pietrikovski S, Bork P, Attwood TK, Hood L (1997) Gene Families: The Taxonomy of Protein Paralogs and Chimeras. Science 278:609-614
     
    Jacob, F. The Possible and the Actual. (The Jessie and John Danz lectures). Seattle and London:University of Washington Press, 1982.
     
    Klenk HP, Clayton RA, Tomb JF, et al. (1997) The complete genome sequence of the hyperthermophilic, sulphate-reducing archaeon Archaeoglobus fulgidus. Nature 390:364-370
     
    Lander ES (1996) The new genomics: global views of biology. Science 274:536-539
     
    Lehvaslaiho H, Ashburner M, Etzold T (1998) Unified access to mutation databases. Trends in Genetics 14:205-206
     
    Little P (1998) Human genome annotation - a possible role for HUGO? Nature Genetics 19:222
     
    Mayr, E. The growth of biological thought: Diversity, evolution and inheritance, Cambridge, MA:Harvard University Belknap Press, 1982. pp. 63-67.
     
    McKusick VA (1986) The morbid anatomy of the human genome. Medicine 65:1-33
     
    McKusick VA (1987) Toward a complete map of the human genome. Genomics 1:103-106
     
    McKusick, V.A. Mendelian Inheritance in Man: A Catalog of Human Genes and Genetic Disorders., Baltimore :Johns Hopkins University Press, 1994. Ed.11th
     
    Miklos GLG, Rubin GM (1996) The role of the genome project in determining gene function: insights from model organisms. Cell 86:521-529
     
    Mitelman F( (1995) ISCN - 1995: An International System for Human Cytogenetic Nomenclature.. S Karger, Basel
     
    Nowacki P, Byck S, Prevost L, Scriver CR (1998) PAH Mutation Analysis Consortium Database: 1997. Prototype for relational locus-specific mutation databases. Nucleic Acids Res 26:220-225
     
    Oliver S (1996) From DNA sequence to biological function. Nature 379:597-600
     
    Ridley M (1991) A survey of science: The edge of ignorance. The Economist 1-22
     
    Scriver CR (1976) Genetics: Voyage of discovery for everyman. (Presidential Address). Pediat Res 10:865-872
     
    Scriver CR, Nowacki PM, Cotton RG (1998) The HUGO Mutation Database Initiative. Genome Digest 5:8-11
     
    Searle AG, Peters J, Lyon MF, Hall JG, Evans EP, Edwards JH, Buckle VJ (1989) Chromosome maps of man and mouse. IV. Annals of Human Genetics 53:89-140
     
    Searle AG, Edwards JH, Hall JG (1994) Mouse homologues of human hereditary disease. J Med Genet 31:1-19
     
    Smith DR, Doucette-Stamm LA, Deloughery C, et al. (1997) Complete genome sequence of Methanobacterium thermoautotrophicum deltaH: functional analysis and comparative genomics. J Bacteriol 179:7135-7155
     
    Strachan T, Abitbol M, Davidson D, Beckmann JS (1997) A new dimension for the human genome project: towards comprehensive expression maps. Nature Genetics 16:126-132
     
    Tatusov R, Koonin EV, Lipman DJ (1997) A Genomic Perspective on Protein Families. Science 278:631-637
     
    Tomb JF, White O, Kerlavage AR, et al. (1997) The complete genome sequence of the gastric pathogen Helicobacter pylori. Nature 388:539-547
     
    Tugendreich S, Bassett DE, Jr., McKusick VA, Boguski MS, Hieter P (1994) Genes covered in yeast and humans. Human Molecular Genetics 3:1509-1517
     
    Venter JC, Adams MD, Sutton GG, Kerlavage AR, Smith HO, Hunkapiller M (1998) Shotgun sequencing of the human genome. Science 280:1540-1542
     
    Waters PJ, Parniak MA, Nowacki P, Scriver CR (1998) In vitro expression analysis of mutations in phenylalanine hydroxylase: Linking genotype to phenotype and structure to function. Human Mutation 11:4-17
     
    White JA, McAlpine PJ, Antonarakis S, et al. (1997) Guidelines for human gene nomenclature (1997).. Genomics 45:468-471

     

     

    Figure Legend.

    An entity relationship diagram describing a module in PAHdb , a relational locus-specific mutation database (Nowacki et al. 1998). This portion of a much larger diagram (Nowacki PM. Thesis. McGill Univ. 1998) describes a single entity in this locus-specific mutation database. An entity is a thing in the real world with its own existence; the entity shown here (in the rectangular box) is "mutation". Entities have attributes to describe them; attributes correspond to fields in data tables and here they are the descriptors of mutations. The attributes shown here (in the ellipses) are linked to the entity by faint lines. In the relational model, only entities are related. In the diagram, "mutation" is connected to other entities (not shown) by heavy lines. (For the complete ER diagram, visit the website listed below: the textual tables and field listings for this (and the other) entities will also be found in the document). PAHdb has a modular design and each module contains an entity with its attributes; modules in PAHdb contain the following entities: mutation, reference, nucleotide sequence, association (with population, geographic region, relative frequency), polymorphic haplotype background, expression analysis in vitro (human and rat). The database management system is Visual FoxPro 5.0, Windows NT. For further information about PAHdb tables, see http://www.debelle.mcgill.ca/pahdb/docu/.

     

    TABLE 1

    THE NAMING OF PARTS; AN ILLUSTRATION

     

    ENTITY NAME (and SYMBOL)

    · Species

    ---

    H. sapiens (HSA)

    · Chromosome

    ---

    12

    · Locus

    ---

    12q24.1

    · Gene (Symbol)

    ---

    Phenylalanine Hydroxylase (PAH)

    · Reference Sequence

    ---

    cDNA, U49897 (GenBank)

    · Allele

    ---

    c.1222C->T (systematic)

    R408W (trivial)

    · Product

    ---

    PAH (EC 1.14.16.1)

    · Disease

    ---

    PKU and non-PKU HPA

    · OMIM #

    ---

    261600

    · on line db

    ---

    http://www.mcgill.ca/pahdb

    Addendum: 

    Since this paper was presented to the SSIEM Membership at the 1998 Annual Meeting (York Univ. UK); the following should be mentioned: 

    1. A database of metabolic pathways for H. Sapiens and other organisms is available at: http://genome.ad.jp/kegg/kegg.html 

    2. Mutation View is a distributed database system for mutations in human disease genes with software providing chromosome ideograms; links with OMIM and GDB; views of genomic and cDNA structures; views of disease-causing alleles. The database has interesting graphics, is dynamic and can receive new data. (Contact: shimuzu@dmd.med.keio.ac.jp)

       

    3. High quality of mutation data is assisted by a "mutation checker" to compare the reported allelic sequence with the standard (wildtype) sequence, and the wildtype amino acid with predicted substitution.

      Software developed by Heikki Lehvaslaiho can be accessed at http://www2.ebi.ac.uk/cgi-bin/mutations/check.cgi

    4. Public databases contain complete genome sequences for organisms mentioned in this article and many others; 18 in all at the time of writing. See Entrez Genomes at www.ncbi.nlm.gov/Entrez/Genome/org.html

      The (new) URL for annotated mapping information on over 30,000 human genes is: www.ncbi.nlm.nih.gov/genemap

    5. The Web provides access to two resources useful for genetic counselling (Medical Genetics Knowledge Base, www.geneclinics.org) and genetic testing/diagnosis. HELIX: Genetic Testing Resource,http://healthlinks.washington.edu/helix)

    6. The American Journal of Human Genetics now requires the list of references in a research article to be preceded by a list of the databases and informatics resources used to prepare the article.

    7. The issue for Oct. 23, 1998 of Science contains an informative article on databases in genomic research (Gelbart, W.M. Science 282: 659-661, 1998). The views given there about the relevance of and need for both generalized and specialist databases, echo those about genomic and locus-specific mutation databases (see text above and "Guidelines" at http://www.debelle.mcgill.ca/guidelines or http://data.mch.mcgill.ca/guidelines)


    Tornar a Base de Dades