Download GO & KEGG

[Download 01.09.2007] GO & KEGG ontologies.

The format of the file is the following:

[term_ID, [branch_of_ontology, is_a_list, part_of_list]]

  • term_ID is the identifier of the GO or KEGG term in the form:
    GO:nnnnnnn or KEGG:nnnnn,
  • branch_of_ontology is a one of the strings: 'molecular_function', 'cellular_component', 'biological_process' or 'KEGG_pathway',
  • is_a_list is a list of term_ID's that are in a is_a relation with the term_ID
  • part_of_list is a list of term_ID's that are in a part_of relation with the term_ID

This ontology was created using the original plain text files:
GO and KEGG.

Introduction to GO & KEGG

The Gene Ontology (GO) project is a collaborative effort to address the need for consistent descriptions of gene products in different databases.

The GO project has developed three structured controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner. The use of GO terms by collaborating databases facilitates uniform queries across them. The controlled vocabularies are structured so that they can be queried at different levels: for example, you can use GO to find all the gene products in the mouse genome that are involved in signal transduction, or you can zoom in on all the receptor tyrosine kinases. This structure also allows annotators to assign properties to genes or gene products at different levels, depending on the depth of knowledge about that entity.

As we said, the three organizing principles of GO are cellular component, biological process and molecular function. A gene product might be associated with or located in one or more cellular components; it is active in one or more biological processes, during which it performs one or more molecular functions. For example, the gene product 'cytochrome c' can be described by the molecular function term oxidoreductase activity, the biological process terms oxidative phosphorylation and induction of cell death, and the cellular component terms mitochondrial matrix and mitochondrial inner membrane.

The building blocks of the Gene Ontology are the terms. Each entry in GO has a unique numerical identifier of the form GO:nnnnnnn, and a term name, e.g. cell, fibroblast growth factor receptor binding or signal transduction. Each term is also assigned to one of the three ontologies, molecular function, cellular component or biological process. The terms in an ontology are linked by two relationships, is_a and part_of. is_a is a simple class-subclass relationship, where A is_a B means that A is a subclass of B; for example, nuclear chromosome is_a chromosome. part_of is slightly more complex; C part_of D means that whenever C is present, it is always a part of D, but C does not always have to be present. An example would be nucleus part_of cell; nuclei are always part of a cell, but not all cells have nuclei.

The ontologies are structured as directed acyclic graphs, which are similar to hierarchies but differ in that a child, or more specialized, term can have many parents, or less specialized, terms. For example, the biological process term hexose biosynthesis has two parents, hexose metabolism and monosaccharide biosynthesis. This is because biosynthesis is a subtype of metabolism, and a hexose is a type of monosaccharide. When any gene involved in hexose biosynthesis is annotated to this term, it is automatically annotated to both hexose metabolism and monosaccharide biosynthesis, because every GO term must obey the true path rule: if the child term describes the gene product, then all its parent terms must also apply to that gene product. See Figure 1 for an example of a small section of the GO graph.

Figure 1: This figure shows a part of GO providing the annotations concerning positive regulation of muscle cell differentiation.

Kyoto Encyclopedia of Genes and Genomes (KEGG) is a collection of manually drawn pathway maps representing our knowledge on the molecular interaction and reaction networks for Metabolism, Genetic Information Processing, Environmental Information Processing, Cellular Processes and Human Diseases. The metabolic pathway is a series of chemical reactions occurring within a cell, catalyzed by enzymes (genes), resulting in either the formation of a metabolic product to be used or stored by the cell, or the initiation of another metabolic pathway. That for, each pathway defines a set of genes collectively performing the same 'job'. KEGG terms are hierarchically organized on three levels with only one type of relationship, is_a. See Figure 2 for an example of a small section of the KEGG hierarchy.

Figure 2: This figure shows a part of KEGG providing the annotations concerning Metabolism.