At the heart of the system is the CATH classification of protein domains, derived from integrated semi-automatic processing and manual-curation of high-resolution 3D structures in the wwPDB. From these structures protein domains are identified and compared to identify homology relationships and other structural similarities. This hierarchy can be browsed and relationships studied through the website. CATH also provides a set of tools for general structural comparison.
The CATH superfamilies are then extended to the major protein sequence repositories through a process of modelling sequence variation within domain superfamilies, use of the sophisticated Hidden Markov search software HMMER3, and an in-house algorithm called DomainFinder to resolve potential matches into a unified multi-domain architecture (referred to as an ‘MDA’). These predicted sequence domains are presented as the Gene3D resource. Gene3D also merges in many different sources of protein function annotation, ranging from pathway data to active sites, and presents these through a web interface with complex querying abilities.
For more details on the construction of these resources, you are recommended to read the latest NAR papers and documentation around the sites. We are also happy to answer any direct questions about the data (firstname.lastname@example.org, email@example.com).
There are many different possible starting points for an investigation. Here we are going to start with a classic investigation, tracing from a single structure in the PDB to the protein’s function and distribution of homologues. Along the way you will hopefully discover something of use for the future. CATH typically releases once or twice a year, and so it is possible that the PDB record you wish to study hasn’t yet been curated. There are two possible solutions: You can look at the pre-release information for that record (see below), or you can search the structure using CATHEDRAL, the CATH domain recognition algorithm (also see below).
Let’s say we’re interested in PDB chain 1GCQ, and in particular the domains that can be found in it. You can have a look at the PDB record here: http://www.rcsb.org/pdb/explore.do?structureId=1GCQ
The first step is to go to the CATH home page (http://www.cathdb.info/) and just take a quick look. To the right hand side you can find a column of boxes with links to new developments in CATH. Other links to useful information (references, search tools) can be found in the main part of page, as well as in the footer, which is found at the bottom of every page. Up in the top right you can find links to the key functions of the web site and the ‘Quick Search’ box. This is where we are going to start. Enter the PDB code 1gcq into this box and hit the button (or the ‘Enter’ key).
In a CATH results page all the records that may correspond to the query term are returned. The possible return types are ‘Domain’, ‘Chain’, ‘PDB’ and ‘Node’. Node corresponds to a node in the CATH structural hierarchy, i.e. a superfamily. It is worth noting that a CATH domain code is an extension of the PDB Chain identifier (i.e. 1gcqA00 is a domain in chain 1gcqA, which is found in record 1gcq).
How many chains and domains are associated with this PDB? How many domains per chain? And do all the domains belong to different or the same superfamilies (“T-level”).
From the results page for 1gcq, click on the PDB record. Here you can view the structures and sequences for the chains in 1gcq, as well as a simple table of corresponding chains and domains.
Next you’ll look at the pages for the chain 1gcqA, and then finally the domain 1gcqA00. In the summary tabs below for both, you’ll see a ‘History’ tab has been added. This tab details the actions CATH curators have taken with respect to the domain assignments. If you’re looking for an explanation as to why something has been re-defined you can normally find it here.
The domain pages are where the structural definitions meet the CATH hierarchy. As well as an identifier that links to the PDB record, each domain is given a 9-part code specifying its location in the hierarchy. The first four parts, from Class to Homology, are curated while the subsequent levels are based on an automatic sequence clustering protocol.
Use the links from here to find out how many members the superfamily contains and how many superfamilies belong to the fold by using the child node summaries in the relevant pages. And feel free to explore a little at this point.
CATHEDRAL is an algorithm for identifying domains in structures through comparison of input structures with known domains in CATH. It can handle multi-domain as well as single domain structures. Find it by clicking on ‘Tools’ at the top, and then the CATHEDRAL server link.
For this test we are going to take a more recent structure for human vav protein that hasn’t yet been classified: 2vrwB. Normally you would enter this code in the identifier box and ‘Continue’. And again. But CATHEDRAL takes a while to run, so instead you can directly go to the results with this link:
How many domains are reported and to what superfamilies?
CATH also provide a server for the pair-wise comparison method SSAP. This returns a superposition of two structures along with a similarity score. For this tutorial, we’ll move on but feel free to try it out if you have time.
Next we’re going to look at the corresponding records for Vav protein and the superfamilies it belongs to in Gene3D.
Fusing structural annotation with genomes and functions. In this guide you can learn a few things about the types of data in Gene3D, how you can retrieve sets of interest, and what tools are built into the website. There are several ways of beginning your investigation, depending on whether you are interested in particular proteins, superfamilies or genomes, so feel free to jump to the section that best describes what you wish to do and start there.
Gene3D can be queried with most recognised identifiers (e.g. uniprot ID's) along with any gene names provided by these resources. If your query returns more than one sequence, then you will be able to choose the appropriate one from the lists provided. Here we want to find out about VAV1 in human. Enter ‘VAV1’ in the proteins search type in 'human' in the taxon filter box (to restrict to VAV1 proteins in human) and click 'get proteins' to retrieve the proteins Direct link to Results.
Looking through the list you will find two distinct records for the search; this is because Gene3D merges resources at the sequence level, so slightly differing sequences for the same protein are treated distinctly. However, by clicking the 'Get more functional annotation button' we can see only one of the sequences is found in the Ensembl human genome assembly.
Clicking on the 'Get protein' link for the VAV1 protein thats in ensembl we get a detailed summary view for this protein Direct link to Results.
Th first tab has a summary page of annotations for the protein. The second ‘Sequence Features’ tab shows the predicted CATH domains, along with sequence annotation from other resources, including other domain databases, UniProt sequence annotation (i.e. active sites) etc.
'Mouse Over for More'
Clicking on domain images will reveal extra functional information and link-outs for a domain.
By looking around the various tabs the funfam assignments you should be able to find annotations from GO and KEGG on the role of VAV1 in the cell and it's molecular function. We can also inspect the functions of its interactors to help establish the roles of this protein in the cell.
In the sequence features tab clicking for VAV1 click on the link 'Click here for Proteins with similar CATH arrangements' and this will retrieve other proteins with a similar domain organisation. Also on this page is a summary of GO annotations and associated evidences for all proteins with this domain organisation. You can then retrieve the sequences from the organism of interest for example for homo sapiens. Direct link to Results. This displays a protein collection page of multiple proteins, further annotation can be obtained from the drop down menu.
We can find a summary of a superfamily by searching from the “Get superfamily summary” tab on the front page. For example searching for 22.214.171.124 we can see information on functions, domain partners, genome distributions etc Direct link to Results. If we click on the Domain organisation tab we can see different domain combinations and the organisms they are found in.
For example clicking on the “number of viruses” we can see this domain is found along with other domains in certain viruses.
We can find a summary of a genome by searching from the “Get genome summary” tab on the front page. For example searching for taxon id 4932 we can see information on superfamilies, funfams, domain organisations etc. of a genome. Direct link to Results. From each of these pages its possible to retrieve individual protein sets.
We can compare 2 genomes by searching from the “Compare Genomes” tab on the front page. For example lets compare the human pathogen plasmodium vivax and the more lethal species plasmodium falciparum. Direct link to Results. we can click on individual tabs to see superfamilies, funfams and domain organisations compared between the 2 genomes by their counts of proteins between the two species. For example on the funfams tab we can see that the “Rifin -like domain” is found in several sequences in P.falciparum and is absent from p.vivax. The corresponding proteins can be retrieved for either genome on any of the tabs.