CATH FAQ

CATH FAQ

This page contains answers for the most frequently asked questions that we receive at CATH and is the best place to starting looking if you have a question about anything to do with the CATH resource.

Please note, these documentation pages are currently in their infancy so there may be some questions that don't yet have answers. This means that we know the question is important and we will document the answer as soon as we can.

FAQ: Introduction

What is CATH?

The CATH database is a hierarchical domain classification of protein structures in the Protein Data Bank. Protein structures are classified using a combination of automated and manual procedures. There are four major levels in this hierarchy:

Class - structures are classified according to their secondary structure composition (mostly alpha, mostly beta, mixed alpha/beta or few secondary structures).

Architecture - structures are classified according to their overall shape as determined by the orientations of the secondary structures in 3D space but ignores the connectivity between them.

Topology (fold family) - structures are grouped into fold groups at this level depending on both the overall shape and connectivity of the secondary structures.

Homologous superfamily - this level groups together protein domains which are thought to share a common ancestor and can therefore be described as homologous.

How does it help me?

For any given structure classified in the database, CATH gives you information on the structure and function of that protein. The evolutionary relationships involving the structure of interest and other proteins in the database can also be determined.

CATH also gives an overall view of the known protein structure universe to date. You can find which folds and superfamilies are the most populated, for example, and which structures are rare in nature.

Who maintains CATH?

Maintaining the CATH database is very much a team effort. Most of the members of the Orengo group have helped with the manual curation of the database and some have developed algorithms to aid with the automated aspects of maintaining and updating it.

Core CATH Team

Ian Sillitoe is the CATH Manager. Nicola Bordin is a Research Associate and is involved in algorithms development for the generation of Functional Families and their applications. Natalie Dawson was the CATH curator and a Research Associate in the group. Vaishali Waman is a Research Associate in the team and she is now the CATH curator.

FAQ: CATH Data

What do the letters "C.A.T.H.S.O.L.I.D" mean?

CATH is a tree-like, hierarchical classification that starts off at the tree “trunk” by clustering protein domains into broad categories (e.g. C, or class, where domains are clustered solely based on their general secondary structure content). As the hierarchy moves away from the “trunk” to the “branches”, more stringent clustering criteria are applied to provide clusters of domains with finer granularity of similarity.

Depth	Letter	Name	Clustering criteria
1	C	Class	Secondary structure content
2	A	Architecture	General spatial arrangement of secondary structures
3	T	Topology	Spatial arrangement and connectivity of secondary structures (fold)
4	H	Homologous Superfamily	Manual curation of evidence of evolutionary relationship (at least two criteria from sequence/structure/function must be observed)
5	S	Sequence Family (S35)	>= 35% sequence similarity
6	O	Orthologous Family (S60) *	>= 60% sequence similarity
7	L	“Like” domain (S95) *	>= 95% sequence similarity
8	I	Identical domain (S100)	100% sequence similarity
9	D	Domain counter	Unique domains

* We are aware that the names “Orthologous” and “Like” are by no means perfect descriptions of the clustering criteria that they represent. However we find it useful to provide some kind of label for these clusters and (quite frankly) these are the best we could come up with.

[from Ivan Kon on 20/10/2008]

How does the numbering in the CATH classification work?

CATH is a hierarchical classification that clusters protein structures at differing levels of similarity. The first level, Class, clusters proteins based on their general secondary structure content and is represented by the first number in the CATH code (the 'C' column in the table below).

Domain	CATH code	C	A	T	H	S	O	L	I	D
1nr3A00	3.30.1190.10.1.1.1.1.1	3	30	1190	10	1	1	1	1	1

A more detailed explanation on the numbering involved in sequence clusters (SOLID levels) can be found in this blog entry .

What do the numbers in CATH versions mean?

For a particular CATH version, for example 3.2.0, the first number indicates the most recent major CATH database release (i.e. version 3.0.0), whilst the second number indicates a minor release. Version 3.2.0 is therefore the second update of the major CATH release 3.0.0. The third number is used for internal purposes.

What do the domain identifiers mean?

A domain identifier is assigned to every classified domain in the CATH database. It consists of a 4-character PDB code, for example 1kcm, followed by the chain name, denoted by a letter, and a two-digit domain number. If there is only one chain, it will be assigned the letter A in the same way as the first chain in a multi-chain structure. If there is only one domain in the chain then 00 is used for the domain number. The structure 1kcm has only a single domain in a single chain; the domain identifier will therefore be 1kcmA00.

Why did the domain identifiers change from 6 to 7 characters between v2.6.0 and v3.0.0?

This was implemented due to the emergence of protein structures with more than nine domains. As experimental techniques for solving crystal structures have improved, the determination of protein structures with a large number of separate domains has increased.

Why did the chain identifiers change between v3.1.0 and v3.2.0?

This was due to the wwPDB remediation project. Please click here for further information.

Why did FunFam identifiers change between v4.2.0 and v4.3.0?

FunFams now have a more consistent numbering scheme based on the amount of sequences contained in the 'seed' alignments at their time of generation. FunFam 1 has the highest number or sequences, FF2 the second-highest, and so on.

FAQ: Web Pages

How do I search CATH?

A tutorial on how to search CATH can be found here

How do I include CATH information in my application?

The answer to this is use the CATH webservices. However, the CATH webservices are undergoing a major revamp and are still in testing. We will update this section when we move the webservices to production.

How do I get a link to my resource from the CATH database?

If you would like us to link to your resource and there is a natural mapping from one of the CATH entities (PDB, PDB Chain, Domain, Classification, etc) then get in touch.

Which is the latest version of the CATH?

CATH v4.3 is the latest CATH release. Please find more details in below paper: Sillitoe I, Bordin N, Dawson N, Waman VP, Ashford P, Scholes HM, Pang CSM, Woodridge L, Rauer C, Sen N, Abbasian M, Le Cornu S, Lam SD, Berka K, Varekova IH, Svobodova R, Lees J, Orengo CA. CATH: increased structural coverage of functional space. Nucleic Acids Res. 2021 Jan 8;49(D1):D266-D273. doi: 10.1093/nar/gkaa1079.

CATH Documentation