CATH Documentation
Hooking into the EBI's proteomics resources
I've been at a training course at the EBI all week called “Programmatic access of proteomics resources” (surely “to”..?). The aim of the course was to introduce people to some of the various tools and libraries which enable you to remotely hook into their systems and use their data and services.
What follows is a quick run-down of the practical parts of the course, in case there's anything you might find useful in there. It strikes me that there's an awful lot of data at the EBI which is very easy to retrieve, and which could easily be used to automatically acquire and present additional functional information about proteins or domains in CATH or Gene3D – potentially handy to the users as well as during the curation process.
I'm happy to help out with any queries on this, to the best of my abilities, although for the really tricky questions you'll need to contact the appropriate developers/maintainers. The course was Java-centric, but many of the access methods are cross-platform, and others provide clients in other languages as well (generally Perl at least and often many more). They are running a similar course later in the year which is Perl-centric:
http://www.ebi.ac.uk/training/handson/course_080908_perlwebservices.html
Registration deadline is on Monday 11th of August if you're interested. Leave a comment at the bottom if you need me to explain anything better.
Andrew.
Data services
UniProt
The mother-of-all protein sequence resources actually contains four seperate (but cross-referenced) databases these days:
- UniProtKB – 'classic' UniProt, basically Swiss-Prot and TrEMBL.
Sequences plus as much annotation as is available: ontological terms, database cross-references, evidence attributions etc.
- UniParc – new, revised and obsolete sequences from UniProt and many other databases.
Lets you get an audit trail, see if a sequence has been corrected, refer to a specific version of a sequence etc. Basically intends to be the archive of all protein sequences anywhere, anytime…
- UniMES – sequences from metagenomics and environmental proteomics experiments, like Craig Venter's adventures in the Sargasso Sea.
These aren't necessarily tagged with species information or the other metadata you'd expect in UniProtKB.
- UniRef – non-redundant reference clusters from UniProt and UniParc, clustered at various different levels of sequence similarity.
The course covered two ways to access these resources: a Java API and REST web services.
The former, which requires you to download and install it, hides the network communication layer from you and lets you create UniProt objects and call their methods as if the databases resided on your own machine. All data is populated automatically on demand. You can query by gene or protein name, EC number, keyword etc., or blast your own sequence against the databases.
More here: http://www.ebi.ac.uk/uniprot/remotingAPI/doc.html
The REST services on the other hand provide a simple method to run queries and retrieve data over HTTP, in any programming language. The easiest way to use them is to retrieve a single sequence like so:
http://www.uniprot.org/uniprot/P12345.rdf
http://www.uniprot.org/uniprot/P12345.fasta
http://www.uniprot.org/uniprot/P68441
http://www.uniprot.org/uniprot/P06213.txt
http://www.uniprot.org/uniref/UniRef90_P33810.xml
http://www.uniprot.org/uniparc/UPI000000001F
The first part of the path ('uniprot') is the database to query, the second part ('P12345') is the identifier, and the suffix ('fasta') is the format (except in UniParc for some reason). There are also options for running more complex queries if you don't already know the identifier.
More here: http://www.uniprot.org/faq/28
InterPro
InterPro describes itself as “a database of protein families, domains, repeats and sites in which identifiable features found in known proteins can be applied to new protein sequences”. It aggregates data from a variety of sources including CATH and Gene3D. Programmatically-speaking, InterPro can be searched by identifier, using Dbfetch, EB-Eye or SRS (see below), or you can whack a sequence into it using the InterProScan webservice, and get back all the relevant information from the member databases and ontologies (assuming it finds a close enough match).
InterProScan is a standard SOAP webservice but they have kindly provided example clients in about 8 different languages!
More here: http://www.ebi.ac.uk/Tools/webservices/clients/interproscan
Reactome
Reactome is a knowledgebase of biological pathways, compiled by topic experts with reference to the literature, and including database cross-references for all the biological entities that take part in each pathway. It covers various kinds of pathway including metabolism, signalling, cell cycle control, viral lifecycles and lots more, and there are several more topics in preparation.
It has a SOAP webservice API which lets you do things like finding all pathways that a given gene or protein is involved in (individually or in batches), or conversely, all molecules which are involved in a given pathway. Interestingly for a webservice, it also provides visualization methods that let you generate a diagram of all or part of a pathway in SVG format.
More here: http://www.reactome.org:8080/caBIOWebApp/docs/caBIG_Reactome_User_Guide.pdf
It also has a really neat visualization tool called SkyPainter, which lets you submit a list of genes, proteins or small molecules, and highlights the pathways they're involved in on a vast map of all the pathways it knows about. This can be invoked via HTTP without having to use the webservices API.
More here: http://www.reactome.org/userguide/skypainter_technical.html
IntAct
Complementary to Reactome, IntAct is a database of experimentally-verified protein-protein interactions, with database cross-references and controlled vocabulary terms. Unlike the whole-pathway view taken by Reactome, IntAct is less holistic and works at the level of individual observed interactions – just because two proteins interact in a yeast experiment, doesn't mean they're ever expressed in the same tissues in human, etc.
IntAct offers two main ways to access its data from your own code. You can download all or parts of the interaction database in various XML or flatfile formats, and use their supplied Java API to read it, index it and query it locally, or populate a local database with the same schema as theirs. Or you can connect to their database via a SOAP service and query it by protein, interaction type, source species, publication details or any other metadata, using the Molecular Interaction Query Language (MIQL).
More here (slightly sketchy): http://www.ebi.ac.uk/~intact/devsite/
PRIDE
The Proteomics Identification Database is a repository for data submitted by groups doing high-throughput proteomics experiments – smash up some tissue, extract the proteins, identify them with chromatography, mass spec, protein arrays etc. and record the absolute and relative quantities. For a given experiment, you can find out what proteins were found and at what levels, or you can look for all experiments (or species or tissue types or…) where a given protein or set of proteins was found. Reactome's SkyPainter (see above) can be used for visualization, so you can see what pathways were most active in the sample at the time of extraction. Lots of metadata is supplied about experimental methods etc. All the data in PRIDE can be searched or browsed via its own website or accessed via BioMart (see below).
More here: http://www.ebi.ac.uk/pride/prideMartWebService.do
OLS and PICR
These are two useful utilities that I could see saving a lot of effort. OLS (Ontology Lookup Service) lets you browse and query a variety of different ontologies using a web interface or a SOAP service. You can get all terms matching a query string, and parent terms, children, root nodes, database cross-references, metadata, essentially most useful ontology operations. It includes all (I think!) of the OBO ontologies, so GO, Chebi (chemicals), various anatomical and developmental vocabularies for different organisms, taxonomy, and loads more.
More here: http://www.ebi.ac.uk/ontology-lookup/
The Protein Identifier Cross-Reference service (PICR) is also available through a website or SOAP service, and has a REST interface too. It maps protein IDs/accessions between different databases based on sequence identity, letting you find all equivalent identifiers for a given identifier or even for a given sequence.
More here: http://www.ebi.ac.uk/Tools/picr/
APIs
As well as these databases and services themselves, we also covered various access points which allow you to query several different databases.
Dbfetch
This is a generic method for retrieving records from EBI databases by identifier (accession number etc.), in a variety of human- or machine-readable formats. Around 25 'databases' are accessible through this method, although some of these are actually different views over the same data. It can be operated via a web form by a human, but it's trivial to call it in a REST-like way from your scripts or programs just by making normal HTTP GET requests, and choosing an easily-parseable output format:
http://www.ebi.ac.uk/cgi-bin/dbfetch?db=uniprot&id=P12345&format=fasta&style=raw
Yes, there is indeed redundancy between this approach and the REST methods for UniProt above. Such is life…
More here: http://www.ebi.ac.uk/cgi-bin/dbfetch
You can also access Dbfetch via a SOAP web service if that's your bag.
More here: http://www.ebi.ac.uk/Tools/webservices/services/dbfetch
EB-eye
This is the search engine (actually Apache Lucene) that powers the search box at the top of each EBI web page. It allows many of the data and literature resources within the EBI, as well as the website itself, to be searched with free-text queries. Most importantly, in this context at least, there's a SOAP webservice API that lets you run your own queries remotely. You can choose which 'domains' (data sources) and fields you want to search in, and what the format and content of the output should be.
More here: http://www.ebi.ac.uk/Tools/webservices/services/eb-eye
SRS
Although it's a little old, SRS is still a pretty powerful way to formulate complex queries, and covers a startling multitude of databases and other resources, split into various biological groups. It has its own query language which allows you to link databases together and restrict and select particular fields, meaning you can ask questions across resources like “show me all the proteases in Swiss-Prot which occur in zebrafish”:
[SWISSPROT-all:protease]<[TAXONOMY-all:“Zebra fish”]
Like Dbfetch, you can send these requests from your scripts or programs over a standard HTTP GET and specify a textual format for the results (no XML though!), but with much more expressiveness than Dbfetch. There are also sample clients provided in various languages to take the hard work out of it for you.
More here: http://www.ebi.ac.uk/~srs/wiki/doku.php?id=guides:linkingtosrs
Encore
Part of the wider Enfin project, Encore provides a common mechanism for retreiving annotations for sets of query proteins from multiple databases, including UniProt, IntAct, Reactome, PRIDE, ArrayExpress, GO and KEGG. It works by passing an XML document from database to database via SOAP webservices. Each service parses the document, extracts the original set of proteins and optionally any previously-added annotations that it understands, runs its own queries and adds the results to the document in a pre-defined format.
There is a web front end for Encore here:
http://www.ebi.ac.uk/enfin-srv/envision/index.html
but Encore is designed primarily to be invoked via its API. Because the Enfin XML format is quite complex, a utility service is provided which generates a valid XML document from a list of supplied protein identifiers. You can then pass this document to any of the Encore services. The services can be chained together pretty much seamlessly either in a client script or in a workflow manager like Taverna, since they all comply with the XML standard and know what to expect from and what to add to the document.
More here: http://www.ebi.ac.uk/seqdb/confluence/display/Proteomics/ENFIN+web+services+description
and here: http://www.ebi.ac.uk/seqdb/confluence/display/Proteomics/Java+Code+Samples
BioMart
BioMart is a toolkit, written in Perl, for turning a database into a mini data warehouse. Using a GUI, and without having to write any code by hand, it will let you transform the schema and contents of your database into a special denormalized schema optimized for very fast querying. Also, the resulting mart comes with several extra features for free:
- A standard web interface, allowing the user to build complex queries over your data.
- A simple web service interface using XML over HTTP, with the same functionality.
- A Perl API to make writing web service clients easier (the Java one is thoroughly out of date).
- The ability to federate with other Biomart databases, even at other organizations, allowing distributed queries.
Various databases, including PRIDE, Ensembl, Wormbase, Reactome and HapMap have BioMart implementations – there's a list on the BioMart website, from where you can also run queries against any of them.
More here: http://www.biomart.org/
and here for an example of a mart in action: http://www.ebi.ac.uk/pride/prideMart.do