Dr. Andrew B. Clegg

Andrew with a half-metre sausage, found (and eaten) on holiday in Germany recently

Senior Research Associate, CATH Development

A member of the Orengo group since June 2008, I am the technical lead on the FuncNet platform, which brings together an ensemble of protein function analysis tools from various groups around Europe. This work is supported by the EU-funded EMBRACE and ENFIN research networks.

I'm involved in various other initiatives to extend the capabilities of CATH and Gene3D and enable them to interoperate better with bioinformatics resources at other organizations.

I also write and maintain biotext.org.uk.



Academic Background

I am a data scientist and software developer with experience in the fields of molecular biology, clinical research, public health and the pharmaceutical industry.

My MSc and PhD projects at Birkbeck were on text mining techniques for bioinformatics, and this is a field I am still involved in. My thesis, supervised by Dr. Adrian Shepherd, was on parsing sentences into phrase structure trees and dependency graphs, and extracting facts about gene regulation from these syntactic structures.

As an undergrad I studied History and Philosophy of Science at UCL and I am still interested in the history and public perception of science, technology and medicine.

Current Research Interests

As of Summer 2010, I'm mostly working on:

  • Visualization of biological networks
  • Information retrieval and text mining
  • Integrative statistical methods for protein function prediction
  • Web service development (SOAP, REST, JSON-RPC)
  • Rich internet applications (AJAX, GWT)

Outside of the lab, I'm the sole developer of the GraphSpider/MPL natural-language processing toolkit, and also involved with Smesh, a platform for analysis of social media data.


Alison Cuff, Ian Sillitoe, Tony Lewis, Andrew Clegg, Robert Rentzsch, Nicholas Furnham, Marialuisa Pellegrini-Calace, David T. Jones, Janet Thornton and Christine A. Orengo, ”Extending CATH: Increasing Coverage of the Protein Structure Universe and Linking Structure with Function”, in Nucleic Acids Research Database Issue 39 (2010).

Juan A. G. Ranea, Ian Morilla, Jon G. Lees, Adam J. Reid, Corin Yeats, Andrew B. Clegg, Francisca Sánchez Jiménez and Christine Orengo, ”Finding the 'Dark Matter' in Human and Yeast Protein Network Prediction and Modelling”, in PLoS Computational Biology 6:9 (2010).

Contributor, The management of bacterial meningitis and meningococcal septicaemia in children and young people younger than 16 years in primary and secondary care (National Institute for Health and Clinical Excellence, 2010).

Adam J. Reid, Juan A. G. Ranea, Andrew B. Clegg and Christine A. Orengo, ”CODA: Accurate Detection of Functional Associations between Proteins in Eukaryotic Genomes Using Domain Fusion”, in PLoS ONE 5:6 (2010).

Steve Pettifer, Jon Ison, Matus Kalas, Dave Thorne, Philip McDermott, Inge Jonassen, Ali Liaquat, Jose M. Fernandez, Jose M. Rodriguez, INB-Partners, David G. Pisano, Christophe Blanchet, Mahmut Uludag, Peter Rice, Edita Bartaseviciute, Kristoffer Rapacki, Maarten Hekkelman, Olivier Sand, Heinz Stockinger, Andrew B. Clegg, Erik Bongcam-Rudloff, Jean Salzemann, Vincent Breton, Teresa K. Attwood, Graham Cameron and Gert Vriend, ”The EMBRACE Web Service Collection”, in Nucleic Acids Research Web Servers Issue (2010).

Corin Yeats, Jon Lees, Oliver Redfern, Andrew Clegg and Christine Orengo, ”Gene3D: Merging Structure and Function For a Thousand Genomes”, in Nucleic Acids Research Database Issue 38:D296-D300 (2009).

Pascal Kahlem, Andrew Clegg, Florian Reisinger, Ioannis Xenarios, Henning Hermjakob, Christine Orengo and Ewan Birney, ”ENFIN -- A European network for integrative systems biology”, in Comptes Rendus Biologies 332:11 (2009).

Contributor, Reducing differences in the uptake of immunisations (National Institute for Health and Clinical Excellence, 2009).

Jose M. G. Izarzugaza, Anja Baresic, Lisa E. M. McMillan, Corin Yeats, Andrew B. Clegg, Christine A. Orengo, Andrew C. R. Martin and Alfonso Valencia, ”An integrated approach to the interpretation of Single Amino Acid Polymorphisms within the framework of CATH and Gene3D”, in BMC Bioinformatics 10 (Suppl 8):S5 (2009).

Renata Kabiljo, Andrew B. Clegg and Adrian J. Shepherd, ”A realistic assessment of methods for extracting gene/protein interactions from free text”, in BMC Bioinformatics 10:233 (2009).

Andrew B. Clegg and Adrian J. Shepherd, “Syntactic pattern matching with GraphSpider and MPL”, in Proceedings of the Third International Symposium on Semantic Mining in Biomedicine (SMBM'08) (Turku, Finland: 2008).

Andrew B. Clegg and Debbie Pledge, “Streamlining the clinical guideline production process with fuzzy citation matching”, in Proceedings of the First Conference on Text and Data Mining of Clinical Documents (Louhi'08) (Turku, Finland: 2008).

Andrew B. Clegg and Adrian J. Shepherd, ”Text mining”, in Jon Keith (ed.), Bioinformatics Volume II: Structure, Function and Applications (Humana Press, New Jersey: 2008).

Christian Guy, Emma Goddard, Emily Milner, Lisa Murch, and Andrew B. Clegg, “Looking into the core of the sun”, in Hasok Chang and Catherine Jackson (eds.), An Element of Controversy: The Life of Chlorine in Science, Medicine, Technology and War (British Society for the History of Science: 2007).

Andrew B. Clegg and Adrian J. Shepherd, ”Benchmarking natural-language parsers for biological applications using dependency graphs”, in BMC Bioinformatics 8:24 (2007).

Andrew B. Clegg and Adrian J. Shepherd, ”Evaluating and integrating treebank parsers on a biomedical corpus”, in Proceedings of the Association for Computational Linguistics Workshop on Software (Ann Arbor, Michigan: 2005).

Other Interests

Along with Cass Johnston and Nathan Harmston, I run an informal meet-up group for bioinformatics people and anyone else interested in the technical side of what we do: London BioGeeks. We have monthly-ish technical meetings (with talks) and social nights (with beer). Come along sometime. I'm also involved with the London Java Community and Clojure Dojo.

I'm an occasional peer reviewer for Bioinformatics and the Journal of Biomedical Informatics, and I was on the programme committees for Louhi 2008, the BioNLP 2009 shared task, and the ACL 2010 track on “NLP for biology, medicine, law, etc.”

On a slightly less geeky note (or perhaps not) I am interested in electronica and avant-garde music, linguistics, London and cycling. Not usually at the same time, though. I also DJ occasionally at events like the Drones Club and write music reviews for websites such as Connexion Bizarre.


My main site is at http://biotext.org.uk. My Birkbeck homepage still exists, for historical value only.

Other CATH Team Members

Person Description
benoit Former Member In September 2011 I moved to Osaka, Japan, to work as a Post-Doctoral Fellow in Dr Mizuguchi's group at the National Institute of Biomedical Innovation. Research Interests My main research interests include the study of interactions between proteins and other molecules, both at the structural and network levels.
cuff [ Me and my Cat] CATH Manager I am responsible for the general management and manual curation of CATH. Academic Background As a undergraduate, I read for a BSc(Hons) degree in Biomedical Sciences at the University of Durham and then, after deciding I wanted to pursue Bioinformatics research, I took a MSc degree in Information Technology at the University of Teesside (this was all back in the days before MSc courses in Bioinformatics became available!).
lee [ ] Post Doctoral Research Fellow I work for the Midwest Center for Structural Genomics (MCSG). My responsibilities include selecting protein targets for structure determination, monitoring the success of target selection strategies, and providing homology models of relatives of MCSG structures.
lees Gene3D Since arriving in October 06 I've been doing development of the Gene3D database in collaboration with Corin Yeats. I also maintain the current Gene3D website. I am involved in several collaborations with experimentalists. Recently (June 2009) I have started a new post employed by ENFIN coordinating a chromosome condensation prediction project, with Juan Ranea (Malaga) and the Ellenberg group (EMBL) (amongst others). We are using novel high throughput phenotype data (Ellenberg Group) a…
lewis [Me in Malaysia] Senior Programmer I was heavily involved in the complete rewrite of the CATH update procedure that culminated in CATH v3.0.0. I am still involved in maintaining and developing CATH in an ongoing consultancy capacity. Academic Background MSc Intelligent Systems, UCL (2002-2003)
orengo See departmental staff page
perkins [Me] London Pain Consortium PhD Student I am a member of the London Pain Consortium, an initiative formed in 2002 by a grant from the Wellcome Trust. I am currently moving into the first year proper of my PhD, supervised by Christine and based in the CATH lab, having completed a year of 3 rotations, working on projects with different labs.
phil [Me] Role in CATH I am post doctoral research associate. One of my responsibilities is the target selection database for the Center for Structural Genomics of Infectious Diseases's structural genomics project. Research Interests CSGID applies state-of-the-art high-throughput structural biology technologies to experimentally characterise the three dimensional atomic structure of targeted proteins from pathogens in the NIAID Category A-C priority lists and organisms causing emerging and re-eme…
redfern [Posing on the southbank of the Thames] Post-Doctoral Research fellow I work as part of the Midwest Consortium for Structural Genomics, aiding target selection and analysis of the novelty of the protein structures they produce. In parallel, I also develop methods for homology recognition and function prediction from protein structure and sequence.
reid [Me enjoying a traditional Japanese kaiseki meal in a ryokan somewhere outside Kyoto] Me enjoying a traditional Japanese kaiseki meal in a ryokan somewhere outside Kyoto PhD student I am currently nearing the end of my PhD and planning to submit by the end of the year.
rentzsch [Me] Former PhD student I did my PhD in the lab between 2007 and 2012, funded by a EU grant (ENFIN). The ENFIN Network of Excellence aims at close collaboration between experimental and computational groups throughout Europe. I've also worked as a research assistant here.
sillitoe [Me with one of the Sillitoe clan (I'm the one on the right)] CATH Technical Manager I am responsible for the technical aspect of CATH. This generally involves maintaining and developing both the front-end interfaces (internal and external web pages and webservices) and back-end code library and databases.
studer {{ :cathteam:picture.jpg|Me}} {Role in CATH} Description of role in CATH Academic Background Current Research Interests Your research interests go here. Put some pretty pictures in with something like the following: {{ :cathteam:consensus_contact_map_example.png?300 |Example of a consensus structural alignment and contact map }}
yeats Gene3D and BioMiner Gene3D: Design and development, HMM library construction and prediction verification, and web services. Academic Background PhD at the Sanger Institute (2004), supervised by Alex Bateman (Pfam). Thesis: Biological Investigations Through Sequence Analysis.