# So much to do, so little time

Trying to squeeze sense out of chemical data

## Substructure Matching, REST style

I’ve been putting up a number of REST services for a variety of cheminformatics tasks. One that was missing was substructure searching. In many scenarios it’s useful to be able to check whether a target molecule contains a query substructure or not. This can now be done by visiting URL’s of the form

 1 http://rguha.ath.cx/~rguha/cicc/rest/substruct/TARGET/QUERY

where TARGET and QUERY are SMILES and SMARTS (or SMILES) respectively (appropriately escaped). If the query pattern is found in the target molecule then the resultant page contains the string “true” otherwise it contains the string “false”. The service uses OpenBabel to perform the SMARTS matching.

Using this service, I updated the ONS data query page to allow one to filter results by SMARTS patterns. This generally only makes sense when no specific solute is selected. However, filtering all the entries in the spreadsheet (i.e., any solvent, any solute) can be slow, since each molecule is matched against the SMARTS pattern using a separate HTTP requests. This could be easily fixed using POST, but it’s a hack anyway since this type of thing should probably be done in the database (i.e., Google Spreadsheet).

### Update

The substructure search service is now updated to accept POST requests. As a result, it is possible to send in multiple SMILES strings and match them against a pattern all at one go. See the repository for a description on how to use the POST method. (The GET method is still supported but you can only match a pattern against one target SMILES). As a result, querying the ONS data using SMARTS pattens is significantly faster.

Written by Rajarshi Guha

February 3rd, 2009 at 6:01 pm

## Update to the REST Descriptor Services

The current version of the REST interface to the CDK descriptors allowed one to access descriptor values for a SMILES string by simply appending it to an URL, resulting in something like

http://rguha.ath.cx/~rguha/cicc/rest/desc/descriptors/
org.openscience.cdk.qsar.descriptors.molecular.ALOGPDescriptor/c1ccccc1COCC

This type of URL is pretty handy to construct by hand. However, as Pat Walters pointed out in the comments to that post, SMILES containing ‘#’ will cause problems since that character is a URL fragment identifier. Furthermore, the presence of a ‘/’ in a SMILES string necessitates some processing in the service to recognize it as part of the SMILES, rather than a URL path separator. While the service could handle these (at the expense of messy code) it turned out that there were subtle bugs.

Based on Pats’ suggestion I converted the service to use base64 encoded SMILES, which let me simplify the code and remove the bugs. As a result, one cannot append the SMILES directly to the URL’s. Instead the above URL would be rewritten in the form

http://rguha.ath.cx/~rguha/cicc/rest/desc/descriptors/
org.openscience.cdk.qsar.descriptors.molecular.ALOGPDescriptor/YzFjY2NjYzFDT0ND

All the example URL’s described in my previous post that involve SMILES strings, should be rewritten using base64 encoded SMILES. So to get a document listing all descriptors for “c1ccccc1COCC” one would write

http://rguha.ath.cx/~rguha/cicc/rest/desc/descriptors/YzFjY2NjYzFDT0ND

While this makes it a little harder to directly write out these URL’s by hand, I expect that most uses of this service would be programmatic – in which case getting base64 encoded SMILES is trivial.

Written by Rajarshi Guha

January 11th, 2009 at 5:52 pm

## Playing with REST Descriptor Services

As part of my work at IU I have been implementing a number of cheminformatics web services. Initially these were SOAP, but I realized that REST interfaces make life much easier. (also see here) As a result, a number of these services have simple REST interfaces. One such service provides molecular descriptor calculations, using the CDK as the backend. Thus by visiting  (i.e., making a HTTP GET request) a URL of the form

http://rguha.ath.cx/~rguha/cicc/rest/desc/descriptors/CC(=O)

you get a simple XML document containing a list of URL’s. Each URL represents a specific “resource”. In this context, the resource is the descriptor values for the given molecule. Thus by visiting

http://rguha.ath.cx/~rguha/cicc/rest/desc/descriptors/
org.openscience.cdk.qsar.descriptors.molecular.ALOGPDescriptor/CC(=O)C

one gets another simple XML document that lists the names and values of the AlogP descriptor. In this case, the CDK implementation evaluates AlogP, AlogP2 and molar refractivity – so there are actually three descriptor values. On the other hand something like the  molecular weight descriptor gives a single value. To just see the list of available descriptors visit

http://www.chembiogrid.org/cheminfo/rest/desc/descriptors

which gives an XML document containing a series of links. Visiting one of these links gives the “descriptor specification” – information on the vendor, version, reference to a descriptor ontology and so on.

(I should point out that the descriptors available in this service are from a pretty old version of the CDK. I really should update the descriptors to the 1.2.x versions)

### Applications

This type of interface makes it easy to whip up various applications. One example is the PCA analysis of compound collections. Another one I put together today based on a conversation with Jean-Claude was a simple application to plot pairs of descriptor values for a collection of SMILES.

The app is pretty simple (and quite slow, since it uses synchronous GET’s to the descriptor service for each SMILES and has to make two calls for each SMILES – hey, it was a quick hack!). Currently, it’s a bit restrictive – if a descriptor calculates multiple values, it will only use the first value. To see how many values a molecular descriptor calculates, see the list here.

With a little more effort one could easily have a pretty nice online descriptor calculation application rivaling a standalone application such as the the CDK descriptor GUI

Also,if you struggle with nice CSS layouts, the CSS Layout Collection is a fantastic resource. And jQuery rocks.

Written by Rajarshi Guha

January 7th, 2009 at 7:06 am

## Extending the REST PCA Service

I recently described a REST based service for performing PCA-based visualization of chemical spaces. By visiting a URL of the form

http://rguha.ath.cx/~rguha/cicc/rest/chemspace/default/
c1ccccc1,c1ccccc1CC,c1ccccc1CCC,C(=O)C(=O),CC(=O)O

one would get a HTML, plain text or JSON page containing the first two principal components for the molecules specified. With this data one can generate a simple 2D plot of the distributions of molecules in the “default” chemical space.

However, as Andrew Lang pointed out on FriendFeed, one could use SecondLife to look at 3D versions of the PCA results. So I updatesd the service to allow one to specify the number of components in the URL. The above form of the service will still work – you get the first two components by default.

To specify more components use an URL of the form

http://rguha.ath.cx/~rguha/cicc/rest/chemspace/default/3/mol1,mol2,mol3

where mol1, mol2, mol3 etc should be valid SMILES strings. The above URL will return the first three PC’s. To get just the first PC, replace the 3 with 1 and so on. If more components are requested than available, all components are returned.

Currently, the only available space is the “default” space which is 4-dimensional, so you can get a maximum of four components. In general, visit the URL

http://rguha.ath.cx/~rguha/cicc/rest/chemspace/

to obtain a list of currently available chemical spaces, their names and dimensionality.

### Caveat

While it’s easy to get all the components and visualize them, it doesn’t always make sense to do so. In general, one should consider those initial principal components that explain a significant portion of the variance (see Kaisers criterion). The service currently doesn’t provide the eigenvalues, so it’s not really possible to decide whether to go to 3, 4 or more components. For most cases, just looking at the first two principal components will sufficient – especially given the currently available chemical space.

### Update (Jan 13, 2009)

Since the descriptor service now requires that Base64 encoded SMILES, the example usage URL is now invalid. Instead, the SMILES should be replaced by their encoded versions. In other words the first URL above becomes

http://rguha.ath.cx/~rguha/cicc/rest/chemspace/default/
YzFjY2NjYzE=,YzFjY2NjYzFDQw==,YzFjY2NjYzFDQ0M=,
Qyg9TylDKD1PKQ==,Q0MoPU8pTw==

Written by Rajarshi Guha

January 3rd, 2009 at 1:14 am

## A Report from a Stranger in a Strange Land

I just got back from ACoP7, the yearly meeting of the International Society of Pharmacometrics (ISoP). Now, I don’t do any PK/PD modeling (hence the “strange land”) but was invited to talk about our high throughput screening platform for drug combinations. I also hoped to learn a little more about this field as well as get an idea of the state of quantitative systems pharmacology (QSP). This post is a short summary of some aspects of the meeting and the PK/PD field that caught my eye, especially as an outsider to the field (hence the “stranger”).

The practice of PK/PD is clearly quite a bit downstream in the drug development pipeline from where I work, though it can be beneficial to keep PK/PD aspects in mind even at the lead discovery/optimization stages. However I did come across a number of talks and posters that were attempting to bridge pre-clinical and clinical stages (and in some cases, even making use of in vitro) data. As a result the types of problems being considered were interesting and varied – ranging from models of feeding to predict weight loss/gain in neonates to analyzing drug exposure using mechanistic models.

A lot of PK/PD problems are addressed using model based methods, as opposed to machine learning methods (see Breiman, 2001). I have some familiarity with the types of statistics used, but in practice much of my work is better suited for machine learning approaches. However, I did come across nice examples of some methodologies that may be useful in QSAR type settings – including mixed effect models, IRT models and Bayesian methods. It was also nice to see a lot of people using R (ISoP even runs a Shiny server for members’ applications) and companies providing R solutions (e.g., Metrum, Mango) and came across a nice poster (Justin Penzenstadler, UMBC) comparing various R packages for NLME modeling. I also came across Stan, which seems like a good way to get into Bayesian modeling. Certainly worth exploring nore.

The data used in a lot of PK/PD problems is also qualitatively (and quantitatively) different from my world of HTS and virtual screening. Datasets tend to be smaller and noiser, which are challenging to model (hence less focus on purely data driven, distribution-free M/L methods). A number of presentations showed results with quite wide CI’s and significant variance in the observed properties. At the same time, models tend to be smaller in terms of features, which are usually driven by the disease state or the biology being modeled. This is in contrast to the 1000’s of descriptors we deal with in QSAR. However, even with smaller feature sets I got the impression that feature selection (aka covariate selection) is a challenge.

Finally, I was interested in learning more about QSP. Having followed this topic on and off (my initiation was this white paper), I wasn’t really up to date and was a bit confused between QSP and phsyiologically based PK (PBPK) models, and hoped this meeting would clarify things a bit. Some of the key points I was able to garner

• QSP models could be used to model PK/PD but don’t have to. This seems to be the key distinction between QSP and PBPK approaches
• Building a comprehensive model from scratch is daunting, and speaking to a number of presenters, it turns out many tend to reuse published models and tweak them for their specific system. (this also leads one to ask what is “useful”?)
• Some models can be very complex – 100’s of ODE‘s and there were posters that went with such large models but also some that went with smaller simplified models. It seems that one can ask “How big a model should you go for to get accurate results?” as well as “How small a model can you get away with to get accurate results?“. Model reduction/compression seems to be an actively addressed topic
• One of the biggest challenges for QSP models is the parametrization – which appears to be a mix of literature hunting, guesswork and some experiment. Examples where the researcher used genomic or proteomics data (e.g. Jaehee Shim, Mount Sinai) were more familiar to me, but nonetheless, daunting to someone who would like to use some of this work, but is not an expert in the field (or a grad student who doesn’t sleep). PK/PD models tend to require fewer parameters, though PBPK models are more closer to QSP approaches in terms of their parameter space.
• Where does one find models and parameters in reusable (aka machine readable) formats? This is an open problem and approaches such as DDMoRE are addressing this with a repository and annotation specifications.
• Much of QSP modeling is done in Matlab (and many published models are in the form of Matlab code, rather than a more general/abstract model specification). I didn’t really see alternative approaches (e.g., agent based models) to QSP models beyond the ODE approach.
• ISoP has a QSP SIG which looks like an interesting place to hang out. They’ve put out some papers that clarify aspects of QSP (e.g., a QSP workflow) and lay out a roadmap for future activities.

So, QSP is very attractive since it has the promise of supporting mechanistic understanding of drug effects but also allowing one to capture emergent effects. However, it appears to be very problem & condition specific and it’s not clear to me how detailed I’d need to get to reach an informative model. It’s certainly not something I can pull off-the-shelf and include in my projects. But definitely worth tracking and exploring more.

Overall, it was a nice experience and quite interesting to see the current state of the art in PK/PD/QSP and learn about the challenges and successes that people are having in this area. (Also, ISoP really should make abstracts publicly linkable).

Written by Rajarshi Guha

October 26th, 2016 at 9:39 pm