BAZOO

# So much to do, so little time

Trying to squeeze sense out of chemical data

## Extending the REST PCA Service

I recently described a REST based service for performing PCA-based visualization of chemical spaces. By visiting a URL of the form

http://rguha.ath.cx/~rguha/cicc/rest/chemspace/default/
c1ccccc1,c1ccccc1CC,c1ccccc1CCC,C(=O)C(=O),CC(=O)O

one would get a HTML, plain text or JSON page containing the first two principal components for the molecules specified. With this data one can generate a simple 2D plot of the distributions of molecules in the “default” chemical space.

However, as Andrew Lang pointed out on FriendFeed, one could use SecondLife to look at 3D versions of the PCA results. So I updatesd the service to allow one to specify the number of components in the URL. The above form of the service will still work – you get the first two components by default.

To specify more components use an URL of the form

http://rguha.ath.cx/~rguha/cicc/rest/chemspace/default/3/mol1,mol2,mol3

where mol1, mol2, mol3 etc should be valid SMILES strings. The above URL will return the first three PC’s. To get just the first PC, replace the 3 with 1 and so on. If more components are requested than available, all components are returned.

Currently, the only available space is the “default” space which is 4-dimensional, so you can get a maximum of four components. In general, visit the URL

http://rguha.ath.cx/~rguha/cicc/rest/chemspace/

to obtain a list of currently available chemical spaces, their names and dimensionality.

### Caveat

While it’s easy to get all the components and visualize them, it doesn’t always make sense to do so. In general, one should consider those initial principal components that explain a significant portion of the variance (see Kaisers criterion). The service currently doesn’t provide the eigenvalues, so it’s not really possible to decide whether to go to 3, 4 or more components. For most cases, just looking at the first two principal components will sufficient – especially given the currently available chemical space.

### Update (Jan 13, 2009)

Since the descriptor service now requires that Base64 encoded SMILES, the example usage URL is now invalid. Instead, the SMILES should be replaced by their encoded versions. In other words the first URL above becomes

http://rguha.ath.cx/~rguha/cicc/rest/chemspace/default/
YzFjY2NjYzE=,YzFjY2NjYzFDQw==,YzFjY2NjYzFDQ0M=,
Qyg9TylDKD1PKQ==,Q0MoPU8pTw==

Written by Rajarshi Guha

January 3rd, 2009 at 1:14 am

### 12 Responses to 'Extending the REST PCA Service'

1. Rajarshi, interesting approach. One of the things that’s bothered me about using SMILES (or other line notations) in URLs is the arbitrary URL size limit that certain browsers impose:

http://support.microsoft.com/kb/208427

It’s not hard to come up with SMILES that approach the limit.

I know this issue has been discussed elsewhere, but what do you think about it – especially if a URL encodes multiple SMILES as in:

http://rguha.ath.cx/~rguha/cicc/rest/chemspace/default/3/mol1,mol2,mol3

Rich Apodaca

3 Jan 09 at 1:47 am

2. Rich, good point. Yes, it’s definitely a problem – in fact for this type of application 10 or 20 molecules would probably reach the limit – not very useful at all!

From what I understand based on searching the web, most REST applications are based on a single ID which defines a specific resource. I haven’t really seen any discussions where people are passing datasets via GET.

On the other hand, from a CGI point of view, arbitrary sized payloads can be sent via POST. In fact, it would be trivial to rework the current services to handle data coming in from a POST request – but such an approach would be against REST principles as I understand them.

REST is handy if what you’re doing is based on a single SMILES – for datasets (i.e., collections of molecules for example), it may not be the best approach. If you have any insight into this I’d love to know

Rajarshi Guha

3 Jan 09 at 2:11 am

3. Rajarshi, I think the key is to identify the actual resource being requested by the URI. Encoding the representation in the URI is not necessary at all – and may actually hinder the development of a truly RESTful system.

For example, if we think of a molecular_mass calculation as a RESTful resource, we could define these URIs:

POST /mass_calculations/ # adds a new molecular mass
# to the collection
# by reading form data (SMILES,
# molfile, CML, JSON, etc.)

POSTing to /mass_calculations/ would then redirect to
/mass_calculations/id/, where ‘id’ was generated by the server:

GET /molecular_masses/id/ #reads the molecular mass
#calculation identified by id

So it’s really not necessary to encode molfiles, SMILES, or even InChIs in the URI at all. Notice that you can pass datasets to the server of virtually unlimited size and detail this way.

Rich Apodaca

11 Jan 09 at 6:39 pm

4. Thanks for the description. However it seems that the scheme you described is equivalent to GET’ing with a SMILES directly. So, say I do a POST with a SMILES encoded as form data. According to your scheme the service would generate an ID and then redirect to /mass_calculations/ID – doesn’t this imply that one must store the ID, SMILES, MW tuple?

if one wanted to avoid storing data, then the POST method might as well redirect to /mass_calculations/SMILES which is what I currently do.

So, it appears that your scheme implies server side storage of ID’s and associated data. This doesn’t seem scalable. or have I misunderstood something?

Rajarshi Guha

11 Jan 09 at 6:54 pm

5. This Railscast might give you another perspective on the problem:

Although the implementation is in Rails, the concept comes through no matter your background.

You can see something similar that was implemented in the Chempedia query form:

http://chempedia.com/queries/new

Rich Apodaca

11 Jan 09 at 7:01 pm

6. I don’t think there’s anything that says the resource needs to be stored permanently. But doing so might allow you to do some very interesting things. It’s just an example of how developing with REST can change the way you look at your application.

If storage was an issue, how about a single-use resource that’s deleted the first time it’s viewed?

Rich Apodaca

11 Jan 09 at 7:05 pm

7. Rich, thanks for the pointer to the RailsCast. I can see how it’s useful – bookmarked searches are quite nice. Sometime I back, I did something similar that allowed you create RSS feeds based on searches of PubChem – though it didn’t require me to store the searches in a db.

More generaly, if a application doesn’t need a DB why have it? Sure one could delete the entry after the first view, but that seems hackish rather than elegant.

Rather, the service should be smart enough to figure out what’s coming in from the URL – if it looks like an ID, probably do a DB lookup, if it’s something else, do whatever is required.

You note that “I don’t think there’s anything that says the resource needs to be stored permanently.” and I agree – but in your scheme, if the POST redirects to a URL containing an ID, the service must be able to convert that ID to the actual resource / result – either by a lookup (DB) or by some form of processing. And if the latter, why not just send the actual ‘thing’?

Rajarshi Guha

11 Jan 09 at 7:21 pm

8. “More generaly, if a application doesn’t need a DB why have it? Sure one could delete the entry after the first view, but that seems hackish rather than elegant.”

A temporary DB entry is just one method of temporarily caching a result. It’s common to do this sort of thing, for example, with sessions. Maybe it’s a hack, but it’s effective. I’m sure you could come up with several other ways to accomplish the same thing if running a DB was a problem. Flat files come to mind.

Also, with the proliferation of free, high-performance, lightweight, zero-maintenance RDBMSs such as H2:

http://www.h2database.com/html/main.html

And given that storing the jobs as they come in gives you so many more options for how to grow the site/service, it seems pretty compelling to just store all valid requests.

If the resource is computation-intensive to generate, the argument is even more compelling because this offers a cacheing mechanism.

I’ll concede that there still may be cases in which a db-free web service is the best way to go – I just can’t think of any. You?

“You note that “I don’t think there’s anything that says the resource needs to be stored permanently.” and I agree – but in your scheme, if the POST redirects to a URL containing an ID, the service must be able to convert that ID to the actual resource / result – either by a lookup (DB) or by some form of processing. And if the latter, why not just send the actual ‘thing’?”

If the ‘thing’ takes more than 2083 characters to encode as a URI, you have no choice but to use form data with POST, unless you don’t care about IE 6/7.

Rich Apodaca

11 Jan 09 at 7:56 pm

9. I can certainly see the value of having a backend DB in many scenarios. I agree that for a computation intensive task a DB would be the way to go. But for stuff like say a depiction service or simple descriptor service why bother with a back end? One could argue that doing a DB lookup after the first query, even for these simple services, might be faster than rerunning the calculation. This is something that should be benchmarked.

(At the same time, I think browsers will cache URL’s – so in many cases where the request can be encoded in a GET URI, there would be no need to go on the network)

But over time, given a popular service, the DB will grow. Then one hits the issue of scaling the database. Should we do pruning? Should we partition the DB?

I’m quite a fan of databases, but it seems to me that the decision to use a database (even no-hassle ones like you mentioned) implies some serious thinking on what the service is meant to do. For many of the services I’m running, I don’t see a need for a DB

> If the ‘thing’ takes more than 2083 characters to encode
> as a URI, you have no choice but to use form data with
> POST, unless you don’t care about IE 6/7.

Oh right, I forgot about that. Yes, that’s true – in fact, this is an argument for a backend DB. If the ‘thing’ is large, a caching scheme will save bandwidth.

Rajarshi Guha

11 Jan 09 at 8:07 pm

10. “But over time, given a popular service, the DB will grow. Then one hits the issue of scaling the database. Should we do pruning? Should we partition the DB?”

If a DB just won’t work for your use case, there’s always the ultimate short-term caching solution:

in-memory storage

Under this scenario, the POST to /mass_calculations/ would update an integer id counter and then associate an in-memory record containing the request form data.

Then when the redirect to /mass_calculations/id/ occurs, the resource is created from the id and the in-memory record is destroyed.

Subsequent requests to the same URI would give a 404.

Clean and simple.

BTW, the other reason to not encode the request in the URI is if the ‘thing’ can’t be represented conveniently w/ URI-safe characters. That rules out many line notations.

Rich Apodaca

11 Jan 09 at 9:00 pm

11. Rich, good points, though for cases where the URI approach is fine I still see doing a GET directly on a URI as acceptable in a REST point of view. In general, I’d assume that the code could handle either a GET or POST and do something intelligently with it.

Rajarshi Guha

11 Jan 09 at 10:10 pm

12. The simplest variation of all would be eliminate the redirect altogether:

# returns mass calculation in json, xml, plaintext, or whatever
POST /mass_calculations/

# get method returns nothing
# GET /mass_calculations/id/

Rich Apodaca

11 Jan 09 at 11:25 pm