So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for the ‘semantic’ tag

Notes & thoughts from the IU semantics workshop

without comments

Over the last two days I attended a workshop titled Exploiting Big Data Semantics for Translational Medicine, held at Indiana University, organized by David Wild, Ying Ding, Katy Borner and Eric Gifford. The stated goals were to explore advances in translation medicine via data and semantic technologies, with a view towards possible fundable ideas and funding opportunites. A nicely arranged workshop that was pretty intense – minimal breaks, constant thinking – which is a good use of 2 days. As you can see from the workshop website, the attendees brought a variety of skills and outlooks to the meeting. For me this was one of the most attractive features of the workshop.

This post is a rough dump of some observations & thoughts during the workshop – I’m sure I’ve left out important comments, provide minimal attribution and I assume there will be a more thorough report coming out from the organizers. I also point out that I am an interested bystander to this field and somewhat of a semantic web/technology (SW/T) skeptic – so some views may be naive or just wrong. I like the ideas and concepts, I can see their value, but I have not been convinced to invest significant time and efort into “semantifying” my day to day work. A major motivation for my attending this workshop was to learn what the experts are doing and see how I could incorporate some of these ideas into my own work.

The Meeting

The first day started with 5 minute introductions which was quite useful and great overview talks by three of the attendees. With the information dump, a major focus of the day was a discussion of opportunities and challenges. This was a very useful session with attendees listing specific instances of challenges, opportunities, bottlenecks and so on. I was able to take some notes on the challenges, including

  • Funding – lack of it and difficulty in obtaining it (i.e., persuading funders)
  • Cultural and social issues around semantic approaches (e.g., why change what’s already working? etc)
  • Data problems such as errors being propagated through ontologies and semantic conversion processes etc (I wonder to what extent this is a result of automated conversion processes such as D2R, versus manual errors introduced during curation. I suspect a mix of both)
  • Hilbert Problems” – a very nice term coined by Katy to represent grand challenges or open problems that could serve as seeds around which the community could nucleate. (This aspect was of particular interest as I have found it difficult to identify compelling life science use cases that justify a retooling (even partial) of current workflows.)

The second day focused on breakout sessions, based on the opportunities and challenges listed the day before. Some notes on some of these sessions:

Bridging molecular data and clinical data – this session focused on challenges and opportunities in using molecular data together with clinical data to inform clinical decision making. Three broad opportunies came out of this, viz., Advancing understanding of disease conditions, Optimizing data types/measurements for clinical decision making outcomes and Drug repurposing. Certainly very broad goals, and not particularly focused on SW/T. My impression that SW/T can play an important role in standardization and optimization of coding standards to more easily and robustly connect molecular and clinical data sources. But one certainly needn’t invoke SW/T to address these opportunities

Knowledge discovery – the considerations addressed by this group included the fact that semantified data (vocabularies, ontologies etc) is increasing in volume and availability, tools are available to go from raw data to semantified forms and so on. An important point was made the quality is a key consideration at multiple levels – the raw data, the semantic representation and the links between semantic entities. A challenge identified by this group was to identify use cases that SW/T can resolve and traditional technologies cannot.

RDBMS vs semantic databases – this was an interesting session that tried to address the question of when one type of database is better than the other. It seems that the consensus was that certain problems are better suited for one type over another and a hybrid solution is usually a sensible approach – but that goes without saying. A comment was made that certain classes of problems that involve identifying paths between terms (nodes) are better suited for semantic (graph) databases – this makes intuitive sense, but there was a consensus that there weren’t any realistic applications that one could point to. I like the idea – have attributes in a RDBMS, but links in a graph database and use graph queries to identify relations and entities that are then mapped to the RDBMS. My concern with this is that path traversals are easy (Neo4J does this quite efficiently) – the problem is in the explosion of possible paths between nodes and the fact that the majority of them are trivial at best or nonsensical at worst. This suggests that relevance/ranking is a concern in semantic/graph databases.

The session of most interest to me was that of grand challenges. I think we got to 5  or 6 major challenges

  • How to represent knowledge (methods for, evaluation of)
  • How do changes in ontologies affect scientific research over time
  • How to construct an ontology from a set of ontologies (i.e. preexisting knowledge) that is better than the individual ones (and so links to how to evaluate an ontology in terms of “goodness”)
  • Error propagation from measurements to representation to analysis
  • Visualization of multi-dimensional / high dimensional data – while a general challenge, I think it’s correct that visual representations of semantified data (and their supporting infrastructure such as ontologies) could make the methods and tools much more accessible. Would’ve been nice if we had more discussion on this aspect

We finally ended with a discussion concrete projects that attendees would be interested in collaborating on and this was quite fruitful.

My Opinion

It turns out that a good chunk of the discussion focused on translational medicine (clinical informatics, drug repurposing etc.) and the use of different data types to enable life science research, but largely independent of SW/T. Indeed, the role of SW/T seemed rather fuzzy at times – to some extent, a useful tool, but not indispensable. My impression was that much of the SW/T that was discussed really focused on labeling of knowledge via ontologies and making links between datasets and the challenges faced during these operations (which is fine and important – but does it justify funding?).

I certainly got some conflicting views of the state of the art. Comments from Amit Sheth made it appear that SW/T is well established and the main problems are solved, based on deployed applications in the “enterprise”. But comments from many of the attendes working in life sciences suggested many problems in dealing and working with semantic data. Sure, Google has it’s Knowledge Graph and other search engines are employing SW/T under the hood. But if it’s so well established, where are the products, tools and workflows that an informatics-savvy non-expert in SW/T can employ? Does this mean research funding is not really required and it’s more of a productization/monetization issue? Or is this a domain specific issue – what works for general search doesn’t necessarily work in the life sciences?

My fundamental issue is the absence of a “killer application” – an application or use case that gives a non-trivial result, that could not be achieved via traditional means. (I qualify this, by asking for such use cases in life sciences. Maybe bankers have already found their killer applications). Depending on the semantic technology one considers there are partial answers: ontologies are an example of such a use case, when used to enable linkages between datasets and sources across domains. To me this makes perfect sense (and is of particular interest and use in current projects such as BARD). But surely, there must be more than designing ontologies and annotating data with ontological terms? One of the things that was surprising to me was that some of the future problems that were considered for possible collaborations were not really dependent on SW/T – in other words, they could largely be addressed via pre-existing methodologies.

My (admittedly cursory) reading of the SW/T literature seems to suggest to me that a major promise of this field is “reasoning” over my data. And I’m waiting for non-trivial assertions made based on linked data, ontologies and so on – that really highlight where my SQL tables will fail. It’s not sufficient (to me) to say that what took me 50 lines of Python code takes you 2 lines of SPARQL – I have an investment made in my RDBMS, API’s and codebase and yes, it takes a bit more fiddling – but I can get my answer in 5 minutes because it’s already been set up.

Some points were made regarding challenges faced by SW/T including complexity of OWL, difficulty in leaning SPARQL, poor performanec queries. Personally, these are not valid challenges and I certainlly do not make the claim that tricky SPARQL queries are preventing me from jumping into SW/T. I’m perfectly willing to wait 5 min for a SPARQL query to run, if the outcome is of sufficient value. The bigger issue for me is the value of the outcomes – maybe it’s just too early for truly novel, transformative results to be produced. Or maybe it’s simply one tool amongst others that can be used to tackle a certain class of problems.

Overall, it was a worthwhile two days interacting with a group of interesting people. But definitely some fuzziness in terms of what role SW/T can, should or will play in translational life science research.

Written by Rajarshi Guha

March 27th, 2013 at 7:26 pm

ChEMBL in RDF and Other Musings

with one comment

Earlier today, Egon announced the release of an RDF version of ChEMBL, hosted at Uppsala. A nice feature of this setup is that one can play around with the data via SPARQL queries as well as explore the classes and properties that the Uppsala folks have implemented. Having fiddled with SPARQL on and off, it was nice to play with ChEMBL since it contains such a wide array of data types. For example,  find articles referring to an assay (or experiment) run in mice targeting isomerases:

1
2
3
4
5
6
7
8
9
10
PREFIX  chembl:  <http://rdf.farmbio.uu.se/chembl/onto/#>
SELECT DISTINCT ?x ?pmid ?pdesc ?DESC WHERE {
?protein chembl:hasKeyword "Isomerase" .
?x chembl:hasTarget ?protein .
?protein chembl:hasDescription ?pdesc .
?x chembl:organism  "Mus musculus" .
?x chembl:hasDescription ?DESC .
?x chembl:extractedFrom ?resource .
?resource <http://purl.org/ontology/bibo/pmid> ?pmid
}

I’ve been following the discussion on RDF and Semantic Web for some time. While I can see a number of benefits from this approach, I’ve never been fully convinced as to the utility. In too many cases, the use cases I’ve seen (such as the one above) could have been done relatively trivially via traditional SQL queries. There hasn’t been a really novel use case that leads to ‘Aha! So that’s what it’s good for’

Egons’ announcement today, led to a discussion on FriendFeed. I think I finally got the point that SPARQL queries are not magic and could indeed be replaced by traditional SQL. The primary value in RDF is the presence of linked data – which is slowly accumulating in the life sciences (cf. LODD and Bio2RDF).

Of the various features of RDF that I’ve heard about, the ability to define and use equivalence relationships seems very useful. I can see this being used to jump from domain to domain by recognizing properties that are equivalent across domains. Yet, as far as I can tell, this requires that somebody defines these equivalences manually. If we have to do that, one could argue that it’s not really different from defining a mapping table to link two RDBMS’s.

But I suppose in the end what I’d like to see is using all this RDF data to perform automated or semi-automated inferencing. In other words, what non-obvious relationships can be draw from a collection of facts and relationships? In absence of that, I am not necessarily pulling out a novel relationship (though I may be pulling out facts that I did not necessarily know) by constructing a SPARQL query. Is such inferencing even possible?

On those lines, I considered an interesting set of linked data – could we generate a geographically annotated version of PubMed. Essentially, identify a city and country for each PubMed ID. This could be converted to RDF and linked to other sources. One could start asking questions such as are people around me working on a certain topic? or what proteins are the focus of research in region X? Clearly, such a dataset does not require RDF per se. But given that geolocation data is qualitatively different from say UniProt ID’s and PubMed ID’s, it’d be interesting to see whether anything came of this. As a first step, here’s BioPython code to retrieve the Affiliation field from PubMed entries from 2009 and 2010.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from Bio import Entrez

startYear = 2009
endYear = 2010

Entrez.email = "some@email.id"
h = Entrez.esearch(db='pubmed', term='%d:%d[dp]' % (startYear,endYear), retmax=1000000)
records = Entrez.read(h)['IdList']
print 'Got %d records' % (len(records))
o = open('geo.txt', 'w')
for pmid in records:
    print 'Processing PMID %s' % (pmid)
    hf = Entrez.efetch(db='pubmed', id=pmid, retmode='xml', rettype='full')
    details = Entrez.read(hf)[0]
    try:
        aff = details['MedlineCitation']['Article']['Affiliation']
    except KeyError:
        print '%s had no affiliation' % (pmid)
        continue
    try:
        o.write('%s\t%s\n' % (pmid, aff.encode('latin-1')))
    except UnicodeEncodeError:
        'Cant encode for %s' % (pmid)
        continue
o.close()

Using data from the National Geospatial Agency, it shouldn’t be too difficult to link PubMed ID’s to geography.

Written by Rajarshi Guha

February 10th, 2010 at 4:40 am