Archive for the ‘text mining’ tag
I’ve been working for some time with the PubChem Bioassay collection – a set of 1293 assays that cover a range of techniques (enzymatic, phenotypic etc.), targets and sizes (from 20 molecules to 200,000 molecules). In addition, some assays are primary, high-throughput assays whereas a number of them are smaller, confirmatory assays. While an extremely valuable collection, one of the drawbacks is the lack of curation. This has led to some people saying that the data is too noisy to be useful. Yes, the noise is a problem, but I think there’s still useful data to extract and model.
One of the problems that I have faced is that while one can perform a full text search for assays on PubChem, there is no form of annotations on the assays themselves. One effect of this is that it is difficult to link an assay to other biological resources (though for enzymatic assays, one can determine a Pubmed protein identifier). While working on my bioassay network project, I needed annotations and I didn’t want to do it manually.
The other day I was reading a paper and as is my habit, while reading I flip to see what papers are being cited. Since this was an ACS journal, the references are listed in the order that they occur in the text. When the authors were discussing a point in the paper, they’d usually include a number of references. Given the ordering of the references, this implies that related references are grouped together in the bibliography.
This set me thinking – given a set of references and their citations within a paper, we can capture relationships between the references in various ways. Most obviously, one might analyze the cited papers (either in whole, or in part such as just the abstract or title) and draw conclusions.
However, the fact that the authors of the paper considered references X, Y and Z to be related to a specific point already provides us with some information. Thus in a bibliography where references are order based on first occurrence, can we use the “locality” of the references in the list to draw any conclusions? One could employ some form of a sliding window and look at groups of references. The key thing here would be to have a way to characterize a reference – so it’d probably require that you can access the title (or better, the abstract or full text) of the paper being cited. I will admit that I’m not sure what sort of conclusions one might draw from such an analysis – but it was interesting to observe “local behavior” in a list of references.
Not having followed work in bibliometrics, I’m sure someone has already thought of this and looked into it. If anybody has heard of stuff like this, I’d appreciate any pointers.
(Of course this is all moot, if we can’t easily access the paper itself)