Monday, April 20, 2009

Citation Coloring with Google Scholar

So following on from my post on the Google Scholar API and PaperCube I just wrote a little ruby script to screen scrape the first cited by link and number of hits of google scholar. I'm hoping to create an example of the kinds of things that would become easier if a Google Scholar API was published. Here's the script:

#!/usr/bin/ruby -w
require 'open-uri'
require 'pp'

begin
query = ARGV[0].gsub(/\W/,'+')

open("http://scholar.google.com/scholar?num=1&hl=en&lr=&q=%22#{query}%22&btnG=Search") do |f|
f.each do |line|
link = line[/<br><a class=fl href=\"([^<]*)\">Cited by (\d*)<\/a>([^<]*)/,1]
unless link.nil?
pp link
pp line[/<br><a class=fl href=\"([^<]*)\">Cited by (\d*)<\/a>([^<]*)/,2]
break
end
end

end
rescue
pp 'usage: scholar.rb <paper-title>'
end
This works by doing a search on a paper title and assuming the first hit will be the correct paper, and then grabbing the number of citations and link to the page of citations using regular expression matching. Clearly this would be much cleaner if there was an API to hit.

Now I just need to create a web interface or a PDF document javascript plugin (some example scripts here), so that I can achieve my goal of being able to color-highlight all the references in a academic document so that the most highly cited ones stand out. I think this would be really useful for quickly homing in on the key papers in a field. I am not sure if a PDF document plugin can access the web, so I went ahead and created a little rails app that will accept a list of papers and then try and look them all up in Google Scholar using the above script.

It's not working wonderfully well yet as it turns out it is pretty difficult to write a generic regular expression that will extract all the titles from a list of references. My two approaches so far are as follows:

/(\w+\s*,?\s*(&|and)?\s*((\s*(\w\.))+,?\s*)+)+(\(\d\d\d\d\)\.?)?\s*("|“)?([^,\.\)\("”“\d]{2,})\.("|”)?/

/("|“)?((\s?[^\.\)\("”“\s]{2,}\s?){3,})\.?("|”)?/

The former tries to use the fact that authors' surnames usually appears first, and the latter tries a bottom up approach by trying to extract something that is title like, i.e. space separated words. Neither is working quite as well as I would like at the moment. I have to work on other stuff now, so I will come back to this in a few days, but any input on creating more robust regex for title extraction would be most welcome. Here is the set of references I have been testing on:

1. Erickson, T. & Kellogg, W. A. “Social Translucence: An Approach to Designing Systems that Mesh with Social
Processes.” In Transactions on Computer-Human Interaction. Vol. 7, No. 1, pp 59-83. New York: ACM Press, 2000.
2. Erickson, T. & Kellogg, W. A. “Knowledge Communities: Online Environments for Supporting Knowledge
Management and its Social Context” Beyond Knowledge Management: Sharing Expertise. (eds. M. Ackerman, V.
Pipek, and V. Wulf). Cambridge, MA, MIT Press, in press, 2001.
3. Erickson, T., Smith, D.N. Erickson, T., Smith, D.N., Kellogg, W. A., Laff, M. R., Richards, J. T., and Bradner, E.
(1999). “Socially translucent systems: Social proxies, persistent conversation, and the design of Babble.” Human Factors
in Computing Systems: The Proceedings of CHI ‘99, ACM Press.
4. Goffman, E. Behavior in Public Places: Notes on the Social Organization of Gatherings. New York: The Free Press,
1963.
5. Heath, C. and Luff, P. Technology in Action. Cambridge: Cambridge University Press, 2000.
6. Smith, C. W. Auctions: The Social Construction of Value. New York: Free Press, 1989
7. Whyte, W. H., City: Return to the Center. New York: Doubleday, 1988

4 comments:

Nate75Sanders said...

Hi Sam,

I haven't looked at your regex terribly closely, but I doubt even a monumental effort using only regex is going to be robust across many bib entries. I think you need to use extra information here. I'd maybe try to pick out several putative titles from an article and then come up with a scoring system for these. I think the best single metric for scoring is probably to make some attempt to automatically find a bibtex entry with your putative title in the title field.

-- Nate

badgettrg said...

I do not know what area of academia you are in, but are PMIDs or DOIs reliably present in your field? The are much easier to regex than titles. You have to regex html rather than pdfs.

Sam Joseph said...

Hi Nate,

I agree that getting great robustness is hard, but I'd be satisfied with an 80% hit rate. And I think a few non-regex heuristics could make the difference, e.g. exclude all sentences from reference section that have the terms 'Proceedings' or 'Press' in them.

This seems like a fairly standard problem, and I'd be interested to see what others had tried. I think ideally we'd have some sort of learning system, where you could correct failures and have the matching algorithm improve over time. But that's overkill for the moment. Right now I"d be happy with 8 of 10 or even 6 of 10 entries in a reference section being automatically colored with their Google Scholar citation count.

CHEERS> SAM

Sam Joseph said...

Hi Badgetrrg,

I think there are some DOIs, however I don't know if Google Scholar will support DOI lookup. Seems it does:

http://scholar.google.com/scholar?num=100&hl=en&lr=&q=10.1000%2F182&btnG=Search

Can't see DOIs or other similar things in most of the papers in my field (i.e. computer science) ...

I agree some standardized thing like this would be much easier to match, but I don't see anything being available very soon.

Like I said in my other comment, I'd be happy with an 80% hit rate and leave it open for people to suggest improvements ...

CHEERS> SAM