Monday, April 20, 2009

Citation Coloring with Google Scholar

So following on from my post on the Google Scholar API and PaperCube I just wrote a little ruby script to screen scrape the first cited by link and number of hits of google scholar. I'm hoping to create an example of the kinds of things that would become easier if a Google Scholar API was published. Here's the script:

#!/usr/bin/ruby -w
require 'open-uri'
require 'pp'

begin
query = ARGV[0].gsub(/\W/,'+')

open("http://scholar.google.com/scholar?num=1&hl=en&lr=&q=%22#{query}%22&btnG=Search") do |f|
f.each do |line|
link = line[/<br><a class=fl href=\"([^<]*)\">Cited by (\d*)<\/a>([^<]*)/,1]
unless link.nil?
pp link
pp line[/<br><a class=fl href=\"([^<]*)\">Cited by (\d*)<\/a>([^<]*)/,2]
break
end
end

end
rescue
pp 'usage: scholar.rb <paper-title>'
end
This works by doing a search on a paper title and assuming the first hit will be the correct paper, and then grabbing the number of citations and link to the page of citations using regular expression matching. Clearly this would be much cleaner if there was an API to hit.

Now I just need to create a web interface or a PDF document javascript plugin (some example scripts here), so that I can achieve my goal of being able to color-highlight all the references in a academic document so that the most highly cited ones stand out. I think this would be really useful for quickly homing in on the key papers in a field. I am not sure if a PDF document plugin can access the web, so I went ahead and created a little rails app that will accept a list of papers and then try and look them all up in Google Scholar using the above script.

It's not working wonderfully well yet as it turns out it is pretty difficult to write a generic regular expression that will extract all the titles from a list of references. My two approaches so far are as follows:

/(\w+\s*,?\s*(&|and)?\s*((\s*(\w\.))+,?\s*)+)+(\(\d\d\d\d\)\.?)?\s*("|“)?([^,\.\)\("”“\d]{2,})\.("|”)?/

/("|“)?((\s?[^\.\)\("”“\s]{2,}\s?){3,})\.?("|”)?/

The former tries to use the fact that authors' surnames usually appears first, and the latter tries a bottom up approach by trying to extract something that is title like, i.e. space separated words. Neither is working quite as well as I would like at the moment. I have to work on other stuff now, so I will come back to this in a few days, but any input on creating more robust regex for title extraction would be most welcome. Here is the set of references I have been testing on:

1. Erickson, T. & Kellogg, W. A. “Social Translucence: An Approach to Designing Systems that Mesh with Social
Processes.” In Transactions on Computer-Human Interaction. Vol. 7, No. 1, pp 59-83. New York: ACM Press, 2000.
2. Erickson, T. & Kellogg, W. A. “Knowledge Communities: Online Environments for Supporting Knowledge
Management and its Social Context” Beyond Knowledge Management: Sharing Expertise. (eds. M. Ackerman, V.
Pipek, and V. Wulf). Cambridge, MA, MIT Press, in press, 2001.
3. Erickson, T., Smith, D.N. Erickson, T., Smith, D.N., Kellogg, W. A., Laff, M. R., Richards, J. T., and Bradner, E.
(1999). “Socially translucent systems: Social proxies, persistent conversation, and the design of Babble.” Human Factors
in Computing Systems: The Proceedings of CHI ‘99, ACM Press.
4. Goffman, E. Behavior in Public Places: Notes on the Social Organization of Gatherings. New York: The Free Press,
1963.
5. Heath, C. and Luff, P. Technology in Action. Cambridge: Cambridge University Press, 2000.
6. Smith, C. W. Auctions: The Social Construction of Value. New York: Free Press, 1989
7. Whyte, W. H., City: Return to the Center. New York: Doubleday, 1988