Tuesday, April 28, 2009

Citation parsing regular expression breakthrough

So after staring at another paper in a new field where I wasn't sure which of the papers it cited should be the ones to read, I said to myself, I am not going to let these regular expressions get the better of me. By adjusting the granularity of my unit tests I cranked our a new regular expression that would reliably extract the title of the papers from my test set of seven citations. Here's that expression so far:

SURNAME = '[A-Z][a-z]{1,}'
INITIALS = '((\s[A-Z](\.|,|\.,))(\s?[A-Z](\.|,|\.,))*)'
TITLE = '(([A-Za-z:,\r\n]{2,}\s?){3,})'
REGEX = /([^e][^d][^s][^\.]\s|\d+\.?\s|^)(#{SURNAME},?#{INITIALS})(\s?(,|and|&|,\s?and)?
\s?(#{SURNAME},?#{INITIALS}))*\s*(\(?\d\d\d\d\)?\.?)?\s*("|“)?(#{TITLE})\.?("|”)?/

Now I am sure that this can be improved upon, but with a little web interface I have cooked up I can take the following:

1. Erickson, T. & Kellogg, W. A. “Social Translucence: An Approach to Designing Systems that Mesh with Social Processes.” In Transactions on Computer-Human Interaction. Vol. 7, No. 1, pp 59-83. New York: ACM Press, 2000.
2. Erickson, T. & Kellogg, W. A. “Knowledge Communities: Online Environments for Supporting Knowledge Management and its Social Context” Beyond Knowledge Management: Sharing Expertise. (eds. M. Ackerman, V. Pipek, and V. Wulf). Cambridge, MA, MIT Press, in press, 2001.
3. Erickson, T., Smith, D.N. Erickson, T., Smith, D.N., Kellogg, W. A., Laff, M. R., Richards, J. T., and Bradner, E. (1999). “Socially translucent systems: Social proxies, persistent conversation, and the design of Babble.” Human Factors in Computing Systems: The Proceedings of CHI ‘99, ACM Press.
4. Goffman, E. Behavior in Public Places: Notes on the Social Organization of Gatherings. New York: The Free Press, 1963.
5. Heath, C. and Luff, P. Technology in Action. Cambridge: Cambridge University Press, 2000.
6. Smith, C. W. Auctions: The Social Construction of Value. New York: Free Press, 1989
7. Whyte, W. H., City: Return to the Center. New York: Doubleday, 1988.
and turn it into this:
1. Erickson, T. & Kellogg, W. A. “Social Translucence: An Approach to Designing Systems that Mesh with Social Processes (Cited by 78).” In Transactions on Computer-Human Interaction. Vol. 7, No. 1, pp 59-83. New York: ACM Press, 2000.
2. Erickson, T. & Kellogg, W. A. “Knowledge Communities: Online Environments for Supporting Knowledge Management and its Social Context (Cited by 52)” Beyond Knowledge Management: Sharing Expertise. (eds. M. Ackerman, V. Pipek, and V. Wulf). Cambridge, MA, MIT Press, in press, 2001.
3. Erickson, T., Smith, D.N. Erickson, T., Smith, D.N., Kellogg, W. A., Laff, M. R., Richards, J. T., and Bradner, E. (1999). “Socially translucent systems: Social proxies, persistent conversation, and the design of Babble (Cited by 284).” Human Factors in Computing Systems: The Proceedings of CHI ‘99, ACM Press.
4. Goffman, E. Behavior in Public Places: Notes on the Social Organization of Gatherings (Cited by 822). New York: The Free Press, 1963.
5. Heath, C. and Luff, P. Technology in Action (Cited by 408). Cambridge: Cambridge University Press, 2000.
6. Smith, C. W. Auctions: The Social Construction of Value (Cited by 210). New York: Free Press, 1989
7. Whyte, W. H., City: Return to the Center (Cited by 14). New York: Doubleday, 1988.

Which I think is pretty damn useful. I'm getting about a 70% hit rate on other lists of references and I'm sure that can be improved. There are also changes that I might make to the color gradation. At the moment I'm just setting the red value from 0 to 255 based on number of citations, and everything with more than 255 citations doesn't get any redder. I'd like to set it up so that the color was normalised, so that the highest citation count in the references corresponds to red and all the gradations are in between, and ideally I'd like to slide between red and white instead of red and black and have the background color change rather than the text, but that's all icing on the cake really.

What I'd most like to see is this as a web service that everyone could use, and an ongoing group effort to improve the regex further and get as many title matches as possible. If interested please add your vote to the Google Scholar feature request.

6 comments:

Robert Brewer said...

FYI, the way Google folks track how many people are interested in an issue is by how many people have "starred" it. Annoyingly, you'll get an email message every time someone leaves another "me too" comment on the issue, but it pays to be counted.

Ray H said...

Hi Sam,

I am doing a software engineering project for school, and I was wondering if I could use your regular expression as a base to developing it.

Thanks
Raymond

isch said...

Ray,

I'm doing a similar project in LIS. If you're interested in exchanging ideas and experiences, please contact me (email, see profile).

kb

Sam Joseph said...

Hi Ray, please do use my regular expressions, and do post back any improvements you manage to make. I have put this project on a back burner since there doesn't seem much chance of Google releasing an official API. There is a Thomson ISI API for similar data that you can use if your school or organization is subscribed. Of course you might just be interested in the regular expressions. For me they were just part of a bigger project to grab citation data from web services and annotate bibliographies.

Would love to hear more about your project.

CHEERS> SAM

myq said...

Hi, just wondering if this project has progressed any further. I've been waiting for a Google Scholar API for years and have seen a few attempts to mine the output and construct reference trees and such. Your script might help push those ideas forward. It looks great.

Sam Joseph said...

Hi Myq,

No further progression I'm afraid. It's been suggested to me that Google pays alot for their access to the citation services and is not in a position to provide an API; but who knows, maybe one day.

If you are in an academic setting there is a citation API available through somebody like ISI or Thomson or someone that can be used purely within the academic institution, but I didn't look into it any further ...