Tuesday, April 28, 2009

Citation parsing regular expression breakthrough

So after staring at another paper in a new field where I wasn't sure which of the papers it cited should be the ones to read, I said to myself, I am not going to let these regular expressions get the better of me. By adjusting the granularity of my unit tests I cranked our a new regular expression that would reliably extract the title of the papers from my test set of seven citations. Here's that expression so far:

SURNAME = '[A-Z][a-z]{1,}'
INITIALS = '((\s[A-Z](\.|,|\.,))(\s?[A-Z](\.|,|\.,))*)'
TITLE = '(([A-Za-z:,\r\n]{2,}\s?){3,})'
REGEX = /([^e][^d][^s][^\.]\s|\d+\.?\s|^)(#{SURNAME},?#{INITIALS})(\s?(,|and|&|,\s?and)?
\s?(#{SURNAME},?#{INITIALS}))*\s*(\(?\d\d\d\d\)?\.?)?\s*("|“)?(#{TITLE})\.?("|”)?/

Now I am sure that this can be improved upon, but with a little web interface I have cooked up I can take the following:

1. Erickson, T. & Kellogg, W. A. “Social Translucence: An Approach to Designing Systems that Mesh with Social Processes.” In Transactions on Computer-Human Interaction. Vol. 7, No. 1, pp 59-83. New York: ACM Press, 2000.
2. Erickson, T. & Kellogg, W. A. “Knowledge Communities: Online Environments for Supporting Knowledge Management and its Social Context” Beyond Knowledge Management: Sharing Expertise. (eds. M. Ackerman, V. Pipek, and V. Wulf). Cambridge, MA, MIT Press, in press, 2001.
3. Erickson, T., Smith, D.N. Erickson, T., Smith, D.N., Kellogg, W. A., Laff, M. R., Richards, J. T., and Bradner, E. (1999). “Socially translucent systems: Social proxies, persistent conversation, and the design of Babble.” Human Factors in Computing Systems: The Proceedings of CHI ‘99, ACM Press.
4. Goffman, E. Behavior in Public Places: Notes on the Social Organization of Gatherings. New York: The Free Press, 1963.
5. Heath, C. and Luff, P. Technology in Action. Cambridge: Cambridge University Press, 2000.
6. Smith, C. W. Auctions: The Social Construction of Value. New York: Free Press, 1989
7. Whyte, W. H., City: Return to the Center. New York: Doubleday, 1988.
and turn it into this:
1. Erickson, T. & Kellogg, W. A. “Social Translucence: An Approach to Designing Systems that Mesh with Social Processes (Cited by 78).” In Transactions on Computer-Human Interaction. Vol. 7, No. 1, pp 59-83. New York: ACM Press, 2000.
2. Erickson, T. & Kellogg, W. A. “Knowledge Communities: Online Environments for Supporting Knowledge Management and its Social Context (Cited by 52)” Beyond Knowledge Management: Sharing Expertise. (eds. M. Ackerman, V. Pipek, and V. Wulf). Cambridge, MA, MIT Press, in press, 2001.
3. Erickson, T., Smith, D.N. Erickson, T., Smith, D.N., Kellogg, W. A., Laff, M. R., Richards, J. T., and Bradner, E. (1999). “Socially translucent systems: Social proxies, persistent conversation, and the design of Babble (Cited by 284).” Human Factors in Computing Systems: The Proceedings of CHI ‘99, ACM Press.
4. Goffman, E. Behavior in Public Places: Notes on the Social Organization of Gatherings (Cited by 822). New York: The Free Press, 1963.
5. Heath, C. and Luff, P. Technology in Action (Cited by 408). Cambridge: Cambridge University Press, 2000.
6. Smith, C. W. Auctions: The Social Construction of Value (Cited by 210). New York: Free Press, 1989
7. Whyte, W. H., City: Return to the Center (Cited by 14). New York: Doubleday, 1988.

Which I think is pretty damn useful. I'm getting about a 70% hit rate on other lists of references and I'm sure that can be improved. There are also changes that I might make to the color gradation. At the moment I'm just setting the red value from 0 to 255 based on number of citations, and everything with more than 255 citations doesn't get any redder. I'd like to set it up so that the color was normalised, so that the highest citation count in the references corresponds to red and all the gradations are in between, and ideally I'd like to slide between red and white instead of red and black and have the background color change rather than the text, but that's all icing on the cake really.

What I'd most like to see is this as a web service that everyone could use, and an ongoing group effort to improve the regex further and get as many title matches as possible. If interested please add your vote to the Google Scholar feature request.