Tuesday, April 28, 2009

Got myself blocked from Google as a robot

So in the process of developing my scholar system that uses Google scholar to get citation counts and automatically insert them into (and color) lists of scientific references, I managed to trip the Google robot blocker. Of course I'd love to be using their API, but Google has not opened Google Scholar as part of their API yet. I've been blogging about this for a few days - I think it would be a hugely popular move. The reason I got myself blocked yesterday was that I finally worked out some regex to semi-reliably strip the titles from random scientific references. As I was debugging the colorization process I was hitting Google Scholar a bit too frequently and then the queries starting failing. Got the above image in my browser, which very kindly allowed be to re-access Google via the web once I'd typed in the captcha.

Of course by ruby script was still blocked so I gave up work for the evening, but it was working again this morning - thanks Google - and so I immediately implemented a caching mechanism so that I don't hit Google Scholar each time I go through a debug cycle. Of course this means that releasing what I've created as a service would be problematic as it wouldn't scale. This is another reason why it would be beneficial for Google to release an API for Google scholar and allow systems to access through authenticated keys to distinguish them from robots. And then the services that I and others have built could be available to lots of other academics. Stay tuned for my next blog post in which I'll post some images from my new service ... (not really such a cliff-hanger - gotta have lunch first :-)

5 comments:

Robert Brewer said...

Procrastination continues, eh? :)

Sam Joseph said...

yes, but at least now I'm making "progress" with my procrastination :-) See my next post ...

Brian said...

Interesting use of Google Scholar. Do you mind sharing what hit rate it took to get banned? I'd like to do some of this, too, but need to be able to figure out how long it would take given the rate of queries.

Sam Joseph said...

Hi Brian,

I'm not sure of the exact rate. I think I must have rerun my test code about 15 times in a row, and each time I was sending 7 or 14 Google Scholar queries.

Had I not reacquired access I would have started analyzing the http signature I was sending and inserting short breaks between hits as well as caching; as I assume that is part of what they are measuring.

CHEERS> SAM

Brahim Hamadicharef said...

Hi, I encounter the same problem. Even with some long pause it blocks after about 60 queries. I also tried to clear cache and also delete cookies, no real effect. How does tools such as PoP get access to large number of queries ?!