Tuesday, December 8, 2009

Mozilla + google + squid + pubmed == pain

One of my users reported the following issue: he goes to google, types "pubmed" and clicks the first link in the search results, which is in fact for PubMed (http://www.ncbi.nlm.nih.gov/pubmed/). He then gets this error:

ERROR: 404 Not Found
NCBI C++ Exception:
Info: CGI(CCgiRequestException::Unexpected or inconsistent HTTP request) "/export/home/miller/PORTAL/2.7/src/cgi/cgiapp.cpp", line 1056: --- Prefetch is not allowed for CGIs
Error: WEB(CCgiException::eInvalid) "/export/home/miller/PORTAL/2.7/src/internal/portal/web/papp.cpp",
line 82: --- OnExceptionURL is not set

The cause turns out to be the confluence of the following:

1. Firefox already implements a soon-to-be-standard HTML feature called
pre-fetching: a page can provide a series of hints about the next page
the user is likely to click to, and provide some links to resources for
pre-fetching. It's supposed to make the load time shorter. More info

2. Google now provide pre-fetch hints for the top links on the search
results. View the source of your search for pubmed, and you'll see this:

<link rel=prefetch href="http://www.ncbi.nlm.nih.gov/pubmed/">

3. PubMed clearly don't like people pre-fetching their site, and have
taken some fairly heavy-handed tactics to combat it:
see the source here
You can see they're checking for the x-moz: prefetch header
and returning HTTP status 403, with no pragma to prevent a proxy server
from caching that response. Then you click the link, and get the cached
version from the proxy with the error message. This is why shift-reload
works - it's forcing the proxy to go get it again, and since there's no
prefetch header, this time it works.

There are a couple of ways to avoid this, here they are in my favoured
order of preference:

1. PubMed find a better way to avoid prefetches on their CGIs (e.g.
either explicitly set pragma to prevent caching by proxies, or use an
HTTP 503)

2. our users get to pubmed via a bookmark

3. you can disable the firefox pre-fetch mechanism, but that's per-user,
per computer - adds a lot of overhead to IT which frankly, I could live

1 comment:

  1. It's belatedly occurred to me that we could use our proxy to re-write these headers in the out-going request.