Tuesday, January 22, 2013

The citation trap

We have all seen this in one form or another.  The web is such a dynamic place that web pages that existed when someone was writing their content no longer exists today.

Students graduate, and their university home pages are taken down.  Web sites are reorganized, and the content is moved or lost.  Companies are bought and sold.  Any number of things can cause content to move or simply disappear.

And then there are proxy servers.

In my last post on EZproxy clustering, I noted that the way EZproxy's clustering support currently works, it exposes individual cluster nodes to the end user, and that some vendors use the referring URL to construct a citation link:
http://www.example.com.ezproxy-1.library.example.edu/path/to/content
The issue being that someday the name ezproxy-1.library.example.edu is going to go away.  Maybe due to growth.  Perhaps due to the college being purchased by another school.  Or maybe a change of regime in IT brings in a new leader who wants to change how things are run.  Any number of things could trigger changing a proxy hostname, and I have not seen anyone discuss how to gracefully deal with this situation.

There is a second potential issue here as well beyond simple citation preservation: LMS content.

As your instructors become more sophisticated users of your school's LMS, there are going to be an increasing number of links to content through your proxy servers.  Those links should look like this (remember the cluster setup with the shared DNS name):
http://ezproxy.library.example.edu/login?url=http://www.example.com/path/to/content
This gives us two scenarios to handle.  One where the proxy hostname is already embedded into the URL, and one where it is not.  One where the content is part of the URL, and the other where the content is passed as an argument.

There are a few options here:

First, if all you are doing is changing the proxy hostname, you can setup an alias with a DNS CNAME record.  This will keep the old name alive, and the way the EZproxy currently behaves, it will issue a HTTP redirect from the old name to the new name, and the user will be sent to the new server transparently:


$ telnet localhost 80
Trying 127.0.0.1...
Connected to localhost.localdomain (127.0.0.1).
Escape character is '^]'.
GET / HTTP/1.0
Host: ezproxy.library.example.edu

HTTP/1.1 302 Moved temporarily
Date: Fri, 21 Jan 2013 01:42:03 GMT
Server: EZproxy
Expires: Mon, 02 Aug 1999 00:00:00 GMT
Last-Modified: Fri, 21 Jan 2013 01:42:03 GMT
Cache-Control: no-store, no-cache, must-revalidate
Cache-Control: post-check=0, pre-check=0
Pragma: no-cache
Location: http://new-shiny-hostname.library.example.edu/
Connection: close

Here you can see that I requested the proxy's home page, and was sent back a redirect to the new-shiny-hostname proxy server instead.  If that's all you needed, you're done.


Sometimes that approach won't work, though.  Say you moved from a single proxy server to a group of proxy servers, each serving a different subset of your user community.  Now you have to have something send the users to the correct proxy server, and a simple alias cannot accomplish that.

I'm going to leave the "something that sends users to the correct proxy server" as an exercise for the reader, but for this second scenario, let's say some shiny new web portal handles this for you.  The trick is now going to be how to get users from the old hostname to the web portal so that they can be sent to the new proxy server.

One way to do this is with the Apache web server's virtual hosting and URL rewriting functionality.  The idea is that you will setup a virtual host that will answer to the old EZproxy hostname, as well as the proxied vendor hostname, and rewrite those requests into the new system.

This is best illustrated with an example.  Remember that we're talking about a EZproxy cluster with a shared DNS name and two nodes.  We will assume for sake of example that the /proxy URL on the portal system takes the same "url=<vendor url>" argument that EZproxy does.

<VirtualHost *:80>
ServerAlias ezproxy.library.example.edu
ServerAlias ezproxy-1.library.example.edu
ServerAlias ezproxy-2.library.example.edu

RewriteEngine on
# Send users logging into the old proxy server into the new portal system
RewriteRule ^/login http://portal.library.example.edu/proxy [R=303,L]
</VirtualHost>

Here, the configuration is setup to handle the EZproxy login URL, redirecting it to the portal system where it is dispatched to the correct proxy server in the new environment.

That was fairly straightforward, and if that is all that you need to worry about for your LMS scenario above, you can stop there.

But if you need to worry about poorly formed citations, this is where the fun part starts.  How do you deal with the overly-clever vendor citations?  Well, this is where your Apache skills needs to be a few notches above novice to be successful.  Here you need to setup a virtual host to answer for the proxied vendor hostname, and have Apache do the Right Thing(tm) for that vendor's services.

<VirtualHost *:80>
ServerName www.example.com.ezproxy.library.example.edu
ServerAlias www.example.com.ezproxy-1.library.example.edu
ServerAlias www.example.com.ezproxy-2.library.example.edu

RewriteRule ^(.*) http://portal.library.edu/proxy?url=http://www.example.com/$1 [R=303,L]
</VirtualHost>

In this simple example, we naively take the request URI and tack it on to the end of the portal entry URL.  Whether or not this works depends on the vendor and how they run their services.  Every vendor is going to be slightly different, and here is where your Apache skills are going to come into play correctly populating the VirtualHost block.

Once you start down this path, you'll realize that there are other possibilities, but that's a post for another day.

No comments:

Post a Comment