Monday, January 21, 2013

EZproxy wish list: better clustering support

For anyone outside of the library field, you have probably never heard of EZproxy. In short, it's a very simple web proxy server targeted at the library market.  It was originally developed by Chris Zagar, and later purchased by OCLC.

It achieves several goals very well:

  • It is easy to install (single static binary download)
  • It is easy to configure (a config.txt file and a users.txt file)
  • It is each to manage (built-in administrative tools)
EZproxy is a piece of software that any electronic services librarian worth their salt should be able to download, install, configure, and run with minimal hand holding.  For that I give Chris a hearty pat on the back, because that is something that cannot be said of many pieces of software.

For larger sites, though, we have different needs that do not necessarily mesh well with the "ease of setup and use" that smaller sites need.  This is the first in a series of posts about ways that EZproxy could be made better for larger sites.

Today I want to spend some time on EZproxy clustering. EZproxy has some very basic support for clustering, but it has quite a few caveats.  The way EZproxy clustering works is that you setup a peer relationship between the proxy servers, and give them a shared hostname, and point the DNS entry for that shared hostname at each proxy's IP address.  When you access the shared hostname, the web browser receives a HTTP redirect to one of the members of the cluster, and you proceed normally from there.

Again, Chris gets a pat on the back, because this is not hard to setup, and it does work.  You can take a proxy out of the cluster, work on it, put it back in, and have minimal disruption for your patrons.  But it's not zero disruption.   If a patron is on proxy A, and you take proxy A offline, the patron has to re-login to proxy B to continue their research.  For zero disruption, the proxy servers would need to share session data, which they do not today.

This also has an unintended side effect that is not directly EZproxy's fault.  Some vendors use the hostname that they were accessed via to build citation links. You might have serup ezproxy.library.example.edu as the shared DNS entry, with ezproxy-1.library.example.edu and ezproxy-2.library.example.edu as the cluster members.

In this scenario, you would create links to your proxy server like so: 
http://ezproxy.library.example.edu/login?url=http://vendor.example.com/
Since you have a cluster setup, the HTTP redirect might send you to ezproxy-1.library.example.edu to use, so when you finally access the vendor's web site, the URL will look like this:
http://vendor.example.com.ezproxy-1.library.example.edu/
Where this becomes a problem is when certain vendors use that URL to create citation links.  What you want is a citation that looks like this:
http://ezproxy.library.example.edu/login?url=http://vendor.example.com/path/to/article
What you get, though is this:
http://vendor.example.com.ezproxy-1.library.example.edu/path/to/article
Because not all vendors allow you to define a proxy prefix that will allow you to specify how your proxy server should be addressed.

Why is this a problem?  Quite simply put, things change over time.

I am a firm believer that webmasters have a duty to make a best effort attempt to keep URLs working as much as they can over time.  Not everything can be kept functional, but if you had a link that worked last year, and reorganized your site last month, users should still be able to get as close as possible to the intended content when users follow older links.

In this context, what happens when example.edu opens multiple campuses, grows from a college to a university, and opens multiple libraries?  Are you still going to want to be tied to that same namespace that made sense when you started?  How are you going to handle all those citations that reference a proxy name that may no longer even physically exist?  Are you willing to drop those citations on the floor?  That doesn't seem to fit the academic spirit.  What about all those links in your schools' LMS?  Who is going to update all of those now-broken links?

You can see from this example where the EZproxy clustering scheme has weaknesses both in regular maintenance scenarios, as well as when combined with vendors who are a little to clever for their own good.  There is a way out of the vendor citation trap, btw, which I will discuss in another post, but I don't want to rabbit trail from the clustering topic for now.

How could EZproxy be improved to work better in this case?  There are two changes that would make this setup a much stronger solution:

If you look at other proxy systems, when you configure a proxy cluster, there is a communication channel between the proxy servers that allows them to share session state to each other.  EZproxy could be extended to share login information with its peer servers so that they all share a common view of the logged-in sessions.  That way when a single node goes down, the user would be able to fall back to a different proxy server, and resume their existing session without having to login again, and would likely never know there was a failure.

But for shared sessions to reach its full potential, addressing individual proxy servers in a cluster has to stop.  When you setup a cluster relationship, the proxy servers should always use the cluster hostname, rather than the individual node name, for communicating with users.  The user should never know how many nodes are behind ezproxy.library.example.edu.  They only thing that they should ever see in their browser's location bar is that shared DNS name.  For administrative purposes, you will need to be able to access the administrative interface individually, but for services you should always use (and see!) the shared name.

Thus, with these two changes, EZproxy's native clustering solution would be a much stronger feature.

2 comments:

  1. I know that this is an old post...but do you know of any advances on this? From what I can tell, session state is just stored in the "cookies" directory. If that directory can be replicated between clusters hosts...will that keep them in sync?

    I am digging into this soon, but I think that some interesting use of DRBD and haproxy in front of the boxes MIGHT make this work.

    ReplyDelete
    Replies
    1. I have not seen anything in recent EZproxy releases that has updated the session handling.

      I think you're going in the right direction with a shared cookies directory. Not knowing the algorithm that EZproxy uses to choose the session name, I would encourage you to do a lot of testing to see if it is a time-based algorithm that could potentially duplicate sessions across the cluster if you had a high frequency of logins in a short period of time.

      The other risk that I see in a shared cookies directory is session cleanup. I have not studied how EZproxy performs maintenance on that directory for stale sessions. It could be a simple internal timer that reaps expired sessions, and that means that you could get into a small race in the cluster for deleting the old sessions. This situation could result in a silent failure (this is my expectation), or generate a log warning, or could be viewed as a more serious condition by EZproxy. This should be easy to test, though, by generating sessions and removing the session files outside of EZproxy to try to trigger the error condition and see how it responds.

      Please let me know the results of your work, I think it has a lot of potential.

      Delete