Wednesday, January 23, 2013

EZproxy wish list: caching

This is the second in a series of thoughts on how EZproxy could be made better.  For a background, please read my original post.

For all of EZproxy's features, one glaring omission is an option to enable a caching feature.

For smaller sites, this probably would not mean much.  Their patrons may be on-site at the library anyway, or geographically close to the library, so they are probably using the same ISP as the library.  In this case, how much benefit that a caching solution might offer is debatable.

But for larger sites, and sites that support a geographically diverse user population, caching can be a significant performance enhancement.

Here's a very simple example of why:
A user sitting in California needs to access a resource in Denver from a proxy in Virginia.
In this scenario, the physical path that the packets are likely to take through wires in the ground very nearly matches the mental line that you might draw between these three locations.  You'll have to very nearly cross the continent twice to get the request from the desktop to the server, and then turn around and do it again for the reply from the server to the desktop.

So what's the big deal?

One big problem: Users are impatient.  Each trip across the continent is going to cost precious milliseconds and give the perception (warranted or not) of slowness.  Waiting for every single search, every search limiter, every single page render, every single article retrieval is going to make users irritated.  The more they wait, they more frustrated they become, especially if they are new to research and do not have much patience to start with.

Another problem: Someone pays for the bandwidth.  Somewhere in your IT department's budget is a line item or three for your internet access.  The more capacity (bandwidth) you have to access the internet, the more it costs.  Web caching has been used for almost two decades now to address this.  Businesses use web caches for performance, cost savings, and security; countries use them for performance and in some cases censorship; chances are the ISP you are using right now uses a transparent web cache to save on their own bandwidth costs.

If EZproxy were to support a web caching feature the user in California would only need to contact the proxy server in Virginia for 80-90% of their web requests.   Vendors re-use many of the same CSS stylesheets, graphics, and JavaScript files on their web pages, so the only content that is unique are the search results and the article retrievals.

Depending on the vendor, even the articles might be cacheable, so if you were teaching a lab, only the first student to retrieve the article would have to wait for the full round trip, while the second and subsequent students may get the article from the cache.

If you know that you have a caching server on your campus, or at your ISP, you could setup EZproxy to participate in a web cache hierarchy.  Depending on your network, you might be able to piggy-back on other user's activities and your cache would not have to work as hard.  You can generally setup hierarchy caches such that your cache will not store that content locally, since retrieving from that upstream cache is not considered as expensive.  This frees up your cache to focus on the content that is not shared with the upstream server.

If you have multiple EZproxy servers, you could setup a cache confederation when this is combined with a clustered proxy setup, thus amplifying the benefits of having a cluster.  Instead of each cache keeping duplicate content, you could have each proxy query its peer before requesting content upstream.  This would require EZproxy to support the ICP protocol, which might make for interesting architecture possibilities as well.

Even without the advanced caching support, though, just a basic web cache implementation would be useful to many sites, both directly in performance improvements, and indirectly via bandwidth savings.

No comments:

Post a Comment