Thursday, February 21, 2013

EZproxy + Squid: Bolting on a caching layer

In an earlier wish list post for native caching support in EZproxy, I stated that the user could easily save 10-20% of their requests to vendor databases if EZproxy natively supported web caching.

I was wrong.

The actual number is closer to double that estimate.

I recently setup a Squid cache confederation upstream from EZproxy, did some testing against Gale and ProQuest databases, and found that the real world number is between 30-40% savings by adding a caching layer.

This re-validates that studies done in the late 90's on HTTP caching appear to still hold true today:
Journal of the Brazilian Computer Society
Performance Analysis of WWW Cache Proxy HierarchiesPrint version ISSN 0104-6500
J. Braz. Comp. Soc. vol. 5 n. 2 Campinas Nov. 1998
http://dx.doi.org/10.1590/S0104-65001998000300003
A Performance Study of the Squid Proxy on HTTP/1.0Alex Rousskov / National Laboratory for Applied Network Research
Valery Soloviev / Inktomi Corporation
Enhancement and Validation of Squid’s Cache Replacement Policy John Dilley, Martin Arlitt, Stéphane Perret
Internet Systems and Applications Laboratory
HP Laboratories Palo Alto
It was very interesting that in my limited testing that my results were largely inline with those studies from over a decade ago:
  • 30-40% cache hit rates with a Squid memory-only cache configuration
  • 5-10% improvement in cache hit ratio by just adding one peer cache
This, despite all of the technology changes that have become commonplace thanks to Web 2.0 that did not exist back when these studies were originally made.

I opted to not configure disk-based storage for the cache for this test, but I may re-visit that at some point in the future, given that Rousskov and Soloviev were reporting nearly 70% hit ratios in their study.

Disk based storage for the cache deserves a look, but  my initial expectation is that in an academic library search setting, one is unlikely to achieve a greater than 40% hit ratio, simply due to the nature of the web sites being used.  Some things that are going to prevent a higher ratio include:
  • Search term auto completion using AJAX calls
  • The search results themselves
  • Search filtering and refinement
In a general purpose library setting, a proxy may be able to achieve higher ratios as patrons go to the same sets of web sites for news, job postings, social networks, etc.  In an academic setting, though, with patrons executing individual searches, I am not convinced that achieving the higher cache hit ratios is a reasonable expectation.

The working set of cached objects between Gale and ProQuest was approximately 90MB, so it was well within the default 256MB memory cache size Squid uses by default.  With that workload, the only thing that a disk cache could be expected to do is to re-populate the in-memory cache copy when the server is restarted.  The cache will be quickly primed after only a few requests, though, so it's not the same as a busy cache that may have gigabytes of data stored on disk.

Another interesting behavior that I observed was that even though the working set could be fully held in either cache's memory, what I saw develop over time was one of the peer caches would hold a subset of objects until they expired, and then the other cache would pick up the baton, refresh the objects, and serve the newly refreshed objects to the cache cluster.  Wash, rinse, repeat, and you start seeing a pendulum pattern as the fresh content moves between the cache peers, with ICP requests fulfilling requests from the peer before doing the long haul to the origin server.

Even a 30-40% cache hit rate is nothing to downplay, though.  That is a significant bandwidth (and to a certain extent time) savings, and given that EZproxy does not support HTTP compression, this may be the best that can be hoped for in the short term.

No comments:

Post a Comment