Friday, January 30, 2015

Basic kickstart file for EZproxy instances

A discussion thread about EZproxy server sizing for VMs has been underway on the EZproxy mailing list this week, and some have asked for details on the setup that we run for our hosted proxy servers.

We use VMs with 512MB RAM, 1 processor, and 8GB of disk space, with log files stored on a network share.  This provides adequate resources for both EZproxy and Squid to run side-by-side:

             total       used       free     shared    buffers     cached
Mem:        502112     272564     229548         16       4500      78088
-/+ buffers/cache:     189976     312136

This is achieved by running a minimal installation with all unnecessary daemon processes disabled.

Here is the kickstart that we use for our proxy servers:

lang en_US.UTF-8
selinux --enforcing
keyboard us
authconfig --enableshadow --enablemkhomedir --enablecache --passalgo=sha512
timezone --utc America/New_York
firewall --enabled --ssh --port=53:tcp,53:udp,80:tcp,443:tcp,3128:tcp,3130:udp
rootpw --iscrypted <hashed password>
firstboot --disabled
services --disabled anacron,atd,autofs,avahi-daemon,bluetooth,cups,firstboot,gpm,hidd,mdmonitor,netfs,pcscd,readahead_early,rpc
gssd,rpcidmapd,yum-updatesd,microcode_ctl
text
skipx
reboot
install
bootloader --location=mbr --driveorder=sda
network --bootproto=static --device=eth0 --ipv6=auto --ip=<ipaddr> --netmask=255.255.255.0 --gateway <gwipaddr> --nameserver=<dns1ip>,<dns2ip> --hostname <proxy host name>
url --url=http://<install server>/centos/6.6/os/x86_64
repo --name=epel --baseurl=http://<install server>/epel/6/x86_64
zerombr yes
clearpart --all --drives=sda
part swap  --fstype=swap --ondisk sda --size=2048
part /boot --fstype=ext4 --ondisk sda --size=256
part /     --fstype=ext4 --ondisk sda --size=1 --grow 
# Packages
%packages --nobase
epel-release
yum
yum-utils
sudo
strace
telnet
tcpdump
rpcbind
nfs-utils
autofs
openssh-server
openssh-clients
puppet
ipa-client
squid
calamaris
awstats
 %post --interpreter /bin/sh --log /root/post_install.log
chvt 3
exec < /dev/tty3 > /dev/tty3
echo "Running %post script"
echo "Running puppet agent"
puppet agent --test --waitforcert 60 --logdest /root/puppet_install.log
echo "Removing 32-bit runtime"
# We do not need 32-bit compatability by default
yum -y erase glibc.i686
echo "Performing update"
# Update to latest
yum -y update
echo "Fixing plymouth"
# Turn off the pretty end-user boot screen, and show the useful boot messages
plymouth-set-default-theme details
/usr/libexec/plymouth/plymouth-update-initrd
exec < /dev/tty1 > /dev/tty1
%end
After the install, puppet re-installs only the base 32-bit runtime libraries needed for EZproxy, copies the EZproxy binary, configures it, and starts it up.  The only step that currently still need to be done manually is the EZproxy SSL setup, which certmonger may be able to help address.

Wednesday, January 28, 2015

EZproxy + Squid: Bolting on a caching layer (revisited)

In a previous writeup, I detailed early results using a caching layer with EZproxy.  Now that we have quite a bit of experience with the configuration, it's time to update with a long-term view of the results and some analysis of how effective it is overall.

To understand the benefits, first a discussion of the architecture in place is necessary.  We run in a clustered configuration using the HAPeer support built into EZproxy, with each proxy server running its own local Squid installation with a sibling configuration as a cache peer of its partner's Squid instance.


    Vendor                       Vendor
      /\                           /\
      ||                           ||
+============+               +============+
|   Squid    | <- sibling -> |   Squid    |
+------------+               +------------+
      /\                           /\
      ||                           ||
Proxy/ProxySSL                Proxy/ProxySSL
      ||                           ||
      ||                           ||
+------------+               +------------+
|  EZproxy   | <-  HAPeer -> |  EZproxy   |
+============+       /\      +============+
                     ||
                     ||
                   Patron


The patron accesses the proxy cluster by the HAPeer name, which is DNS mapped to each of the proxy servers.  EZproxy makes an internal decision to either service the request, or to send it to a peer machine to service the request, and issues a redirect to the patron's browser.  From that point forward,  the request uses the regular flow as a stand-alone proxy installation through the EZproxy server.

In our implementation, we put Squid on the local proxy machine with mostly default settings for RHEL/CentOS: no disk caching (i.e. no cache_dir setting) and the defaults for cache_mem (currently 256MB).  The defaults have worked well in our workloads, as our maximum memory usage is around 128MB after the cache is fully primed.

The biggest change needed was to enable the sibling support between the Squid instances:
digest_generation on
icp_port 3130
icp_access allow localnet
icp_access deny all
cache_peer ezproxy-01.example.edu sibling   3128  3130  proxy-only
This allows the proxy server to ask its peer:  "Hey, I don't have anything for this URL, do you?" and if the sibling has the content, it can request it from the partner cache, slightly increasing the overall effectiveness of the clustered proxy setup.  In larger setups, multicast support would make sense, but ours is small enough that it was not worth the extra configuration to get that working.

One tool that we use to measure the effectiveness of the cache setup is a reporting program called Calamaris.  This gives us insight into what kind of requests are being made, how many of those requests are serviced locally or from a sibling, and how many of them are pulled from vendor content directly.

In a recent sample, the report showed that almost 48% of the requests were found in the cache, while 52% of the requests had to go all the way to the vendor to be serviced.  The sibling setup was used for 3.5% of the requests, and were successfully served from the sibling's copy about 33% of the time.  I suspect that the more diverse the vendor resources, the higher this sibling number may go, though studies suggest that the number of successful sibling requests will probably never exceed 15% overall.

Looking at the report by network usage shows a slightly different story, where 30% of the traffic going through Squid was served locally (and at maximum network speed since in our configuration all objects are stored in RAM), while the remaining 70% of the network traffic was not found in cache and had to go all the way to the vendor.

What were some of the web assets that were successfully cached?
  • CSS files (90% hit ratio)
  • PNG images (90% hit ratio)
  • JavaScript files (89% hit ratio)
  • GIF images (87% hit ratio)
  • ICO images (70% hit ratio)
  • JPEG images (51% hit ratio)
I suspect the lower hit ratio of the JPEG files is due to the fact that those files are more likely to be licensed photography content rather than user interface elements on the web pages, where GIF and PNG files are more commonly used.  The net result of this is that the files that typically block web page layout and rendering are served at the highest speed possible, which makes for a better overall user experience.  This was validated by actual inquiries we received from our members asking "What changed? Everything feels faster now!" after this was rolled out.

What requests typically bypass the caches?
  • JSON requests for AJAX calls (3% hit ratio)
  • PDF files (3.5% hit ratio)
  • Dynamic content (6.5% hit ratio)
Which falls into expectations.  The JSON requests are used for functions like autocomplete of search terms, search limiters on some platforms, pagination of search results, and analytics tracking.  The PDF files are mostly licensed content, and the "Dynamic content" category largely catches searches and search results screens.  

Thankfully, we have found that the developers at the vendors make correct use of cache-control headers to keep non-cacheable content from being served, and we have had ZERO instances reported of any issues that were cache related over the past 2 years that this configuration has been in production use.  I credit this largely to the practice of ISPs implementing transparent cache servers on their home user networks to manage bandwidth usage.

Prior to implementing caching support on our proxy servers, our network usage in:out ratio was very close to 1:1.  After implementing this architecture, it is not uncommon to see a 1:2 ratio where half of our proxy-to-vendor traffic has been eliminated thanks to a shared caching configuration.  If a version of EZproxy is released that supports compression as well, this ratio may go as high as 1:10 between the combination of compression and caching.