Thursday, November 18, 2021

What iCloud Public Relay means to libraries and universities

This topic has generated quite a bit of discussion on one of the proxy mailing lists, and it occurred to me that it may need a bit wider distribution since not all of the potential impact cases are exclusive to proxy services.

What is "iCloud Private Relay"?

It is a new feature that is part of iOS 15, iPadOS 15, and macOS Monterey that implements a privacy-preserving feature so that IP addresses cannot be mapped directly back to end users. 

Conceptually, Private Relay shares many attributes with a VPN service.  One major differentiating feature from traditional academic rewriting proxy services is that there is an explicit separation between incoming and outgoing traffic, so that the incoming Apple device network traffic and the network traffic leaving this service and going to web sites are handled by separate parties, and are configured in such a way that the incoming and outgoing network equipment cannot see the other side:

When Private Relay is enabled, your requests are sent through two separate, secure internet relays. Your IP address is visible to your network provider and to the first relay, which is operated by Apple. Your DNS records are encrypted, so neither party can see the address of the website you’re trying to visit. The second relay, which is operated by a third-party content provider, generates a temporary IP address, decrypts the name of the website you requested and connects you to the site. (source)

What is required to enable the "Private Relay" feature?

This is currently a monthly subscription service, and enablement is managed by individual users.

Why should I care about Private Relay?

If users enable this service, their devices will use IP addresses that are mapped to broad geographic regions (Geolocation IP mapping data is available from Apple).  

This means that users on campus would no longer be using campus IPs when accessing vendor services.  If those services are IP authenticated to your campus, those devices will no longer be using a campus IP.  

Depending on how the campus network is configured, it is also possible that the Private Relay IP addresses could be used to access on-campus services, making the user appear as though they were off-campus.  This is probably an unusual situation these days, but it is possible.

Will this impact authentication services?

Probably not, unless there are IP-based usage restrictions in place for your authentication service.  The other possibility that may come into play is if your IdP implements location-based or distance-based threat calculations to detect potential bad actors.  

One example that I have seen would block users that would have had to travel faster than 600 miles per hour to travel from location A to location B.  The way that Apple has setup this service makes it unlikely those rules would be triggered, but until we see it in action, I would not feel comfortable saying that it could absolutely not happen.

Will this impact proxy services?

Again, probably not.  Most proxy servers use cookies to keep track of user sessions, and those cookies should not be tied to the IP address of the authenticated user.

This service is not setup to randomize IP access, but rather to obscure them.  As currently documented, users will be mapped to a country, state/province, or major city as their location, but there is no discussion about bouncing between geography unless the user is actually mobile.  This is not much different than how a user on a phone or hotspot might appear as they were traveling down a highway today.

Unless there is an intermediate system that is involved in authentication that does take IP addresses into account, the current thinking is that this will not impact proxy users any more than mobile networks or VPNs do today.  I have seen some token or HMAC based authentication services that can use IP addresses as part of the authentication, but again, this is not an IP randomization service, so the end user's IP address in use may actually wind up changing less often than it does today.

Another factor is that the service is designed to use the newer UDP based QUIC protocol. I am not aware of any rewriting proxy software today that supports QUIC, and if the destination site does not use QUIC, then the connections may not be permitted to use Private Relay at all:

iCloud Private Relay uses QUIC, a new standard transport protocol based on UDP. QUIC connections in Private Relay are set up using port 443 and TLS 1.3, so make sure your network and server are ready to handle these connections. (source)

Will this impact usage metrics?

If you rely on IP location mapping for approximations of usage by campus location, for example, you can wave that goodbye if your campuses are in close proximity to each other.

Many vendors are leveraging Cloudflare to front-end their platforms today, and if HTTP/3 (with QUIC) is enabled on the Cloudflare zone, then it that zone will be a candidate for Private Relay.

Cloudflare dashboard

You will instead get report data down to the "region", which could be a country, state/province, or closest major metropolitan area depending on local population density.  

If you have multiple campuses (or other group/entities that you wish to measure by geography) that fall within that defined "region", the metrics are going to all be lumped together for those users, and a different method for categorizing users will need to be employed.

​What if we need to use campus IP authentication for services?

There is a provision for local IT to take steps to disable the use of Private Relay for the local site:

Some enterprise or school networks might be required to audit all network traffic by policy, and your network can block access to Private Relay in these cases. The user will be alerted that they need to either disable Private Relay for your network or choose another network.

The fastest and most reliable way to alert users is to return a negative answer from your network’s DNS resolver, preventing DNS resolution for the following hostnames used by Private Relay traffic. Avoid causing DNS resolution timeouts or silently dropping IP packets sent to the Private Relay server, as this can lead to delays on client devices.

mask.icloud.com
mask-h2.icloud.com 
(source)​ 
​This can be accomplished by using DNS filtering, Response Policy Zones (RPZ), or similar DNS server techniques.​

Thursday, May 18, 2017

A Boring Company

Love him or hate him, you cannot deny Elon Musk's sense of humor.  With his announcement of "The Boring Company", he has setup the perfect backdrop for a modern revival of the classic "Who's on First?" comedy routine.

Scene: Two men are sharing a ride in a shuttle from the airport to their hotel, making small talk to pass time.

[A] What are you in town for?
[B] A boring convention.
[A] Well, they can't all be fun.
[B] Oh, I'm really excited about this one.
[A] <skeptically> Really, why's that?
[B] I get to see the latest in boring equipment.
[A] <flatly> Ah-ha.

[A] What company do you work for?
[B] I work for The Boring Company.
[A] Don't we all!  But seriously, which company?
[B] The Boring Company.
[A] I see.  What kind of work do you do there?
[B] I bore.
[A] <deadpan stare> You don't say...

[A] So when you go to work each day, what do you do?
[B] I run The Boring Machine.
[A] Gets tedious fast, does it?
[B] Oh no, every day is a new challenge?
[A] Really?  How is that?
[B] Well, I get to bore things that have never been bored before.
[A] I know how it feels...  Do you have any coworkers?
[B] Yes, there are two of us in my department.
[A] What does your colleague do?
[B] He runs The Other Boring Machine.
[A] <aside> How did I not see that coming?

[A] Do you have to submit every any reports on your work?
[B] Oh, every day!
[A] What kind of reports do you submit?
[B] Boring reports.
[A] You mean nobody reads them?
[B] Oh no, everyone reads them, some of them even get posted on the wall!
[A] The wall?
[B] Yes, we have a wall where the best reports are posted.
[A] What do you call this wall?
[B] The Boring Wall.

[A] Do you know your company's leadership?
[B] Oh sure!
[A] So when your leader makes an introduction, what does he say?
[B] He says, "Hi, I'm The Boring Company's CEO."
[A] Of course he does.

[A] Is your company publicly traded?
[B] Oh yes!
[A] What is the ticker?
[BOTH] BOR

[A] Do you ever read your company's financial reports?
[B] Every quarter.
[A] So at the top of the financial reports, what does it say?
[B] "The Boring Company's balance sheet".
[A] Right.

[A] So let me get this straight:  You work at the boring company, doing a boring job where you bore all day long, run a boring machine along with a coworker that runs the other boring machine, write boring reports that get posted on the boring wall, work for a boring CEO who submits boring financial reports?
[B] That's right!
[A] Sounds like a boring life.
[B] <enthusiastically> You said it!



Wednesday, March 8, 2017

Never Give Up, Never Surrender: How to connect to modern SSL websites from EZproxy 5.x using stunnel

Prior to OCLC's acquisition of EZproxy, there were permanent licensing options available to institutions that many sites chose to exercise.  Unfortunately, due to the continued practice of EZproxy not embracing the platform and linking dynamically to the operating system's installation of OpenSSL, sites still running version 5.x of the software are now starting to run into SSL issues due to PCI-DSS mandated SSL changes that are being adopted by various hosting providers and CDN platforms.

In a nutshell, EZproxy is no longer tall enough to ride the modern encrypted Internet.  

Or for you meme lovers, EZproxy no can haz interwebz.

Fear not, there is a solution that will enable sites with permanent licenses to continue using the software for a while longer:

We need to give EZproxy some platform shoes.




The root of the problem is that PCI-DSS compliance now requires SSL web servers to use increasingly sophisticated encryption settings, some of which were simply not available in the version of OpenSSL that EZproxy 5.x uses.  The work-around is to add a helper tool that will still talk the older version of SSL to EZproxy while talking the newer versions of SSL to remote web servers.

Now this is not quite as easy as configuring a proxy server and using ProxySSL to solve the issue.  Why?  Because while a SSL proxy uses the HTTP CONNECT verb to establish an on-demand tunnel to the remote endpoint, it does not actively participate in the conversation beyond the initial "CONNECT http://search.example.com HTTP/1.1" request, which initiates the tunnel connection.  After that, the client and the remote system interact directly, which means that the SSL protocol negotiation occurs directly between EZproxy and the remote system, and that is no better than connecting directly to the remote server.

What needs to happen is that EZproxy needs to talk to something using the encryption that it supports, and that something needs to talk to the remote system using encryption that the remote system support.

Enter stunnel.

The work-around is actually pretty simple once you understand the problem.  What needs to be achieved looks like this:

EZproxy => stunnel (server) => stunnel (client) => remote server

Let's tackle this one piece at a time, using https://search.example.com/ for the demonstration destination website.

First we need to intercept all calls to the remote server.  Normally when EZproxy contacts a remote server, it performs a DNS lookup to find the IP address of the remote system and connects.  This needs to be short-circuited so that we can intercept the HTTP traffic and route it through stunnel instead.  

The easiest way to do this is to leverage your system's hosts file (/etc/hosts on *NIX systems, c:\Windows\System32\Drivers\etc\hosts on MS Windows systems) and add an override for the hostname:
127.0.0.2 search.example.com
Now your EZproxy server will use the locally defined address instead of the DNS entry:

Let's try to ping it:
# ping search.example.com
PING server.example.com (127.0.0.2) 56(84) bytes of data.
From 127.0.0.2 icmp_seq=1 Destination Net Unreachable
Well, the name override worked, but the IP address is not answering.  That's easy to fix, actually, by using interface aliases on the loopback interface:
ifconfig lo:0 127.0.0.2
Now the ping works:
# ping server.example.com
PING server.example.com (127.0.0.2) 56(84) bytes of data.
64 bytes from 127.0.0.2: icmp_seq=1 ttl=64 time=0.040 ms
So far so good.  The server now thinks that server.example.com is 127.0.0.2, and that IP address is also answering to basic networking functions.

Next we setup stunnel itself (in this example, all of the following will go into /etc/stunnel/ezproxy.conf).

First some global settings:
#foreground = yes
#debug = 7
syslog = yes
The foreground option and debug level are very useful in debugging, but be sure to comment them out when you're done (as seen above).  Recording access in syslog is handy for troubleshooting, but not strictly necessary.
Next we setup the local stunnel server.  This is the part that is going to accept SSL connections from EZproxy itself.
[server.example.com-server]
client = no
accept = 127.0.0.2:443
connect = 127.0.0.2:80
protocolHost = server.example.com:443
cert = /opt/ezproxy/ssl/00000001.crt
key = /opt/ezproxy/ssl/00000001.key
There are a few things to note here:

  1. "client = no", so this defines a service that is listening for connections.
  2. This server is setup to accept HTTPS connections on port 443 and turn back around and connect to the same IP address on a HTTP connection via port 80.
  3. This configuration is re-using the EZproxy SSL certificate and key pair, so the filenames used may be different on your system.

Next, add the stunnel client:
[server.example.com-client]
client = yes
accept = 127.0.0.2:80
connect = 192.0.2.1:443
protocolHost = 192.0.2.1:443
sni = server.example.com
sslVersion = TLSv1.2
Again a few notes to walk you through this:

  1. "client = yes", so this is the part that is going to be talking to the remote server.
  2. This accepts the un-encrypted traffic from the server block above.
  3. Stunnel supports SNI, which is being more widely adopted now that XP is no longer a concern.  If you have to enact this work-around, you should anticipate needing to use SNI as well.
  4. The last line is what allows EZproxy 5.x to use modern SSL with PCI-DSS compliant platforms.
Now that your hosts entry is in place, your IP alias is answering, and you have configured stunnel, simply point stunnel at the configuration file, and fire it up:
# stunnel /etc/stunnel/ezproxy.conf
If you enabled the debug and foreground options in the configuration file, you should see something like this emitted:
Service [search.example.com-server] accepted connection from 127.0.0.2:60181
connect_blocking: connected 127.0.0.2:80
Service [search.example.com-client] accepted connection from 127.0.0.2:51642
connect_blocking: connected 192.0.2.1:443
At this point, unless you are using a caching Squid proxy (and you ARE, aren't you?), you should be able to access https://search.example.com via your EZproxy server and load the content.  If you're running Squid, you will need to reload the cache to pick up on the new /etc/hosts entries before this will work.

So there you have it, another tool for your EZproxy toolbox.

Monday, September 19, 2016

Space based data centers?

For those of you who were in the IT space about a decade ago, you may remember Sun Microsystems started promoting "data center in a box" where they converted a standard shipping container into a mobile server room that could be deployed to major events, provide disaster support relief, etc.  But taking a bunch of these, slapping a nose cone around them, and sending them into space is not what this post is about.

A recent article in the Winnipeg Free Press about the modern space race between billionaires Elon Musk, Jeff Bezos, Paul Allen, and Richard Branson reminded me about SpaceX's plans to develop a global satellite internet service, and got me thinking about where their initiative could lead.  But first, a little background:

While traditional communication satellites orbit geosynchronously at 22,236 miles, the current plans for SpaceX's constellation of satellites would orbit at only a 750 mile altitude --  much lower than traditional communications satellites.   This would place the SpaceX constellation not quite twice as high as the Iridium constellation today.

Using a lower orbit will require more satellites to provide the same coverage, but solves one of the fundamental issues with satellite internet service -- latency.  A one-way trip from the earth's surface to a geostationary satellite takes approximately 120ms.  Then it has to be re-transmitted to a ground station, adding another 120ms.  Only then can the request be sent to the servers, incurring what most of us consider "normal" latency, but then it has to make 2 more trips -- one up and another down -- to the return to the user, adding another 120ms again each way.  Not accounting for any terrestrial delays, this means that every single packet going through traditional satellite internet systems today takes about 1/2 second simply to go to the geostationary satellite and back 4 times.

How satellite internet works today (right hand side shows the data path)


This means that every web site, even those using AJAX calls for user interface responsiveness, will instantly feel "slow" on existing satellite services.  Research by Jacob Nielsen has shown that for a system to feel like it is responding instantaneously, it needs to update in 0.1 seconds.  For a user's train of thought to not be interrupted, it needs to update in 1.0 seconds.  With 0.5 seconds of guaranteed latency, the "instantaneous" is already out the window with traditional satellite internet, and "normal" internet transmission and processing delays make meeting the 1.0 second deadline a challenge for anything but the most basic of actions.

Choosing a significantly lower orbit changes this.  Instead of 120ms, it only takes signals about 4ms to reach a satellite 750 miles up.  The 4-way trip only adds 16ms to the transmission time, which is on par with terrestrial network performance.  This means that latency should give way to bandwidth as the primary concern for users of this system.

So far, so good, right?  Now, assuming you have all 4,000 satellites circling the Earth at 750 miles and providing internet service, what else could you do with them?

Having been operating in the library world for several years now, and using caching proxies in a commercial setting for many years before that, I am very familiar with the challenges posed by limited bandwidth.  Just ask any IT department what one of their biggest problems is, and bandwidth is going to be in the short list.  To combat this, caching proxy servers can be used, and in fact are already deployed in various ways by existing internet service providers, especially the satellite ISPs.  Some may provide customer equipment that includes a proxy server, others deploy proxy servers at the ground stations.  So far, however, I have been unable to find an instance where a satellite operator deployed a proxy server on the satellite itself.  This is where things start to get interesting.

One of the cost containment measures that SpaceX employs on the Falcon 9 is to prefer deploying off-the-shelf components instead of specialized hardware at several multiples of the cost.  CubeSats today use a similar approach, leveraging off-the-shelf components crammed into a cube (thus the name) 10 cm of useful volume.  [While I'm not sure that SpaceX could quite cram enough into a CubeSat to build their network, the thought of a single Falcon 9 deploying a swarm of 4,000 CubeSats at once amuses me, and would certainly set multiple records.]

Given SpaceX's existing experience with Falcon 9 hardware, as well as the data from CubeSat experiments, it is possible that SpaceX may eschew traditional radiation hardened CPUs, and test off-the-shelf components for their satellite constellation, adding radiation shielding and designing redundancy into the circuits to mitigate the radiation effects instead.  This could mean anything from ARM processors to x86 based designs, but for sake of imagination, let's assume that an x86 design was chosen, and that the processor selected had all of the virtualization bells and whistles enabled, allowing for satellite control operations to be cleanly segmented away from satellite service operations.  What might that mean for their platform?

Well, for one, SpaceX could deploy caching proxy servers on the satellites themselves.  For static assets (graphics, javascript, css, etc.) this would save the satellite to ground station leg, avoid the internet service times, and reduce the latency to about 8ms for the trip up to the satellite, servicing by the local proxy, and return to the ground.  If the satellites were also mesh networked, they could operate as a proxy cluster, sharing assets and query their local peer group for content as well.

The concept of a local peer group in a meshed satellite constellation is a very interesting concept to me.  Without doing all of the detailed math, allow me to do some hand waving to move the discussion along.  A sphere with a 4,700 mile radius (average Earth circumference of 3,950 miles + 750 miles of orbital altitude) has a surface area of about 277,591,000 square miles.  Assume that the satellites are evenly distributed across that sphere (here's where the hand waving is) and that means that each satellite will cover approximately 70,000 square miles on the surface of that sphere.  Applying basic geometry means that each satellite will be on the order of 150 miles apart from each other, allowing each satellite to reasonably communicate with 5 of its peers without adding noticeable  latency to the system, assuming they fly in a formation similar to a geodesic dome configuration.

Why is this significant?  It means that in addition to the proxy server running on each individual satellite, each proxy could query its peers for assets as well, with the satellite-to-satellite communication still being faster than communication the ground station.  Having worked with clustered proxy configurations before, this serves to amplify the effective capacity of the cache cluster.  Depending on the nature of the cached requests, and the exact configuration of the satellite constellation,  it might make sense to define the local cache cluster as not just the immediate peer satellites, but also include their peers, further amplifying the overall benefits of the peering relationships.

"OK, that's great,", you may be thinking, "but what does this have to do with space-based servers?"  Well, remember how I asserted that SpaceX may be able to use off-the-shelf hardware for their satellites, and then laid out one application as a specific example of how an internet service (caching proxy) that could take advantage of running directly on the platform?

What about the remaining cores that are sitting idle on the satellite CPUs?

Recent Intel Phi processors have 72 cores (yes, I realize that product is targeted at the HPC market, but even traditional virtualization targeted CPUs have 24-32 cores these days, thus the point still remains), so if this were the processor of choice for the satellites, control operations could have a core dedicated to it, proxy services could take a second core, leaving 70 cores twiddling their thumbs.  On 4,000 satellites.  With reasonable latency not only to the ground, but between each other.

What would you do with over a quarter of a million CPU cores sitting idle on a low-latency space-based network?  If I were SpaceX, I would look at renting them out.  "The Cloud" is widely used today to talk about hosted servers on the internet, but this would be a true cloud platform, one circling the Earth like an electron around an atom.  And there is no reason to assume that one application has to be statically mapped to one core, either.  Applications could be deployed as docker containers instead of fully virtualized servers, raising the effective capacity of the entire swarm.

Traditional CDN providers would seem to be a natural fit for this platform, but what would major internet services do with access to a platform like this?  It would not be large enough to displace their terrestrial operations, but with a small collection of smart edge nodes to boost their services, what functionality would that open?

Add migration capabilities to the platform, and a single application could move between satellites as they orbit, maintaining coverage over a specific terrestrial geography 24x7.

SpaceX could also choose to expand the scope of the network a bit by throwing a few extra sensors on the satellites and sell time to scientists for research.  Add a couple of cameras for real-time earth imaging, and they could open up not only earth observation science, but also real-time image feeds from space for various commercial applications.  Could the platform also function as an alternative to TDRSS for other science missions?

Take this same orbital cloud, put it on a Falcon Heavy, and deploy it to Mars with the same (or better) capabilities to establish a global network there before any human sets foot on the planet, and you can start exploration, colonization, and research on Mars with full planetary monitoring, voice and electronic communications, file storage and sharing, and other IoT conveniences available from day one.

Add larger transmission relay nodes at the Lagrange points, and you could interface each planetary orbital cloud with a high-powered transmitter to enable high-bandwidth store-and-forward communications at the interplanetary scale.  Mail from bob@domain.com.earth to alice@domain.org.mars, anyone?

Thursday, March 31, 2016

Thing-ifying your Internet

Librarians in general were called out recently for not having widespread knowledge of IoT.  The irony of this event was that it happened mere days after being asked for examples of people unintentionally being made to look stupid in public.  QED, I suppose.

Remember that everyone has a domain in which they are knowledgeable, and we should not only respect that, but also recognize that when any of us stray into an area in which we are not (yet) knowledgeable, we are going to feel like kindergartners in a calculus class.  It is far better to find a common frame of reference and work forward from that point.

I suspect that many may already be aware of the concept of IoT, without being aware of that term.  Since it is currently a very nascent technology, I would assert that it is still in the early adopter phase, and has not yet fully come into the common vernacular, so it is unrealistic at this point to have an expectation that everyone should be intimately familiar with IoT concepts.  Depending on the subject area you work with, you might be more familiar the term "Wearable", "Implantable devices", "Ingestible devices", "Medical Monitor", "Smart Home/House/Car/Street/Appliance", "Connected Device", etc.

No one says "I’m IoT-ifying my house" — of course not, they say "I’m making my home a smart house" because that’s the terminology that is used in that corner of the IoT market.  Do you hear people boasting of the IoT on their wrist?  No, they show off their "Smart Watch".  Do they cook meals in an IoT?  No they use a "Smart crock pot".  (The very concept of the internet connected crock pot still makes me chuckle. I can’t wait until I can finally realize my dream of connecting to my toaster via telnet.  I wonder if it will have a camera to watch the toast brown?)

I would expect that librarians working at sites with maker spaces would be much more likely to be aware of the IoT concept, because that demographic is going to be at or near the leading edge of technical adoption, and patrons using 3D printers are the ones I would expect to say "Hey, can I put a computer in this thing I just printed?".  Other sites, especially those light in physical holdings and heavy in online resources, would not have an immediate need to care about IoT yet, other than helping researchers know where to look to find resources on that topic.

This meshes nicely with a talk that I heard Tim O'Reilly give once: when an industry is disrupted, new value is found by moving either up or down the technology stack.  The example he cited was when computers became commodities, new value was found by moving up (Microsoft into operating systems) or down (Intel into chips).  Applying this concept to librarianship, the choice is to broaden into areas like maker spaces, or to specialize even deeper into specialized subject areas to still provide value.  Both approaches require compromises: wide and shallow, or narrow and deep?

Even looking a few years into the future when IoT technologies are ubiquitous, I doubt IoT will ever be the dominant term used.  We don't Short Message Service our friends, we text them.  We don't apply Huffman coding to a string of binary data, we compress a file.  

You don’t Thing-ify your Internet, you buy smart stuff.

Friday, January 30, 2015

Basic kickstart file for EZproxy instances

A discussion thread about EZproxy server sizing for VMs has been underway on the EZproxy mailing list this week, and some have asked for details on the setup that we run for our hosted proxy servers.

We use VMs with 512MB RAM, 1 processor, and 8GB of disk space, with log files stored on a network share.  This provides adequate resources for both EZproxy and Squid to run side-by-side:

             total       used       free     shared    buffers     cached
Mem:        502112     272564     229548         16       4500      78088
-/+ buffers/cache:     189976     312136

This is achieved by running a minimal installation with all unnecessary daemon processes disabled.

Here is the kickstart that we use for our proxy servers:

lang en_US.UTF-8
selinux --enforcing
keyboard us
authconfig --enableshadow --enablemkhomedir --enablecache --passalgo=sha512
timezone --utc America/New_York
firewall --enabled --ssh --port=53:tcp,53:udp,80:tcp,443:tcp,3128:tcp,3130:udp
rootpw --iscrypted <hashed password>
firstboot --disabled
services --disabled anacron,atd,autofs,avahi-daemon,bluetooth,cups,firstboot,gpm,hidd,mdmonitor,netfs,pcscd,readahead_early,rpc
gssd,rpcidmapd,yum-updatesd,microcode_ctl
text
skipx
reboot
install
bootloader --location=mbr --driveorder=sda
network --bootproto=static --device=eth0 --ipv6=auto --ip=<ipaddr> --netmask=255.255.255.0 --gateway <gwipaddr> --nameserver=<dns1ip>,<dns2ip> --hostname <proxy host name>
url --url=http://<install server>/centos/6.6/os/x86_64
repo --name=epel --baseurl=http://<install server>/epel/6/x86_64
zerombr yes
clearpart --all --drives=sda
part swap  --fstype=swap --ondisk sda --size=2048
part /boot --fstype=ext4 --ondisk sda --size=256
part /     --fstype=ext4 --ondisk sda --size=1 --grow 
# Packages
%packages --nobase
epel-release
yum
yum-utils
sudo
strace
telnet
tcpdump
rpcbind
nfs-utils
autofs
openssh-server
openssh-clients
puppet
ipa-client
squid
calamaris
awstats
 %post --interpreter /bin/sh --log /root/post_install.log
chvt 3
exec < /dev/tty3 > /dev/tty3
echo "Running %post script"
echo "Running puppet agent"
puppet agent --test --waitforcert 60 --logdest /root/puppet_install.log
echo "Removing 32-bit runtime"
# We do not need 32-bit compatability by default
yum -y erase glibc.i686
echo "Performing update"
# Update to latest
yum -y update
echo "Fixing plymouth"
# Turn off the pretty end-user boot screen, and show the useful boot messages
plymouth-set-default-theme details
/usr/libexec/plymouth/plymouth-update-initrd
exec < /dev/tty1 > /dev/tty1
%end
After the install, puppet re-installs only the base 32-bit runtime libraries needed for EZproxy, copies the EZproxy binary, configures it, and starts it up.  The only step that currently still need to be done manually is the EZproxy SSL setup, which certmonger may be able to help address.

Wednesday, January 28, 2015

EZproxy + Squid: Bolting on a caching layer (revisited)

In a previous writeup, I detailed early results using a caching layer with EZproxy.  Now that we have quite a bit of experience with the configuration, it's time to update with a long-term view of the results and some analysis of how effective it is overall.

To understand the benefits, first a discussion of the architecture in place is necessary.  We run in a clustered configuration using the HAPeer support built into EZproxy, with each proxy server running its own local Squid installation with a sibling configuration as a cache peer of its partner's Squid instance.


    Vendor                       Vendor
      /\                           /\
      ||                           ||
+============+               +============+
|   Squid    | <- sibling -> |   Squid    |
+------------+               +------------+
      /\                           /\
      ||                           ||
Proxy/ProxySSL                Proxy/ProxySSL
      ||                           ||
      ||                           ||
+------------+               +------------+
|  EZproxy   | <-  HAPeer -> |  EZproxy   |
+============+       /\      +============+
                     ||
                     ||
                   Patron


The patron accesses the proxy cluster by the HAPeer name, which is DNS mapped to each of the proxy servers.  EZproxy makes an internal decision to either service the request, or to send it to a peer machine to service the request, and issues a redirect to the patron's browser.  From that point forward,  the request uses the regular flow as a stand-alone proxy installation through the EZproxy server.

In our implementation, we put Squid on the local proxy machine with mostly default settings for RHEL/CentOS: no disk caching (i.e. no cache_dir setting) and the defaults for cache_mem (currently 256MB).  The defaults have worked well in our workloads, as our maximum memory usage is around 128MB after the cache is fully primed.

The biggest change needed was to enable the sibling support between the Squid instances:
digest_generation on
icp_port 3130
icp_access allow localnet
icp_access deny all
cache_peer ezproxy-01.example.edu sibling   3128  3130  proxy-only
This allows the proxy server to ask its peer:  "Hey, I don't have anything for this URL, do you?" and if the sibling has the content, it can request it from the partner cache, slightly increasing the overall effectiveness of the clustered proxy setup.  In larger setups, multicast support would make sense, but ours is small enough that it was not worth the extra configuration to get that working.

One tool that we use to measure the effectiveness of the cache setup is a reporting program called Calamaris.  This gives us insight into what kind of requests are being made, how many of those requests are serviced locally or from a sibling, and how many of them are pulled from vendor content directly.

In a recent sample, the report showed that almost 48% of the requests were found in the cache, while 52% of the requests had to go all the way to the vendor to be serviced.  The sibling setup was used for 3.5% of the requests, and were successfully served from the sibling's copy about 33% of the time.  I suspect that the more diverse the vendor resources, the higher this sibling number may go, though studies suggest that the number of successful sibling requests will probably never exceed 15% overall.

Looking at the report by network usage shows a slightly different story, where 30% of the traffic going through Squid was served locally (and at maximum network speed since in our configuration all objects are stored in RAM), while the remaining 70% of the network traffic was not found in cache and had to go all the way to the vendor.

What were some of the web assets that were successfully cached?
  • CSS files (90% hit ratio)
  • PNG images (90% hit ratio)
  • JavaScript files (89% hit ratio)
  • GIF images (87% hit ratio)
  • ICO images (70% hit ratio)
  • JPEG images (51% hit ratio)
I suspect the lower hit ratio of the JPEG files is due to the fact that those files are more likely to be licensed photography content rather than user interface elements on the web pages, where GIF and PNG files are more commonly used.  The net result of this is that the files that typically block web page layout and rendering are served at the highest speed possible, which makes for a better overall user experience.  This was validated by actual inquiries we received from our members asking "What changed? Everything feels faster now!" after this was rolled out.

What requests typically bypass the caches?
  • JSON requests for AJAX calls (3% hit ratio)
  • PDF files (3.5% hit ratio)
  • Dynamic content (6.5% hit ratio)
Which falls into expectations.  The JSON requests are used for functions like autocomplete of search terms, search limiters on some platforms, pagination of search results, and analytics tracking.  The PDF files are mostly licensed content, and the "Dynamic content" category largely catches searches and search results screens.  

Thankfully, we have found that the developers at the vendors make correct use of cache-control headers to keep non-cacheable content from being served, and we have had ZERO instances reported of any issues that were cache related over the past 2 years that this configuration has been in production use.  I credit this largely to the practice of ISPs implementing transparent cache servers on their home user networks to manage bandwidth usage.

Prior to implementing caching support on our proxy servers, our network usage in:out ratio was very close to 1:1.  After implementing this architecture, it is not uncommon to see a 1:2 ratio where half of our proxy-to-vendor traffic has been eliminated thanks to a shared caching configuration.  If a version of EZproxy is released that supports compression as well, this ratio may go as high as 1:10 between the combination of compression and caching.