Monday, February 25, 2013

I go, you go, we all go for SPNEGO

While working through a web SSO Kerberos authentication issue (SPNEGO), I tried testing Safari and Chrome as well as Firefox to make sure that what I was running into was not a bug in Firefox.

The experience left a lot to be desired.

To be fair, I have been working with FreeIPA, so Firefox was already mostly configured for SPNEGO,  since it already had network.negotiate-auth.delegation-uris and network.negotiate-auth.trusted-uris set for my domain.  But that's about the only trick to getting Firefox to work with SPNEGO, and when I went to use it on another server in the same REALM, it appears to be properly sending the correct authentication negotiation headers to the server.

Safari has no such settings, since it relies on the Kerberos setup at the OS level. I have used it with other Apache's mod_auth_kerb module with other servers in the same REALM, so I know it basically just works.  For some reason though, the server was not sending back a 401 authentication challenge, so Safari just may not be supported by this application.  Que sera sera.

On to Chrome.

Oh my!  Chrome requires command line arguments to enable SPNEGO support.  There are no preferences in the UI that you can set.  There is no .plist or .ini or any other kind of file you can edit to cleanly enable it in a persistent manner.  You have to type in this abomination of a command line in a terminal window to run Chrome on a Mac with SPNEGO support:

open '/Applications/Google Chrome.app' --args --auth-server-whitelist="<server>" --auth-negotiate-delegate-whitelist="<server>" --auth-schemes="digest,ntlm,negotiate" https://<server>/


I don't object to using a terminal window; in fact I spend most of my time working in one.  But one would think that Google could come up with a more graceful way to handle that.  And that's not the only time I've had to resort to that for Chrome -- certain developer options require command line switches to enable as well, but I can forgive them -- a little -- in that case.

(This also implies that it will be a cold day in the Valley before Android tablets will have reasonable SPNEGO support.  You can't exactly pass command line options to browsers on tablets without jumping through hoops.  After I get the desktop browsers sorted out, I'll have to see just how bad the situation is on the tablet front.)

Moral of this story: out of the 3 major browsers for the Mac, Firefox seems to have the most widely supported and least troublesome Kerberos/SPNEGO support of them all.

Friday, February 22, 2013

Crontab and percent signs

It's funny how long you can work with a piece of software, and never run into certain features.  Even the most basic software is not immune to this.

I was recently setting up a cron job to do some processing on the previous day's log file:
/path/to/command --logfile=/path/to/logfile-$(date +'%Y%m%d' -d 'yesterday').log
For the non-UNIX literate readers, the $() construct says to run the command inside the parenthesis and use the output.  In this case, I wanted the date for yesterday formatted as YYYYMMDD.

Tested and worked just find from the command line, but when I created a cron job for it, I found this in my inbox the next day:
/bin/sh: -c: line 0: unexpected EOF while looking for matching ``'
/bin/sh: -c: line 1: syntax error: unexpected end of file
First thing I thought is that I had missed a "'" character somewhere, but I hadn't.  How odd.

What does cron have to say for itself?
Feb 21 02:00:01 server CROND[17834]: (root) CMD (/path/to/command --logfile=/path/to/logfile-$(date +')
 Hmm.  Truncated at the first "%" sign, now why would that happen?  Well, according to the manual page for crontab, the "%" character has special meaning:
Percent-signs (%) in the command, unless escaped with backslash (\), will be changed into newline characters, and all data after the first % will be sent to the command as standard input.
I shudder to think how many years I've been using cron, and have managed to side-step this particular feature.  I guess I've always put date functions like that into scripts, and had cron call the script, so I never had to escape the "%" in the actual crontab before.

So now the cron command looks like this:
/path/to/command --logfile=/path/to/logfile-$(date +'\%Y\%m\%d' -d 'yesterday').log
and problem solved.  I had to chuckle to myself, though, because that feature has been around for at least 20 years, and somehow this is the first time I've run into it.

Thursday, February 21, 2013

EZproxy + Squid: Bolting on a caching layer

In an earlier wish list post for native caching support in EZproxy, I stated that the user could easily save 10-20% of their requests to vendor databases if EZproxy natively supported web caching.

I was wrong.

The actual number is closer to double that estimate.

I recently setup a Squid cache confederation upstream from EZproxy, did some testing against Gale and ProQuest databases, and found that the real world number is between 30-40% savings by adding a caching layer.

This re-validates that studies done in the late 90's on HTTP caching appear to still hold true today:
Journal of the Brazilian Computer Society
Performance Analysis of WWW Cache Proxy HierarchiesPrint version ISSN 0104-6500
J. Braz. Comp. Soc. vol. 5 n. 2 Campinas Nov. 1998
http://dx.doi.org/10.1590/S0104-65001998000300003
A Performance Study of the Squid Proxy on HTTP/1.0Alex Rousskov / National Laboratory for Applied Network Research
Valery Soloviev / Inktomi Corporation
Enhancement and Validation of Squid’s Cache Replacement Policy John Dilley, Martin Arlitt, Stéphane Perret
Internet Systems and Applications Laboratory
HP Laboratories Palo Alto
It was very interesting that in my limited testing that my results were largely inline with those studies from over a decade ago:
  • 30-40% cache hit rates with a Squid memory-only cache configuration
  • 5-10% improvement in cache hit ratio by just adding one peer cache
This, despite all of the technology changes that have become commonplace thanks to Web 2.0 that did not exist back when these studies were originally made.

I opted to not configure disk-based storage for the cache for this test, but I may re-visit that at some point in the future, given that Rousskov and Soloviev were reporting nearly 70% hit ratios in their study.

Disk based storage for the cache deserves a look, but  my initial expectation is that in an academic library search setting, one is unlikely to achieve a greater than 40% hit ratio, simply due to the nature of the web sites being used.  Some things that are going to prevent a higher ratio include:
  • Search term auto completion using AJAX calls
  • The search results themselves
  • Search filtering and refinement
In a general purpose library setting, a proxy may be able to achieve higher ratios as patrons go to the same sets of web sites for news, job postings, social networks, etc.  In an academic setting, though, with patrons executing individual searches, I am not convinced that achieving the higher cache hit ratios is a reasonable expectation.

The working set of cached objects between Gale and ProQuest was approximately 90MB, so it was well within the default 256MB memory cache size Squid uses by default.  With that workload, the only thing that a disk cache could be expected to do is to re-populate the in-memory cache copy when the server is restarted.  The cache will be quickly primed after only a few requests, though, so it's not the same as a busy cache that may have gigabytes of data stored on disk.

Another interesting behavior that I observed was that even though the working set could be fully held in either cache's memory, what I saw develop over time was one of the peer caches would hold a subset of objects until they expired, and then the other cache would pick up the baton, refresh the objects, and serve the newly refreshed objects to the cache cluster.  Wash, rinse, repeat, and you start seeing a pendulum pattern as the fresh content moves between the cache peers, with ICP requests fulfilling requests from the peer before doing the long haul to the origin server.

Even a 30-40% cache hit rate is nothing to downplay, though.  That is a significant bandwidth (and to a certain extent time) savings, and given that EZproxy does not support HTTP compression, this may be the best that can be hoped for in the short term.

Wednesday, February 20, 2013

EZproxy Wish List: HTTP Compression Support

While looking at ways to make our EZproxy servers more efficient, I re-discovered something that I already knew, but had been ignoring:

EZproxy strips out the Accept-Encoding header from requests, and requests uncompressed content from the upstream servers and sends uncompressed content to the downstream clients.

One might think that simply adding

HTTPHeader Accept-Encoding

to the proxy configuration would be enough to handle this, and it does fix part of the problem.  This allows the browser's Accept-Encoding header to be passed through to the upstream server, but it is not a complete solution (and can beak in certain corner cases):

Client => EZproxy

GET / HTTP/1.1
Host: www.example.com

Accept-Encoding: gzip,deflate,sdch

EZproxy => Server

GET / HTTP/1.1
Host: www.example.com
Accept-Encoding: gzip,deflate,sdch

Server => EZproxy

HTTP/1.1 200 OK
Content-Encoding: gzip
Content-Length: 6202

EZproxy => Client


HTTP/1.1 200 OK
Content-Encoding: none

When EZproxy receives the reply from the upstream server, it decompresses the content so that it can rewrite the content as necessary to keep users from breaking out of the proxy.  The missing step is that EZproxy does not then re-compress the content before sending it back to the user's browser.

Just how big of a deal is this?  Well, on just that one request, the uncompressed content was 26.5KiB vs. 6KiB, so the proxy transferred 4.4 times as much data from the server and to the client.  For fun, ask your IT department what they would do with ~75% more bandwidth...

So why not just add the HTTPHeader line globally, and at least benefit from the Server => EZproxy compression?  Well, some vendors have tried to be smart and dynamically compress or minify JavaScript on the fly, depending on the client browser's capabilities.  In the cited example, the minify handling was broken, and served out corrupted JavaScript files.

It is not a stretch to think that there may be other issues lurking out there when the server is told that the client can handle something that it will not be given.  Look closely at that Accept-Encoding line from Chrome.  Notice "sdch"?  Yeah, I had to look it up too:  Shared Dictionary Compression over HTTP.  There are a few posts that give an overview of what SDCH is about, but in short, it's a technique for sending a delta between a web page that you have and the web page that the server is getting ready to send.  Think of it like a diff function for HTTP content.

Now, what if the upstream Server supports SDCH and sends back a reply that EZproxy has no idea how to cope with properly?  You're going to get sporadic reports of problems, and it may take a while to narrow down that it's isolated to Chrome users, and maybe even longer to figure out it's SDCH at play.

That's just one example of how blindly passing through Accept-Encoding can go wrong, so I'm not opposed to EZproxy manipulating that header.  All of the mainstream browsers handle gzip encoding, and it's easy enough to support.

There is no good reason that I can think of that EZproxy could not simply filter the Accept-Encoding header to only contain gzip (and maybe even deflate), then decompress the server reply on the fly, apply any content changes to keep the users on the proxy, re-compress the content, and send it on to the client.  Once upon a time, someone might have piped up "CPU Cycles!", but I think the days that argument is pretty much dead these days thanks to Moore's Law.

With compression support, seeing a decrease in non-graphics content (HTML, JavaScript, CSS, JSON, XML, etc) of 80% is not an unreasonable expectation.  Add in caching support to handle the graphics, and EZproxy could be significantly more bandwidth friendly.

Tuesday, February 19, 2013

Using IPA with EZproxy: Where the Wildcards Aren't

Fedora's FreeIPA project is an interesting piece of software that has pulled together several pieces of open source software and given it a point-and-click interface as something of an answer to Active Directory.

The major pieces are the BIND DNS server from ISC, the 389 Directory Server (which started its life as the Netscape Directory Server so many moons ago), DogTag Certificate Server (also of Netscape lineage), and MIT Kerberos.  Each of those packages can be daunting to setup on their own, but the IPA project has done an admirable job of integrating them and making their setup and use simple.

I have been tracking the evolution of IPA for some time now and have finally decided to take the plunge.  So far things have been fairly smooth, with one exception:

IPA does not currently support wildcard DNS.

For a lot of people this would not matter, but when combined with EZproxy in a proxy-by-hostname configuration, it becomes a major problem, since wildcard support is key to its function.

The root of the problem is that the software that ties the DNS server to the LDAP storage engine -- the creatively named bind-dyndb-ldap -- does not support wildcard DNS entries yet.  Like others, when the UI did not allow me to create the wildcard entries, I opened up my favorite LDAP editor (Apache Directory Studio) and created an entry manually.  Alas, it was not a simple UI issue, but rather non-support in the back-end software.  There are bugs entered for both IPA (3148) and bind-dyndb-ldap (95) to track the issue.

Until that is addressed, sites that adopt IPA and use EZproxy need a work-around for this issue.  All both of us.

Now in a traditional BIND setup, you could of course use wildcard entries directly, or you could just point the NS record at the EZproxy server a documented by OCLC on the DNS configuration page, and enabling the DNS functionality built into EZproxy:

ezproxy.example.edu IN A 192.0.2.1
ezproxy.example.edu IN NS ezproxy.example.edu.

Unfortunately, adding that NS record does not seem to work in IPA.  I have not yet taken time to peel back the layers of the onion to figure out exactly why it does not work and where it fails, but adding a NS record for the EZproxy host to IPA caused the lookups to fail completely when they were added to the host entry.

I tried a few other approaches to get this to work in IPA -- setting up a dummy zone for the proxy server and changing with the zone forwarder settings, putting the host and the service names in different zones -- to no avail.  Some approaches looked more promising than others, but none ultimately worked.

Clearly I was not going to be able to address this in the 2.2.0 release of IPA, and would need to go outside the system until wildcard support is natively supported.

My first instinct was to just setup a traditional BIND zone file for each proxy server.  This certainly worked, but required both a named.conf entry, as well as a separate file for each proxy server "zone".  I wanted a solution that would involve less configuration litter to clean up later.

What I finally settled on was setting up simple static-stub zones in BIND with forwarders set to EZproxy:

zone "ezproxy.example.edu" IN {
  type static-stub;
  server-named { "ezproxy.example.edu"; };
  forwarders { 192.0.2.1; };
};

It feels a little dirty having to resort to that, and I'm reminded of the scene from Star Trek: First Contact where Dr. Crusher mutters "I swore I'd never use one of these things" as she activates the Emergency Medical Hologram to create a diversion as she escapes the Borg, but it does work and will buy time until the bind-dyndb-ldap developers can figure out how they want to support wildcard DNS entries.

Monday, February 18, 2013

Communication is a lost art

I recently reported an issue to one of my vendors regarding one of their web sites that they use as a vanity entry point to their service platform.

The initial report was:
When sending users to <website> via HTML form with a GET method, WebKit browsers (Chrome, Safari, and some Android browsers) append a "?" character (see WebKit bug 30103 (https://bugs.webkit.org/show_bug.cgi?id=30103) and Chrome bugs 108690, 121380 (http://code.google.com/p/chromium/issues/detail?id=108690,http://code.google.com/p/chromium/issues/detail?id=121380). 
This causes the browser to access "http://<website>/?" which redirects to http://<vendor website>/?" Note the trailing "?" on the <vendor> URL. The "?" is preserved, and appended to the password field, rending the URL invalid, and presenting the user with a login screen. 
Could you please update the redirect handling on <website> to not preserve the "?" character that WebKit is sending? Side note: any value sent after <website> is also preserved, triggering the same behavior. E.g.:<website>/foo redirects to <vendor website/foo ; while one can take the stance "don't do that", it would be a better user experience to not preserve any path data if it is going to cause errors like this.
I reported the issue, the cause of the issue, the symptoms, the URLs involved, and a resolution path.
Could you provide the screenshots of the issue you reported until  the "?" is preserved, and appended to the password field, rending the URL invalid, and presenting the user with a login screen. This will help in forwarding this issue to the concerned department for further investigation. 
Screenshots?

Really?

To address an issue with how a vanity entry web site mishandles any extra data passed in the URL you want screenshots?

Really?!?

OK, fine, I'll take some screen captures of exactly what I stated and send them.


The website URL

The vendor URL
See?  Start at http://<website>/? and you get redirected to http://<vendorsite>/login..../?

The "?" abides.
Thank you for your email and also for the screenshots. I have forwarded this issue to the appropriate department for further investigation. I will contact you as soon as there is any information regarding this.
I have passed the gauntlet, there is hope that this issue will be fixed!
I received an update from the concerned department regarding the issue with  sending users to <website> via HTML form with a GET method, WebKit browsers (Chrome, Safari, and some Android browsers) append a "?" character. The concerned department has requested for the exact sequence of steps to duplicate this issue. Could you provide the same.
Umm.  HTML form....GET method...this is not looking good.
Load this basic HTML form in a Chrome browser:
<form method="get" action="<website>" >
<input type="submit"/>
</form>
Click the submit button.  Chrome will append a "?" character to the URL, and the <vendor> login error page will be generated. 
I can appreciate the need for a good test case, but this one seemed pretty straightforward...
Thank you for your email and for the additional information too. I have forwarded this to the appropriate department for further review. I will keep you updated as I receive any information in this direction.
My enthusiasm has been diminished, but we'll see if that's the missing part the vendor needed to resolve this.
I received the following update from the concerned department regarding " WebKit browsers (Chrome, Safari, and some Android browsers) appending a "?" character. The update is that 'this appears to be an issue with webkit itself and will have to be fixed by google or apple in the webkit engine. Unfortunately we do not think there is anything we can do on the <vendor> side since this is not specific to <vendor service>. This would happen with any URL.
Sigh. I am not asking them to fix WebKit; I am asking them to fix the way their vanity entry website handles redirecting users into their service platform website to address a very specific browser issue.

I can't help but think of old school burlesque/vaudeville comedy routines (Who's on First, etc.), and the memorable scene from Pulp Fiction between Jules Winnfield (Samuel L. Jackson) and Brett (Frank Whaley).

Apparently the "concerned department" does not grasp the concept of the Robustness Principle.  Funny thing is, the other vanity entry web sites for this vendor work just fine, it's only this one entry point that is broken.

Get the popcorn, kids! This one could drag out for a while.

Friday, February 15, 2013

Collector's Cards: rpm2cpio

Once upon a time, at a job far, far away, we used to refer to bugs as "collectors cards".  Here's an example of why...

It all started innocently enough.  I wanted to crack open a RPM to inspect the contents without actually installing the RPM on a system.  The way I normally do this is using rpm2cpio:
rpm2cpio <RPM> | cpio -id
This takes the RPM payload -- which is in CPIO format -- and dumps it to standard output for the cpio utility to extract.  Then you can go spelunking through the extracted files to see whatever you might be looking for.  (This is also a great rabbit to have in your sysadmin hat for recovering from any number of systems failure scenarios, BTW.)

This simple command normally works great.  That is, until I tried it on a CentOS 6 RPM on the CentoOS 5 system that still manages our internal mirrored content.

When I tried it this time, I consistently got:
cpio: premature end of archive
It didn't matter if I was working on the streamed output (thinking a read error may have caused a failure that was silently eaten by the act of streaming the output into a pipe) or on a file that I piped the output to.  The rpm2cpio extraction seemed to run fine, it's just what was supposed to be cpio content was not decipherable:

$ file output.cpio
output.cpio: data

Taking advantage of a bit of knowledge of the mechanisms behind RPM's payload handling, I deduced that the archive was compressed by something that rpm2cpio was not handling correctly, as I tried the usual suspects: gzip, bzip2, uncompress, zip, with no success.  The file was not identifiable by file, either, but had this in its header:

$ od -a output.cpio | head -1
0000000 } 7 z X Z nul nul nl a { ff ! stx nul ! soh
Hmmm..  "7z" "XZ".  I've heard of compression algorithm 7-zip, and I remember something about "xz" compression being more rsync friendly, and remember talk of RPM using that compression format to make Fedora content more efficient more efficient to mirror.

That got me on the right path, and sure enough, there is a bug (602423) in Red Hat's bugzilla on this very issue, along with a pointer to the unxz command that I had not had a need to use before:
$ cat output.cpio | unxz > output
$ file output
output: ASCII cpio archive (SVR4 with no CRC)
Ahh, there we are, finally the output I was after.

So there are multiple failures at play here:
  1. The file command does not understand how to identify the data compressed with the xz format.
  2. The rpm2cpio command only understands how to handle gzip and bzip2 compressed content.
Both of these are understandable for newly developed code; the ecosystem needs time to catch up with new development work.  The file lag is even more understandable since it is a separate package altogether, and the database that it works from needs to be updated.

The reason this is a "collector's card":  This issue was first reported in the middle of 2010.  It is now 2013, almost 2 1/2 years later.  Support for xz compressed payloads for RPM was added during the Fedora 12 release cycle, which is what served as the basis for RHEL 6.  You're honestly telling me that at no time in the past 2 1/2 years Red Hat could not have released an updated version of RPM on RHEL 5 to one that understood xz compressed payloads?

Here is my prediction of how this bug is going to play out:

This bug will be ignored until RHEL 5 reaches the end of one of its Production cycles that dictates that no further updates will be shipped at that stage of the product's lifecycle.  

If this is deemed an "enhancement" rather than a "bug fix", then that milestone has already passed on Jan. 8 2013.  I highly doubt this will be classified as an "Urgent Priority Bug Fix" worthy of an errata, so the window has likely already closed.

Why does this rub me the wrong way?  Mainly because this has become the modus operandi for how far too many RHEL bugs are "resolved":  Let them fester in bugzilla for a few years, until the time window for dealing with them has passed, and then close them as "too late to fix it now".

"But you can't really expect Red Hat to ship support for new features on old systems!" you say.  

This is an interesting point to address.  Red Hat did change RPM mid-release several years ago, during RHL 6 (no "E") when they updated from RPM 3 to RPM 4.  This created all kinds of challenges when building software during the second half of that product's lifetime.  You had to update any newly installed systems to the RPM 4 binaries before you could install any custom-built software from your own repositories.  I think even released errata had this issue -- you had to update to RPM 4 before you could fully update the system.  Nasty stuff!

This is not quite the same situation, as it is an update that accommodates a new payload compression format, rather than a new RPM header structure.  But is it really that unreasonable to ask that RHEL N-1 be able to understand RHEL N's RPM package format?  Or that support tools like rpm2cpio be able to, if for no other justification than it makes mirror management easier and keeps system recovery options open.