Thursday, January 31, 2013

EZproxy wish list: Don't write to the license file

EZproxy is a licensed product, and has a configuration file, ezproxy.key, that holds the license key.

In most other products, only administrators touch the license key file, or the software does, but only upon a human interaction, that is when the license is entered or updated.

EZproxy takes a different approach.  It reads the file, performs a validation of the key file (which appears to be a local validation, rather than a remote call to OCLC servers), and then writes the file back out, with a new timestamp value, then re-reads the license file, and continues the startup processing.

This happens each and every time the server is started or restarted as part of its initialization.

Here's where the problem comes in.  Before this change (which happened 3 or so years ago, I forget exactly which version introduced this "feature"), I used to be able to make this file read-only, and not writable by the ezproxy RunAS user.  (You are using RunAS, right?)  After this change, I had to make the file not only read-writable but read-writable by the RunAS user.

Sorry, but this is a BROKEN DESIGN.

I'm sure there are other pieces of software that behave this badly, but I am hard pressed to name any.  Perhaps it's the advanced repression techniques kicking in.

Look at any other piece of software, and the concept that the administrative user owns the files, and the non-priviledged user just reads the configurations and runs the software is a pervasive concept.

Why do I feel so strongly about this?
  1. This leads to service outages.
  2. This negates part of the benefits of RunAS.
  3. This can introduce unintended consequences.
Let's explore each of these:

1) Service Outages.

Be honest, EZproxy is such a low-maintenance piece of software that it is very easy to set it up, and forget about it until there is a problem.  Sometimes you can automate your way out of many of them, but the truth is that the squeaky wheel gets the grease, and EZproxy generally doesn't squeak.

One of the scenarios leading to a service outage is a disk full situation due to log files.  Even with filtering, rotation, and compression, given enough time, disks will fill up, especially on a busy proxy server.  Even with disk space monitoring, you may not appreciate the seriousness of the alert until it's too late, or you might *gasp* be on vacation when it happens.

In a normal scenario, when the disk fills up, EZproxy will happily keep running.   You just lose your ability to record log data.   Not optimal, but not catastrophic either.

That is, until you restart the software.

What happens?  EZproxy reads the license file, validates the license, writes the license file ... oops, no disk space ... *BOOM* bye-bye proxy:
open("ezproxy.key", O_RDONLY) = 5
...
open("ezproxy.key", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 5
See that O_TRUNC flag in the open() function call?

       O_TRUNC
              If  the file already exists and is a regular file and the open mode allows writing (i.e.,
              is O_RDWR or O_WRONLY) it will be truncated to length 0.  If the file is a FIFO or termi-
              nal device file, the O_TRUNC flag is ignored. Otherwise the effect of O_TRUNC is unspeci-
              fied.

The file is truncated as a result, so now you no longer have a valid license file because there is no disk space left to re-write the file, and when the subsequent re-read of the license file occurs, the license file is empty, and the server is now unlicensed and will not start.

Reason #1 why having the software re-write its license file on the fly is a BAD IDEA.

2) Weakening the security model of RunAS

Running software as a non-administrative user is a very good thing.   Running as a unique user, separate from any other system tasks is a very good thing.  Partitioning run-time processing from configure-time processing is a very good thing.

Except by writing to the license file, this breaks the partitioning run-time from configure-time.  Look at the model that most other software uses:

The root (administrative) user starts the daemon process
The process opens any ports that require root permissions
The process opens/reads any files that require root permissions
(Some software will chroot() to an empty directory to raise the "you must be this tall" bar for compromising the system at this stage.)
The process will drop root permissions and run as (RunAS, get it?) a non-administrative user

By making it so that the license key file is writable by the RunAS user, a security weakness is introduced where an attacker who finds a way to the RunAS user account can setup a denial of service attack via the license file:  delete it, corrupt it, fill up the disk space that holds the file (and there are several nasty ways to do that under the radar), etc.


Reason #2 why having the software re-write its license file on the fly is a BAD IDEA.

This also sets up the next issue....

3) Introducing unintended consequences

There are probably as many different ways to manage an EZproxy server as there are EZproxy servers.

Some of these may involve giving out access to the RunAS user for various reasons.  Your site might have administrators install the software, then hand it over to an electronic services librarian who configures and maintains it.  Or manages user authentication files.  Or updates the database definitions.  Or maintains the files in the public/loggedin/limited directories.  The point is that is is not hard to imagine a scenario where users might share access to the RunAS user account, or are put into the same group as EZproxy and have write access to the license file, either intentionally or by oversight.

Now, combine this with the fact that the license file has to be writable by the RunAS user, the overall system is made less secure.  On the innocent side, users make mistakes and accidents happen.  Ever do a "rm -rf . /*"?  You'll (hopefully) only do that once, and learn a painful enough lesson that you won't ever do it again.

On the nefarious side, ever have a staff member leave under less than optimal circumstances?  One simple change to the license file, and your proxy is now a ticking logic bomb.

Either way, an action that is normally benign -- a proxy software or server restart -- will now turn into a major problem.  How long will it take you to figure out what the problem is, find the license code, and fix it?  Murphy says this will happen after support hours before an extended holiday, all of your backup tapes were stored "on top of the new cabinet" (which turns out to be a transformer), and the only person who knows the license code will be on a pilgrimage to Motuo County.


Reason #3 why having the software re-write its license file on the fly is a BAD IDEA.

In short, all of these real and potential problems are introduced just so the server can log this message:
2013-01-31 09:20:15 Thank you for your purchase of this licensed copy of EZproxy.  EZproxy was last able to validate the license on 2013-01-31 09:20:15.
Is that feel-good message really worth it?  Can we please drop the useless timestamp in that message, go back to just validating the license, and leave the license file alone?

Wednesday, January 30, 2013

EZproxy wish list: SSL enhancements

This is another easy one, like the 64-bit binaries.

NIST 800-57 stated that 1024 bit SSL keys should only be used through 2008 (so that a 2-year key expiring in 2010 would still be considered secure), yet EZproxy still allows generating 512 and 1024 SSL certificates.

I don't know of any CA's that will sign keys less than 2048 bits these days, and anything less is a false sense of security given modern computing speeds.  Removing the lower bit options will help users to not generate a CSR that is going to be rejected by the CA, saving time and aggravation.

The 512 bit key is a thing of the past.  Even Microsoft has effectively patched them out of existence.

Now, I personally have no problem with letting someone shoot themselves in the foot, so I'm OK with OCLC leaving the 1024 bit options there if someone just wants a self-signed SSL certificate for SSL's sake, but go ahead and give us options for 3072 and 4096 bits as well.

Why 3072 and 4096 bits?  Well if you're running your own CA, it is not unheard of for those to generate SSL certificates with a lifetime of 10 years or more.  The 3072-bit keys are projected to be secure until sometime around 2030.  The 4096 will be secure for quite a bit longer.  If your internal CA is generating 15 year keys, you should already be planning the move up to 4096 bits.

And some of us just like big bits, and I cannot lie.

Oh, and while we're on the topic of SSL, making a pass through section 4.2 of NIST 800-57 would be a good thing for OCLC to do, and enabling/disabling cipher suites as appropriate.  After all, if we're going to use SSL, we should at least be getting the best bang for the buck from it.

Bonus points if a directive akin to Apache's SSLCipherSuite is added to tune the SSL ciphers in play.  There is always the possibility that a weakness could be found in a cipher, and being able to disable them individually would be a good ability to have.

Tuesday, January 29, 2013

EZproxy wish list: Collaborative stanza maintenance

EZproxy is configured via little chunks of configuration snippets commonly called "stanzas".  OCLC maintains a large list of these stanzas, the EZproxy Wiki has a few more stanzas, and better vendors will generally not look at you like you're from Mars if you ask for an EZproxy stanza for their service.

Sometimes the stanzas even match.

There are regularly requests on the EZproxy mailing list for updated stanzas for vendors, comments on how a vendor stopped using a particular version of a stanza X years ago, etc.

I've been working with one vendor for a while now to get a fix into their stanza so that I can tell OCLC to update the ones on their web site.  When another vendor added a new service, and in order to interact with that new service a specific way, I had to tweak their stanza to allow linking to it directly.  I don't think anyone else in the world has this tweak in their stanza.

There needs to be a better way.

While thinking about this, it occurs to me that one of the key things that is missing is a feedback cycle.  There is plenty of collaboration on the EZproxy mailing list, but I am yet to see a vendor pay attention to that list. OCLC's hands are tied, because they don't know the services like the vendor is supposed to.

In the spirit of The Cathedral and the Bazaar, I offer this idea for a way forward.

First let's define the problem:
  • OCLC publishes the stanzas on their web site, but they defer to the vendors for the content.  
  • Vendors (mostly) give out stanzas, but changing one seems to be a Herculean effort.  
  • Wiki pages are collaborative, but may not be the best tool for the job.
  • All of this requires human interaction to copy the stanza into config.txt and having the proxy maintainer keep it current.
  • There is no notification mechanism to keep proxy maintainers notified when stanza changes are made.  OCLC puts the date the stanza was last updated on their web site, but you still have to be actively looking to see it.
Now let's make some requirements:
  • There needs to be a collaborative environment
  • There needs to be an authoritative source
  • There needs to be a better way to find out when stanzas are updated
  • There should be a way to create and use your own versions of the stanzas
  • There should be a way to automatically update stanzas
If you take a step back, these requirements are a lot like what a software developer would want, too.  Except stanzas would be source code.  And "create and use your own versions" would be called forking or branching.

So here's the radical idea:

Create a source code repository (like GitHub) to hold the EZproxy stanza files.  This satisfies the collaboration requirement.  

Have the owner of this master repository be OCLC.  There's your authoritative source.

Use the SCM features (activity streams, RSS feeds, commit mails, etc) to satisfy the update notification requirement.

Now, here's where it gets interesting...

Extend the EZproxy admin interface to be able to point to a GIT repository, and have EZproxy pull a copy of the repository locally.  This could be any GIT repository, either the official OCLC repository, or one that you forked with your own stanza versions.  It could be hosted at GitHub, or running locally.

Once the repository is downloaded, present the EZproxy admin with the services maintained within that repository to enable/disable.  This means that there is no reason not to have a canonical source with stanzas for everything, since you do not have to use them all, only the ones you want.

And finally, have EZproxy run a periodic update from the repository either manually or automatically.  If you're using the OCLC authoritative repository, you could receive updates as they are released, or just update between terms.  If you have forked the OCLC repository, you can pull updates from the master repository into yours to stay current.

Now you have all the moving parts for collaborative stanza maintenance:
  • OCLC establishes a master repository and imports the existing stanzas
  • Vendors for the repository and maintain their own stanzas, allowing OCLC to pull their changes back to the master
  • EZproxy administrators subscribe to the repository of their choice (defaulting to OCLC, of course)
  • EZproxy administrators can choose fork either OCLC or Vendor repositories for their own use, and suggest changes back
  • EZproxy administrators can subscribe to other administrator's repositories
  • Changes can be pulled between repositories, maintaining version history
So there you have it.  A way to embrace chaos while bringing a modicum of order.  A way to publish the stanzas so that they can be maintained, updated, published, and imported in a sane way.  A way to enable EZproxy administrators to work together in an environment that is more geared toward collaboration than the existing channels.

Monday, January 28, 2013

EZproxy wish list: 64-bit binaries

This one is an easy one: I'd like to see a 64-bit build of EZproxy.

Why?  So I don't have to have 32-bit and 64-bit runtimes on my servers.  It is not uncommon to have bugs and security issues that show up when running 32-bit software on a 64-bit system.

Having a 64-bit build of EZproxy would reduce the size of my installed virtual machine disk images, wipe out a whole category of operational issues that arise from file sizes in 32-bit binaries (log files > 2GB, 4GB file downloads, etc.), close any possibility for 32-bit security issues.

Now, about those 64-bit dynamic binaries....

Friday, January 25, 2013

EZproxy wish list: dynamically linked binary

In order to make EZproxy easier to support, it has historically been shipped as a statically linked binary.

This means that there are far fewer support issues caused by the libraries installed on the server that EZproxy is being run on.  These can be caused by differences between Linux distributions, new releases of a Linux build with different library versions, 32-bit vs 64-bit versions, etc.

This reduced support load comes at a cost, however.  As end users, we are at the mercy of OCLC to release updates for bugs/issues in underlying support libraries.

By releasing a dynamically linked version of EZproxy, the software will be using the library files on the server, rather than the one built into it when the software was built for release.

For example, any issue in the OpenSSL libraries may go unresolved for months between EZproxy releases, while the OS vendor may have an updated library released within days.

For reference, other 3rd party libraries that EZproxy build into the static binary include the GNU C Library, MaxMind's GeoIP, OpenLDAP, a XMLSec library, a XPath/XSLT library, what looks like a HTML parsing library, most likely a SOAP library, a SAX parser, a Kerberos library, and maybe a few more that I missed.

That list may be daunting, but in reality, I suspect that all of those libraries are available in any modern Linux distribution out of the box.  Of course with a properly packaged software build, they could incorporate requirements and all of this would be installed automatically by the system's software management tools.

As a side note, even if they do not release a dynamically linked binary, OCLC should at least change the way they communicate releases:

  • Be more open about what libraries and versions of libraries a given release incorporates
  • When an supporting library is updated note any CVE references fixed in the newer library version
  • Differentiate between when a security fix was due to a supporting library vs. when a fix was in EZproxy itself.


Thursday, January 24, 2013

EZproxy wish list: SELinux support

SELinux is the software that I love to hate:  I love it when it works, but I hate it when I have to beat it into submission.

And I wouldn't have it any other way.

SELinux compartmentalizes each piece of software and defines a "software firewall" -- if you will -- for what a given program can do, what files it can read, what directories it can write to, what level of network access it has, etc.

Where many people get frustrated with SELinux is when they stray afield of where the OS vendor's policies expected them to go.  Running a web server on port 80 is normal.  Running on on port 6000, not so much.


For EZproxy, there was no existing policy, so I had to write my own from scratch.  I even found some interesting things along the way about EZproxy's memory handling that I had to put special exceptions in the policy to accommodate.  I don't know if OCLC fixed the special exception case or not, but I gave them a heads-up about it; I suppose I should remove the exception and see if the server still blows up spectacularly without it.

Basic support was fairly direct, but with each EZproxy upgrade I have to revalidate the policies to ensure that they are still working.  

Occasionally I have to tweak things due to OS vendor changes as well.  Those generally only show up when I build a new policy; so far the existing policies have continued to work even when I can't build a new one because I did not update for the new vendor changes.

Lately I've been doing more advanced work, allowing librarians to manage their content via FTP into the EZproxy documentation directories.

It is worth it, though, because I know that should someone find a remote compromise for EZproxy, that the damage that they can do as the ezproxy user is limited not only by the system's file permissions, but also by their SELinux context.

For OCLC to be successful with adding a SELinux policy to EZproxy, though, they need to move away from the statically linked binary installer that you can install anywhere.  They need to produce packages for each supported Linux variant so that the files will be installed into known places that the SELinux policies can reference.

Wednesday, January 23, 2013

EZproxy wish list: caching

This is the second in a series of thoughts on how EZproxy could be made better.  For a background, please read my original post.

For all of EZproxy's features, one glaring omission is an option to enable a caching feature.

For smaller sites, this probably would not mean much.  Their patrons may be on-site at the library anyway, or geographically close to the library, so they are probably using the same ISP as the library.  In this case, how much benefit that a caching solution might offer is debatable.

But for larger sites, and sites that support a geographically diverse user population, caching can be a significant performance enhancement.

Here's a very simple example of why:
A user sitting in California needs to access a resource in Denver from a proxy in Virginia.
In this scenario, the physical path that the packets are likely to take through wires in the ground very nearly matches the mental line that you might draw between these three locations.  You'll have to very nearly cross the continent twice to get the request from the desktop to the server, and then turn around and do it again for the reply from the server to the desktop.

So what's the big deal?

One big problem: Users are impatient.  Each trip across the continent is going to cost precious milliseconds and give the perception (warranted or not) of slowness.  Waiting for every single search, every search limiter, every single page render, every single article retrieval is going to make users irritated.  The more they wait, they more frustrated they become, especially if they are new to research and do not have much patience to start with.

Another problem: Someone pays for the bandwidth.  Somewhere in your IT department's budget is a line item or three for your internet access.  The more capacity (bandwidth) you have to access the internet, the more it costs.  Web caching has been used for almost two decades now to address this.  Businesses use web caches for performance, cost savings, and security; countries use them for performance and in some cases censorship; chances are the ISP you are using right now uses a transparent web cache to save on their own bandwidth costs.

If EZproxy were to support a web caching feature the user in California would only need to contact the proxy server in Virginia for 80-90% of their web requests.   Vendors re-use many of the same CSS stylesheets, graphics, and JavaScript files on their web pages, so the only content that is unique are the search results and the article retrievals.

Depending on the vendor, even the articles might be cacheable, so if you were teaching a lab, only the first student to retrieve the article would have to wait for the full round trip, while the second and subsequent students may get the article from the cache.

If you know that you have a caching server on your campus, or at your ISP, you could setup EZproxy to participate in a web cache hierarchy.  Depending on your network, you might be able to piggy-back on other user's activities and your cache would not have to work as hard.  You can generally setup hierarchy caches such that your cache will not store that content locally, since retrieving from that upstream cache is not considered as expensive.  This frees up your cache to focus on the content that is not shared with the upstream server.

If you have multiple EZproxy servers, you could setup a cache confederation when this is combined with a clustered proxy setup, thus amplifying the benefits of having a cluster.  Instead of each cache keeping duplicate content, you could have each proxy query its peer before requesting content upstream.  This would require EZproxy to support the ICP protocol, which might make for interesting architecture possibilities as well.

Even without the advanced caching support, though, just a basic web cache implementation would be useful to many sites, both directly in performance improvements, and indirectly via bandwidth savings.

Tuesday, January 22, 2013

The citation trap

We have all seen this in one form or another.  The web is such a dynamic place that web pages that existed when someone was writing their content no longer exists today.

Students graduate, and their university home pages are taken down.  Web sites are reorganized, and the content is moved or lost.  Companies are bought and sold.  Any number of things can cause content to move or simply disappear.

And then there are proxy servers.

In my last post on EZproxy clustering, I noted that the way EZproxy's clustering support currently works, it exposes individual cluster nodes to the end user, and that some vendors use the referring URL to construct a citation link:
http://www.example.com.ezproxy-1.library.example.edu/path/to/content
The issue being that someday the name ezproxy-1.library.example.edu is going to go away.  Maybe due to growth.  Perhaps due to the college being purchased by another school.  Or maybe a change of regime in IT brings in a new leader who wants to change how things are run.  Any number of things could trigger changing a proxy hostname, and I have not seen anyone discuss how to gracefully deal with this situation.

There is a second potential issue here as well beyond simple citation preservation: LMS content.

As your instructors become more sophisticated users of your school's LMS, there are going to be an increasing number of links to content through your proxy servers.  Those links should look like this (remember the cluster setup with the shared DNS name):
http://ezproxy.library.example.edu/login?url=http://www.example.com/path/to/content
This gives us two scenarios to handle.  One where the proxy hostname is already embedded into the URL, and one where it is not.  One where the content is part of the URL, and the other where the content is passed as an argument.

There are a few options here:

First, if all you are doing is changing the proxy hostname, you can setup an alias with a DNS CNAME record.  This will keep the old name alive, and the way the EZproxy currently behaves, it will issue a HTTP redirect from the old name to the new name, and the user will be sent to the new server transparently:


$ telnet localhost 80
Trying 127.0.0.1...
Connected to localhost.localdomain (127.0.0.1).
Escape character is '^]'.
GET / HTTP/1.0
Host: ezproxy.library.example.edu

HTTP/1.1 302 Moved temporarily
Date: Fri, 21 Jan 2013 01:42:03 GMT
Server: EZproxy
Expires: Mon, 02 Aug 1999 00:00:00 GMT
Last-Modified: Fri, 21 Jan 2013 01:42:03 GMT
Cache-Control: no-store, no-cache, must-revalidate
Cache-Control: post-check=0, pre-check=0
Pragma: no-cache
Location: http://new-shiny-hostname.library.example.edu/
Connection: close

Here you can see that I requested the proxy's home page, and was sent back a redirect to the new-shiny-hostname proxy server instead.  If that's all you needed, you're done.


Sometimes that approach won't work, though.  Say you moved from a single proxy server to a group of proxy servers, each serving a different subset of your user community.  Now you have to have something send the users to the correct proxy server, and a simple alias cannot accomplish that.

I'm going to leave the "something that sends users to the correct proxy server" as an exercise for the reader, but for this second scenario, let's say some shiny new web portal handles this for you.  The trick is now going to be how to get users from the old hostname to the web portal so that they can be sent to the new proxy server.

One way to do this is with the Apache web server's virtual hosting and URL rewriting functionality.  The idea is that you will setup a virtual host that will answer to the old EZproxy hostname, as well as the proxied vendor hostname, and rewrite those requests into the new system.

This is best illustrated with an example.  Remember that we're talking about a EZproxy cluster with a shared DNS name and two nodes.  We will assume for sake of example that the /proxy URL on the portal system takes the same "url=<vendor url>" argument that EZproxy does.

<VirtualHost *:80>
ServerAlias ezproxy.library.example.edu
ServerAlias ezproxy-1.library.example.edu
ServerAlias ezproxy-2.library.example.edu

RewriteEngine on
# Send users logging into the old proxy server into the new portal system
RewriteRule ^/login http://portal.library.example.edu/proxy [R=303,L]
</VirtualHost>

Here, the configuration is setup to handle the EZproxy login URL, redirecting it to the portal system where it is dispatched to the correct proxy server in the new environment.

That was fairly straightforward, and if that is all that you need to worry about for your LMS scenario above, you can stop there.

But if you need to worry about poorly formed citations, this is where the fun part starts.  How do you deal with the overly-clever vendor citations?  Well, this is where your Apache skills needs to be a few notches above novice to be successful.  Here you need to setup a virtual host to answer for the proxied vendor hostname, and have Apache do the Right Thing(tm) for that vendor's services.

<VirtualHost *:80>
ServerName www.example.com.ezproxy.library.example.edu
ServerAlias www.example.com.ezproxy-1.library.example.edu
ServerAlias www.example.com.ezproxy-2.library.example.edu

RewriteRule ^(.*) http://portal.library.edu/proxy?url=http://www.example.com/$1 [R=303,L]
</VirtualHost>

In this simple example, we naively take the request URI and tack it on to the end of the portal entry URL.  Whether or not this works depends on the vendor and how they run their services.  Every vendor is going to be slightly different, and here is where your Apache skills are going to come into play correctly populating the VirtualHost block.

Once you start down this path, you'll realize that there are other possibilities, but that's a post for another day.

Monday, January 21, 2013

EZproxy wish list: better clustering support

For anyone outside of the library field, you have probably never heard of EZproxy. In short, it's a very simple web proxy server targeted at the library market.  It was originally developed by Chris Zagar, and later purchased by OCLC.

It achieves several goals very well:

  • It is easy to install (single static binary download)
  • It is easy to configure (a config.txt file and a users.txt file)
  • It is each to manage (built-in administrative tools)
EZproxy is a piece of software that any electronic services librarian worth their salt should be able to download, install, configure, and run with minimal hand holding.  For that I give Chris a hearty pat on the back, because that is something that cannot be said of many pieces of software.

For larger sites, though, we have different needs that do not necessarily mesh well with the "ease of setup and use" that smaller sites need.  This is the first in a series of posts about ways that EZproxy could be made better for larger sites.

Today I want to spend some time on EZproxy clustering. EZproxy has some very basic support for clustering, but it has quite a few caveats.  The way EZproxy clustering works is that you setup a peer relationship between the proxy servers, and give them a shared hostname, and point the DNS entry for that shared hostname at each proxy's IP address.  When you access the shared hostname, the web browser receives a HTTP redirect to one of the members of the cluster, and you proceed normally from there.

Again, Chris gets a pat on the back, because this is not hard to setup, and it does work.  You can take a proxy out of the cluster, work on it, put it back in, and have minimal disruption for your patrons.  But it's not zero disruption.   If a patron is on proxy A, and you take proxy A offline, the patron has to re-login to proxy B to continue their research.  For zero disruption, the proxy servers would need to share session data, which they do not today.

This also has an unintended side effect that is not directly EZproxy's fault.  Some vendors use the hostname that they were accessed via to build citation links. You might have serup ezproxy.library.example.edu as the shared DNS entry, with ezproxy-1.library.example.edu and ezproxy-2.library.example.edu as the cluster members.

In this scenario, you would create links to your proxy server like so: 
http://ezproxy.library.example.edu/login?url=http://vendor.example.com/
Since you have a cluster setup, the HTTP redirect might send you to ezproxy-1.library.example.edu to use, so when you finally access the vendor's web site, the URL will look like this:
http://vendor.example.com.ezproxy-1.library.example.edu/
Where this becomes a problem is when certain vendors use that URL to create citation links.  What you want is a citation that looks like this:
http://ezproxy.library.example.edu/login?url=http://vendor.example.com/path/to/article
What you get, though is this:
http://vendor.example.com.ezproxy-1.library.example.edu/path/to/article
Because not all vendors allow you to define a proxy prefix that will allow you to specify how your proxy server should be addressed.

Why is this a problem?  Quite simply put, things change over time.

I am a firm believer that webmasters have a duty to make a best effort attempt to keep URLs working as much as they can over time.  Not everything can be kept functional, but if you had a link that worked last year, and reorganized your site last month, users should still be able to get as close as possible to the intended content when users follow older links.

In this context, what happens when example.edu opens multiple campuses, grows from a college to a university, and opens multiple libraries?  Are you still going to want to be tied to that same namespace that made sense when you started?  How are you going to handle all those citations that reference a proxy name that may no longer even physically exist?  Are you willing to drop those citations on the floor?  That doesn't seem to fit the academic spirit.  What about all those links in your schools' LMS?  Who is going to update all of those now-broken links?

You can see from this example where the EZproxy clustering scheme has weaknesses both in regular maintenance scenarios, as well as when combined with vendors who are a little to clever for their own good.  There is a way out of the vendor citation trap, btw, which I will discuss in another post, but I don't want to rabbit trail from the clustering topic for now.

How could EZproxy be improved to work better in this case?  There are two changes that would make this setup a much stronger solution:

If you look at other proxy systems, when you configure a proxy cluster, there is a communication channel between the proxy servers that allows them to share session state to each other.  EZproxy could be extended to share login information with its peer servers so that they all share a common view of the logged-in sessions.  That way when a single node goes down, the user would be able to fall back to a different proxy server, and resume their existing session without having to login again, and would likely never know there was a failure.

But for shared sessions to reach its full potential, addressing individual proxy servers in a cluster has to stop.  When you setup a cluster relationship, the proxy servers should always use the cluster hostname, rather than the individual node name, for communicating with users.  The user should never know how many nodes are behind ezproxy.library.example.edu.  They only thing that they should ever see in their browser's location bar is that shared DNS name.  For administrative purposes, you will need to be able to access the administrative interface individually, but for services you should always use (and see!) the shared name.

Thus, with these two changes, EZproxy's native clustering solution would be a much stronger feature.

Friday, January 18, 2013

Logfile spelunking: reckless robots

I really like using scheme-less URLs.  They sidestep a whole class of problems when running a website that has mixed SSL/non-SSL web pages, as it allows you to share a common path to assets without having to resort to ugly hacks to test if the pages were loaded secure or insecure and adjusting the http vs https scheme accordingly.

That said, it appears that several authors of web crawlers have never actually read RFC 3986 section 4.2 where relative URLs are defined.  They incorrectly assume that all relative URLs are relative to the host, and did not actually read Section 5.3 where the authors helpfully laid out pseudocode for how to properly construct a relative URL.

And while we're talking about web crawlers, what's up with robots not honoring robots.txt?  It's only been a de-facto standard since 1994 and a draft RFC since 1997.

Just about every library and tool that will perform crawling functions honors those standards, and you either have to go out of your way to explicitly turn off that support, or you have to suffer from Not Invented Here syndrome to write your own crawling library.  You have to suffer an even worse affliction to suffer from NIH and not include support for robots.txt.  May I politely suggest those who do run, not walk, to their doctor and request an emergency cranial rectal extraction?


Thursday, January 17, 2013

User hostile interface design

While testing Internet Explorer 9's handling of the X-UA-Compatible header, I was able to get the Compatibility Mode icon to display, which I needed to validate something I was testing.  Of course when the testing was completed, I wanted to revert this change.

In past versions of IE, you would just un-check the icon that looks like a broken icon.  I always had to do it a few times to remember which icon state was standards mode vs. compatibility mode, which probably explains the new behavior in IE9, where the icon disappears completely once you click it.

To revert the change, you just to go to Tools -> Compatibility View Settings to remove the site that you just manually added to the Compatibility View list.

Except it's not the "Tools" menu in the upper right of the window that looks like a gear icon, and is activated by Alt-X key, it's the other "Tools" menu.  The super secret hidden "Tools" menu that does not show up unless you hit the "Alt" key and let another menu bar appear.  You can also hit Alt-T, but you don't know this until after you hit the "Alt" key, and the menu appears.  Then you can see "Tools" and know that "T" is the shortcut to use.

And people wonder why I tell them I feel like a monkey staring at an algebra problem when I use a Windows system...

Wednesday, January 16, 2013

Musings on customer service

I've noticed a trend over the past few years among IT professionals to lose track of the fact that IT is a customer service industry.

Think about that for a minute:  IT is not the end game, it is a service organization that enables other groups to accomplish goals.  I've kept many bosses and customers alike happy over the years by keeping that simple fact in view.

Unfortunately, this trend is moving beyond IT organizations and into customer service groups.  Once upon a time, when you contacted a customer service department and reported an issue, there was a clear process to get the issue resolved.  That process might involve escalating the issue to the engineering department and having a developer fix the issue in the next release of the product or service.

I have noticed that I'm getting more and more "We don't do that." answers back from vendors than I used to.  Customer support is becoming less and less customer focused and seems to be morphing into an organization focused exclusively on reducing company costs by pigeonholing enhancement requests, denying bugs exist (until they are fixed, of course), and being generally unhelpful.

I wonder if customer service managers understand the damage that they are doing to their companies by taking a non-caring attitude towards its customers.  Perception is reality, and once customers decide that a support organization is useless, that perception is difficult to change.  Eventually the poor quality support is going to impact the company's bottom line.  I have already stopped purchasing products and services from one vendor because I got tired of them ignoring issues until the product was old enough that the problem was classified differently and could be closed with a "WONTFIX" status.

There are exceptions to this, of course, and I will gladly reward companies with quality support with more business.  So what do I look for?

  • First and foremost a product that "just works".  Something that you setup once, and can completely forget about except when you need to make an operational change: manage capacity, adjust settings, monitor performance.  With a high enough quality product, I may never need to contact support.  Sadly, these are rare.
  • Excellent documentation.  If the documentation is good enough, I can generally find examples of what I'm trying to setup, or answer some obscure question by ample notes and clear explanations of how things work and what the design philosophy was.
  • A good self-help portal or knowledge base.  It is so much faster to search a quality knowledge base and find well-written articles than it is to go through a support process.  But producing that content and keeping it current is not an easy task, so I appreciate the companies that excel here that much more.  Generally I'll find 2 or 3 topics while looking that I didn't even know I was interested in, so that's the down side -- a good knowledge base creates more work for me, because I found more things that I need or want to do with the product.
  • And efficient ticketing and support process.  There are several things that really annoy me when it comes to having to actually open a ticket and get support.
  1. Taking the time to collect data, analyze what is going on, present the data in a logical fashion, and lay out the cause-effect relationships in the ticket.  Then I have to spend just as much time talking to the support person on the other end to re-explain everything that I put into the ticket.  Why did I spend the time writing up the ticket explanation if I just had to verbally repeat everything I put into the ticket?
  2. Sending the detailed explanation, and then being asked for screen shots that show exactly what I documented, but in an inferior pictorial form.  While reporting a web site issue to a vendor recently, I explained the mechanism by how I discovered the issue, the symptoms of the issue, the causes of the issue, and how to reproduce the issue.  I was asked for screen shots of the problem.  I literally took screen captures of the exact same entry URL and resulting destination URL that caused the problems and replied back, even though both were clearly documented in the problem report.  I do not understand why the textual description was insufficient.  I fear that this is the result of an entire generation of students who were taught visually vs. textually hitting the workforce.
  3. Having a screen sharing session be the instant go-to when I have already captured log files showing the beginning state, the command histories showing the actions taken and the resulting state.  I can fully appreciate the ability to virtually watch over someone's shoulder, but when you ask me to login to a device and run the exact same commands as before, see the exact same output that I documented, and offer no alternative approaches, what exactly did that accomplish other than wasting my time?
The key here is that when I contact a support organization, I'm trying to get something done.  The more that support organizations do to:

  • give a path to engineering to address product issues
  • provide high-quality documentation
  • maintain a richly populated and easy to search knowledge base
  • create a support process that does not have me doing double work to document and then verbally, visually, or interactively explain the problem
the more credibility they are going to give their company in my eyes, and the more likely I am going to be to support that company financially.

Tuesday, January 15, 2013

Help me, help you

Just as any other developer, I receive problem reports when things don't work.  Some of these are excellent, others ... not so much.

What falls into the "not so much" category?  Here are some more examples, almost verbatim:

  • It doesn't work
  • I get an error message
  • It doesn't look right
  • The web site doesn't load [irony alert: this one was submitted via web form]
Each of these was a valid problem, but it took several days of back-and-forth with the person who reported the problem to get enough information to even know where to start looking for a problem.  Add a spherical earth into the mix, and even simple email exchanges can take days to clarify the issue enough to start looking into the cause.

So what differentiates these from good problem reports?  Let's take a stab at the above list to make it better:
  • I started at this here, I performed this action, I expected this result, but I received this other result instead.
  • I was trying to do this specific action, and I received an error message that said "..."
  • I was looking at this page, and it did not look right.  Here is a screen shot of what I'm talking about.
  • I was trying to load such-and-such page about 7:30PM on Febtober the 32nd from IP address 192.0.2.1 and I received a timeout error.
And please, when you leave an email address, PLEASE leave one that works.  I've lost track of how many times I have written up a detailed email message with a fix, a workaround, or a request for more specific information just to have the message bounce back as undeliverable.

One other thing:  some people will take screen shots and then paste the graphic into a Word document and mail the Word document as part of their problem report.  Depending on the issue that you're trying to document, this may not be the best tool for the job.  Word will decrease the quality of the graphic, so if I need to look at the location bar in your Internet Explorer window, Word will have made it practically unreadable.  It's generally better to just send the screen shot image as an attachment instead of pasting it into a Word document.

Monday, January 14, 2013

Logfile spelunking: /cache/ trash

I regularly make time to run some ad-hoc reports on the log files from our web servers to look for strange and unusual things.  Thanks to the Interwebs, there are never a shortage of either to be found.

Each trip through the log files, I try to isolate at least one error or error message and find the root cause. Once I have that, I then look for ways to remedy or work around the issue.  Sometimes I get lucky and it's something that I can actually put a fix in for, but most times I'm left with a more questions and fewer answers.

The symptom of today's journey of exploration was the addition of "/cache/<hex string>" to random URLs on the site that has been going on for some time now.  It has been affecting Chrome users, and across several Chrome versions.

I had initially be scouring the JavaScript on the site to see if one of the 3rd party libraries had gone off the deep end and had a bug that only Chrome users hit, with no success.  A few months ago, I went down the path of bad caching/acceleration software that was using a scheme like this to ... well, your guess is as good as mine.

Today, though, I got lucky: I found an answer.

What I was able to track down today is a report of similar problems on other sites that lead me to a Chrome bug report, as well as a one of many self-help malware remedy pages.  It seems that people have installed an extension called "Browser Companion Helper" that sounds like it is anything but what the name would suggest.

Since knowing is half the battle, now the next trick is to figure what to do with this new information.

There does not appear to already be a way to detect the plugin and alert the user directly.  Without infecting one of my systems with that malware, I can't figure out if there is a way to redirect the user to a removal page to clean up their system, as sometimes the erroneous URL is requested before a legitimate one, other times after.

It would be nice if Google could blacklist plugins like this and prevent the extension from running in a future release of Chrome.  It looks like there was some work done mid-2012 in the extension blacklist corner of Chrome, but it is unclear what Google's policy is for shipping default blacklists in cases like this.

So I'm not sure just how much I might be able to do to remedy the real root cause and enable the user to get the malware off their system.  In the mean time, I now know what causes these requests, and I can safely ignore them as a client-side issue, not something that the JavaScript on our pages is doing wrong.

Friday, January 11, 2013

Tarnished Chrome

I used to think very highly of Google's Chrome browser.  It developed a reputation for being a lightning fast, standards compliant browser, and it was one of the first -- if not the first -- browsers to auto-upgrade users to the latest release.

But recently, I have been finding myself more frustrated with Chrome than Internet Exploder Explorer.

Among the issues I've run into recently:
  • If you have a HTML form with a GET action, but no input elements, the browser appends a "?" to the action URL.  There are a few different bugs logged with Chrome and with WebKit on this issue.  Many web sites will accept this extra character without complaint, but some sites with naive redirect rules get rather cranky when this happens.
  • Chrome's handling of popup windows makes my brain hurt.  I thought the purpose of popup blockers was to not load content in the first place.  Because of how they handle popups, JavaScript code cannot determine when a popup succeeded or failed, because it always succeeds, even when it fails.  Ouch!
  • Even Chrome Frame is not immune.  There appears to be a food fight between Chrome Frame and Internet Explorer over cookie handling in general, and there may be some corner cases in AJAX handling where cookies are not preserved.
  • There is some utterly bizarre issue on Chrome 11 that ships with the B&N Nook that causes AJAX-driven actions to pause for 20 seconds before running during on the 2nd execution.  Across 34 variations of desktop browser and OS and 10 different mobile flavors, no other browser exhibits that behavior.  That's going to be a fun one to track down.
Don't get me wrong, Chrome is still a decent browser.  So far it gives me much less trouble than Internet Explorer, but its stock definitely fell a few points after having to work around these issues.

Thursday, January 10, 2013

Fun With Appliance SSL Certificates

I recently had the "pleasure" of installing SSL certificates onto a VMware vCenter Server Appliance.  And by pleasure, I mean I followed the 81-step process for replacing the self-signed SSL certificates that were created when the appliance was initially installed.  Yes, you read that right, 81 steps.

In reality, it was over 100 steps by the time you completed all of the preparation to generate the certificate signing requests (CSR) and send them off to the CA to be signed.  The process was made that much more enjoyable by the fact that I had to perform it 2 times since the certificates needed to be setup for both client and server authentication usage, something that the CSR specified but my CA software did not honor the first time through.

As I was going through the process, I could not help but wonder what drove the Java camp to adopt PKCS12 formatting as their preferred SSL container vs PEM formatting that the C world uses.  Add to that the special sauce for the Java keystore file in step 65, and I had to keep reminding myself of the old adage: The nice thing about standards is that there are so many to choose from.

Please, appliance vendors, take note:  Any time you have to write a process for your users that involves the use of ssh to perform what should be basic management tasks, you have violated the implied appliance contract. An appliance needs to be something that I install and forget about; a tool that I use to accomplish a task; one less operating system installation that I have to worry about.  Not something that I have to ssh into and copy files by hand in a manual, repetitive, error-prone process.  I use systems management tools to avoid this kind of mundane work for a reason.

Hopefully VMware will improve the appliance management functions and provide an interface that will allow the certificates to be managed by the appliance.  Zimbra has this ability already, maybe the VCSA developers should buy them a cup-o-joe and pick their brains.  If not, maybe projects like vCert Manager will bear fruit and make this task less painful in the future.