Confessions of a Library Technologist: Logfile spelunking: reckless robots

Friday, January 18, 2013

Logfile spelunking: reckless robots

I really like using scheme-less URLs. They sidestep a whole class of problems when running a website that has mixed SSL/non-SSL web pages, as it allows you to share a common path to assets without having to resort to ugly hacks to test if the pages were loaded secure or insecure and adjusting the http vs https scheme accordingly.

That said, it appears that several authors of web crawlers have never actually read RFC 3986 section 4.2 where relative URLs are defined. They incorrectly assume that all relative URLs are relative to the host, and did not actually read Section 5.3 where the authors helpfully laid out pseudocode for how to properly construct a relative URL.

And while we're talking about web crawlers, what's up with robots not honoring robots.txt? It's only been a de-facto standard since 1994 and a draft RFC since 1997.

Just about every library and tool that will perform crawling functions honors those standards, and you either have to go out of your way to explicitly turn off that support, or you have to suffer from Not Invented Here syndrome to write your own crawling library. You have to suffer an even worse affliction to suffer from NIH and not include support for robots.txt. May I politely suggest those who do run, not walk, to their doctor and request an emergency cranial rectal extraction?

Confessions of a Library Technologist

Friday, January 18, 2013

Logfile spelunking: reckless robots

No comments:

Post a Comment