Monday, January 30, 2012

if it ain't documented, it ain't permanent

On the matter of README files... (caution: it's a long one)


Back around the turn of the century during the dotcom bubble, I found myself working for a certain hosting company.  I started work on a gloomy day in mid December, 1999 and from the moment I showed up I had no idea just what exactly I was supposed to be doing there.  My previous job had been building a Linux distribution, so I spent a week trying to scramble to find a niche I could fill with my skill of wedging device drivers into an installation disk and rebuilding RPMs.  After my first week, I was thinking to myself "Oh God... they're going to fire me because I can't do anything useful!"

As luck would have it, one of the founders handed me a floppy disk that they had tried to use to do automated installs for RedHat 5.2 systems, and asked me to try and set something up for the upcoming 6.1 release.  I'd gotten that done, made up some new floppies and handed a stack of them out to our datacenter crew to use to build new servers.  Then two important things happened:  1)  the hosting company decided to buy a new SCSI RAID controller that the stock RedHat installer didn't support, and 2) I overheard some of support staff complaining about how they were having to recompile PHP and kernels to enable some high inode deal that some e commerce package was recommending.  Pretty soon I had rebuilt the kernel RPMs to add in support for a new driver, and rebuilt PHP so that our servers were optimized out of the box.  Then I wedged them into a new install pool I'd setup on an NFS server.   While I was at it, I fixed a bug (actually a misconfig) in the RH installer that brought the install time down from around 20 mins to 7 mins.

As time went on, I started creating more and more custom RPMs based on customer demand that filtered in to me from the support folks, and I added in support for more and more funky hardware.  I eventually heard about PXE booting from a new guy who was all hardcore about server hardware, and moved the system over  to it so I wouldn't have to worry about someone accidentally using an old floppy image.  Pretty soon, all our datacenter folks needed to do was plug a server into the network, boot it up, and they would be given a prompt for a customer identification number.  After adding in some mojo to pull down configuration information from our customer information system, I was able to create an install system that could handle any combination of RedHat release versions and optional 3rd party packages (such as our custom tape backup scripts, shopping carts, control panels, etc) based solely on the customer identification number.

The benefit of the system was pretty huge.  We were able to insert security and driver updates into our provisioning process instantly.  All of our installs were centralized on a single server, so migrating the system to new datacenters on remote continents became a breeze.  Integrating 3rd party software at install time became simple.  Our techs could assemble an enterprise class server in under 20 minutes... combined with a sub 10 minute install, we gained the option of being able to provision a server in under 30 minutes.   Servers were going online on an unprotected public network with the most recent security patches and we had a searchable record of the software that was installed so that when a vulnerability was discovered we knew exactly who was at risk.

We then started trying to expand the system.  I added in support for FreeBSD, and while we weren't quite able to add in support for Windows 2K, we were able to use PXE to boot strap the install image and have the installer write important versioning information back to the install server.

While setting up and maintaining an automated provisioning system wasn't particularly difficult or challenging, it did become tedious after 4 years.  Each new operating system we added just added in another permutation into possible configurations and drove up complexity.  It quickly got to a point where I was overwhelmed by having to backport patches and rebuild RPMs: a single kernel vulnerability meant having to rebuild at least 2 (usually 3) kernel packages, and each those meant having to patch in specific driver versions for hardware we were using.  We grew 3 more datacenters, which meant I now had 3 sets of hardware distributors to cover.  It got kind of rough, and I started getting crabby as I had to spend more and more time nursing the system and couldn't spend time talking to the support folks who were on the front lines or the datacenter folks whose lives I was supposed to be making easier.  I started whining about getting some help and after a brief 18 months, reinforcements finally arrived.
The upside is that during those 18 months, I was able to create some very simple documentation on how the whole install system worked.  The original intent for documenting it was to transfer knowledge to other people in case something happened to me:  Here's what The Machine does, why I built it, and what I hope to do with it in the future.  In the fine tradition of Dr. Moreau and Dr. Frankenstein, I had left blueprints for my monster along with a chronicle of my madness.  It wasn't exactly complete, but it was sufficient to train my first replacement well enough that he was able to jump ship and pick up a gig at RedHat after a few months.  My second replacement picked it up well enough to add in support for Debian's FAI stuff... and that was around the time someone up above me got the bright idea to rip me out of a development role and stick me in the IT department.  Around a year later, I'd leave the company and go try out some other stuff.

It made me proud to hear that even a couple years after I'd moved on, the provisioning system I'd setup was still in place and had been passed on to a 3rd and then 4th generation of techs.  I usually keep an eye on job ads just as a way to get hints on what various people are doing, and it was kind of cool to see job postings for a role I had created.  A role that didn't exist before 2000.

And unfortunately no longer exists.  Which is the whole point of this post.

It seems that one of the groups that came after me decided to rewrite the scripts that glued the installer together.  I'll admit to some wounded pride, but hey... progress happens.  What that group failed to do, however, was document what they did, how it worked, and why they did what they did.  As key developers of the group moved on to other jobs (both inside the company and out), institutional knowledge not only of the solution but the problem itself was lost.  My old documentation and the notes that had been added to it were chucked because they were obsolete and could only provide a brief glimpse of what we trying to do 10 years ago.  None of it applied to the new stuff.  Eventually, the hosting company was left with a system that had stopped making it nimble and able to accommodate customer needs and had turned into this ticking time bomb that no one understood or could understand without some major league code archaeology expeditions.  To make it worse, the people with the skills to run those archaeology expeditions were busy doing other things that were vastly more exciting and interesting; the people who could defuse the bomb had better, safer stuff to do than defusing bombs. 

In short, an asset soured into a liability.  Eventually the company figured out that the old way wasn't going to work any more.  They had grown large enough to demand that their vendors make the necessary changes to the operating system before buying the hardware.  The responsibility of the job was moved from the company I used to work for up to its vendors.  The job role I had created was outsourced because no one left anything behind to explain WTF they were thinking or why they were doing it.  I'll leave it up to the biz guys to argue whether or not that was a Good Thing(tm) as far as the P&L sheets are concerned, but I think it's Bad(tm) because a tactical ability was lost that provided a strategic advantage.

As an not-quite-greybeard, I'm finding that the idea behind the code is much, much more valuable than the code itself especially if you are trying to do something strange/unique/innovative.  A quick README file explaining what you're trying to do can be more informative than a 40 page essay on the clever use of language idioms or circular documentation comments on all the methods in your classes (ie, def Defrobnify( self ):   # defrobnifies the object).  If our jobs as developers are to solve problems, then we ought to be making some notes to help us remember the how and why we got the answers.

Show your work, or be forgotten.

No comments: