The other day I was browsing around on a professional networking site and came across a post in one of the forums from a guy asking for help on solving a monitoring problem. The problem was that when the servers got under a heavy enough load they would become unresponsive to his monitoring systems and the page would light up like a Christmas tree. He was looking for suggestions to relax the monitoring thresholds so the alerts wouldn't bug him.
The problem is not that the monitoring system is doing its job. The problem is that the servers are under such high enough load that they can't respond to simple "Are you still alive?" queries from a monitoring system. The correct solution is to either add hardware to the cluster and distribute the load or find some way to refactor the code so that it's more efficient.
More alarming was the fact that a bunch of people weighed in before I got there with various suggestions on how to increase timeouts, drop SNMP monitoring, etc. In a week, none of them said, "Hey, maybe the server being in distress is like, bad."
In the tech world, we fall prey to tunnel vision a lot. We become willing to push an incorrect solution to a problem so far that we will layer bad idea after bad idea on to a system, which in turn just keeps bringing in more and more points of failure. Pretty soon, you end up with a shaky, complex Akira style thing that is impossible for anyone else to understand or modify that does nothing but create unnecessary work for you.
The more I look around at my colleagues and talk with them about the battles they're fighting daily, the more I start to wonder if anyone else understands why the bearded Unixy elders held elegance in such high regard.