Take your time, document, test, implement

If you do this job long enough you have to replace hardware.  Either hardware you installed or something somebody else engineered and installed.  For branch stuff or access hardware it’s simple, you plan an outage for that segment and swap out the gear.  Datacenter hardware can be much more tricky.  Some challenges that you’re likely to face are proprietary protocols, improperly engineered designs, and the multitude of Macgyver type solutions that were put in place because there wasn’t budget or time to do it right.

First thing you need to do is understand what you have.  Sounds simple, right?  Well, if you get dumped into a new environment you may just look at the documentation that was left for you.  DON’T TRUST THAT!  Old or just flat-out wrong documentation will kill you.  Take the time to go in, fully investigate every piece of hardware you plan on changing.  It’s a pain, but it’s worth the fact that you’ll probably still have a job when you’re done.  And I mean really get in there, check every layer and document EVERYTHING.  Interfaces, cables, layer 2, layer 3.

Next, build yourself a lab if you can.  Make sure you can build your current environment and then practice getting things to how you want them to look when you’re done, noting critical steps along the way.  If you do this you’ll hopefully find the problems before you start the real deal and can plan accordingly.  Here’s an example from a job I’m working on now:

Old firewalls need to become new firewalls and we’re cleaning up the switching topology a bit at the same time.  Now our challenges are the datacenter access switching is old and we’re changing the distribution layer to a different manufacturer.  This gear is going to be replaced later, but this project is big enough already, so we’ll have a different window to replace the access switching.

Before we put the distribution switching into production we’ve pre-configured it all, but we really wanted to test it out.  So I grabbed some smaller switches than our current access layer, but running the same old software version, so all the protocols should run the same.  I knew from my documentation how it was supposed to look when we were done, but as I was checking the final lab I noticed that while the root bridge was correct the port blocking wasn’t was I had expected.  Not a huge problem really.  There were no loops, but something didn’t look right and it’s not the sort of thing you want to try to figure out at 3am.

Turns out that the old switches were so old that they were still using the old-school “short” values to calculate path costs, so a 1gig link between two old switches was getting a cost of 4, while the new hardware was reporting a 1gig link as 20,000.

This little discovery, while it probably wouldn’t have caused major issues allows us to fix our plan, include some new changes to keep link cost consistent throughout the environment and not have to do a “clean-up” project later.  Because I think we all know that “clean-up” projects never get funded or prioritized high enough to ever get completed, so you’re left with stuff that isn’t right, and that you know isn’t right.  Which makes just about every engineer I know a little crazy.

Obviously if something breaks and you need to get a new solution in you don’t have time to do all of this.  But if this is a planned project you’ll feel way better in the end knowing that your documentation is correct and the design that is in place is one you’re fully happy with.