|[Home] [Credit Search] [Category Browser] [Staff Roll Call]||The LINUX.COM Article Archive|
|Originally Published: Tuesday, 9 January 2001||Author: Paul Summers|
|Published to: enhance_articles_sysadmin/Sysadmin||Page: 1/1 - [Printable]|
Personal Side of Being a Sysadmin
Have you got what it takes to be a sysadmin? Can you deal with an annoying user without telling them off? How about that dreaded boss with an idea? In this article, the Personal Side of being a Sysadmin, we will look at methods of dealing with the day to day aspects of keeping all the kids happy in the sandbox.
|Page 1 of 1|
So you think you've got what it takes to administrate a server farm or office network with hundreds, if not thousands of users on it? The unfortunate reality of being a Systems Administrator is that sometime during your career, you will most likely run into a user (or lus3r if you prefer) whom has an IQ of a diced carrot and demands that you drop everything to fix their system/email/whatever.
This article focuses on how to deal with these kinds of issues. We will look at how to deal with day-to-day issues that spring up, and more importantly how not to deal with them. We will go over a number of tips and tricks that will pull your bacon from the coals when stuff happens. Finally, we will look at how to deal with priorities in a triage situation when it all goes wrong.
First let's look at a number of practical examples of situations that may arise in every day work as a Sysadmin.
Example number one. You work for a small dot com. You are the only sysadmin in the company, and are responsible for keeping a local network of 50 workstations of mixed origin, and a 10 machine server farm running.
It's Monday morning, you just got your coffee the right color, and someone from marketing runs up to your cube and begins a bitch-a-thon about "e-mail and network" being down.
So, what should you do? The first thing should be to get rid of the miffed co-worker. As much fun as it would be to tell them to go fly a kite off the roof of the nearest skyscraper, a more diplomatic approach must be used to avoid the "blue room" (also known as the boss' office, where old sysadmins go before they die).
Step One: Remove the annoying co-worker. This is generally done by assuring them you will give it top priority and sort the issue out as soon as possible. They will usually ask for an exact time, down to the second, of when you will "have it fixed." If you can't get out of giving them an exact number, compute the time it would take you to rebuild the entire network from scratch and reinstall their workstation. This will avoid having the user come back and complain if your repairs take longer than they usually would. This also gives you claim to the "Scotty the miracle worker" when you fix the problem much sooner than anticipated. Contrary to what some people may think, this is not a bad thing. It's called problem management, and it results in never having to give anyone bad news. Plan for the worst, and all your surprises are happy ones.
Step Two: By now, the annoying co-worker has hopefully buggered off to get a coffee, as they now have an excuse not to immediately get to work, because "the network is down". The first thing you should do is verify connectivity of the network and the hosts on it. This can be done via ping, or some other network tool as we'll get into later in this article. 99% of the time, your network will be working just fine. Let's assume that inbound and outbound pings work fine to and from any host on the network, as well as to outside hosts. This means the problem is at the user's machine.
Pick up your handy cable/connectivity tester (You DO have one, right?) and spork it into the port the user's machine is connected to. Follow the possible problems down the logical path until you find it. Nine times out of ten, it's a Windows user who has either borked their patch cable or "updated" their network card drivers.
At this point, you should have connectivity restored to their box ("the network is back up") and be able to do further troubleshooting. At this point, in most cases, the other services mentioned (e-mail) will magically start working again.
Result: The user is happy that their machine is now able to access the network, and they can get their e-mail. They're also happy as they were given top priority as repairs were done in about 30 minutes when the original estimate was three hours.
The first example was an easy one. You only had one user to deal with, with an isolated problem. The solution however, or at least the personal side of it, is the same as our next few examples.
Our next example involves you, the senior systems administrator, in a large data warehousing/resale environment (let's say amazon.com just hired you) with five other junior systems administrators working on your team. It's Friday morning, and everyone is looking forward to the weekend when suddenly, it hits the fan. You and your fellow sysadmins get no less than 10 calls each in as many minutes, reporting that the network is down. In as many more minutes, you have a dozen people milling about your desk wanting answers.
Remain calm. You should not dive headfirst into your wire closet or server room, armed with your trusty tool kit, locking the door behind you. First, you have got to deal with the human aspect whom all think the world has just come to an end. Why? Because in all likelihood you are going to be faced with a fairly simple problem such as a switch or router rebooting, or a simple piece of gateway hardware eating itself. You assume everything will be fine in a few minutes, but what happens if it's not something simple? What happens if a router has eaten itself and you don't have a replacement? What happens if there was a fiber cut and you're at the mercy of the local telco?
You have got to prepare the people whom are depending on this network for an outage that may last for minutes, or all day. Then, go fix the problem.
Step 1: Deal with the people who know you are the person to deal with this problem, who are milling about your desk and peering over your shoulder for answers. The sooner you reassure them, the sooner you can get to work solving the problem. The best approach is usually a general address to all concerned. Something like, "We are aware the network is down and are working on the problem. We will inform you as to any changes in status or estimates for restoration of service. Please return to your jobs so we can do ours." Hopefully, the majority of them will get the idea and bugger off to play IR Pong on their PalmPilots. If a few remain and demand answers, reassure them that you are giving the problem top priority. (This effects everyone, so you can be a bit more sincere than with the last example). Do what is necessary to give a confident answer, and if anyone demands an ETA on repairs, give them the longest possible time that you would need to fix the situation based on what you already know about it. Don't be afraid to say, "I don't know." Just say, "I don't know, but if you'll let me get back to work, I'll find out and let you know." instead.
Now, before you go running, pick up your phone and set a do-not-disturb message to something like you told everyone a few minutes ago. This will (hopefully) prevent a huge backlog of voice mail and other annoying things.
Step 2: Fix it. We won't deal with the trouble shooting aspects of fixing the problem, as this article is focusing on the personal side of the job, not the technical. Your number one goal as a Sysadmin should always be to the people using the system you are running. Because, quite simply, without them you don't have a job.
For our third example, we will focus on a situation you will most certainly have to deal with as a Sysadmin. The BWI situation. BWI stands for Boss With Idea, and it's never a good thing. Upper management throughout time has been a marketing pawn and is rarely as practically educated as you are when it comes to making things work. However, when they see an ad for xyz bloated tech application on TV, or read some glowing report of the performance of some closed-source proprietary system in a magazine-- rest assured they will be suggesting it to you, and soon.
Step 1: Choose your ground. Usually BWI comes in the form of a departmental e-mail in degrees of insanity (take a look at this, to we're using this next week). Or, perhaps it's in the form of a voice mail. Regardless of how you are tipped off about a BWI, always go directly to the source. Schedule a meeting with the BWI immediately and as soon as possible to go over the "idea". It is MUCH better to discuss things face to face. If you banter via e-mail, you will lose, even if all the facts and data agree with you. This should be a private meeting, in the interest of saving face. Remember, you have to work with this person in the future, and it's generally not a good idea to publicly make him/her look like an idiot.
Step 2: Prepare the facts. You have a lot of weapons at your disposal, one of the most powerful is hard data. Say the BWI is to drop the current GNU/Linux Apache/PHP solution you spent a year building, and go with an out of the box NT IIS/ASP solution. Hit google and suck up all the benchmarks and reliability data you can find to support your opinion. Stop off at Kinkos and have some graphs and flowcharts made, or a full out presentation if it's important enough. The larger impact your position has, the more weight it will carry in the decision. Practical demonstrations are always a good idea as well. Set up a small Linux server, then a small NT server. Set up a few small tasks and ask your boss to do them. Say, browsing the company Web site under a P100 NT box, and a P100 Linux box running X. The performance difference should be fairly obvious even to the untrained.
Step 3: Follow through. The best presentation only goes so far if directly after it your boss goes home, flips on the tube, and is bombarded with yet more advertising. The next morning schedule another meeting, and present your redesign plan for whatever application the BWI was directed at. Of course, this basically boils down to adding a few new cool features to your existing application, but this will impress the BWI as it re-instates the fact that your existing solution is the way to go. The BWI gets something new to play with, and you get a platform that works. Everybody wins.
Now on to a different aspect of the personal side of being a Sysadmin. Tips and tricks that will save your butt, and a lot of time. These are things that will help you keep people informed and off your back when something goes wrong. Some are simple, some are not. All will make your life much easier as a Sysadmin.
With our first example above, we had a user complaining that the network was down. How about coding a nice little perl script which uses wget or ping to hit every host on the network, and report the status of each in a nice little table? This is something you can show to the user, whom will take up position directly over the shoulder, and point out that it is not a network outage, but only a problem with their machine. Of course, you'll "Get right on it."
What about all those voice mails? How about making a few pre-recorded messages for various problems that may show up and storing them on your workstation. Then, all you have to do is play them back when you set your message, and not worry about what to say, how to say it, or filtering out all the background blabber. This is something best done late at night when few other people are around your workspace.
Remember, if your network or mail server goes down, you're not going to get much use out of those department-wide email alert aliases you set up. Same goes with that nifty support incident ticket system you set up while on a coffee break.
Here's another good one. You know that guy/gal who is always hanging around when you're in the middle of ripping a server apart? The one who is asking endless questions and always seems to want static IPs on the network for some reason? Enter support lackey, stage left. A person like this, who is obviously interested in what you are doing but knows next to nothing about it, can be really handy when it hits the fan. Say the entire network explodes like in our second example. Wouldn't it be nice to have this person run around (at least to the executives) and inform them that you are already hard at work on the problem? This kind of personal touch to things keeps people happy. Keeping people happy is a good thing, as it will be remembered when it comes time to sign that expense check you racked up in Vegas.
My personal favorite trick, is the recovery CD. Using programs like slate and ghost, or their OSS alternatives, you can create CDs if minimal operating systems (650 megs worth or so, or the whole shebang on DVD-R if you want to get fancy) as they were installed in production. Why do this? Well, if a machine gets fried and you have to rebuild from scratch, it's generally easier to restore hard drive partitions then reinstall a box from scratch.
Now on to the last situation in our article. Doomsday. Yep, just like in all those James Bond flicks you've watched, sometimes it can all go very wrong, and be up to you to fix.
It's a peaceful Saturday afternoon. You're sitting in the park, reading a good book (or in your local pub tossing a few back after the nightmare that happened on Friday when the network went down) and suddenly it happens. The page of doom. Sure, it could be your friend wanting to go catch a Yankees game, or it could be your girlfriend wanting to go shopping. But, as you just got comfortable and got into the interesting part of your book, Murphy's law dictates that it must be a page of doom. Most sysadmins know this POD alert, and even have it programmed into their pager or phone as a special tone. It means, quite simply, all hell just broke loose back on the ranch.
You run back to the office to survey the disaster, and are met with various problems. First, it seems someone kicked over a water dispenser one floor up, and two or three of your servers are sitting in a puddle of water. Needless to say, they aren't happy. What's more, water-soaked ceiling tile fell on a switch, breaking its fiber uplink lines. As well, that puddle of water is inching towards the UPS responsible for keeping all this stuff running.
Step 1: Don't Panic. Do push the panic button. This is a total interruption of service, boys and girls. This is bad. If you have Jr. Sysadmins, call them. If not, call your kernel hacking buddies and promise them a beer or 12 after the mess is cleaned up. Just do something to get as many pairs of hands helping you as possible.
Step 2: Triage. While you're waiting for your co-workers and/or friends to show up and madly tossing towels into the water while trying to find that wet/dry vacuum you saw kicking around a few weeks ago, take a moment to sit back and survey the damage.
Your first priority should be safety to yourself and others. So, if the puddle of water is surrounding a 440V mains breaker panel, you probably don't want to be standing in it throwing switches. Unless you want a punk Hairstyle and a free ride in an ambulance. Assuming that is, that you don't turn yourself into a pizza pop on the spot.
If everything is safe to play with, your next priority should be to prevent any further damage from occurring. If that puddle of water is creeping toward other servers and a huge UPS, shut 'em down. A powered down and offline server is better then a cooked server. This should be a last resort, but an option you should not be afraid to use if the situation warrants it. After all, your boss will be much happier that you saved thousands of dollars of equipment, then allowing those last two ecom transactions to go through.
Step 3: Fight bigger battles first. So, you've got the water about mopped up and your buddies finally show up. After surveying the damage, you find that two servers are cooked, along with some broken fiber patch. Things could be worse. Your first priority should be restoring connectivity to the machines that didn't go for a swim. So, replace the damaged fiber patch and log into the switch to make sure everything is happy. Hopefully, everything will be and you won't have to go running around trying to find a replacement switch on a Saturday evening.
Now that connectivity is restored, we look at the servers. Turns out that the primary mail server and backup server got cooked. Upon closer inspection, the power sources got wet and shorted, and those mainboards don't look at all healthy. Sure, other parts in the system could be fine, but time is money. Slap together a couple of new boxes and restore them from your handy-dandy recovery CDs. (You did make them, didn't you?) Now restore the data, get the boxes back online, and go buy those friends of yours a few beers.
This is one tip I can't stress enough. Set up another meeting with your boss, and have them allocate a disaster preparation fund. The most common setup is a corporate credit card, for which you have unrestricted use. So, when something like the above happens, and you don't have enough spare parts in stock, you can run out and get them without having to wait for accounting to show up on Monday. Be careful though, you will have to account for everything you buy. So, it might be a bit hard explaining why the network suddenly needed a shiny new Ferrari.
In closing, a Sysadmin must deal with many situations regardless of the size of the company. Some of them are fun, most of them are not. That's the nature of the job-- fixing things that nature or silly people have decided to break. This can be accomplished more effectively with good interpersonal skills as described above. If people know you are honestly doing everything you can, even if it is trivial and meaningless to you, it will earn you their respect. If you approach your boss with a well designed, thoughtful argument of why your network or server setup is better than the one he saw on TV, it will get you a lot farther then "Ugh.. that thing is a piece of crap." Always remember that one should never panic, and always prioritize problems. Have a backup plan, and a backup plan for your backup plan. Remember, it could always be worse than it is.
|Page 1 of 1|