System problems fall into several categories. The first category is difficult to describe and even more difficult to track down. For lack of a better word, I am going to use the word “glitch.” Glitches are problems that occur infrequently and under circumstances that are not easily repeated. They can be caused by anything from users with fat fingers to power fluctuations that change the contents of memory.
Next are special circumstances in software that are detected by the
Kernel Panics
What if the
When the system panics, using its last dying breath, the
If the power goes out on the system, it is not really a system problem, in
the sense that it was caused by an outside influence, similar to someone
pulling the plug or flipping the circuit breaker (which my father-in-law did to
me once). Although this kind of problem can be remedied with a
Another annoying situation is when the
system just “hangs.” That is, it stops completely and does not react to any
input. This could be the result of a bad hard disk controller, bad
Because a system panic
is really the only time you can easily track down the problem, I will start
there. The first thing to think about is that as the system goes down, it does
two things: writes the registers to the console screen and writes a memory image
to the
The first thing to look at is
the instruction pointer. This is actually composed of two registers: the CS
(code segment) and EIP (instruction pointer) registers. This is the instruction
that the
On the
other hand, if the
The problem with this approach is that the
Keep in mind that this technique probably may not tell you what
SIMM is bad, but only indicate that you may have a bad
Many live Linux CDs as well as installed versions provide a memory test that you can start when the system is booted.
Getting to the Heart of the Problem
Okay, so we know what types of problems can occur. How do we correct them? If you have a contract with a consultant, this might be part of that contract. Take a look at it and read it. Sometimes the consultant is not even aware of what is in his or her own contract. I have talked to customers who have had consultant charge them for maintenance or repair of hardware, insisting that it was an extra service. However, the customer could whip out the contract and show the contractor that these services were included.
If you are not fortunate to have such an expensive support contract, you will obviously have to do the detective work yourself. If the printer catches fire, it is pretty obvious where the problem is. However, if the printer just stops working, figuring out what is wrong is often difficult. Well, I like to think of problem solving the way Sherlock Holmes described it in The Seven Percent Solution (and maybe other places):
“Eliminate the impossible and whatever is left over, no matter how improbable, must be the truth.”
Although this sounds like a basic enough statement, it is often difficult to know where to begin to eliminate things. In simple cases, you can begin by eliminating almost everything. For example, suppose your system was hanging every time you used the tape drive. It would be safe at this point to eliminate everything but the tape drive. So, the next big question is whether it is hardware problem or not.
Potentially, that portion of the
kernel containing the tape driver was corrupt. In this case, simply rebuilding
the
Although much less common, some older tape driver required their own controller card. If this tape drive requires its own controller and you have access to another controller or tape drive, you can swap components to see whether the behavior changes. However, just as you don’t want to install multiple pieces of hardware at the same time, you don’t want to swap multiple pieces. If you do and the problem goes away, how do you know whether it was the controller or the tape drive? If you swap out the tape drive and the problem goes away, that would indicate that the problem was in the tape drive. However, does the first controller work with a different tape drive? You may have two problems at once.
If you don’t have access to other equipment
that you can swap, there is little that you can do other than verify that it is
not a software problem. I have had at least one case while in tech support in
which a customer called in, insisting that our driver was broken because he
couldn’t access the tape drive. Because the tape drive worked under
Well, we
had been testing it using tar the whole time because tar is quick and easy when
you are trying to do tests. When we ran a quick test using
If the software behaves correctly, there is potential for conflicts. This
only occurs when you add something to the system. If you have been running for
some time and suddenly the tape drive stops working, then it is unlikely that
there are conflicts; unless, of course, you just added some other piece of
hardware. If problems arise after you add hardware, remove it from the
Another issue that people often forget is cabling. It has happened to me a number of times when I had a new piece of hardware and after relinking and rebooting, something else didn’t work. After removing it again, the other piece still didn’t work. What happened? When I added the hardware, I loosened the cable on the other piece. Needless to say, pushing the cable back in fixed my problem.
I have also seen cases in which the cable itself is bad. One support engineer reported a case to me in which just pin 8 on a serial cable was bad. Depending on what was being done, the cable might work. Needless to say, this problem was not easy to track down.
Potentially, the connector on the
cable is bad. If you have something like
If you do have a hardware problem, often times it is the result of a conflict. If your system has been running for a while and you just added something, it is fairly obvious what is causing the conflict. If you have trouble installing, it is not always as clear. In such cases, the best thing is to remove everything from your system that is not needed for the install. In other words, strip your machine to the “bare bones” and see how far you get. Then add one piece at a time so that once the problem re-occurs, you know you have the right piece.
As you try to track down
the problem yourself, examine the problem carefully. Can you tell whether there
is a pattern to when and/or where the problem occurs? Is the problem related to
a particular piece of hardware? Is it related to a particular software package?
Is it related to the load that is on the system? Is it related to the length of
time the system has been up? Even if you cant tell what the pattern means, the
support representative probably has one or more pieces of information to help
track down the problem. Did you just add a new piece of hardware or SW? Does
removing it correct the problem? Did you check to see whether there are any
hardware conflicts such as base
I have talked to customers who were having trouble with one particular command. They insist that it does not work correctly and therefore there is a bug in either the software or the doc. Because they were reporting a bug, we allowed them to speak with a support engineer even though they did not have the valid support contract. They kept saying that the documentation is bad because the software did not work the way it was described in the manual. After pulling some teeth, I discovered that the doc the customers used is for a product that was several years old. In fact, there had been three releases since then. They were using the latest software, but the doc was from the older release. No wonder the doc didn’t match the software.
- Collection information
- Instead of a simple list, I suggest you create a mind map. Your brain works in a non-linear fashion, and unlike a simply list a mind map, helps you gather and analyse information the way your brain actaully works.
- Work methodically and stay on track
- Unless you have a very specific reason, don’t jump to some other area before you complete the one you are working on. It is often a waste of time, not because that other area is not where the problem is, but rather “finding yourself” again in the original test area almost always requires a little bit of extra time (“Now where was I?”). Let your rest results in one area guide you to other areas even if that means jumping somewhere else before you are done. But make sure you have a reason.
- Split the problem in pieces
- Think of a chain that has a broken link. You can tie the end onto something, but when you pull nothing happens. Each link needs to be examined invidually. Also, the larger the pieces, the easier it is to overlook something.
- Keep track of where you have been
- “Been there done that.” Keep a record of what you have done/tested and what the results where. This can save a lot of time whith complex problems with many different components.
- Listen to the facts
- One key concept I think you need to keep in mind is that appearances can be deceiving. The way the problem presents itself on the surface, may not the real problem at all. Especially when dealing with complex systems like Linux or networking, the problem may be buried under several different layers of “noise”. Therefore, you should try not make too many assumptions and if you do, verify those assumptions before you go wandering off on the wrong path. Generally, if you can figure out the true nature of the problem then then finding the cause is usually very easy.
- Be Aware of all limitation and restrictions
- Maybe what you are trying to do is not possible given the current configuration or hardware. For example, maybe there is a firewall rule which prevents two machines from communicating. Maybe you are not authorized to use resources on a specific machines. You might be able to see machine using some tools (e.g. ping) but not with others (e.g. traceroute).
- Read what is in front of you
- Pay particular attention to error messages. I have had “experienced” system administrators reports problems to me and say that there was “some error message” on the screen. It’s true that many errors are vague or come from the last link in the chain, but more often than not they provide valuable information. This also applies to the output of commands. Does the command report the information you expect it to?
- Keep calm
- Getting upset or angry will not help you solve the problem. In fact, just the opposite is true. You begin to be more concerned with your frustration or anger and forget about the true problem. Keep in mind that if the hardware or software is as buggy as you now think it is, the company would be out of business. (Obviously that statement does not apply to Microsoft products) It’s probably one small point in the doc that you skipped over (if you even read the doc) or something else in the system is conflicting. Getting upset does nothing for you. In fact (speaking from experience), getting upset can cause you to miss some of the details for which you’re looking.
- Recreate the problem
- As in many branches of science, you cause something to happen and then examine both the cause and results. This not only verifies your understanding of the situation, it also helps prevent wild gooses chases. Users with little or no technical experience tend to over dramatize problems. This often results in in comments like “I didn’t do anything. It just stopped working.” By recreating the problem yourself, you have ensured that the problem does not exist between the chair and the keyboard.
- Stick with known tools
- There are dozens (if not hundreds) of network tools available. The time to learn about their features is not necessarily when you are trying to solve a business critical problem. Find out what tools are already available and learn how to use them. I would also recommend using the tools that are available on all machines (or at least as many as possible). That way you don’t need to spend time learing the specifics of each tool.
- Don’t forget the obvious
- Cables can accidently get kicked out or damaged. I have seen cases where the cleaning crew turned off a monitor and the next day the user reported the computer didn’t work because the screen was blank.