{"id":337,"date":"2020-08-18T19:23:47","date_gmt":"2020-08-18T20:23:47","guid":{"rendered":"http:\/\/www.linux-tutorial.info\/?page_id=77"},"modified":"2020-08-22T19:26:18","modified_gmt":"2020-08-22T20:26:18","slug":"this-is-the-page-title-toplevel-170","status":"publish","type":"page","link":"http:\/\/www.linux-tutorial.info\/?page_id=337","title":{"rendered":"Problem Solving"},"content":{"rendered":"\n<title>Problem Solving<\/title>\n<p>\nSystem problems fall into several categories. The first category is difficult\nto describe and even more difficult to track down. For lack of a better word, I\nam going to use the word &#8220;glitch.&#8221; Glitches are problems that occur infrequently\nand under circumstances that are not easily repeated. They can be caused by\nanything from users with fat fingers to power fluctuations that change the\ncontents of memory.\n<\/p>\n<p>\nNext are special circumstances in software that are detected by the <glossary>CPU<\/glossary>\nwhile it is in the process of executing a command. I discussed these briefly in the section on\n <glossary>kernel<\/glossary> internals. These problems are traps, faults, and exceptions,\nincluding such things as page faults. Many of these events are normal parts of\nsystem operation and are therefore expected. Other events, like following an\ninvalid pointer, are unexpected and will usually cause the process to\nterminate.\n<\/p>\n<p class=\"subtitle\">Kernel Panics<\/p>\n<p>\nWhat if the <glossary>kernel<\/glossary>\ncauses a\ntrap, <glossary>fault<\/glossary>,\nor exception? As I mentioned in the section on <tutorial id=86>kernel internals<\/tutorial>,\nthere are only a few\ncases when the <glossary>kernel<\/glossary>\nis allowed to do this. If this is not one of those cases, the situation is\ndeemed so serious that the <glossary>kernel<\/glossary>\nmust stop the system immediately to prevent any further damage.\nThis is a <glossary>panic<\/glossary>.\n<\/p>\n<p>\nWhen the system panics, using its last dying breath, the <glossary>kernel<\/glossary>\nruns a\nspecial routine that prints the contents of the internal registers onto the\nconsole. Despite the way it sounds, if your system is going to go down, this is\nthe best way to do it. The rationale behind that statement is that when the\nsystem panics in this manner, at least there is a record of what happened.\n<\/p>\n<p>\nIf the power goes out on the system, it is not really a system problem, in\nthe sense that it was caused by an outside influence, similar to someone\npulling the plug or flipping the circuit breaker (which my father-in-law did to\nme once). Although this kind of problem can be remedied with a\n<glossary>UPS<\/glossary>,  the first time the system goes down before the UPS is\ninstalled can make you question the stability of your system. There is no record\nof what happened and unless you know the cause was a power outage, it could have\nbeen anything.\n<\/p>\n<p>\nAnother annoying situation is when the\nsystem just &#8220;hangs.&#8221; That is, it stops completely and does not react to any\ninput. This could be the result of a bad hard disk controller, bad\n<glossary>RAM<\/glossary>,  or an improperly written or corrupt\ndevice driver. Because there is no record of what was happening, trying to\nfigure out what went wrong is extremely difficult, especially if this happens\nsporadically.\n<\/p>\n<p>\nBecause a system panic\nis really the only time you can easily track down the problem, I will start\nthere. The first thing to think about is that as the system goes down, it does\ntwo things: writes the registers to the console screen and writes a memory image\nto the <glossary>dump device<\/glossary>.  The fact that it does this as it&#8217;s\ndying makes me think that this is something important, which it is.\n<\/p>\n<p>\nThe first thing to look at is\nthe instruction pointer. This is actually composed of two registers: the CS\n(code segment) and EIP (instruction pointer) registers. This is the instruction\nthat the <glossary>kernel<\/glossary> was executing at the time\nof the <glossary>panic<\/glossary>.\n By comparing the <glossary>EIP<\/glossary>\nof several different panics, you can make some assumptions about\nthe problem. For example, if the <glossary>EIP<\/glossary>\nis consistent across several different panics, this indicates\nthat there is a software problem. The assumption is made because the system was\nexecuting the same piece of code every time it panicked. This <i>usually<\/i>\nindicates a software problem.\n<\/p>\n<p>\nOn the\nother hand, if the <glossary>EIP<\/glossary>\nconsistently changes, then this indicates that probably no one piece of code\nis the problem and it is therefore a hardware problem. This could be bad\n<glossary>RAM<\/glossary>\nor something else. Keep\nin mind, however, that a hardware problem could cause repeated <glossary>EIP<\/glossary>\nvalues, so this is not a hard\n-coded rule.\n<\/p>\n<p>\nThe problem with this approach is that the <glossary>kernel<\/glossary>\nis generally loaded the same\nway all the time. That is, unless you change something, it will occupy the same\narea of memory. Therefore, it&#8217;s possible that bad <glossary>RAM<\/glossary>\nmakes it look as though there is a bad driver. The way to\nverify this is to change where the <glossary>kernel<\/glossary>\nis physically loaded. You can do this by rearranging the\norder of your memory chips.\n<\/p>\n<p>\nKeep in mind that this technique probably may not tell you what\nSIMM is bad, but only indicate that you may have a bad <glossary>SIMM<\/glossary>.\n The only sure-fire test is to swap out\nthe memory. If the problem goes away with new <glossary>RAM<\/glossary>\nand returns with the old RAM, you have a bad\nSIMM.\n<\/p>\n<p>\nMany live Linux CDs as well as installed versions provide a memory test that you can start when the system is booted.\n<\/p>\n<p class=\"subtitle\">Getting to the Heart of the Problem<\/p>\n<p>\nOkay, so we know what types of problems\ncan occur. How do we correct them? If you have a contract with a consultant,\nthis  might be part of that contract. Take a look at it and read it. Sometimes\nthe consultant is not even aware of what is in his or her own contract. I have\ntalked to customers who have had consultant charge them for maintenance or\nrepair of hardware, insisting that it was an extra service. However, the\ncustomer could whip out the contract and show the contractor that these services\nwere included.\n<\/p>\n<p>\nIf you\nare not fortunate to have such an expensive support contract, you will obviously have to\ndo the detective work yourself. If the printer catches fire, it is pretty\nobvious where the problem is. However, if the printer just stops working,\nfiguring out what is wrong is often difficult. Well, I like to think of problem\nsolving the way Sherlock Holmes described it in <i>The Seven Percent\nSolution<\/i> (and maybe other places):\n<\/p>\n<p>&#8220;Eliminate the impossible and whatever is left over, no matter how\nimprobable, must be the truth.&#8221;<\/p>\n<p>\nAlthough this sounds like a basic enough statement, it is\noften difficult to know where to begin to eliminate things. In simple cases, you\ncan  begin by eliminating almost everything. For example, suppose your system\nwas hanging every time you used the tape drive. It would be safe at this point\nto eliminate everything but the tape drive. So, the next big question is whether\nit is hardware problem or not.\n<\/p>\n<p>\nPotentially, that portion of the\nkernel containing the tape driver was corrupt. In this case, simply rebuilding\nthe <glossary>kernel<\/glossary> is enough\nto correct the problem. Therefore, when you <glossary>relink<\/glossary>,\n you link in a new copy of the driver. If that is not\nsufficient, then restoring the driver from the distribution media is the next\nstep.  However, based on your situation, checking the hardware might be easier,\ndepending on your access to the media.\n<\/p>\n<p>\nAlthough much less common, some older tape driver required their own controller card.\nIf this tape drive requires its own controller and you have access to another\ncontroller or tape drive, you can swap components to see whether the behavior\nchanges. However, just as you don&#8217;t want to install multiple pieces of hardware\nat the same time, you don&#8217;t want to swap multiple pieces. If you do and the\nproblem goes away, how do you know whether it was the controller or the tape\ndrive? If you swap out the tape drive and the problem goes away, that would\nindicate that the problem was in the tape drive. However, does the first\ncontroller work with a different tape drive? You may have two problems at\nonce.\n<\/p>\n<p>\nIf you don&#8217;t have access to other equipment\nthat you can swap, there is little that you can do other than verify that it is\nnot a software problem. I have had at least one case while in tech support in\nwhich a customer called in, insisting that our driver was broken because he\ncouldn&#8217;t access the tape drive. Because the tape drive worked under\n<glossary>DOS<\/glossary> and the tape drive was listed as supported, either the\ndocumentation was wrong or something else was. Relinking the\n<glossary>kernel<\/glossary> and replacing the driver had no effect. We checked\nthe hardware settings to make sure there were no conflicts, but everything\nlooked fine.\n<\/p>\n<p>\nWell, we\nhad been testing it using tar the whole time because tar is quick and easy when\nyou are trying to do tests. When we ran a quick test using <command>cpio<\/command>, the tape drive\nworked like a champ. When we tried outputting <command>tar<\/command> to a file, it failed. Once we\nreplaced the <command>tar<\/command> <glossary>binary<\/glossary>,  everything worked correctly.\n<\/p>\n<p>\nIf the software behaves correctly, there is potential for conflicts. This\nonly occurs when you add something to the system. If you have been running for\nsome time and suddenly the tape drive stops working, then it is unlikely that\nthere are conflicts; unless, of course, you just added some other piece of\nhardware. If problems arise after you add hardware, remove it from the\n<glossary>kernel<\/glossary> and see whether the problem goes away. If it doesn&#8217;t\ngo away, remove the hardware physically from the system.\n<\/p>\n<concept id=\"\" description=\"Bad cabling can cause problems with hardware just as bad hardware can.\" \/>\n<p>\nAnother issue that people often forget is cabling. It has happened to me a number of\ntimes when\nI had a new piece of hardware and after relinking and rebooting, something else\ndidn&#8217;t work. After removing it again, the other piece still didn&#8217;t work. What\nhappened? When I added the hardware, I loosened the cable on the other piece.\nNeedless to say, pushing the cable back in fixed my problem.\n<\/p>\n<p>\nI have also seen cases in which the cable itself is bad. One support engineer\nreported a case to me in which just pin 8 on a serial cable was bad. Depending on\nwhat was being done, the cable might work. Needless to say, this problem was not easy to track down.\n<\/p>\n<p>\nPotentially, the connector on the\ncable is bad. If you have something like <glossary>SCSI<\/glossary>,\non which you can change the order on the SCSI cable\nwithout much hassle, this is a good test. If you switch hardware and the\nproblem moves from one device to the other, this could indicate one of two\nthings: either the termination or the connector is bad.\n<\/p>\n<concept id=\"\" description=\"Hardware problems can be caused by setting conflicts.\" \/>\n<p>\nIf you do have a hardware problem, often times it is the result of a\nconflict.  If your system has been running for a while and you just added\nsomething, it is fairly obvious what is causing the conflict. If you have\ntrouble installing, it is not always as clear. In such cases, the best thing is\nto remove everything from your system that is not needed for the install. In\nother words, strip your machine to the &#8220;bare bones&#8221; and see how far you get.\nThen add one piece at a time so that once the problem re-occurs, you know you\nhave the right piece.\n<\/p>\n<p>\nAs you try to track down\nthe problem yourself, examine the problem carefully. Can you tell whether there\nis a pattern to when and\/or where the problem occurs? Is the problem related to\na particular piece of hardware? Is it related to a particular software package?\nIs it related to the load that is on the system? Is it related to the length of\ntime the system has been up? Even if you cant tell what the pattern means, the\nsupport representative probably has one or more pieces of information to help\ntrack down the problem. Did you just add a new piece of hardware or SW? Does\nremoving it correct the problem? Did you check to see whether there are any\nhardware conflicts such as base <glossary>address<\/glossary>,\n<glossary>interrupt<\/glossary> vectors, and <glossary>DMA<\/glossary>\nchannels?\n<\/p>\n<p>\nI have talked to customers who were having trouble with one particular\ncommand. They insist that it does not work correctly and therefore there is a\nbug in either the software or the doc. Because they were reporting a bug, we\nallowed them to speak with a support engineer even though they did not have the\nvalid support contract. They kept saying that the documentation is bad because\nthe software did not work the way it was described in the manual. After pulling\nsome teeth, I discovered that the doc the customers used is for a product that\nwas several years old. In fact, there had been three releases since then. They\nwere using the latest software, but the doc was from the older release. No\nwonder the doc didn&#8217;t match the software.\n<\/p>\n<dl>\n<dt>Collection information<\/dt>\n<dd>Instead of a simple\nlist, I suggest you create a <a href=\"http:\/\/www.peterussell.com\/Mindmaps\/HowTo.html\">\nmind map<\/a>. Your brain works in a non-linear fashion, and unlike a simply list\na mind map, helps you gather and analyse information the way your brain actaully works.\n<\/dd>\n<dt>Work methodically and stay on track<\/dt>\n<dd>Unless you have a very specific reason, don&#8217;t jump to some other area before you\ncomplete the one you are working on. It is often a waste of time, not because that\nother area is not where the problem is, but rather &#8220;finding yourself&#8221; again in the\noriginal test area almost always requires a little bit of extra time (&#8220;Now where was\nI?&#8221;). Let your rest results in one area guide you to other areas <i>even if<\/i> that\nmeans jumping somewhere else before you are done. But make sure you have a reason.\n<\/dd>\n<dt>Split the problem in pieces<\/dt>\n<dd>Think of a chain that has a broken link. You can tie the end onto something, but\nwhen you pull nothing happens. Each link needs to be examined invidually. Also, the\nlarger the pieces, the easier it is to overlook something.\n<\/dd>\n<dt>Keep track of where you have been<\/dt>\n<dd>&#8220;Been there done that.&#8221; Keep a record of what you have done\/tested and what the\nresults where. This can save a lot of time whith complex problems with many different\ncomponents.\n<\/dd>\n<dt>Listen to the facts<\/dt>\n<dd>One key concept I think you need to keep in mind is that appearances can be\ndeceiving. The way the problem presents itself on the surface, may not the real\nproblem at all. Especially when dealing with complex systems like Linux or\nnetworking, the problem may be buried under several different layers of &#8220;noise&#8221;.\nTherefore, you should try not make too many assumptions and if you do, verify those\nassumptions before you go wandering off on the wrong path. Generally, if you can\nfigure out the true nature of the problem then then finding the cause is usually very\neasy.\n<\/dd>\n<dt>Be Aware of all limitation and restrictions<\/dt>\n<dd>\nMaybe what you are trying to do is not possible given the current configuration or\nhardware. For example, maybe there is a firewall rule which prevents two machines\nfrom communicating. Maybe you are not authorized to use resources on a specific\nmachines. You might be able to see machine using some tools (e.g. ping) but not\nwith others (e.g. traceroute).\n<\/dd>\n<dt>Read what is in front of you<\/dt>\n<dd>Pay particular attention to error messages. I have had &#8220;experienced&#8221; system\nadministrators reports problems to me and say that there was &#8220;some error message&#8221; on\nthe screen. It&#8217;s true that many errors are vague or come from the last link in\nthe chain, but more often than not they provide valuable information. This also\napplies to the output of commands. Does the command report the information\nyou expect it to?\n<\/dd>\n<dt>Keep calm<\/dt>\n<dd>Getting upset or angry will not help you solve the problem. In fact, just the\nopposite is true. You begin to be more concerned with your frustration or anger\nand forget about the true problem. Keep in mind that if the hardware\nor software is as buggy as you now think it is, the company would be out of\nbusiness. (Obviously that statement does not apply to Microsoft products)\nIt&#8217;s probably one small point in the doc that you skipped over (if you\neven read the doc) or something else in the system is conflicting. Getting upset\ndoes nothing for you. In fact (speaking from experience), getting upset can\ncause you to miss some of the details for which you&#8217;re looking.\n<\/dd>\n<dt>Recreate the problem<\/dt>\n<dd>As in many branches of science, you cause something to happen and then examine\nboth the cause and results. This not only verifies your understanding of the situation,\nit also helps prevent wild gooses chases. Users with little or no technical\nexperience tend to over dramatize problems. This often results in in comments like\n&#8220;I didn&#8217;t do anything. It just stopped working.&#8221; By recreating the problem yourself,\nyou have ensured that the problem does not exist between the chair and the keyboard.\n<\/dd>\n<dt>Stick with known tools<\/dt>\n<dd>There are dozens (if not hundreds) of network tools available. The time to learn\nabout their features is not necessarily when you are trying to solve a business\ncritical problem. Find out what tools are already available and learn how to use\nthem. I would also recommend using the tools that are available on all machines (or\nat least as many as possible). That way you don&#8217;t need to spend time learing the\nspecifics of each tool.\n<\/dd>\n<dt>Don&#8217;t forget the obvious<\/dt>\n<dd>Cables can accidently get kicked out or damaged. I have seen cases where\nthe cleaning crew turned off a monitor and the next day the user reported\nthe computer didn&#8217;t work because the screen was blank.<\/dd>\n<\/dl>\n","protected":false},"excerpt":{"rendered":"<p>Problem Solving System problems fall into several categories. The first category is difficult to describe and even more difficult to track down. For lack of a better word, I am going to use the word &#8220;glitch.&#8221; Glitches are problems that &hellip; <a href=\"http:\/\/www.linux-tutorial.info\/?page_id=337\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-337","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"http:\/\/www.linux-tutorial.info\/index.php?rest_route=\/wp\/v2\/pages\/337","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/www.linux-tutorial.info\/index.php?rest_route=\/wp\/v2\/pages"}],"about":[{"href":"http:\/\/www.linux-tutorial.info\/index.php?rest_route=\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"http:\/\/www.linux-tutorial.info\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/www.linux-tutorial.info\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=337"}],"version-history":[{"count":1,"href":"http:\/\/www.linux-tutorial.info\/index.php?rest_route=\/wp\/v2\/pages\/337\/revisions"}],"predecessor-version":[{"id":667,"href":"http:\/\/www.linux-tutorial.info\/index.php?rest_route=\/wp\/v2\/pages\/337\/revisions\/667"}],"wp:attachment":[{"href":"http:\/\/www.linux-tutorial.info\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=337"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}