Problem-solving starts before you have even installed your system. Because a detailed knowledge of your system is important to figuring out what’s causing problems, you need to keep track of your system from the very beginning. One most effective problem-solving tool costs less than $2 and can be found in grocery stores, gas stations, and office supply stores. Interestingly enough, I can’t remember ever seeing it in a store that specialized in either computer hardware or software. I am talking about a notebook. Although a bound notebook will do the job, I find a loose-leaf notebook to be more effective because I can add pages more easily as my system develops.
With today’s technology, you don’t need to do things like this with pen and paper. There are any number of software products that provide you with a means of recording this kind of information. Plus it can be accessed by anyone anywhere where they have access to a computer. However, you can carry the notebook with you into every corner of your company, even if there is no computer access available.
In the notebook I include all the configuration information from my system, the make and model of all my hardware, and every change that I make to my system. This is a running record of my system, so the information should include the date and time of the entry, as well as the person making the entry. Every time I make a change, from adding new software to changing kernel parameters, should be recorded in my log book.
As we discussed, there are a number of different programs that will show you information about your system. It would be farily straight forward to write a script that collects this information, as well as the contents of various configuration files and then
In putting together your notebook, don’t be terse with comments like, “Added SCSI patch and relinked.” This should be detailed, like, “Added patch for Adaptec AIC-7xxx. Rebuild and reboot successful.” Although it seems like busy work, I also believe things like adding users and making backups should be logged. If messages appear on your system, these, too, should be recorded with details of the circumstance. The installation guide should contain an “installation checklist.” I recommend that you complete this before you install and keep a copy of this in the log book.
Something else that’s very important to include in the notebook is problems that you have encountered and what steps were necessary to correct that problem. One support engineer with whom I worked told me he calls this his “solutions notebook.”
As you assemble your system, write down everything you can about the hardware components. If you have access to the invoice, a copy of this can be useful for keeping track of the components. If you have any control over it, have your reseller include details about the make and model of all the components. I have seen enough cases in which the invoice or delivery slip contains generic terms like Intel 2400Mhz CPU, cartridge tape drive, and 500GB hard disk. Often this doesn’t even tell you whether the hard disk is SCSI, IDE, SATA, or what.
Next, write down all the settings of all the cards and other hardware in your machine. The jumpers or switches on hardware are almost universally labeled. This may be something as simple as J3 but as detailed as IRQ. Linux installs at the defaults on a wide range of cards, and generally there are few conflicts unless you have multiple cards of the same type. However, the world is not perfect and you may have a combination of hardware that neither I nor Linux developers has ever seen. Therefore, knowing what all the settings are can become an important issue.
One suggestion is to write this information on gummed labels or cards that you can attach to the machine. This way you have the information right in front of you every time you work on the machine.
Although becoming less common with the rise of the Internet, many companies have a “fax back” service in which you can call a number and have them fax you documentation of their products. For most hardware, this is rarely more than a page or two. For something like the settings on a hard disk, however, this is enough. Requesting faxed documentation has a couple of benefits. First, you have the phone number for the manufacturer of each of your hardware components. The time to go hunting for it is not when your system has crashed. Next, you have (fairly) complete documentation of your hardware. Last, by collecting the information on your hardware, you know what you have. I can’t count the number of times I have talked with customers who don’t even know what kind of hard disk they have, let alone what the settings are.
Another great place to get technical information is the World Wide Web. I recently bought a SCSI hard disk that did not have any documentation. A couple of years ago, that might have bothered me. However, when I got home, I quickly connected to the Web site of the driver manufacturer and got the full drive specs, as well as a diagram of where the jumpers are. If you are not sure of the company’s name, take a guess, as I did. I tried www.seagate.com, and it worked the first time. The worst case is that you need to google for it.
When it comes time to install the operating system, the first step is to read the release notes and installation HOWTO and any documentation that comes with your distribution. I am not suggesting reading them cover to cover, but look through the table of contents completely to ensure that there is no mention of potential conflicts with your host adapter or the particular way your video card needs to be configured. The extra hour you spend doing that will save you several hours later, when you cant figure out why your system doesn’t reboot when you finish the install.
As you are actually doing the installation, the process of documenting your system continues. Depending on what type of installation you choose, you may or may not have the opportunity to see many of the programs in action. If you choose an automatic installation, many of the programs run without your interaction, so you never have a chance to see and therefore document the information.
The information you need to document are the same kinds of things I talked about in the section on finding out how your system was configured. It includes the hard disk geometry and partitions (fdisk), file systems (mount and /etc/fstab), the hardware settings (/var/log/messages), and every patch you have ever installed. You can send the output to all of these commands to a file that you can print out and stick in the notebook.
I don’t know how many times I have said it and how many articles (both mine and others) in which it has appeared, some people just don’t want to listen. They often treat their computer systems like a new toy at Christmas. They first want to get everything installed that is visible to the outside world, such as terminals and printers. In this age of “Net-in-a-box,” often that extends to getting their system on the Internet as soon as possible.
Although being able to download the synopsis of your favorite Deep Space Nine episode is an honorable goal for some, Chief O’Brien is not going to come to your rescue when your system crashes. (I think even he would have trouble with the antiquated computer systems of today.)
Once you have finished installing the operating system, the very first device you need to install and configure correctly is your tape drive. If you don’t have a tape drive, buy one! Stop reading right now and go out and buy one. It has been estimated that a “down” computer system costs a company, on the average, $5,000 an hour. You can certainly convince your boss that a tape drive that costs one-tenth as much is a good investment.
If you have less data than fits on a DVD (less than 4GB/8GB), you could buy yourself a DVD writer.
One of the first crash calls I received while I was in tech support was from the system administrator at a major airline. After about 20 minutes, it became clear that the situation was hopeless. I had discussed the issue with one of the more senior engineers who determined that the best course of action was to reinstall the OS and restore the data from backups.
I can still remember their system administrator saying, “What backups? There are no backups.”
“Why not?” I asked.
“We don’t have a tape drive.”
“My boss said it was too expensive.”
At that point the only solution was data recovery service.
“You don’t understand,” he said. “There is more than $1,000,000 worth of flight information on that machine.”
“Not any more.”
What is that lost data worth to you? Even before I started writing my first book, I bought a tape drive for my home machine. For me, it’s not really a question of data but rather, time. I don’t have that much data on my system. Most of it can fit on a half-dozen floppies. This includes all the configuration files that I have changed since my system was installed. However, if my system was to crash, the time I save restoring everything from tape compared to reinstalling from floppies is worth the money I spent.
As technology progressed, CD/DVD writers became cheaper than tape drives. Current I make backups onto DVD of the most important data, so I can get to it quickly, but I use my tape drive to backup all of the data and system files, as it won’t fit on a DVD.
The first thing to do once the tape drive is installed is to test it. The fact that it appears at boot says nothing about its functionality. It has happened enough that it appears to work fine, all the commands behave correctly, and it even looks as though it is writing to the tape. However, it is not until the system goes down and the data is needed that you realize you cannot read the tape.
I suggest first trying the tape drive by backing up a small subdirectory, such as /etc. There are enough files to give the tape drive a quick workout, but you don’t have to wait for hours for it to finish. Once you have verified that the basic utilities work (like tar or cpio), then try backing up the entire system. If you don’t have some third-party back-up software, I recommended that you use cpio. Although tar can back up most of your system, it cannot backup device nodes.
If the Linux commands are too cumbersome (and they are for many newcomers), a couple of commercial products are available. One such product is Lone-Tar from Cactus International. I have used Lone-Tar for years on a few systems and have found it very easy to use. The front end is mostly shell scripts that you can modify to fit your needs.
In general, Lone-Tar takes a differential approach to making backups. You create one Master Backup and all subsequent backups contain those files that have changed since the master was created. I find this the best approach if your master backup takes more than one tape. However, if it all fits on one tape, you can configure Lone-Tar always to do masters.
Cactus also produces several other products for Linux, including Kermit, and some excellent DOS tools. I suggest you check them out. Demo versions are available from the cactus Web site.
Like religion, it’s a matter of personal preference. I use Lone-Tar for Linux along with their DOS Tar product because I have a good relationship with the company president, Jeff Hyman. Lone-Tar makes backups easy to make and easy to restore. There is even a Linux demo on the Lone-Tar Web site. The Craftworks distribution has a demo version of the BRU backup software.
After you are sure that the tape drive works correctly, you should create a boot/root floppy. A boot/root floppy is a pair of floppies that you use to boot your system. The first floppy contains the necessary files to boot and the root floppy contains the root file system.
Now that you are sure that your tape drive and your boot/root floppy set work, you can begin to install the rest of your software and hardware. My preference is to completely install the rest of the software first, before moving on to the hardware. There is less to go wrong with the software (at least, little that keeps the system from booting) and you can, therefore, install several products in succession. When installing hardware, you should install and test each component before you go on to the next one.
I think it is a good idea to make a copy of your kernel source (/usr/src/linux) before you make any changes to your hardware configuration or add any patches. That way, you can quickly restore the entire directory and don’t have to worry about restoring from tape or the distribution CD-ROM.
I suggest that you use a name that is clearer than /usr/src.BAK. Six months after you create it, you’ll have no idea how old it is or whether the contents are still valid. If you name it something like /usr/src.06AUG95, it is obvious when it was created.
Now, make the changes and test the new kernel. After you are sure that the new kernel works correctly, make a new copy of the kernel source and make more changes. Although this is a slow process, it does limit the potential for problems, plus if you do run into problems, you can easily back out of it by restoring the backup of the link kit.
As you make the changes, remember to record all the hardware and software settings for anything you install. Although you can quickly restore the previous copy of the kernel source if something goes wrong, writing down the changes can be helpful if you need to call tech support or post a message to the Internet.
Once the system is configured the way you want, make a backup of the entire installed system on a different tape than just the base operating system. I like to have the base operating system on a separate tape in case I want to make some major revisions to my software and hardware configuration. That way, if something major goes wrong, I don’t have to pull out pieces, hoping that I didn’t forget something. I have a known starting point from which I can build.
At this point, you should come up with a back-up schedule. One of the first things to consider is that you should backup as often as necessary. If you can only afford to lose one days worth of work, then backing up every night is fine. Some people back up once during lunch and once at the end of the day. More often than twice a day may be too great a load on the system. If you feel that you have to do it more often, you might want to consider disk mirroring or some other level of RAID.
The latest kernel versions support RAID 0 (disk striping), which, although it provides an improvement in performance, has no redundancy. Currently (Sep 2006), I am not aware of any software RAID solutions, though some hardware solutions might work with Linux.
The type of backup you do depends on several factors. If it takes 10 tapes to do a backup, then doing a full backup of the system (that is, backing up everything) every night is difficult to swallow. You might consider getting a larger tape drive. In a case where a full backup every night is not possible, you have a few alternatives.
First, you can make a list of the directories that change, such as /home and /etc. You can then use tar just to backup those directories. This has the disadvantage that you must manually find the directories that change, and you might miss something or back up too much.
Next, there are incremental backups. These start with a master, which is a backup of the entire system. The next backup only records the things that have changed since the last incremental. This can be expanded to several levels. Each level backs up everything that has changed since the last backup of that or the next lower level.
Note that often the term “master” is used with a specific set of data. For example, you might have master database backup or a master webserver backup, neither of which are a complete system backup. As long as you are consistent in your terminology, this does not really matter. For example, you can create an incremental webserver backup to go along with the master webserver backup, just as you can create an incremental system backup to go along with the master system backup.
For example, level 2 backs up everything since the last level 1 or the last level 0 (whichever is more recent). You might do a level 0 backup once a month (which is a full backup of everything), then a level 1 backup every Wednesday and Friday and a level 2 backup every other day of the week. Therefore, on Monday, the level 2 will back up everything that has changed since the level 1 backup on Friday. The level 2 backup on Tuesday will back up everything since the level 2 backup on Monday. Then on Wednesday, the level 1 backup backs up everything since the level 1 backup on the previous Friday.
At the end of the month, you do a level 0 backup that backs up everything. Lets assume this is on a Tuesday. This would normally be a level 2. The level 1 backup on Wednesday backs up everything that has changed since the level 0 backup (the day before) and not since the level 1 backup on the previous Friday.
A somewhat simpler scheme uses differential backups. Here, there is also a master. However, subsequent backups will record everything that has changed (is different) from the master. If you do a master once a week and differentials once a day, then something that is changed on the day after the master is recorded on every subsequent backup.
A modified version of the differential backup does a complete, level 0 backup on Friday. Then on each of the other days, a level 1 backup is done. Therefore, the backup Monday-Thursday will backup everything since the day before. This is easier to maintain, but you may have to go through five tapes.
The third type, the simplest method, is where you do a master backup every day and forget about increments and differences. This is the method I prefer if the whole system fits on one tape because you save time when you have to restore your system. With either of the other methods, you will probably need to go through at least two tapes to recover your data, unless the crash occurs on the day after the last master. If you do a full backup every night, then there is only one backup to load. If the backup fits on a single tape (or at most, two), then I highly recommend doing a full backup every night. Remember that the key issue is getting your people back to work as soon as possible. The average $5,000 per hour you stand to loose is much more than the cost of a larger tape drive. (ca. 80Gb, Sep. 2006)
This brings up another issue, and that is rotating tapes. If you are making either incremental or differential backups, then you must have multiple tapes. It is illogical to make a master then make an incremental on the same tape. There is no way to get the information from the master.
If you make a master backup on the same tape very night, you can run into serious problems as well. What if the system crashes in the middle of the backup and trashes the tape? Your system is gone and so is the data. Also, if you discover after a couple of days that the information in a particular file is garbage and the master is only one day old, then it is worthless for getting the data back. Therefore, if you do full backups every night, use at least five tapes, one for each day of the week. (If you run seven days a week, then seven tapes is likewise a good idea.)
You don’t necessarily always have to back up to tape. If the amount of data that changes is fairly small, you could backup to floppies. This is probably only valid if your system is acting as a Web server and the data change at irregular intervals. As with any backup, you need to weigh the time to recreate the data against the time to make the backup. If your data on the Web server is also stored elsewhere (like on the development machine), it may be easier to back up the Web server once after you get your configuration right, and then skip the backups. However, it’s your call.
Other choices for backup media include WORM (Write Once/Read Many) drive and CD-Recordable. This is only effective if the data isn’t going to change much. You could back up your Web server to one of these media and then quickly recovered it if your machine crashes. Copy the data to another machine on the network where a backup is done. (You could also mount the file system you want to back up via NFS. )
Although most people get this far in thinking about the backup media, many forget about the physical safety of the media. If your computer room catches fire and the tapes or DVDs melt, then the most efficient backup scheme is worthless. Some companies have fireproof safes in which they keep the tapes. In smaller operations, the system administrator can take the tape home from the night before. This is normally only effective when you do masters every night. If you have a lot of tapes, you might consider companies that provide off-site storage facilities.
I worked for one datacenter provider which had a customer who paid for a courier to pick up “clones” of the master backup tapes once a week. There was a full system backup on the weekend to tape and these tapes where then copied (cloned). On Tuesday, the courier would return the tapes from the previous week and pickup the new ones.
Although some commercial products are available (which I will get into in a moment), you can use the tools on your system. For example, you can use tar or cpio. Although tar is a bit easier to use, cpio does have a little more functionality. The tar command has the following basic format:
An example might be
This example would back up /home and /etc and write them to the floppy tape device /dev/fd0. The c option says to create an archive, v is verbose mode in which all the files are output to stdout, and f says that tar should output to the following file. In this case, you are outputting to the device file /dev/fd0.
If you have a lot of directories, you can use the T option to specify file containing the directories to backup. For example, if you had file called file_list that contained the list of directories, the command might look like this:
To extract files, the syntax is essentially the same, except that you use the x option to extract. However, you can still use both the f and T options.
The GNU version of tar (which comes with most versions of Linux) has a very large number of options. One option I use is z, which I use to either compress or uncompress the archive (depending on which direction I am going). Because the archive is being filtered through gzip, you need to have gzip on your system. Although gzip is part of every Linux distribution, it may not be on your system. Also, if you want to copy the archive to another UNIX system, that system may not have gzip. Therefore, you can either skip the compression or use the Z (notice the uppercase) to use compress and uncompress.
Although I can imagine situations in which they might be useful, I have only used a few of them. The best place to look is the tar man-page.
If your backup media can handle more than one set of backups, you can use the mt command to manage your tape drive. Among the functions that mt can do is to write a “file mark,” which is simply a marker on the tape to indicate the end of an archive. To use this function, you must first back the backup to the no-rewind tape device (for example, /dev/rft0). When the drive has written all of the archive to the tape, write the file marker to indicate where the end is.
Normally, when tar is complete and the tape device is closed, it rewinds. When you use the no-rewind device, the tar process finishes, but the tape does not rewind. You can then use the mt command to write the file mark at the tapes current location, which is at the end of the tar archive. Even if there are multiple archives on the single tape, mt will find the specific location. Therefore, whenever you need to restore, you can access any of the archives. See the mt man-page for more detail.