Looking Through Files

Looking Through Files

In the section on looking for files, we talk about various methods for finding a particular file on your system. Let’s assume for a moment that we were looking for a particular file, so we used the find command to look for a specific file name, but none of the commands we issued came up with a matching file. There was not a single match of any kind. This might mean that we removed the file. On the other hand, we might have named it yacht.txt or something similar. What can we do to find it?

We could jump through the same hoops for using various spelling and letter combinations, such as we did for yacht and boat. However, what if the customer had a canoe or a junk? Are we stuck with every possible word for boat? Yes, unless we know something about the file, even if that something is in the file.

The nice thing is that grep doesn’t have to be the end of a pipe. One of the arguments can be the name of a file. If you want, you can use several files, because grep will take the first argument as the pattern it should look for. If we were to enter

grep [Bb]oat ./letters/taxes/*

we would search the contents of all the files in the directory ./letters/taxes looking for the word Boat or boat.

If the file we were looking for happened to be in the directory ./letters/taxes, then all we would need to do is run more on the file. If things are like the examples above, where we have dozens of directories to look through, this is impractical. So, we turn back to find.

One useful option to find is -exec. When a file is found, you use -exec to execute a command. We can therefore use find to find the files, then use -exec to run grep on them. Still, you might be asking yourself what good this is to you. Because you probably don’t have dozens of files on your system related to taxes, let’s use an example from files that you most probably have.

Let’s find all the files in the /etc directory containing /bin/sh. This would be run as

find ./etc -exec grep /bin/sh {} \;

The curly braces ({ }) are substituted for the file found by the search, so the actual grep command would be something like

grep /bin/sh ./etc/filename

The “\;” is a flag saying that this is the end of the command.

What the find command does is search for all the files that match the specified criteria then run grep on the criteria, searching for the pattern [Bb]oat. (in this case there were no criteria, so it found them all)

Do you know what this tells us? It says that there is a file somewhere under the directory ./letters/taxes that contains either “boat” or “Boat.” It doesn’t tell me what the file name is because of the way the -exec is handled. Each file name is handed off one at a time, replacing the {}. It is as though we had entered individual lines for

grep [Bb]oat ./letters/taxes/file1

grep [Bb]oat ./letters/taxes/file2

grep [Bb]oat ./letters/taxes/file3

If we had entered

grep [Bb]oat ./letters/taxes/*

grep would have output the name of the file in front of each matching line it found. However, because each line is treated separately when using find, we don’t see the file names. We could use the -l option to grep, but that would only give us the file name. That might be okay if there was one or two files. However, if a line in a file mentioned a “boat trip” or a “boat trailer,” these might not be what we were looking for. If we used the -l option to grep, we wouldn’t see the actual line. It’s a catch-22.

To get what we need, we must introduce a new command: xargs. By using it as one end of a pipe, you can repeat the same command on different files without actually having to input the command multiple times.

In this case, we would get what we wanted by typing

find ./letters/taxes -print | xargs grep [Bb]oat

The first part is the same as we talked about earlier. The find command simply prints all the names it finds (all of them, in this case, because there were no search criteria) and passes them to xargs. Next, xargs processes them one at a time and creates commands using grep. However, unlike the -exec option to find, xargs will output the name of the file before each matching line.

Obviously, this example does not find those instances where the file we were looking for contained words like “yacht” or “canoe” instead of “boat.” Unfortunately, the only way to catch all possibilities is to actually specify each one. So, that’s what we might do. Rather than listing the different possible synonyms for boat, lets just take the three: boat, yacht, and canoe.

To do this, we need to run the find | xargs command three times. However, rather than typing in the command each time, we are going to take advantage of a useful aspect of the shell. In some instances, the shell knows when you want to continue with a command and gives you a secondary prompt. If you are running sh or ksh, then this is probably denoted as “>.”

For example, if we typed

find ./letters/taxes -print |

the shell knows that the pipe (|) cannot be at the end of the line. It then gives us a > or ? prompt where we can continue typing

> xargs grep -i boat

The shell interprets these two lines as if we had typed them all on the same line. We can use this with a shell construct that lets us do loops. This is the for/in construct for sh and ksh, and the foreach construct in csh. It would look like this:

for j in boat ship yacht > do > find ./letters/taxes -print | xargs grep -i $j > done

In this case, we are using the variable j, although we could have called it anything we wanted. When we put together quick little commands, we save ourselves a little typing by using single letter variables.

In the bash/sh/ksh example, we need to enclose the body of the loop inside the do-done pair. In the csh example, we need to include the end. In both cases, this little command we have written will loop through three times. Each time, the variable $j is replaced with one of the three words that we used. If we had thought up another dozen or so synonyms for boat, then we could have included them all. Remember also that the shell knows that the pipe (|) is not the end of the command, so this would work as well.

for j in boat ship yacht > do > find ./letters/taxes -print | > xargs grep -i $j > done

Doing this from the command line has a drawback. If we want to use the same command again, we need to retype everything. However, using another trick, we can save the command. Remember that both the ksh and csh have history mechanisms to allow you to repeat and edit commands that you recently edited. However, what happens tomorrow when you want to run the command again? Granted, ksh has the .sh_history file, but what about sh and csh?

Why not save commands that we use often in a file that we have all the time? To do this, you would create a basic shell script, and we have a whole section just on that topic.

When looking through files, I am often confronted with the situation where I am not just looking for a single text, but possible multiple matches. Imagine a data file that contains a list of machines and their various characteristics, each on a separate line, which starts with that characteristic. For example:

Name: lin-db-01 IP: 192.168.22.10 Make: HP CPU: 2300 RAM: 512 Location: Room 3
All I want is the computer name, the IP address and the location, but not the others. I could do three individual greps, each with a different pattern. However, it would be difficult to make the association between the separate entries. That is, the first time I would have a list of machine’s names, the second time a list of IP addresses and the third time a list of locations. I have written scripts before that handle this kind of situation, but in this case it would be easier to use a standard Linux command: egrep.

The egrep command is an extension of the basic grep command. (The ‘e’ stands for extended) In older versions of grep, you did not have the ability to use things like [:alpha:] to represent alphabetic characters, so extended grep was born. For details on representing characters like this check out the section in regular expressions and metacharacters.

One extension is the ability to have multiple search patterns that are checked simultaneously. That is, if any of the patterns are found, the line is displayed. So in the problem above we might have a command like this:

egrep “Name:|IP:|Location:” FILENAME

This would then list all of the respective lines in order, making association between name and the other values a piece of cake.

Another variant of grep is fgrep, which interprets the search pattern as a list of fixed strings, separated by newlines, any of which is to be matched. On some systems, grep, egrep and fgrep will all be a hard link to the same file.

I am often confronted with files where I want to filter out the “noise”. That is, there is a lot of stuff in the files that I don’t want to see. A common example, is looking through large shell scripts or configuration files when I am not sure exactly what I am looking for. I know when I see it, but to simply grep for that term is impossible, as I am not sure what it is. Therefore, it would be nice to ingore things like comments and empty lines.

Once again we could use egrep as there are two expressions we want to match. However, this type we also use the -v option, which simply flips or inverts the meaning of the match. Let’s say there was a start-up script that contained a variable you were looking for, You might have something like this:

egrep -v “^$|^#” /etc/rc.d/*|more

The first part of the expressions says to match on the beginning of the line (^) followed immediately by the end of the line ($), which turn out to be all empty lines. The second part of the expression says to match on all lines that start with the pound-sign (a comment). This ends up giving me all of the “interesting” lines in the file. The long option is easier to remember: –invert-match.

You may also run into a case where all you are interested in is which files contain a particular expression. This is where the -l option comes in (long version: –-files-with-matches). For example, when I made some style changes to my web site I wanted to find all of the files that contained a table. This means the file had to contain the <TABLE> tag. Since this tag could contain some options, I was interested in all of the file which contained “<TABLE”. This could be done like this:

grep -l ‘<TABLE’ FILENAME

There is an important thing to note here. In the section on interpreting the command, we learn that the shell sets up file redirection before it tries to execute the command. If we don’t include the less-than symbol in the single quotes, the shell will try to redirect the input from a file name “TABLE”. See the section on quotes for details on this.

The -l option (long version: –files-with-matches) says to simply list the file names. Using the -L option (long version: –files-without-match) we have the same effect as using both the -v and the -l options. Note that in both cases, the lines containing the matches are not displayed, just the file name.

Another common option is -q (long: –quiet or –silent). This does not display anything. So, what’s the use in that? Well, often, you simply want to know if a particular value exists in a file. Regardless of the options you use, grep will return 0 if any matches were found, and 1 if no matches were found. If you check the $? variable after running grep -q. If it is 0, you found a match. Check out the section on basic shell scripting for details on the $? and other variables.

Keep in mind that you do not need to use grep to read through files. Instead, it can be one end of a pipe. For example, I have a number of scripts that look through the process list to see if a particular process is running. If so, then I know all is well. However, if the process is not running, a message is sent to the administrators.