Regular Expressions and Metacharacters

Often, the arguments that you pass to commands are file names. For example, if you wanted to edit a file called letter, you could enter the command vi letter. In many cases, typing the entire name is not necessary. Built into the shell are special characters that it will use to expand the name. These are called metacharacters.

The most common metacharacter
is *. The * is used to represent any number of
characters, including zero. For example, if we have a file in our current
directory called letter and we input

vi let*

the shell would expand this to


vi letter

Or, if we had a file simply called let, this would match as well.

Instead, what if we had several files called letter.chris, letter.daniel, and letter.david? The shell would expand them all out to give me the command

vi letter.chris letter.daniel letter.david

We could also type in vi letter.da*, which would be expanded to

vi letter.daniel letter.david

If we only wanted to edit the letter to chris, we could type it in as vi *chris. However, if there were two files, letter.chris and note.chris, the command vi *chris would have the same results as if we typed in:

vi letter.chris note.chris

In other words, no matter where the asterisk appears, the shell expands it to match every name it finds. If my current directory contained files with matching names, the shell would expand them properly. However, if there were no matching names, file name expansion couldn’t take place and the file name would be taken literally.


For example, if there were no file name in our current directory that began
with letter, the command


vi letter*

could not be expanded and we would end up editing a new file called (literally) letter*, including the asterisk. This would not be what we wanted.


What if we had a subdirectory
called letters? If it contained the three
files letter.chris, letter.daniel,
and letter.david, we could get to them by typing

vi letters/letter*. This would expand to be:

vi letters/letter.chris letters/letter.daniel letters/letter.david


The same rules for path names with commands also apply to files names. The
command


vi letters/letter.chris


is the same as


vi ./letters/letter.chris


which as the same as


vi /home/jimmo/letters/letter.chris


This is because the shell
is doing the expansion before it is passed to the command. Therefore, even
directories are expanded. And the command


vi le/letter.

could be expanded as both letters/letter.chris and lease/letter.joe., or any similar combination

The next wildcard is ?. This is expanded by the shell as one, and only one, character. For example, the command vi letter.chri? is the same as vi letter.chris. However, if we were to type in vi letter.chris? (note that the “?” comes after the “s” in chris), the result would be that we would begin editing a new file called (literally) letter.chris?. Again, not what we wanted. This wildcard could be used if, for example, there were two files named letter.chris1 and letter.chris2. The command vi letter.chris? would be the same as

vi letter.chris1 letter.chris2

Another commonly used metacharacter is actually a pair of characters: [ ]. The square brackets are used to represent a list of possible characters. For example, if we were not sure whether our file was called letter.chris or letter.Chris, we could type in the command as: vi letter.[Cc]hris. So, no matter if the file was called letter.chris or letter.Chris, we would find it. What happens if both files exist? Just as with the other metacharacters, both are expanded and passed to vi. Note that in this example, vi letter.[Cc]hris appears to be the same as vi letter.?hris, but it is not always so.

The list that appears inside the square brackets does not have to be an upper- and lowercase combination of the same letter. The list can be made up of any letter, number, or even punctuation. (Note that some punctuation marks have special meaning, such as *, ?, and [ ], which we will cover shortly.) For example, if we had five files, letter.chris1-letter.chris5, we could edit all of them with vi letter.chris[12435].


A nice thing about this list is that if it is consecutive, we don’t need to
list all possibilities. Instead, we can use a dash (-) inside the brackets to
indicate that we mean a range. So, the command


vi letter.chris[12345]


could be shortened to


vi letter.chris[1-5]


What if we only wanted the first three and
the last one? No problem. We could specify it as


vi letter.chris[1-35]

This does not mean that we want files letter.chris1 through letter.chris35! Rather, we want letter.chris1, letter.chris2, letter.chris3, and letter.chris5. All entries in the list are seen as individual characters.

Inside the brackets, we are not limited to just numbers or just letters. we can use both. The command vi letter.chris[abc123] has the potential for editing six files: letter.chrisa, letter.chrisb, letter.chrisc, letter.chris1, letter.chris2, and letter.chris3.


If we are so inclined, we can mix and match any of these metacharacters any
way we want. We can even use them multiple times in the same command. Let’s take
as an example the command


vi *.?hris[a-f1-5]

Should they exist in our current directory, this command would match all of the following:

letter.chrisanote.chrisaletter.chrisbnote.chrisbletter.chrisc
note.chriscletter.chrisdnote.chrisdletter.chrisenote.chrise
letter.chris1note.chris1letter.chris2note.chris2letter.chris3
note.chris3letter.chris4note.chris4letter.chris5note.chris5
letter.Chrisanote.Chrisaletter.Chrisbnote.Chrisbletter.Chrisc
note.Chriscletter.Chrisdnote.Chrisdletter.Chrisenote.Chrise
letter.Chris1note.Chris1letter.Chris2note.Chris2letter.Chris3
note.Chris3letter.Chris4note.Chris4letter.Chris5note.Chris5


Also, any of these names without the leading letter or note would match. Or,
if we issued the command:


vi .d

these would match

letter.daniel note.daniel letter.david note.david

Remember, I said that the shell expands the metacharacters only with respect to the name specified. This obviously works for file names as I described above. However, it also works for command names as well.

If we were to type dat* and there was nothing in our current directory that started with dat, we would get a message like

dat*: not found

However, if we were to type /bin/dat*, the shell could successfully expand this to be /bin/date, which it would then execute. The same applies to relative paths. If we were in / and entered ./bin/dat* or bin/dat*, both would be expanded properly and the right command would be executed. If we entered the command /bin/dat[abcdef], we would get the right response as well because the shell tries all six letters listed and finds a match with /bin/date.


An important thing to note is that the shell
expands as long as it can
before it attempts to interpret a command. I was reminded of this fact by
accident when I input /bin/l*. If you do an

ls /bin/l* you should get the output:

-rwxr-xr-x 1 root root 22340 Sep 20 06:24 /bin/ln -r-xr-xr-x 1 root root 25020 Sep 20 06:17 /bin/login -rwxr-xr-x 1 root root 47584 Sep 20 06:24 /bin/ls

At first, I expected each one of the files in /bin that began with an “l” (ell) to be executed. Then I remembered that expansion takes place before the command is interpreted. Therefore, the command that I input, /bin/l*, was expanded to be

/bin/ln /bin/login /bin/ls


Because /bin/ln was the first command in the list, the system expected that
I wanted to link the two files together (what /bin/ln is used for). I ended up
with error message:

/bin/ln: /bin/ls: File exists

This is because the system thought I was trying to link the file /bin/login to /bin/ls, which already existed. Hence the message.


The same thing happens when I input /bin/l? because the /bin/ln is expanded
first. If I issue the command /bin/l[abcd], I get the message that there is no
such file. If I type in


/bin/l[a-n]


I get:

/bin/ln: missing file argument

because the /bin/ln command expects two file names as arguments and the only thing that matched is /bin/ln.

I first learned about this aspect of shell expansion after a couple of hours of trying to extract a specific subdirectory from a tape that I had made with the cpio command. Because I made the tape using absolute paths, I attempted to restore the files as /home/jimmo/letters/*. Rather than restoring the entire directory as I expected, it did nothing. It worked its way through the tape until it got to the end and then rewound itself without extracting any files.

At first I assumed I made a typing error, so I started all over. The next time, I checked the command before I sent it on its way. After half an hour or so of whirring, the tape was back at the beginning. Still no files. Then it dawned on me that hadn’t told the cpio to overwrite existing files unconditionally. So I started it all over again.

Now, those of you who know cpio realize that this wasn’t the issue either. At least not entirely. When the tape got to the right spot, it started overwriting everything in the directory (as I told it to). However, the files that were missing (the ones that I really wanted to get back) were still not copied from the backup tape.

The next time, I decided to just get a listing of all the files on the tape. Maybe the files I wanted were not on this tape. After a while it reached the right directory and lo and behold, there were the files that I wanted. I could see them on the tape, I just couldn’t extract them.

Well, the first idea that popped into my mind was to restore everything. That’s sort of like fixing a flat tire by buying a new car. Then I thought about restoring the entire tape into a temporary directory where I could then get the files I wanted. Even if I had the space, this still seemed like the wrong way of doing things.

Then it hit me. I was going about it the wrong way. The solution was to go ask someone what I was doing wrong. I asked one of the more senior engineers (I had only been there less than a year at the time). When I mentioned that I was using wildcards, it was immediately obvious what I was doing wrong (obvious to him, not to me).

Lets think about it for a minute. It is the shell that does the expansion, not the command itself (like when I ran /bin/l*). The shell interprets the command as starting with /bin/l. Therefore, I get a listing of all the files in /bin that start with “l”. With cpio , the situation is similar.

When I first ran it, the shell interpreted the files (/home/jimmo/data/*) before passing them to cpio. Because I hadn’t told cpio to overwrite the files, it did nothing. When I told cpio to overwrite the files, it only did so for the files that it was told to. That is, only the files that the shell saw when it expanded /home/jimmo/data/*. In other words, cpio did what it was told. I just told it to do something that I hadn’t expected.

The solution is to find a way to pass the wildcards to cpio. That is, the shell must ignore the special significance of the asterisk. Fortunately, there is a way to do this. By placing a back-slash (\) before the metacharacter, you remove its special significance. This is referred to as “escaping” that character.

So, in my situation with cpio, when I referred to the files I wanted as /home/jimmo/data/\*, the shell passed the arguments to cpio as /home/jimmo/data/*. It was then cpio that expanded the * to mean all the files in that directory. Once I did that, I got the files I wanted.


You can also protect the metacharacters from being expanded by enclosing the
entire expression in single quotes. This is because it is the shell
that first
expands wildcard
before passing them to the program. Note also that if the wild
card cannot be expanded, the entire expression (including the metacharacters) is
passed as an argument
to the program. Some programs are capable of expanding the
metacharacters themselves.

As in places, other the exclamation mark (!) has a special meaning. (That is, it is also a metacharacter) When creating a regular expression, the exclamation mark is used to negate a set of characters. For example, if we wanted to list all files that did not have a number at the end, we could do something like this

ls *[!0-9]

This is certainly faster than typing this

ls *[a-zA-z]

However, this second example does not mean the same thing. In the first case, we are saying we do not want numbers. In the second case, we are saying we only want letters. There is a key difference because in the second case we do not include the punctuation marks and other symbols.

Another symbol with special meaning is the dollar sign ($). This is used as a marker to indicate that something is a variable. I mentioned earlier in this section that you could get access to your login name environment variable by typing:

echo $LOGNAME

The system stores your login name in the environment variable LOGNAME (note no “$”). The system needs some way of knowing that when you input this on the command line, you are talking about the variable LOGNAME and not the literal string LOGNAME. This is done with the “$”.Several variables are set by the system. You can also set variables yourself and use them later on. I’ll get into more detail about shell variables later.

So far, we have been talking about metacharacters used for searching the names of files. However, metacharacters can often be used in the arguments to certain commands. One example is the grep command, which is used to search for strings within files. The name grep comes from Global Regular Expression Print (or Parser). As its name implies, it has something to do with regular expressions. Lets assume we have a text file called documents, and we wish to see if the string “letter” exists in that text. The command might be

grep letter documents

This will search for and print out every line containing the string “letter.” This includes such things as “letterbox,” “lettercarrier,” and even “love-letter.” However, it will not find “Letterman,” because we did not tell grep to ignore upper- and lowercase (using the -i option). To do so using regular expressions, the command might look like this

grep [Ll]etter documents

Now, because we specified to look for either “L” or “l” followed by “etter,” we get both “letter” and “Letterman.” We can also specify that we want to look for this string only when it appears at the beginning of a line using the caret (^) symbol. For example

grep ^[Ll]etter documents

This searches for all strings that start with the “beginning-of-line,” followed by either “L” or “l,” followed by “etter.” Or, if we want to search for the same string at the end of the line, we would use the dollar sign to indicate the end of the line. Note that at the beginning of a string, the dollar sign is treated as the beginning of the string, whereas at the end of a string, it indicates the end of the line. Confused? Lets look at an example. Lets define a string like this:

VAR=^[Ll]etter

If we echo that string, we simply get ^[Ll]etter. Note that this includes the caret at the beginning of the string. When we do a search like this

grep $VAR documents

it is equivalent to

grep ^[Ll]etter documents

Now, if write the same command like this

grep $VAR$ documents

This says to find the string defined by the VAR variable(^[Ll]etter) , but only if it is at the end of the line. Here we have an example, where the dollar sign has both meanings. If we then take it one step further:

grep ^$VAR$ documents

This says to find the string defined by the VAR variable, but only if it takes up the entry line. In other words, the line consists only of the beginning of the line (^), the string defined by VAR, and the end of the line ($).

Here I want to side step a little. When you look at the variable $VAR$ it might be confusing to some people. Further, if you were to combine this variable with other characters you may end with something you do not expect because the shell decides to include as part of the variable name. To prevent this, it is a good idead to include the variable name within curly-braces, like this:

${VAR}$

The curly-braces tell the shell what exactly belongs to the variable name. I try to always include the variable name within curly-braces to ensure that there is no confusion. Also, you need to use the curly-braces when comining variables like this:

${VAR1}${VAR2}

Often you need to match a series of repeated characters, such as spaces, dashes and so forth. Although you could simply use the asterisk to specify any number of that particular character, you can run into problems on both ends. First, maybe you want to match a minimum number of that character. This could easily solved by first repeating that character a certain number of times before you use the wildcard. For example, the expression


====*


would match at least three equal signs. Why three? Well, we have
explicitly put in three equal signs and the wildcard follows the fourth. Since
the asterisk can be zero or more, it could mean zero and therefore the
expression would only match three.

The next problem occurs when we want to limit the maximum number of characters that are matched. If you know exactly how many to match, you could simply use that many characters. What do you do if you have a minimum and a maximum? For this, you enclose the range with curly-braces: {min,max}. For example, to specify at least 5 and at most 10, it would look like this: {5,10}. Keep in mind that the curly braces have a special meaning for the shell, so we would need to escape them with a back-slash when using them on the command line. So, lets say we wanted to search a file for all number combinations between 5 and 10 number long. We might have something like this:


grep “[0-9]{5,10}” FILENAME


This might seem a little complicated, but it would be far more complicated to
write an regular expression that searches for each combination individually.


As we mentioned above, to define a specific number of a particular character
you could simply input that character the desired number of times. However, try
counting 17 periods on a line or 17 lower-case letters ([a-z]). Imagine trying
to type in this combination 17 times! You could specify a range with a maximum
of 17 and a minimum of 17, like this: {17,17}. Although this would work,
you could save yourself a little typing by simply including just the single
value. Therefore, to match exactly 17 lower-case letters, you might have
something like this:


grep “[a-z]{17}” FILENAME


If we want to specify a minimum number of times, without a maximum, we simply
leave off the maximum, like this:


grep “[a-z]{17,}” FILENAME


This would match a pattern of at least 17 lower-case letters.


Another problem occurs when you are trying to parse data that is not in English.
If you were looking for all letters in an English text, you could use something
like this: [a-zA-Z]. However, this would not include German letters, like �,�,�
and so forth. To do so, you would use the expressions [:lower:], [:upper:] or
[:alpha:] for the lower-case letters, upper-case letters or all letters,
respectively, regardless of the language. (Note this assumes that
national language support (NLS) is configured on your system, which it normally
is for newer Linux distributions.

Other expressions include:


[:alnum:] – Alpha-numeric characters.
[:cntrl:] – Control characters.
[:digit:] – Digits.
[:graph:] – Graphics characters.
[:print:] – Printable characters.
[:punct:] – Punctuation.
[:space:] – White spaces.


One very important thing to note is that the brackets are part of the expression.
Therefore, if you want to include more in a bracket expression you need to make
sure you have the correction number of brackets. For example, if you wanted to
match any number of alpha-numeric or punctuation, you might have an expression
like this: [[:alnum:][:digit:]]*.


Another thing to note is that in most
cases, regular expression are expanded as much as possible. For example, let’s
assume I was parsing an HTML file and wanted to match the
first tag on the line. You might think to try an expression like this:
“<.*>”. This says to match any number of characters between the angle brackets.
This works if there is only one tag on the line. However, if you have more than
one tag, this expression would match everything from the first opening
angle-bracket to the last closing angle bracket with everything
inbetween.

There are a number of rules that are defined for regular expression, the understanding of which helps avoid confusion:

  1. An non-special character is equivalent to that character.
  2. When preceeded by a backslash (\) is every special character equivalent to itself
  3. A period specifies any single character
  4. An asterisk specifies zero or more copies of the preceeding chacter
  5. When used by itself, an asterisk species everything or nothing
  6. A range of characters is specified within square brackets ([ ])
  7. The beginning of the line is specified with a caret (^) and the end of the line with a dollar sign ($)
  8. If included within square brackets, a caret (^) negates the set of characters