Regular Expressions and Metacharacters
Often, the arguments that you pass to commands are file names. For example,
if you wanted to edit a file called letter, you could enter the command vi
letter. In many cases, typing the entire name is not necessary. Built into the
shell are special characters that it will use to expand the name. These are
called metacharacters.
The most common metacharacter
is *. The * is used to represent any number of
characters, including zero. For example, if we have a file in our current
directory called letter and we input
the shell would expand this to
Or, if we had a file simply called let, this would match as well.
Instead, what if we had several files called letter.chris, letter.daniel,
and letter.david? The shell
would expand them all out to give me the command
We could also type in vi letter.da*, which would be expanded to
If we only wanted to edit the letter to chris, we could type it in as vi
*chris. However, if there were two files, letter.chris and note.chris, the
command vi *chris would have the same results as if we typed in:
In other words, no matter where the asterisk appears, the shell
expands it
to match every name it finds. If my current directory contained files
with matching names, the shell
would expand them properly. However, if there
were no matching names, file name expansion couldn't take place and the file
name would be taken literally.
For example, if there were no file name in our current directory that began
with letter, the command
could not be expanded and we would end up
editing a new file called (literally) letter*, including the asterisk. This
would not be what we wanted.
What if we had a subdirectory
called letters? If it contained the three
files letter.chris, letter.daniel,
and letter.david, we could get to them by typing
.
This would expand to be:
The same rules for path names with commands also apply to files names. The
command
is the same as
which as the same as
This is because the shell
is doing the expansion before it is passed to the command. Therefore, even
directories are expanded. And the command
could be expanded as both letters/letter.chris and lease/letter.joe., or any
similar combination
The next wildcard
is ?. This is expanded by the shell
as one, and only one,
character. For example, the command vi letter.chri? is the same as vi
letter.chris. However, if we were to type in vi letter.chris? (note that the "?"
comes after the "s" in chris), the result would be that we would begin editing a
new file called (literally) letter.chris?. Again, not what we wanted.
This wildcard
could be used if, for example, there were two files named
letter.chris1 and letter.chris2. The command vi letter.chris? would be the same
as
Another commonly used metacharacter
is actually a pair of characters: [ ].
The square brackets are used to represent a list of possible characters. For
example, if we were not sure whether our file was called letter.chris or
letter.Chris, we could type in the command as: vi letter.[Cc]hris. So, no matter
if the file was called letter.chris or letter.Chris, we would find it. What
happens if both files exist? Just as with the other metacharacters, both are
expanded and passed to vi. Note that in this example, vi letter.[Cc]hris appears
to be the same as vi letter.?hris, but it is not always so.
The list that appears inside the square brackets does not have to be an
upper- and lowercase combination of the same letter. The list can be made up of
any letter, number, or even punctuation. (Note that some punctuation marks have
special meaning, such as *, ?, and [ ], which we will cover shortly.) For
example, if we had five files, letter.chris1-letter.chris5, we could edit all of
them with vi letter.chris[12435].
A nice thing about this list is that if it is consecutive, we don't need to
list all possibilities. Instead, we can use a dash (-) inside the brackets to
indicate that we mean a range. So, the command
could be shortened to
What if we only wanted the first three and
the last one? No problem. We could specify it as
This does not mean that we want files letter.chris1
through letter.chris35! Rather, we want letter.chris1, letter.chris2,
letter.chris3, and letter.chris5. All entries in the list are seen as individual
characters.
Inside the brackets, we are not limited to just numbers or just letters. we
can use both. The command vi letter.chris[abc123] has the potential for editing
six files: letter.chrisa, letter.chrisb, letter.chrisc, letter.chris1,
letter.chris2, and letter.chris3.
If we are so inclined, we can mix and match any of these metacharacters any
way we want. We can even use them multiple times in the same command. Let's take
as an example the command
Should they exist in our current directory, this command would match
all of the following:
letter.chrisa |
note.chrisa |
letter.chrisb |
note.chrisb |
letter.chrisc |
note.chrisc |
letter.chrisd |
note.chrisd |
letter.chrise |
note.chrise |
letter.chris1 |
note.chris1 |
letter.chris2 |
note.chris2 |
letter.chris3 |
note.chris3 |
letter.chris4 |
note.chris4 |
letter.chris5 |
note.chris5 |
letter.Chrisa |
note.Chrisa |
letter.Chrisb |
note.Chrisb |
letter.Chrisc |
note.Chrisc |
letter.Chrisd |
note.Chrisd |
letter.Chrise |
note.Chrise |
letter.Chris1 |
note.Chris1 |
letter.Chris2 |
note.Chris2 |
letter.Chris3 |
note.Chris3 |
letter.Chris4 |
note.Chris4 |
letter.Chris5 |
note.Chris5 |
Also, any of these names without the leading letter or note would match. Or,
if we issued the command:
these would match
letter.daniel note.daniel letter.david note.david
Remember, I said that the shell
expands the metacharacters only with respect
to the name specified. This obviously works for file names as I described
above. However, it also works for command names as well.
If we were to type dat* and there was nothing in our current directory that
started with dat, we would get a message like
dat*: not found
However, if we were to type /bin/dat*, the shell
could successfully expand this to be /bin/date, which it would then execute. The same applies to relative
paths. If we were in / and entered ./bin/dat* or bin/dat*, both would be
expanded properly and the right command would be executed. If we entered the
command /bin/dat[abcdef], we would get the right response as well because the
shell tries all six letters listed and finds a match with /bin/date.
An important thing to note is that the shell
expands as long as it can
before it attempts to interpret a command. I was reminded of this fact by
accident when I input /bin/l*. If you do an
you should get the output:
-rwxr-xr-x 1 root root 22340 Sep 20 06:24 /bin/ln
-r-xr-xr-x 1 root root 25020 Sep 20 06:17 /bin/login
-rwxr-xr-x 1 root root 47584 Sep 20 06:24 /bin/ls
At first, I expected each one of the files in /bin that began with an "l"
(ell) to be executed. Then I remembered that expansion takes place before
the command is interpreted. Therefore, the command that I input, /bin/l*, was
expanded to be
Because /bin/ln was the first command in the list, the system expected that
I wanted to link the two files together (what /bin/ln is used for). I ended up
with error message:
/bin/ln: /bin/ls: File exists
This is because the system thought I was trying to link the file /bin/login
to /bin/ls, which already existed. Hence the message.
The same thing happens when I input /bin/l? because the /bin/ln is expanded
first. If I issue the command /bin/l[abcd], I get the message that there is no
such file. If I type in
/bin/l[a-n]
I get:
/bin/ln: missing file argument
because the /bin/ln command expects two file names as arguments and the only
thing that matched is /bin/ln.
I first learned about this aspect of shell
expansion after a couple of hours
of trying to extract a specific subdirectory
from a tape that I had made with
the cpio command. Because I made the tape using absolute paths, I attempted to
restore the files as /home/jimmo/letters/*. Rather than restoring the entire
directory as I expected, it did nothing. It worked its way through the tape
until it got to the end and then rewound itself without extracting any files.
At first I assumed I made a typing error, so I started all over. The next
time, I checked the command before I sent it on its way. After half an hour or
so of whirring, the tape was back at the beginning. Still no files. Then it
dawned on me that hadn't told the cpio to overwrite existing files
unconditionally. So I started it all over again.
Now, those of you who know cpio realize that this wasn't the issue either.
At least not entirely. When the tape got to the right spot, it started
overwriting everything in the directory (as I told it to). However, the files
that were missing (the ones that I really wanted to get back) were still not
copied from the backup tape.
The next time, I decided to just get a listing of all the files on the tape.
Maybe the files I wanted were not on this tape. After a while it reached the
right directory and lo and behold, there were the files that I wanted. I could
see them on the tape, I just couldn't extract them.
Well, the first idea that popped into my mind was to restore
everything. That's sort of like fixing a flat tire by buying a new car. Then
I thought about restoring the entire tape into a temporary directory where I
could then get the files I wanted. Even if I had the space, this still seemed
like the wrong way of doing things.
Then it hit me. I was going about it the wrong way. The solution was to go
ask someone what I was doing wrong. I asked one of the more senior engineers (I
had only been there less than a year at the time). When I mentioned that I was
using wildcards, it was immediately obvious what I was doing wrong (obvious to
him, not to me).
Lets think about it for a minute. It is the shell that does the
expansion, not the command itself (like when I ran /bin/l*). The shell
interprets the command as starting with /bin/l. Therefore, I get a listing of
all the files in /bin that start with "l". With cpio , the situation is similar.
When I first ran it, the shell
interpreted the files (/home/jimmo/data/*)
before passing them to cpio. Because I hadn't told cpio to overwrite the files,
it did nothing. When I told cpio to overwrite the files, it only did so for the
files that it was told to. That is, only the files that the shell
saw when it
expanded /home/jimmo/data/*. In other words, cpio did what it was told. I just
told it to do something that I hadn't expected.
The solution is to find a way to pass the wildcards to cpio. That is, the
shell must ignore the special significance of the asterisk. Fortunately, there
is a way to do this. By placing a back-slash (\) before the
metacharacter,
you remove its special significance. This is referred to as "escaping" that
character.
So, in my situation with cpio, when I referred to the files I wanted as
/home/jimmo/data/\*, the shell
passed the arguments to cpio as
/home/jimmo/data/*. It was then cpio that expanded the * to mean all the files
in that directory. Once I did that, I got the files I wanted.
You can also protect the metacharacters from being expanded by enclosing the
entire expression in single quotes. This is because it is the shell
that first
expands wildcard
before passing them to the program. Note also that if the wild
card cannot be expanded, the entire expression (including the metacharacters) is
passed as an argument
to the program. Some programs are capable of expanding the
metacharacters themselves.
As in places, other the exclamation mark (!) has a special meaning.
(That is, it is also a metacharacter) When creating a regular expression, the exclamation mark
is used to negate a set of characters. For example, if we wanted to list all files
that did not have a number at the end, we could do something like this
ls *[!0-9]
This is certainly faster than typing this
ls *[a-zA-z]
However, this second example does not mean the same thing. In the first case, we are saying
we do not want numbers. In the second case, we are saying we only want letters. There is a
key difference because in the second case we do not include the punctuation marks and other
symbols.
Another symbol with special meaning is the dollar sign ($). This is used as
a marker to indicate that something is a variable.
I mentioned earlier in this
section that you could get access to your login
name environment
variable
by
typing:
The system stores your login
name in the environment
variable
LOGNAME (note
no "$"). The system needs some way of knowing that when you input this on the
command line, you are talking about the variable
LOGNAME and not the literal
string LOGNAME. This is done with the "$".Several variables are set by the
system. You can also set variables yourself and use them later on. I'll get into
more detail about shell
variables later.
So far, we have been talking about metacharacters used for searching the
names of files. However, metacharacters can often be used in the arguments to
certain commands. One example is the grep command, which is used to search for
strings within files. The name grep comes from Global Regular Expression Print
(or Parser). As its name implies, it has something to do with regular
expressions. Lets assume we have a text
file called documents, and we wish to
see if the string "letter" exists in that text.
The command might be
This will search for and print out every line containing the string
"letter." This includes such things as "letterbox," "lettercarrier," and even
"love-letter." However, it will not find "Letterman," because we did not tell
grep to ignore upper- and lowercase (using the -i option). To do so using
regular expressions, the command might look like this
Now, because we specified to look for either "L" or "l" followed by "etter,"
we get both "letter" and "Letterman." We can also specify that we want to look
for this string only when it appears at the beginning of a line using the caret
(^) symbol. For example
This searches for all strings that start with the "beginning-of-line,"
followed by either "L" or "l," followed by "etter." Or, if we want to search for
the same string at the end of the line, we would use the dollar sign to indicate
the end of the line. Note that at the beginning of a string, the dollar sign is
treated as the beginning of the string, whereas at the end of a string, it
indicates the end of the line. Confused? Lets look at an example. Lets define a
string like this:
VAR=^[Ll]etter
If we echo that string, we simply get ^[Ll]etter. Note that this includes
the caret
at the beginning of the string. When we do a search like this
it is equivalent to
Now, if write the same command like this
This says to find the string defined by the VAR variable(^[Ll]etter) , but
only if it is at the end of the line. Here we have an example, where the dollar
sign has both meanings. If we then take it one step further:
This says to find the string defined by the VAR variable,
but only if it
takes up the entry line. In other words, the line consists only of the beginning
of the line (^), the string defined by VAR, and the end of the line ($).
Here I want to side step a little. When you look at the variable $VAR$
it might be confusing to some people. Further, if you were to combine this variable
with other characters you may end with something you do not expect because the shell decides
to include as part of the variable name. To prevent this, it is a good idead to include the
variable name within curly-braces, like this:
${VAR}$
The curly-braces tell the shell what exactly belongs to the variable name.
I try to always include the variable name within curly-braces to ensure that there
is no confusion. Also, you need to use the curly-braces when comining variables like this:
${VAR1}${VAR2}
Often you need to match a series of repeated characters, such as spaces, dashes
and so forth. Although you could simply use the asterisk to specify any number
of that particular character, you can run into problems on both ends. First,
maybe you want to match a minimum number of that character. This could
easily solved by first repeating that character a certain number of times before
you use the wildcard. For example, the expression
====*
would match at least three equal signs. Why three? Well, we have
explicitly put in three equal signs and the wildcard follows the fourth. Since
the asterisk can be zero or more, it could mean zero and therefore the
expression would only match three.
The next problem occurs when we want to limit the maximum number of characters
that are matched. If you know exactly how many to match, you could simply use
that many characters. What do you do if you have a minimum and a maximum? For
this, you enclose the range with curly-braces: {min,max}. For example, to
specify at least 5 and at most 10, it would look like this: {5,10}. Keep in mind
that the curly braces have a special meaning for the shell, so we would need to
escape them with a back-slash when using them on the command line. So, lets say
we wanted to search a file for all number combinations between 5 and 10 number
long. We might have something like this:
This might seem a little complicated, but it would be far more complicated to
write an regular expression that searches for each combination individually.
As we mentioned above, to define a specific number of a particular character
you could simply input that character the desired number of times. However, try
counting 17 periods on a line or 17 lower-case letters ([a-z]). Imagine trying
to type in this combination 17 times! You could specify a range with a maximum
of 17 and a minimum of 17, like this: {17,17}. Although this would work,
you could save yourself a little typing by simply including just the single
value. Therefore, to match exactly 17 lower-case letters, you might have
something like this:
If we want to specify a minimum number of times, without a maximum, we simply
leave off the maximum, like this:
This would match a pattern of at least 17 lower-case letters.
Another problem occurs when you are trying to parse data that is not in English.
If you were looking for all letters in an English text, you could use something
like this: [a-zA-Z]. However, this would not include German letters, like ä,Ö,ß
and so forth. To do so, you would use the expressions [:lower:], [:upper:] or
[:alpha:] for the lower-case letters, upper-case letters or all letters,
respectively, regardless of the language. (Note this assumes that
national language support (NLS) is configured on your system, which it normally
is for newer Linux distributions.
Other expressions include:
[:alnum:] - Alpha-numeric characters.
[:cntrl:] - Control characters.
[:digit:] - Digits.
[:graph:] - Graphics characters.
[:print:] - Printable characters.
[:punct:] - Punctuation.
[:space:] - White spaces.
One very important thing to note is that the brackets are part of the expression.
Therefore, if you want to include more in a bracket expression you need to make
sure you have the correction number of brackets. For example, if you wanted to
match any number of alpha-numeric or punctuation, you might have an expression
like this: [[:alnum:][:digit:]]*.
Another thing to note is that in most
cases, regular expression are expanded as much as possible. For example, let's
assume I was parsing an HTML file and wanted to match the
first tag on the line. You might think to try an expression like this:
"<.*>". This says to match any number of characters between the angle brackets.
This works if there is only one tag on the line. However, if you have more than
one tag, this expression would match everything from the first opening
angle-bracket to the last closing angle bracket with everything
inbetween.
There are a number of rules that are defined for regular expression, the understanding
of which helps avoid confusion:
- An non-special character is equivalent to that character.
- When preceeded by a backslash (\) is every special character equivalent to itself
- A period specifies any single character
- An asterisk specifies zero or more copies of the preceeding chacter
- When used by itself, an asterisk species everything or nothing
- A range of characters is specified within square brackets ([ ])
- The beginning of the line is specified with a caret (^) and the end of the line with
a dollar sign ($)
- If included within square brackets, a caret (^) negates the set of characters
|