UNIX Shell Programming : grep,sed and awk

0
5654
Unix Shell Programming Part 3 grep, sed and awk

Contents :-

grep:-Operation, grep Family, Searching for File Content.
sed:-Scripts, Operation, Addresses, commands, Applications, grep and sed.
awk:-Execution, Fields and Records, Scripts, Operations, Patterns, Actions, Associative Arrays, String Functions, Mathematical Functions, User Defined Functions, Using System commands in awk, Applications of awk, grep and sed

grep

grep stands for global regular expression print. It is a family of programs that is used to search the input file for all lines that match a specified regular expression and write them to the standard output file (monitor).

Operation

To write scripts that operate correctly, we must understand how the greputilities work. For each line in the standard input (input file or keyboard), grep performs the following operations:
1.     Copies the next input line into the pattern space. The pattern space is a buffer that can hold only one text line.
2.     Applies the regular expression to the pattern space.
3.     If there is a match, the line is copied from the pattern space to the standard output. 
The grep utilities repeat these three operations on each line in the input.

grep Flowchart

A flowchart for the grep utility is given on the left and two points are to be noted along with that. First, the flowchart assumes that no options were specified. Selecting one ore more options will change the flowchart. Second, although grepkeeps a current line counter so that it always knows which line is being processed, the current line number is not reflected in the flowchart.
grep Operation Example
Let’s take a simple example. We have a file having four lines. Our aim is to display any line in the file that contains UNIX. There are only three lines matching the grep expression. grephandles the following situations:
1.     grep is a search utility; it can search only for the existence of a line that matches a regular expression.
2.     The only action that grepcan perform on a line is to send it to standard output. If the line does not match the regular expression, it is not printed.
3.     The line selection is based only on the regular expression. The line number or other criteria cannot be used to select the line.
4.     grep is a filter. It can be used at the left- or right-hand side of a pipe.
5.     grep cannot be used to add, delete or change a line.
6.     grep cannot be used to print only part of a line.
7.     grep cannot read only part of a file.
8.     grep cannot select a line based on the contents of the previous or the next line. There is only one buffer, and it holds only the current line.
The file contents are:
         Only one UNIX
         DOS only here
Mac OS X is UNIX
Linux is UNIX
$ grep ‘UNIX’ file1

grep Family

There are three utilities in the grep family: grep, egrep and fgrep. All three search one or more files and output lines that contain text that matches criteria specified as a regular expression. The whole line does not have to match the criteria; any matching text in the line is sufficient for it to be output. It examines each line in the file, one by one. When a line contains a matching pattern, the line is output. Although this is a powerful capability that quickly reduces a large amount of data to a meaningful set of information, it cannot be used to process only a portion of the data.
fgrep (Fast grep): supports only string patterns, no regular expressions.
grep: supports only a limited number of regular expressions.
egrep (Extended grep): supports most regular expressions but not all of them.

grep Family Options

There are several options available to the grep family. A summary is given below:
Option
Explanation
-b
-c
-i
-l
-n
-s
-v
-x
-f file
Precedes each line by the file block number in which it is found
Prints only a count of the number of lines matching the pattern
Ignores upper- / lowercase in matching text.
Prints a list of files that contain at least one line matching the pattern
Shows line number of each line before the line.
Silent mode. Executes utility but suppresses all output.
Inverse output. Prints lines that do not match pattern.
Prints only lines that entirely match pattern.
List of strings to be matched are in file.

grep Family Expressions

As we have seen before, fast grep (fgrep) uses only sequence operators in a pattern; it does not support any of the other regular expression operators. Basic grep and extended grep (egrep) both accept regular expressions as shown in table below:
Atoms
grep
fgrep
egrep
Operators
grep
fgrep
egrep
Character
Dot
Class
Anchors
Back Reference
^ $
Sequence
Repetition
Alternation
Group
Save
All but ?
* ? +
Expressions in the grep utilities can become quite complex, often combining several atoms and/or operators into one large expression. When operators and atoms are combined, they are generally enclosed in either single quotes or double quotes. Technically, the quotes are need only when there is a blank or other character that has a special meaning to the grep utilities. As a good technique, we should always use them.

grep

The original of the file-matching utilities, grep handles most of the regular expressions. The middle road between the other two members of the family, grep allows regular expressions but is generally slower than egrep. It is the only member of the grepfamily that allows saving the results of a match for later use. In the example below, we use grep to find all the lines that end in the word the and then pipe the results to head and print the first five.
students@firewall:~/test$ man bash > bash.txt
students@firewall:~/test$ grep -n “the$” bash.txt | head -5
24:       In  addition  to  the  single-character shell options documented in the
46:                 shopt_option is one of the  shell  options  accepted  by  the
51:                 the  standard  output.   If  the invocation option is +O, the
115:       If arguments remain after option processing, and neither the -c nor the
116:       -s option has been supplied, the first argument is assumed  to  be  the

Fast grep

If your search criteria require only sequence expressions, fast grep (fgrep) is the best utility. Because its expressions consist of only sequence operators, it is also easiest to use if you are searching for text characters that are the same as regular expression operators, such as the escape, parentheses or quotes. For example, to extract all lines of bash.txt that contain an apostrophe, we could use fgrepas shown below:
students@firewall:~/test$ fgrep -n “‘” bash.txt | tail -5
5235:              job spec is given, all processes  in  that  job’s  pipeline  are
5334:       to  mail that as well!  Suggestions and `philosophical’ bug reports may
5344:       A short script or `recipe’ which exercises the bug
5353:       It’s too big and too slow.
5362:       Compound commands and command sequences of the form `a ; b ; c’ are not

Extended grep

Extended grep (egrep) is the most powerful of the three grep utilities. While it doesn’t have the save option, it does allow more complex patterns. Consider the case where we want to extract all lines that start with a capital letter and end in letter ‘N’ (uppercase n). Our first attempt at this command is shown below:
students@firewall:~/test$ egrep -n ‘^[A-Z].*N$’ bash.txt |head -5
14:DESCRIPTION
126:INVOCATION
1288:EXPANSION
1748:REDIRECTION
2060:ARITHMETIC EVALUATION
This is relatively a complex expression. It has three parts. The first looks for any line that starts with an uppercase letter. The second part says it can be followed by any character zero or more times. The third part says such a matching line end with uppercase ‘N’.
While the above expression is fine, we want to extend it such that we also want to find all lines that start with a space ‘ ‘ and end with the character comma ‘,’ and also end with period ‘.’. For doing this, we can use the alternation operator. We will design the pattern for each of the requirement and then combine the three patterns with alternation operator. The result is given below:
students@firewall:~/test$ egrep -n ‘(^[A-Z].*N$)|(^ .*,$)|(^ .*.$)’ bash.txt|head -406 |tail -11
2058:       recursive calls.
2060:ARITHMETIC EVALUATION
2064:       for overflow, though division by 0 is trapped and flagged as an  error.
2068:       order of decreasing precedence.
2099:       0 when referenced by name without using the parameter expansion syntax.
2104:       to be used in an expression.
2114:       35.
2118:       above.
2126:       the file argument to  one  of  the  primaries  is  one  of  /dev/stdin,
2127:       /dev/stdout,  or /dev/stderr, file descriptor 0, 1, or 2, respectively,
2128:       is checked.

Examples

1.     Select the lines from the file that have exactly three characters.
egrep ‘^…$’ testFile
2.     Select the lines from the file that have at least three characters.
egrep ‘…’ testFile
3.     Select the lines from the  file that have three or fewer characters
egrep –vn ‘….’ testFile
4.     Count the number of blank lines in the file.
egrep –c ‘^$’ testFile
5.     Count the number of nonblank lines in the file
egrep –c ‘.’ testFile
6.     Select the lines from the file that have the string UNIX
fgrep ‘UNIX’ testFile
7.     Select the lines from the file that have only the string UNIX.
egrep ‘^UNIX$’ testFile
8.     Select the lines from the file that have the pattern UNIX at least two times.
egrep ‘UNIX.*UNIX’ testFile
9.     Copy the file to the monitor but delete the blank lines.
egrep –v ‘^$’ testFile
10.Select the lines from the file that have at least two digits without any other characters in between.
egrep ‘[0-9][0-9]’ testFile
11.Select the lines from the file whose first nonblank character is A.
egrep ‘^ *A’ testFile
12.Select the lines from the file that do not start with A to G.
egrep –n ‘^[^A-G]’ testFile
13.Find out if John is currently logged into the system.
who |  grep ‘John’

Searching for File Content

Some modern operating systems allow us to search for a file based on a phrase contained in it. This is especially handy when we have forgotten the filename but know that it contains a specific expression or set of words. Although UNIX doesn’t have this capability, we can use the grep family to accomplish the same thing.

Search a Specific Directory

When we know the directory that contains the file, we can simply use grep by itself. For example, to find a list of all files in the current directory that contain “bash”, we should use the search as below. The option l prints out the filename of any file that has at least one line that matches the grepexpression.
students@firewall:~/test$ ls
bash.txt     cmpFile1       fgLoop.scr  file3         result.txt
biju.txt     cmpFile2       file1       goodStudents
censusFixed  dastardly.txt  file2       mylist
students@firewall:~/test$ grep -l ‘bash’ *
bash.txt
biju.txt
fgLoop.scr

Search All Directories in a Path

When we don’t know where the file is located, we must use the find command with the execute criterion. The find command begins by executing the specified command, in this case a grep search, using each file in the current directory. It then moves through the subdirectories of the current file applying the grep command. After each directory, it processes its subdirectories until all directories have been processed.
students@firewall:~/test$ find ~ -type f -exec grep -l “bash” {} ;
/home/students/.bash_logout
/home/students/passwd
/home/students/test/fgLoop.scr
/home/students/test/bash.txt
/home/students/test/biju.txt
/home/students/.profile
/home/students/.bash_history
/home/students/assg.txt
/home/students/pd2.txt
/home/students/sort.txt
/home/students/.bashrc

sed

sed is an acronym for stream editor. Although the name implies editing, it is not a true editor; it does not change anything in the original file. Rather sed scans the input file, line by line, and applies a list of instructions (called a sed script) to each line in the input file. The script, which is usually a separate file, can be included in the sed command line if it is a one-line command. The sed utility has three useful options. Option –n suppresses the automatic output. It allows us to write scripts in which we control the printing. Option –f indicates that there is a script file, which immediately follows on the command line. The third option –e is the default. It indicates that the script is on the command line, not in a file.

Scripts

The sed utility is called like any other utility. In addition to input data, sed also requires one or more instructions that provide editing criteria. When there is only command, it may be entered from the keyboard. Most of the time, instructions are placed in a file known as a sed script (program). Each instruction in a sed script contains an address and a command

Script Formats

When the script fits in a few lines, its instructions can be included in the command line. The script must be enclosed in quotes. For longer scripts, or for scripts that are going to be executed repeatedly over time, a separate script file is preferred. The file is created with a text editor and saved. We may give an extension .sed to indicate that it is a sed script. Examples of both are given below:
            $ sed –e ‘address command’ input_file
            $ sed –f script.sed input_file

Instruction Format

Each instruction consists of an address and a command.
address
! (complement, optional)
command
The address selects the line to be processed (or not processed) by the command. The exclamation point (!) is an optional address complement. When it is not present, the address must exactly match a line to select the line. When the complement operator is present, any line that does not match the address is selected; line that match the address are skipped. The command indicates the action that sed is to apply to each input line that matches the address.

Comments

A comment is a script line that documents or explains one or more instructions in a script. It is provided to assist the reader and is ignored by sed. Comment lines begin with a comment token, which the pound sign (#). If the comment requires more than one line, each line must start with the comment token.
            # This line is a comment
            2,14 s/A/B
30d
42d

Operation

Each line in the input file is given a line number by sed. This number can be used to address lines in the text. For each line, sedperforms the following operations:
1.     Copies an input line to the pattern space. The pattern space is a special buffer capable of holding one or more text lines for processing.
2.     Applies all the instructions in the script, one by one, to all pattern space lines that match the specified addresses in the instruction.
3.     Copies the contents of the pattern space to the output file unless directed not to by the –n option flag.
sed does not change the input file. All modified output is written to standard output and to be saved must be redirected to a file.
When all of the commands have been processed, sed repeats the cycle starting with 1. When you examine this process carefully, you will note that there are two loops in this processing cycle. One loop processes all of the instruction against the current line. The second loop processes all lines.
A second buffer, the hold space, is available to temporarily store one or more lines as directed by the sed instructions.
To fully understand how sed interacts with the input file, let’s look at the example. The data file hello.dat contains the following text:
Hello friends
Hello guests
Hello students
Welcome
And the scripts file hello.sed contains the following lines:
1,3s/Hello/Greetings/
2,3s/friends/buddies/
Now executing the command:
$ sed –f hello.sed hello.dat
Let’s go through the different steps. Each line of the input file is copied over to the pattern space, and then all instructions are applied and then output.

Addresses

The address in an instruction determines which lines in the input file are to be processed by the commands in the instruction. Addresses in sedcan be one of four types: single line, set of lines, range of lines and nested addresses.

Single-Line Addresses

A single-line address specifies one and only one line in the input file. There are two single-line formats: a line number or a dollar sign ($), which specifies the last line in the input file. Examples are:
4command
16command
$command

Set-of-Line Addresses

A set-of-line address is a regular expression that may match zero or more liens, not necessarily consecutive, in the input file. The regular expression is written between two slashes. Any line in the input file that matches the regular expression is processed by the instruction command. Two important points need to be noted: First, the regular expression may match several lines that may or may not be consecutive. Second, even if a line matches, the instruction may not find the data to be replaced. Examples are:
/^A/command
/B$/command
A special case of a set-of-line address is the every-line address. When the regular expression is missing, every line is selected. In other words, when there is no address, every line matches.

Range Addresses

An address range defines a set of consecutive lines. Its format is start address, comma with no space, and end address:
            start-address,end-address
The start and end address can be a sed line number or a regular expression as in the next example:
line- number,line-number
line-number,/regexp/
/regexp/,line-number
/regexp/,/regexp/
When a line that is in the pattern space matches a start range, it is selected for processing. At this point, sed notes that the instruction is in a range. Each input line is processed by the instruction’s command until the stop address matches a line. The line that matches the stop address is also processed by the command, but at that point, the range is no longer active. If at some future line the start range again matches, the range is again active until a stop address is found. Two important points need to be noted: First, while a range is active, all other instructions are also checked to determine if any of them also match an address. Second, more than one range may be active at a time. Examples are given below:
A———
B—————
C———-
D————-
B—————–
C—————–
A———–
A—————–
B———–
C————-
A——————
C————-
3,/^A/
/^A/,/^B/
A special case of range address is 1,$, which defines every line from the first line (1) to the last line ($). However, this special case address is not the same as the set-of-lines special case address, which is not address. Given the following two addresses:
            1. command            2. 1,$command
sed interprets the first as a set-of-line address and the second as a range address. Some commands, such as insert (i) and append (a) can be used only with a set-of-line address. These commands accept no address but do not accept 1,$ addresses.

Nested Addresses

A nested address is an address that is contained within another address. While the outer (first) address range, by definition, must be either a set of lines or an address range, the nested address may be either a single line, a set of lines or another range.
Let’s look at two examples. In the first example, we want to delete all blank lines between lines 20 and 30. The first command specifies the line range: it is the outer command. The second command, which is enclosed in braces, contains the regular expression for a blank line. It contains the nested address.
            20,30{
                        /^$/d
                        }
In the second example, we want to delete all lines that contain the work Raven, but oly if the line also contains the word Quoth. In this case, the outer address searches for lines containing Raven, while the inner address looks for lines containing Quoth. What is especially interesting about this example is that the outer address is not a block of lines but a set of lines spread throughout the file.
            /Raven/{
                        /Quoth/d
                        }

Commands

There are 25 commands that can be used in an instruction. They may be grouped into nine categories based on how they perform their task. They are Line Number commands, Modify commands, Substitute commands, Transform commands, Input/Output commands, Files commands, Branch commands, Hold space commands and Quit commands.

 

Line Number Command

The Line number command (=) writes the current line number at the beginning of the line when it writes the line to the output without affecting the pattern space. It is similar to the grep –n option. The only difference is that the line number is written on a separate line. The following example shows the usage. Note that this example uses the special case of the set-of-line address – there is no address, so the command applies to every line.
$ sed ‘=’ TheRavenV1
1
Once upon a midnight dreary, while I pondered, weak and weary
2
Over many a quaint and curious volume of forgotten lore
3
While I nodded, nearly napping, suddenly there came a tapping
4
As of someone gently rapping, rapping at my chamber door.
5
“’Tis some visitor,” I muttered, “tapping at my chamber door
6
Only this and nothing more.”
The next eample, we print only the line number of lines beginning with an upper-case O. To do this we must use the –n option.
$ sed –n ‘/^O/=’ TheRavenV1
2
6

Modify Commands

Modify commands are used to insert, append, change or delete one or more whole lines. The modify commands require that any text associated with them be placed on the next line in the script. Therefore, the script must be in a file; it cannot be coded on the shell command line. Also, the modify commands operate on the whole line. In other words, they are line replacement commands. This means that we can’t use these sedcommands to insert text into the middle of a line. Whatever text you supply will completely replace any lines that match the address.

Insert Command (i)

Insert adds one or more lines directly to the output before the address. This command can only be used with the single line and a set of lines; it cannot be used with a range. In the next example, we insert a title at the beginning of Poe’s “The Raven”.
$ sed –f insertTitle.sed TheRavenV1 | cat –n
# Script Name: insertTitle.sed
# Adds a title to file
1i
                        The Raven
                                    By
                        Edgar Allan Poe
1:                    The Raven
2:                                By
3:                    Edgar Allan Poe
4: Once upon a midnight dreary, …
If you use the insert command with the all lines address, the lines are inserted before every line in the file. This is an easy way to quickly double space a file.
$ sed –f insertBlankLines.sed TheRavenV1
# Script Name: insertBlankLines.sed
# This script inserts a blank line before all lines in a file
i
# End of script

Append Command (a)

Append is similar to the insert command except that it writes the text directly to the output after the specified line. Like insert, append cannot be used with a range address. Inserted and appended text never appear in sed’s pattern space. They are written to the output before the specified line (insert) or after the specified line (append), even if the pattern space is not itself written. Because they are not inserted into the pattern space, they cannot match a regular expression, nor do they affect sed’s internal line counter. The following example demonstrates the append command by appending a dashed line separator after every line and “The End” after the last line of “The Raven”.
$ sed –f appendLineSep.sed TheRavenV1
# Script Name: appendLineSep.sed
# This script appends dashed dividers after each line
a
———————————
$a
                                    The End

Change Command (c)

Change replaces a matched line with new text. Unlike insert and append, it accepts all four address types. In the next example, we replace the second line of Poe’s classic with a common thought expressed by many a weary calculus student.
$ sed –f change.sed TheRavenV1
# Script Name: change.sed
# Replace second line of The Raven
2c
Over many an obscure and meaningless problem of calculus bore

Delete Patten Space Command (d)

The delete command comes in two versions. When a lowercase delete command (d) is used, it deletes the entire pattern space. Any script commands following the delete command that also pertain to the deleted text are ignored because the text is no longer in the pattern space.
$ sed ‘/^O/d’ TheRavenV1

Delete Only First Line Command (D)

When an uppercase delete command (D) is used, only the first line of the pattern space is deleted. Of course, if the only line in the pattern space, the effect is the same as the lowercase delete.

Substitute Command (s)

Pattern substitution is one of the most powerful commands in sed. In general, substitute replaces text that is selected by a regular expression with a replacement string. Thus, it is similar to the search and replace found in text editors. With it, we can add, delete or change text in one or more lines.
Address
s
/
pattern
/
Replacement String
/
Flag(s)

 

Search Pattern

The sed search pattern uses only a subset of the regular expression atoms and patterns. The allowable atoms and operators are shown below.
Atoms
Allowed
Operators
Allowed
Character
Sequence
Dot
Repetition
* ? {…}
Class
Alternation
Anchors
^ $
Group
Back Reference
Save
When a text line is selected, its text is matched to the pattern. If matching text is found, it is replaced by the replacement string. The pattern and replacement strings are separated by a triplet of identical delimiters, slashes (/) in the example given before. Any character can be used as the delimiters, although the slash is the most common.

Pattern Matches Address

If the address contains a regular expression that is same as the pattern we want to match, that is a special case. Here, we don’t need to repeat the regular expression in the substitute command. We do need to show that it is omitted, however, by coding two slashes at the beginning of the pattern. An example is given below.
$ sed ‘/love/s//adore/’ browning.txt
Input:
How do I love thee? Let me count the ways.
I love thee to the depth and breadth and height
My soul can reach, when feeling out of sight
For the ends of being and ideal grace.
I love thee to the level of everyday’s
Most quiet need, by sun and candle-light.
Output:
How do I adore thee? Let me count the ways.
I adore thee to the depth and breadth and height
My soul can reach, when feeling out of sight
For the ends of being and ideal grace.
I adore thee to the level of everyday’s
Most quiet need, by sun and candle-light.

Replace String

The replacement text is a string. Only one atom and two meta-characters can be used in the replacement string. The allowed replacement atom is the back reference. The two meta-character tokens are the ampersand (&) and the back slash (). The ampersand is used to place the pattern in the replacement string; the backslash is used to escape an ampersand when it needs to be included in the substitute text (if it is not quoted, it will be replaced by the pattern). The following example shows how the meta-characters are used. In the first example, the replacement string becomes *** UNIX ***. In the second example, the replacement string is now & forever.
$ sed ‘s/UNIX/*** & ***/’ file1
$ sed ‘/now/s//now & forever/’ file1

Substitute Operation

As we have seen before, when the pattern matches the text, sed first deletes the text and then inserts the replacement text. This means that we can use the substitute command to add, delete or replace part of a line.
Delete Part of a Line: To delete part of a line, we leave the replacement text empty. In other words, partial line deletes are a special substitution case in which the replacement is null. The following example deletes all digits in the input from standard input.
$ sed ‘s/[0-9]//g’
Usually sed command operates only on the first occurrence of a pattern in a line. In the above example, we wanted to delete all digits. Therefore, we used the global flag (g) at the end of the pattern. If we did not use it, only the first digit on each line would be deleted.
Change Part of a Line: To change only part of a line, we create a pattern that matches the part to be changed and then place the new text in the replacement expression. In the following example, we change every space in the file to a tab.
$ sed ‘s/ /     /g’
Now is the time
For all good students
To come to the aid
of their college.
Now     is      the     time
For     all     good    students
To      come    to      the     aid
of      their   college.
Add to Part of a Line: To add text to a line requires both a pattern to locate the text and the text that is to be added. Because the pattern deletes the text, we must include it in the new text.
The next example add two spaces at the beginning of each line and two dashes at the end of each line.
$ sed –f addPart.sed
#!/bin/ksh
# Script Name: addPart.sed
# Adds two spaces at the beginning of line and – to end of line
s/^/  /
s/$/–/

Back References

The examples in the previous section were all very simple and straightforward. More often, we find that we must restore the data that we deleted in the search. This problem is solved with the regular expression tools as demonstrated. The sed utility uses two different back references in the substitution replacement string: whole pattern (&) and numbered buffer (d). The whole pattern substitutes the deleted text into the replacement string. In numbered buffer replacement, whenever a regular expression matches tex, the text is placed sequentially in one of the nine buffers. Numbered buffer replacement (d), in which the d is a number between 1 and 9, substitutes the numbered buffer contents in the replacement string.
            s/—–/—–&—–/
            s/(—–)…(—–)/——1——2/
Whole Pattern Substitution: When a pattern substitution command matches text in the pattern space, the matched text is automatically saved in a buffer (&). We can then retrieve its contents and insert it anywhere, and as many times as needed, into the replacement string. Using the & buffer therefore allows us to match text, which automatically deletes it, and then restore it so that it is not lost. As an example, we can rewrite the previous example of adding two spaces at the beginning of line and two dashes at the end of the line with a single command.
$ sed ‘s/^.*$/  &–/’
Another example is to modify the price list of a restaurant menu so that the $ symbol is added before the prices.
$ sed ‘s/[0-9]/$&/’ priceFile
Numbered Buffer Substitution: Numbered buffer substitution uses one or more of the regular expression numbered buffers. We use it when the pattern matches part of the input text but not all of it. As an example, let’s write a script that reformats a social security number with dashes. We assume that all nine-digit numbers are social security numbers. There are three parts to a social security number: three digits-two digits-four digits. This problem requires that we find them and reformat them. Our script uses a search pattern that uses the numbered buffers to save three constitutive digits followed by two digits and then four digits. Once a complete match is found, the numbered buffers are used to reformat the numbers.
$ sed ‘s/([0-9]{3})([0-9]{2})([0-9]{4})/123/’ empFile
                                                 
George Washigton        001010001
John Adams              002020002
Thomas Jefferson        003030003
James Madison           123456789
George Washigton        001-01-0001
John Adams              002-02-0002
Thomas Jefferson        003-03-0003
James Madison           123-45-6789
Substitute Flags
There are four flags that can be added at the end of the substitute command to modify its behaviour: global substitution (g), specific occurrence substitution (digit), print (p) and write file (w file-name).

Global Flag

The substitute command only replaces the first occurrence of a pattern. If there are multiple occurrences, none after the first are changed.
root@firewall:/var/log# sed ‘s/cat/dog/’
Mary had a black cat and a white cat.
Mary had a black dog and a white cat.
root@firewall:/var/log# sed ‘s/cat/dog/g’
Mary had a black cat and a white cat.
Mary had a black dog and a white dog.

Specific Occurrence Flag

We now know how to change the first occurrence and all o fthe occurrences of a text pattern. Specific occurrence substitution (digit) changes any single occurrence of text that matches the pattern. The digit indicates which one to change. To change the second occurrence of a pattern, we use 2; to change the fifth, we use 5.
root@firewall:/var/log# sed ‘s/cat/dog/2’
Mary had a black cat, a yetllow cat and a white cat.
Mary had a black cat, a yetllow dog and a white cat.

 

Print Flag

There are occasions when we do not want to print all of the output. For example, when developing a script, it helps to view only the lines that have been changed. To control the printing from within a script, we must first turn off the automatic printing. This is done with the –n option. Once the automatic printing has been turned off, we can add a print flag to the substitution command.
-rw-r–r– 1 root root       170 Feb 13 10:12 Feb2013
root@firewall:~# ls -l | sed -n “/^-/s/(-[^ ]*).*:..(.*)/12/p”
-rw-r–r– Feb2013
-rw-r–r– fileList
-rw-r–r– ifconfig.txt
-rw-r–r– iptables_1.lst
-rw-r–r– iptables_2.lst
-rw-r–r– iptables.lst

Write File Flag

The write file command is similar to the print flag. The only difference is that rather than a print command we use the write command. One caution: there can be only one space between the command and the filename. To write the files in the previous example to a file, we would change the code as shown below.

Transform Command (y)

It is sometimes necessary to transform one set of characters to another. For example, IBM mainframe text file are written in a coding system known as EBCDIC. In EBCDIC, the binary codes for characters are different from ASCII. To read an EBCDIC file, therefore, all characters must be transformed to their ASCII equivalent as the file is read. The transform command (y) requires two parallel sets of characters. Each character in the first string represents a value to be changed to its corresponding character in the second string. Another example is to translate the lowercase vowels to the uppercase vowels below:
root@firewall:~# sed ‘y/aeiou/AEIOU/’
A good time was had by all Under the Harvest Moon last September.
A gOOd tImE wAs hAd by All UndEr thE HArvEst MOOn lAst SEptEmbEr.

Input and Output Commands

The sed utility automatically reads text from the input file and writes data to the output file, usually standard output. In this section, we discuss commands that allow us to control the input and output more fully. There are five input/output commands: next (n), append next (N), print (p), print first line (P) and list (l).

Next Command (n)

The next command (n) forces sedto read the next text line from the input file. Before reading the next line, however, it copies the current contents of the pattern space to the output, deletes the current text in the pattern space, and then refills it with the next input line. After reading the input line, it continues processing through the script. The next example, we use the next command to force data to be read. Whenever a line that starts with a digit is immediately followed by a blank line, we delete the blank line.
students@firewall:~/test$ cat deleteBlankLines.sed
# Script Name: deleteBlankLines.sed
# This cript deletes blank likes only if the preceding line starts with a number
/^[0-9]/{
        n
        /^$/d
        }
students@firewall:~/test$ sed -f deleteBlankLines.sed deleteBlankLines.dat
Second Line: Line 1 & Line 3 blank
4th line followed by non-blank line
This is line 5
6th line followed by blank line
Last line (#8)

Append Next Command (N)

Whereas the next command clears the pattern space before inputting the next line, the append nextcommand (N) does not. Rather, it adds the next input line to the current contents of the pattern space. This is especially useful when we need to apply patterns to two or more lines at the same time.
To demonstrate the append next command, we create a script that appends the second line to the first, the fourth to the third and so on until the end of the file. Note however that if we simply append the lines, when they are printed they will revert to two separate lines because there is a newline at the end of the first line. After we append the lines, therefore, we search for the newline and replace it with a space. The file consists of lines filled with the line number.
$ sed –f appendLines.sed appendLines.dat
# Script Name: appendLines.sed
# This script appends every two lines so that the output is Line Line2,
# Line 3 Line4 etc.
N
s/n/ /
Input
Output
11111one1111111111
22222two2222222222
33333three33333333
44444four444444444
55555five555555555
11111one1111111111 22222two2222222222
33333three33333333 44444four444444444
55555five555555555
Another interesting and much useful example replaces multiple blank lines with only one.
$ sed –f appendBlkLines.sed appendBlkLines.dat
/^$/{
            $!N
            /^n$/D
            }
The $!N command is interpreted as “if the line is not the last line”.

Print Command (p)

The print command (p) copies the current contents of the pattern space to the standard output file. If there are multiple lines in the pattern space, they are all copied. The contents of the pattern space are not deleted by the print command.
$ sed ‘p’ linesOfNums.dat
1111111111
2222222222
3333333333
1111111111
1111111111
2222222222
2222222222
3333333333
3333333333

Print First Line Command (P)

Whereas the print command prints the entire contents of the pattern space, the print first line command (P) prints only the first line. That is, it prints the contents of the pattern space up to and including a newline character. Any text following the first newline is not printed.
To demonstrate the print first line, let’s write a script that prints a line only if it is followed by a line that begins with a tab. This problem requires that we first append two lines in the pattern space. We then search the pattern space for a newline immediately followed by a tab. If we find this combination, we print only the first line. We then delete the first line only.
$ sed –nf printFirstLine.sed printFirstLine.dat
# Script Name: printFirstLine.sed
$!N
/n      /P
D
This is line1.
This is line2.
        Line 3 starts with a tab.
        Line 4 starts with a tab.
This is line 5. It’s the last line.
This is line2.
        Line 3 starts with a tab.

List Command (l)

The list command (l) converts the unprintable characters to their octal code.

File Commands

There are two file commands that can be used to read and write files. The basic format for read and write commands are shown below.
address
r/w
file-name
                                                Exactly one space

Read File Command ( r )

The read file command ( r) reads a file and places its contents in the output before moving to the next command. It is useful when you need to insert one or more common lines after text in a file. The contents of the file appear after the current line (pattern space) in the output. An example is to prepare a letter with a standard letter head and signature.
$ sed –f readFile.sed readFile.dat
# Script Name: readFile.sed
1 r letterhead.dat
$ r signature.dat

Write File Command (w)

The write file command (w) writes (actually appends) the contents of the pattern space to a file. It is useful for saving selected data to a file. For example, let’s create an activity log in which entries are grouped by days of the week. The end of each day is identified by a blank line. The first group of entries represents Monday’s activity, the second group represents Tuesday and so forth. The first word in each activity line is the day of the week.
$ sed –nf writeFile.sed aptFile.dat
# Script Name: writeFile.sed
/Monday/,/^$/w Monday.dat
/Tuesday/,/^$/w Tuesday.dat
/Wednesday/,/^$/w Wednesday.dat
/Thursday/,/^$/w Thursday.dat
/Friday/,/^$/w Friday.dat
/Saturday/,/^$/w Saturday.dat
/Sunday/,/^$/w Sunday.dat

Branch Commands

The branch commands change the regular flow of the commands in the script file. Recall that for every line in the file, sed runs through the script file applying commands that match the current pattern space text.. At the end of the script file, the text in the pattern space is copied to the output file, and the next text line is read into the pattern space replacing the old text. Occasionally we want to skip the application of the commands. The branch commands allow us to do just that, skip one or more commands in the script file.

Branch Label

Each branch command must have a target, which is either a label or the last instruction in the script (a blank label). A label consists of a line that beings with a colon (:) and is followed by up to seven characters that constitute the label name. There can be no other commands or text on the script-label line other than the colon and the label name. The label name must immediately follow the colon; there can be no space between the colon and the name, and the name cannot have embedded spaces. An example of the label is:
:comHere

Branch Command

The branch command (b) follows the normal instruction format consisting of an address, the command (b), and an attribute (target) that can be used to branch to the end of the script or to a specific location within the script. The target must be blank or match a script label in the script. If no label is provided, the branch is to be end of the script (after the last line), at which point the current contents of the pattern space are copied to the output file and the script is repeated for the next input line.
The next example demonstrates the basic branch command. It prints lines in a file once, twice or three times depending on a print control at the beginning of the file.
$ sed –f branch.sed branch.dat
# Script Name: branch.sed
# This script prints a line multiple times
/(1)/ b
/(2)/ b print2
/(3)/ b print3
# Branch to end of script
b
# print three
:print3
p
p
b
# print two
:print2
p
Print me once.
(2)Print me twice.
(3)Print me thrice.
(4)Print me once.
Print me once.
(2)Print me twice.
(2)Print me twice.
(3)Print me thrice.
(3)Print me thrice.
(3)Print me thrice.
(4)Print me once.

Branch on Substitution Command

Rather than branch unconditionally, we may need to branch only if a substitution has been made. In this case, we use the branch on substitution or as it is also known, the test command (t). Its format is same as the basic branch command.
students@firewall:~/test$ sed -f branchSub.sed branchSub.dat
# Script Name: branchSub.sed
# This script prints a line multiple times
s/(1)//
t
s/(2)//
t print2
s/(3)//
t print3
# Branch to end of script
b
# print three
:print3
p
p
b
# print two
:print2
p
(1)Print me once.
(2)Print me twice.
(3)Print me thrice.
Default: print me once.
Print me once.
Print me twice.
Print me twice.
Print me thrice.
Print me thrice.
Print me thrice.
Default: print me once.

Hold Space Commands

The sed has actually a hold space which can be used for saving the pattern space. There are five commands that are used to move text back and forth between the pattern space and hold space: hold and destroy (h), hold and append (H), get and destroy (g), get and append (G) and exchange (x).

Hold and Destroy Command

The hold and destroy command (h) copies the current contents of the pattern space to the hold space and destroys any text currently in the hold space.

Hold and Append Command

The hold and append command (H) appends the current contents of the pattern space to the hold space.

Get and Destroy Command

The get and destroy (g) copies the text in the hold space to the pattern space and destroys any text currently in the pattern space.

Get and Append Command

The get and append command (G) appends the current contents of the hold space to the pattern space.

Exchange Command

The exchange command (x) swaps the text in the pattern and hold spaces. That is the text in the pattern space is moved to the hold space and the data that were in the hold space are moved to the pattern space.
Applications, grep and sed. awk:-Execution, Fields and Records, Scripts, Operations, Patterns, Actions, Associative Arrays, String Functions, Mathematical Functions, User Defined Functions, Using System commands in awk, Applications of awk, grep and sed
Afsal is an experienced full stack developer, specialized in desktop application development. He also regularly publishes quality tutorials on his YouTube channel named 'Genuine Coder'. He likes to contribute to open source projects and is always enthusiastic about open source initiatives.