cis52-File Manipulation Utilities

CIS52 – File Manipulation
File Manipulation Utilities
Regular Expressions
sed, awk
© 2001 John Urrutia. All rights reserved.
1
Overview
comm – comparison of sorted files
cut – output sections of lines in a file
find – find files that match a pattern
paste – merges records in files
pr – paginate files into pages
tr – translate or delete characters
© 2001 John Urrutia. All rights reserved.
2
Overview
regular expressions
sed – Stream Editor (batch file editor)
awk – Aho,Weinberger,Kernighan (Pattern
match)
© 2001 John Urrutia. All rights reserved.
3
The comm before the storm
Compares 2 sorted files
Results reported in 3 columns
1st – records found only in file 1
2nd – records found only in file 2
3rd – records that match in both files
Options remove corresponding
columns
 – [1] [2] [3]
© 2001 John Urrutia. All rights reserved.
4
comm – cont.
Either file name can be substituted
with standard input
Example:
File1
aa
dd
ee
gg
hh
File2
bb
cc
dd
ee
ff
© 2001 John Urrutia. All rights reserved.
5
comm results
File1 -12
option
-1 File2
-2
aa
dd
bb
ee
bb
dd
cc
cc
ee
gg
dd
hh
ee
ff
ff
gg
hh
Both
dd
ee
© 2001 John Urrutia. All rights reserved.
6
cut to the chase
Allows you to extract portions of
each record in a file.
Delimits data in the file into fields or
columns.
Default delimiter is the tab character
Can be changed by the –d option
© 2001 John Urrutia. All rights reserved.
7
cut cont.
cut - [b | c | [ f [-d char] [-s] ] list
[--output-delimiter=string]
b – bytes
c – characters (same as bytes)
f – fields
d – delimiter character
s– display only records with
delimiters
© 2001 John Urrutia. All rights reserved.
8
cut ! print
char – single byte used to delimit
fields in a record
list – list of range/s of characters to
display
Ranges are comma separated.
1-7 first 7 characters in record
1,7 first and seventh characters
© 2001 John Urrutia. All rights reserved.
9
cut ! print again
string – list of characters to
substitute for the delimiters.
© 2001 John Urrutia. All rights reserved.
10
cut - Example
[/@linux2 uid]$ cat file1
The quick brown fox eyed the jactitating dog
[/@linux2 uid]$ cut –f1,3,5,8 –d’ ‘ file1
The brown eyed dog
[/@linux2 uid]$ cut –f1,4-6,8 –d’ ‘ file1
The fox eyed the dog
© 2001 John Urrutia. All rights reserved.
11
find that pot of gold
find – selects all files that meet the
selection criteria in the expression
No action is taken unless it is specified
Sub-directories are scanned
automatically
The expression can be simple or
complex
© 2001 John Urrutia. All rights reserved.
12
find me something
The criteria expression:
 And’s each operand separated by a
space
Or’s each operand separated by –o
Processes left to right sequentially
© 2001 John Urrutia. All rights reserved.
13
find criteria continued
Actions
-print prints the path of all files that
meet the selection criteria
-exec cmds\; executes the
commands before the \:
-ok same as –exec but must have a
Y from stdin.
© 2001 John Urrutia. All rights reserved.
14
find criteria continued again
Evaluations
-type specify a type of file (ie. directory)
-atime ±n accessed ±n days ago.
-mtime ±n modified ±n days ago.
-user uid owner of the file
-nouser uid owner is not known to
system
© 2001 John Urrutia. All rights reserved.
15
paste tastes good
paste [options] [filelist]
each record in the file is merged into 1
record
-s process filelist sequentially. All
records are processed before going to
the next file
-d [delimiter list] each character in turn
delimits the file records.
© 2001 John Urrutia. All rights reserved.
16
paste continued
[/@linux2 uid]$ cat file1
A
B
C
[/@linux2 uid]$ cat file2
1
2
3
[/@linux2 uid]$ cat file3
x
y
z
© 2001 John Urrutia. All rights reserved.
17
paste continued
[/@linux2 uid]$ paste –s
file1
file1
file2
file2
file3
file3
Output file
A
B
1
C
x
1
B
2
3
y
x
C
y
3
z
© 2001 John Urrutia. All rights reserved.
18
pr – public relations--NOT
pr paginate file(s) for printing
Can specify page attributes
Changed lines through the –l option
For multiple files each starts a new
page
© 2001 John Urrutia. All rights reserved.
19
pr – continued
pr paginate a file for printing
Creates a header and trailer
Changed through the –h option
Suppress through the –t option
Can create columns of data
–nbr Number of columns per line
–Sx Character used to separate
columns
© 2001 John Urrutia. All rights reserved.
20
pr – continued
Can create numbers for each line
–nck
c - character data separator
default is tab character
k – number of digits
© 2001 John Urrutia. All rights reserved.
21
Regular Expressions
A set of characters that define the
criteria used to identify a string
within a record.
Used by vi, grep, sed, awk, and
others.
© 2001 John Urrutia. All rights reserved.
22
tr – Translate this
tr – [c] [d] [s] [t] set1 [ set2 ]
Translate from set1 to set2
c – compliment of set1
d – delete characters found in set1
s – squeeze out duplicates
t – truncate set1 to length of set2
© 2001 John Urrutia. All rights reserved.
23
Regular Expressions
Simple strings
Bound by / … /
Interpreted literally
ie. /e D/ - matches exactly e D
Taste Dee – OK
Taste don’t – not OK
© 2001 John Urrutia. All rights reserved.
24
Regular Expressions
The • special single sub character
Matches any single character
ie. – /.eny/ matches Aeny Beny Ceny
The [ char-range ] define a character
class
The [^ char-range ] define the not-incharacter class
© 2001 John Urrutia. All rights reserved.
25
Regular Expressions
The 
 (asterisk)
Matches 0 or more of the preceding character.
What’s this?
/.  /
/ [ a-zA-Z ]  /
/ ([^)]  )/
© 2001 John Urrutia. All rights reserved.
26
Regular Expressions
The /^ (for the rabbit) character
In the beginning …
The $/ (for the teacher) character
At the end …
© 2001 John Urrutia. All rights reserved.
27
Regular Expressions
Quote the raven – backslash
\.
This yields •
\\
This yields \
\*
This yields *
\[
This yields [
\]
This yields ]
\ /
This yields /
© 2001 John Urrutia. All rights reserved.
28
sed – the old Stream EDitor
sed [-n] [-fscript ] [file-list]
Copies and edits to standard output
Edits file(s) in a non-interactive mode
Gets its instructions from a script file
–f filename contains sed instructions
No option 1st command argument is used
–n suppress stdout unless specified
© 2001 John Urrutia. All rights reserved.
29
sed – the old mill stream
 Record processing
1. Read record from file list
2. Read record from script (or cmd line)
3. Apply selection criteria
4. If selected perform instruction
and repeat 2  4 until no more script
5. Repeat 1  5 until no more file list.
© 2001 John Urrutia. All rights reserved.
30
He sed what!!??
Instruction format
[addr1 ] ,addr2 ] ] inst [arg-list]
Address
A line number
Regular expression
Addr1 – start
Addr2 – stop
© 2001 John Urrutia. All rights reserved.
31
Address line numbers
$ Designates the last line of the last file
1st address line number
Starts selecting records based on their
position in the input file list relative to 1.
2nd address line number
Stops selecting records when position in
the input file list is > than the line number.
© 2001 John Urrutia. All rights reserved.
32
He sed some more
Instructions
! – Not negates the address selection
 sed ‘!/line/ p’ file.list
{…} – Groups the instructions for the
address selection
© 2001 John Urrutia. All rights reserved.
33
sed Instructions
p – Print now and continue
d – Delete and get the next record
q – Quit processing; Stop; Go Away
© 2001 John Urrutia. All rights reserved.
34
sed Instructions
c – Change
[addr1] [addr2] c\ yada yada yada
all selected records are replaced as a
group by the change value
a – Append
[addr1] a\ …
add the text to the end of the selected
records
© 2001 John Urrutia. All rights reserved.
35
sed Instructions
i – Insert
[addr1] a\ …
add the text to the beginning of the
selected records
n – Next
[addr1] n
writes the current, gets the next and
continues the script
© 2001 John Urrutia. All rights reserved.
36
sed Instructions
w – Write
[addr1] [,addr2] w filename
writes the selected records to a file
r – Read
[addr1] r filename
reads records from the filename and
appends them to the selected record
© 2001 John Urrutia. All rights reserved.
37
sed Instructions
s – Substitute
[addr1] [,addr2] s/ptrn /repl /[g] [p] [w f ]
for each selected record match the
pattern and replace
g – Replace all non-overlapping
occurrences
 p – Print the record
w – write the record to the filename
© 2001 John Urrutia. All rights reserved.
38
Hawk – Squawk – awk
The programmable utility that does everything.
Aho – Weinberger – Kernighan
Provides:
Conditional execution
Looping
Handles:
Numeric & string variables
Regular expresions
C print facilities
© 2001 John Urrutia. All rights reserved.
39
awk
awk [–Fc] [–f] program-file [ file list ]
F – field delimiter character
f – name of the awk program file
program-file
instream instructions
List of files to process
© 2001 John Urrutia. All rights reserved.
40
awk – program lines
pattern [ action ]
Like sed pattern selects records
Record processing is the same as sed
© 2001 John Urrutia. All rights reserved.
41
awk – pattern
Patterns follow regular expression format.
~ Tests for match to regular expression
!~ Tests for NO match to regular expression
, – Establishes a pattern range all records
are processed inclusively within the range
BEGIN
executes before the first record is processed
END
executes after the last record is processed
© 2001 John Urrutia. All rights reserved.
42
awk – relational operators
< – less than
<= – less than or equal to
== – equal to
!= – not equal to
>= – greater than or equal to
> – greater than
© 2001 John Urrutia. All rights reserved.
43
awk – operators
Arithmetic
+ – addition
- – subtraction
* – multiplication
/ – division
Assignment
= – assigns value to the left
+= – adds value to the left
© 2001 John Urrutia. All rights reserved.
44
awk – boolean operators
&&
– and
||
– or
!
– not
© 2001 John Urrutia. All rights reserved.
45
awk – actions
# - Comment to the right on any line
Default action is print to stdout
Multiple actions can be taken
Use {…} to enclose multiple actions
Separate actions with ;
© 2001 John Urrutia. All rights reserved.
46
awk – actions
print variable …
Var , Var2 , Var3
Prints variables separated by delimiter
Var Var2 Var3
NO separators
“literal value “
Prints exactly everything between the “ “
© 2001 John Urrutia. All rights reserved.
47
awk – actions
printf “cntl string” variable …
Control String
\n – new line
\t – tab
%[-] [n] [.d] conv char
- left justification
 n number of character
.d decimal positions
© 2001 John Urrutia. All rights reserved.
48
awk – actions
%[-] [n] [.d] conv char
- left justification
 n number of character
.d decimal positions
conv char – conversion character
d - decimal, e - exponent, f - floating-point
o - octal, x - hexadecimal
s - string
© 2001 John Urrutia. All rights reserved.
49
awk – variables
awk provided variables
NF – total number of fields
$1…$n – each field in the current record
FS – input field separator
(default space or tab )
OFS – output field separator
(default space )
© 2001 John Urrutia. All rights reserved.
50
awk – variables
awk provided variables
NR – current record number
$0 – entire current record
RS – record separator
(default newline )
ORS – output record separator
(default newline )
FILENAME – name of current input file
© 2001 John Urrutia. All rights reserved.
51
awk - variables
Associative Arrays
array_name [ string ]
The array name should be meaningful
The index of the array is a string
Elements are automatically created
for ( element in array ) actions
© 2001 John Urrutia. All rights reserved.
52
awk - functions
length(string) – returns the number of
characters in string
int(num) – returns the integer portion
index(str1,str2) – returns the index of
str2 found in str1 or 0 if not present
split(str,arr,del) – populates arr[ ] from
fields in str delimited by del – returns
count of elements.
© 2001 John Urrutia. All rights reserved.
53
awk - functions
sprintf(fmt , args) – formats args using
the fmt and returns the formatted string.
substr(str , pos , len) – returns a
substring of str starting with position
pos for a length of len.
© 2001 John Urrutia. All rights reserved.
54