rha030-workbook08-student-3.0-0

Workbook 8. String Processing Tools

Red Hat, Inc.

Workbook 8. String Processing Toolsby Red Hat, Inc.Copyright © 2003-2005 Red Hat, Inc.

Revision HistoryRevision rha030-2.0-2003_11_12-en 2003-11-12First RevisionRevision rha030-3.0-0-en-2005-08-17T07:23:17-0400 2005-08-17First Revision

Red Hat, Red Hat Network, the Red Hat "Shadow Man" logo, RPM, the RPM logo, PowerTools, and all Red Hat-based trademarks and logos aretrademarks or registered trademarks of Red Hat, Inc. in the United States and other countries.Linux is a registered trademark of Linus Torvalds.Motif and UNIX are registered trademarks of The Open Group.Windows is a registered trademark of Microsoft Corporation.Intel and Pentium are a registered trademarks of Intel Corporation. Itanium and Celeron are trademarks of Intel Corporation.SSH and Secure Shell are trademarks of SSH Communications Security, Inc.All other trademarks and copyrights referred to are the property of their respective owners.

Published 2005-08-17

Table of Contents1. Text Encoding and Word Counting...................................................................................................... 7

Discussion .......................................................................................................................................... 7Files .......................................................................................................................................... 7Text Encoding........................................................................................................................... 8Internationalization (i18n) ...................................................................................................... 12Revisiting cat, head, and tail ................................................................................................. 15The wc (Word Count) Command ........................................................................................... 17

Examples.......................................................................................................................................... 18Example 1. Counting Characters ............................................................................................ 18Example 2. Invisible Characters Are Important, Too .............................................................18Example 3. What’s My Line?................................................................................................. 18Example 4. I Want It All......................................................................................................... 19Example 5. Linux, Dos, and Macintosh Files ........................................................................ 19Example 6. Counting Users .................................................................................................... 19Example 7. Counting Processes ............................................................................................. 19

Online Exercises............................................................................................................................... 20Specification ........................................................................................................................... 20Deliverables ............................................................................................................................ 21Hints ....................................................................................................................................... 22

Questions.......................................................................................................................................... 222. Finding Text: grep................................................................................................................................ 25

Discussion ........................................................................................................................................ 25Searching Text File Contents using grep ............................................................................... 25Show All Occurrences of a String in a File ............................................................................ 26Searching in Several Files at Once ......................................................................................... 27Searching Directories Recursively ......................................................................................... 27Inverting grep ......................................................................................................................... 28Getting Line Numbers ............................................................................................................ 28Limiting Matching to Whole Words....................................................................................... 29Ignoring Case.......................................................................................................................... 29

Examples.......................................................................................................................................... 30Example 1. Finding Simple Character Strings ....................................................................... 30Example 2. In That Case ........................................................................................................ 30Example 3. Matching Whole Words....................................................................................... 30Example 4. Combining grep and xargs ................................................................................. 31

Online Exercises............................................................................................................................... 32Specification ........................................................................................................................... 32Deliverables ............................................................................................................................ 33

Questions.......................................................................................................................................... 333. Introduction to Regular Expressions ................................................................................................. 37

Discussion ........................................................................................................................................ 37Introducing Regular Expressions............................................................................................ 37Regular Expressions, Extended Regular Expressions, and the grep Command ....................39Anatomy of a Regular Expression.......................................................................................... 39Taking Literals Literally ......................................................................................................... 40

iii

Wildcards................................................................................................................................ 40Common Modifier Characters ................................................................................................ 42Anchored Searches ................................................................................................................. 44Coming to Terms with Regex Grouping................................................................................. 45Escaping Meta-Characters...................................................................................................... 46Summary of Linux Regular Expression Syntax .....................................................................46Regular Expressions are NOT File Globbing ......................................................................... 47Where to Find More Information About Regular Expressions ..............................................48

Examples.......................................................................................................................................... 48Example 1. Literal Searches ................................................................................................... 48Example 2. Range Expressions .............................................................................................. 48Example 3. REGEX Modifiers ............................................................................................... 49Example 4. Anchored Searches.............................................................................................. 49Example 5. REGEX Term Grouping...................................................................................... 50Example 6. Is elvis in the House? .......................................................................................... 50Example 7. Searching for Telephone Numbers ...................................................................... 51


Questions.......................................................................................................................................... 554. Everything Sorting: sort and uniq ..................................................................................................... 59

Discussion ........................................................................................................................................ 59The sort Command................................................................................................................. 59The uniq Command ............................................................................................................... 64

Examples.......................................................................................................................................... 67Example 1. Sorting the Output of ps aux............................................................................... 67Example 2. Using sort and uniq to Collect Information on Running Processes ...................69


Questions.......................................................................................................................................... 735. Extracting and Assembling Text: cut and paste ............................................................................... 77

Discussion ........................................................................................................................................ 77The cut Command.................................................................................................................. 77The paste Command .............................................................................................................. 81

Examples.......................................................................................................................................... 82Example 1. Handling Free-Format Records ........................................................................... 82Example 2. Living With Fixed-Format Records ....................................................................83Example 3. Using (and Misusing) a Space as a Delimiter .....................................................83Example 4. Examples of Pasting ............................................................................................ 84


Questions.......................................................................................................................................... 87

rha030-3.0-0-en-2005-08-17T07:23:17-0400Copyright (c) 2003-2005 Red Hat, Inc. All rights reserved. For use only by a student enrolled in a Red Hat Academy course taught at a Red HatAcademy. Any other use is a violation of U.S. and international copyrights. No part of this publication may be photocopied, duplicated,stored in a retrieval system, or otherwise duplicated whether in electronic or print format without prior written consent of Red Hat, Inc.If you believe Red Hat course materials are being used, copied, or otherwise improperly distributed please email [email protected] phone toll-free (USA) +1 866 626 2994 or +1 (919) 754 3700.

iv

6. Tracking differences: diff .................................................................................................................... 91Discussion ........................................................................................................................................ 91

The diff Command ................................................................................................................. 91Output Formats for the diff Command................................................................................... 92How diff Interprets Arguments .............................................................................................. 95Customizing diff to be Less Picky ......................................................................................... 95Recursive diff’s....................................................................................................................... 97

Examples.......................................................................................................................................... 99Example 1. Using diff to Examine New Configuration Files.................................................99Example 2. Using diff to Examine Recent Changes to /etc/passwd .................................99Example 3. Creating a Patch................................................................................................. 100

Online Exercises............................................................................................................................. 100Specification ......................................................................................................................... 100Deliverables .......................................................................................................................... 101

Questions........................................................................................................................................ 1017. Translating Text: tr............................................................................................................................ 106

Discussion ...................................................................................................................................... 106The tr Command .................................................................................................................. 106Character Specification......................................................................................................... 106Using tr to Translate Characters .......................................................................................... 107Using tr to Delete Characters............................................................................................... 108Using tr to Squeeze Characters ............................................................................................ 109Complementing Sets............................................................................................................. 109One Final Caution: Avoid File Globbing! ............................................................................ 110

Examples........................................................................................................................................ 110Example 1. Using tr to Clean Up the df Command.............................................................110Example 2. Using tr to Convert Dos Text Files to Unix ......................................................111Example 3. Using tr to Count Word Frequencies 2 ..............................................................112Example 4. Rot13 ................................................................................................................. 113


Questions........................................................................................................................................ 1158. Spell Checking: aspell........................................................................................................................ 120

Discussion ...................................................................................................................................... 120Using aspell.......................................................................................................................... 120Performing an Interactive Spell Check................................................................................. 121Performing a Non-interactive Spell Check........................................................................... 122Managing the Personal Dictionary ....................................................................................... 123Getting Help ......................................................................................................................... 124

Examples........................................................................................................................................ 124Example 1. Adding Service Names to aspell’s Personal Dictionary ...................................124

Online Exercises............................................................................................................................. 125Setup ..................................................................................................................................... 125Specification ......................................................................................................................... 125Deliverables .......................................................................................................................... 126

Questions........................................................................................................................................ 127

rha030-3.0-0-en-2005-08-17T07:23:17-0400Copyright (c) 2003-2005 Red Hat, Inc. All rights reserved. For use only by a student enrolled in a Red Hat Academy course taught at a Red HatAcademy. Any other use is a violation of U.S. and international copyrights. No part of this publication may be photocopied, duplicated,stored in a retrieval system, or otherwise duplicated whether in electronic or print format without prior written consent of Red Hat, Inc.If you believe Red Hat course materials are being used, copied, or otherwise improperly distributed please email [email protected] phone toll-free (USA) +1 866 626 2994 or +1 (919) 754 3700.

v

9. Formatting Text (fmt) and Splitting Files (split)............................................................................. 130Discussion ...................................................................................................................................... 130

The fmt Command ............................................................................................................... 130The split Command.............................................................................................................. 135

Examples........................................................................................................................................ 137Example 1. Using fmt to Clean Email ................................................................................. 137Example 2. Using "String Processing" Tools to Manipulate Binary Data ...........................138


Questions........................................................................................................................................ 143

rha030-3.0-0-en-2005-08-17T07:23:17-0400Copyright (c) 2003-2005 Red Hat, Inc. All rights reserved. For use only by a student enrolled in a Red Hat Academy course taught at a Red Hat Academy. Any other use is a violationof U.S. and international copyrights. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise duplicated whether in electronic or printformat without prior written consent of Red Hat, Inc. If you believe Red Hat course materials are being used, copied, or otherwise improperly distributed please [email protected] or phone toll-free (USA) +1 866 626 2994 or +1 (919) 754 3700.

vi

Chapter 1. Text Encoding and Word Counting

Key Concepts• When storing text, computers transform characters into a numeric representation. This process is

referred to as encoding the text.

• In order to accommodate the demands of a variety of languages, several different encoding techniqueshave been developed. These techniques are represented by a variety of character sets.

• The oldest and most prevalent encoding technique is known as the ASCII character set, which stillserves as a least common denominator among other techniques.

• The wc command counts the number of characters, words, and lines in a file. When applied tostructured data, the wc command can become a versatile counting tool.

• The cat command has options that allow representation of nonprinting characters such as NEWLINE.

• The head and tail commands have options that allow you to print only a certain number of lines or acertain number of bytes (one byte usually correlates to one character) from a file.

DiscussionIn this Workbook, we begin looking at various tools for searching, sorting, extracting, and manipulatingtext. Because Linux, and Unix before it, has a strong tradition of storing data in human readable textformats, these tools should be thought of as not only aiding composition, but data manipulation ingeneral.

Files

What are Files?Linux, like most operating systems, stores information that needs to be preserved outside of the contextof any individual process in files. (In this context, and for most of this Workbook, the term file is meant inthe sense of regular file). Linux (and Unix) files store information using a simple model: information isstored as a single, ordered array of bytes, starting from at first and ending at the last. The number of bytesin the array is the length of the file. 1

What type of information is stored in files? Here are but a few examples.

• The characters that compose the book report you want to store until you can come back and finish ittomorrow are stored in a file called (say) ~/bookreport.txt.

• The individual colors that make up the picture you took with your digital camera are stored in the file(say) /mnt/camera/dcim/100nikon/dscn1203.jpg.

7


• The characters which define the usernames of users on a Linux system (and their home directories,etc.) are stored in the file /etc/passwd.

• The specific instructions which tell an x86 compatible CPU how to use the Linux kernel to list the filesin a given directory are stored in the file /bin/ls.

What is a Byte?At the lowest level, computers can only answer one type of question: is it on or off? What is it? Whendealing with disks, it is a magnetic domain which is oriented up or down. When dealing with memorychips, it is a transistor which either has current or doesn’t. Both of these are too difficult to mentallypicture, so we will speak in terms of light switches that can either be on or off. To your computer, thecontents of your file is reduced to what can be thought of as an array of (perhaps millions of) lightswitches. Each light switch can be used to store one bit of information (is it on, or is it off).

Using a single light switch, you cannot store much information. To be more useful, an early conventionwas established: group the light switches into bunches of 8. Each series of 8 light switches (or magneticdomains, or transistors, ...) is a byte. More formally, a byte consists of 8 bits. Each permutation of ons andoffs for a group of 8 switches can be assigned a number. All switches off, we’ll assign 0. Only the firstswitch on, we’ll assign 1; only the second switch on, 2; the first and second switch on, 3; and so on. Howmany numbers will it take to label each possible permutation for 8 light switches? A mathematician willquickly tell you the answer is 2^8, or 256. After grouping the light switches into groups of eight, yourcomputer views the contents of your file as an array of bytes, each with a value ranging from 0 to 255.

Data EncodingIn order to store information as a series of bytes, the information must be somehow converted into aseries of values ranging from 0 to 255. Converting information into such a format is called data encoding.What’s the best way to do it? There is no single best way that works for all situations. Developing theright technique to encode data, which balances the goals of simplicity, efficiency (in terms of CPUperformance and on disk storage), resilience to corruption, etc., is much of the art of computer science.

As one example, consider the picture taken by a digital camera mentioned above. One encodingtechnique would divide the picture into pixels (dots), and for each pixel, record three bytes ofinformation: the pixel’s "redness", "greenness", and "blueness", each on a scale of 0 to 255. The firstthree bytes of the file would record the information for the first pixel, the second three bytes the secondpixel, and so on. A picture format known as "PNM" does just this (plus some header information, such ashow many pixels are in a row). Many other encoding techniques for images exist, some just as simple,many much more complex.

Text EncodingPerhaps the most common type of data which computers are asked to store is text. As computers havedeveloped, a variety of techniques for encoding text have been developed, from the simple in concept(which could encode only the Latin alphabet used in Western languages) to complicated but powerfultechniques that attempt to encode all forms of human written communication, even attempting to include

rha030-3.0-0-en-2005-08-17T07:23:17-0400

Copyright (c) 2003-2005 Red Hat, Inc. All rights reserved. For use only by a student enrolled in a Red Hat Academy course taught at a Red Hat Academy. Any otheruse is a violation of U.S. and international copyrights. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwiseduplicated whether in electronic or print format without prior written consent of Red Hat, Inc. If you believe Red Hat course materials are being used, copied, orotherwise improperly distributed please email [email protected] or phone toll-free (USA) +1 866 626 2994 or +1 (919) 754 3700.

8


historical languages such as Egyptian hieroglyphics. The following sections discuss many of theencoding techniques commonly used in Red Hat Enterprise Linux.

ASCIIOne of the oldest, and still most commonly used techniques for encoding text is called ASCII encoding.ASCII encoding simply takes the 26 lowercase and 26 uppercase letters which compose the Latinalphabet, 10 digits, and common English punctuation characters (those found on a keyboard), and mapsthem to an integer between 0 and 255, as outlined in the following table.

Table 1-1. ASCII Encoding of Printable Characters

Integer Range Character33-47 Punctuation: !"#$%&;*(*+,-./48-57 The digits 0 through 958-64 Punctuation: :;<=?>@

65-90 Capital letters A through Z91-96 Punctuation: [\]^_‘

97-122 Lowercase letters a through z123-126 Punctuation: {|}~

What about the integers 0 - 32? These integers are mapped to special keys on early teletypes, many ofwhich have to do with manipulating the spacing on the page being typed on. The following characters arecommonly called "whitespace" characters.

Table 1-2. ASCII Encoding of Whitespace Characters

Integer Character Common Name CommonRepresentation

8 BS Backspace ’\b’

9 HT Tab ’\t’

10 LF Line Feed ’\n’

12 FF Form Feed ’\f’

13 CR Carriage Return ’\r’

32 SPACE Space Bar127 DEL Delete

Others of the first 32 integers are mapped to keys which did not directly influence the "printed page", butinstead sent "out of band" control signals between two teletypes. Many of these control signals havespecial interpretations within Linux (and Unix).

Table 1-3. ASCII Encoding of Control Signals


rha030-3.0-0-en-2005-08-17T07:23:17-0400Copyright (c) 2003-2005 Red Hat, Inc. All rights reserved. For use only by a student enrolled in a Red Hat Academy course taught at a Red Hat Academy. Any other use is a violation

of U.S. and international copyrights. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise duplicated whether in electronic or print

format without prior written consent of Red Hat, Inc. If you believe Red Hat course materials are being used, copied, or otherwise improperly distributed please email

[email protected] or phone toll-free (USA) +1 866 626 2994 or +1 (919) 754 3700.

9



4 EOT End of Transmission7 BEL Audible Terminal Bell ’\a’

27 ESC Escape

Generating Control Characters from the Keyboard: Control and whitespace characters can begenerated from the terminal keyboard directly using the CTRL key. For example, an audible bell canbe generated using CTRL-G, while a backspace can be sent using CTRL-H, and we have alreadymentioned that CTRL-D is used to generate an "End of File" (or "End of Transmission"). Can youdetermine how the whitespace and control characters are mapped to the various CTRL keycombinations? For example, what CTRL key combination generates a tab? What does CTRL-Jgenerate? As you explore various control sequences, remember that the reset command will restoreyour terminal to sane behavior, if necessary.

What about the values 128-255? ASCII encoding does not use them. The ASCII standard only definesthe first 128 values of a byte, leaving the remaining 128 values to be defined by other schemes.

ISO 8859 and Other Character SetsOther standard encoding schemes have been developed, which map various glyphs (such as the symbolfor the Yen and Euro), diacritical marks found in many European languages, and non Latin alphabets tothe latter 128 values of a byte which the ASCII standard leaves undefined. The following table lists a fewof these standard encoding schemes, which are referred to as character sets. The following table listssome character sets which are supported in Linux, including their informal name, formal name, and abrief description.

Table 1-4. Some ISO 8859 Character Sets supported in Linux

Informal Name Formal Name DescriptionLatin-1 ISO 8859-1 West European languagesLatin-2 ISO 8859-2 Central and East European

languagesArabic ISO 8859-6 Latin/ArabicGreek ISO 8859-7 Latin/GreekLatin-9 ISO 8859-15 West European languages

All of these character encoding schemes use a common technique. They preserve the first 128 values of abyte to encode traditional ASCII, and use the remaining 128 values to encode glyphs unique to theparticular encoding. For example, ISO 8859-1 (Latin-1) uses the value 196 to encode a Latin capital Awith an umlaut (Ä), while ISO-8859-7 (Greek) uses the value 196 to encode the Greek capital letterDelta (∆), but both use the value 101 to encode a Latin lowercase e.

Notice a couple of implications about ISO 8859 encoding.

rha030-3.0-0-en-2005-08-17T07:23:17-0400Copyright (c) 2003-2005 Red Hat, Inc. All rights reserved. For use only by a student enrolled in a Red Hat Academy course taught at a Red Hat Academy. Any other

use is a violation of U.S. and international copyrights. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwiseduplicated whether in electronic or print format without prior written consent of Red Hat, Inc. If you believe Red Hat course materials are being used,copied, or otherwise improperly distributed please email [email protected] or phone toll-free (USA) +1 866 626 2994 or +1 (919) 754 3700.

10


1. Each of the alternate encodings map a single glyph to a single byte, so that the number of lettersencoded in a file equals the number of bytes which are required to encode them.

2. Choosing a particular character set extends the range of characters that can be encoded, but youcannot encode characters from different character sets simultaneously. For example, you could notencode both a Latin capital A with a grave and a Greek letter Delta simultaneously.

Unicode (UCS)In order to overcome the limitations of ASCII and ISO 8859 based encoding techniques, a UniversalCharacter Set has been developed, commonly referred to as UCS, or Unicode. The Unicode standardacknowledges the fact that one byte of information, with its ability to encode 256 different values, issimply not enough to encode the variety of glyphs found in human communication. Instead, the Unicodestandard uses 4 bytes to encode each character. Think of 4 bytes as 32 light switches. If we were to againlabel each permutation of on and off for 32 switches with integers, the mathematician would tell you thatyou would need 4,294,967,296 (over 4 billion) integers. Thus, Unicode can encode over 4 billion glyphs(nearly enough for every person on the earth to have their own unique glyph; the user prince wouldapprove).

What are some of the features and drawbacks of Unicode encoding?

Scale

The Unicode standard will easily be able to encode the variety of glyphs used in humancommunication for a long time to come.

Simplicity

The Unicode standard does have the simplicity of a sledgehammer. The number of bytes required toencode a set of characters is simply the number of characters multiplied by 4.

Waste

While the Unicode standard is simple in concept, it is also very wasteful. The ability to encode 4billion glyphs is nice, but in reality, much of the communication that occurs today uses less than afew hundred glyphs. Of the 32 bits (light switches) used to encode each character, the first 20 or sowould always be "off".

ASCII Non-compatibility

For better or for worse, a huge amount of existing data is already ASCII encoded. In order to convertfully to Unicode, that data, and the programs that expect to read it, would have to be converted.

The Unicode standard is an effective standard in principle, but in many respects it is ahead of its time,and perhaps forever will be. In practice, other techniques have been developed which attempt to preservethe scale and versatility of Unicode, while minimizing waste and maintaining ASCII compatibility. Whatmust be sacrificed? Simplicity.

rha030-3.0-0-en-2005-08-17T07:23:17-0400Copyright (c) 2003-2005 Red Hat, Inc. All rights reserved. For use only by a student enrolled in a Red Hat Academy course taught at a Red Hat Academy. Any otheruse is a violation of U.S. and international copyrights. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwiseduplicated whether in electronic or print format without prior written consent of Red Hat, Inc. If you believe Red Hat course materials are being used, copied, orotherwise improperly distributed please email [email protected] or phone toll-free (USA) +1 866 626 2994 or +1 (919) 754 3700.

11


Unicode Transformation Format (UTF-8)UTF-8 encoding attempts to balance the flexibility of Unicode, and the practicality and pervasiveness ofASCII, with a significant sacrifice: variable length encoding. With variable length encoding, eachcharacter is no longer encoded using simply 1 byte, or simply 4 bytes. Instead, the traditional 127 ASCIIcharacters are encoded using 1 byte (and, in fact, are identical to the existing ASCII standard). The nextmost commonly used 2000 or so characters are encoded using two bytes. The next 63000 or socharacters are encoded using three bytes, and the more esoteric characters may be encoded using fromfour to six bytes. Details of the encoding technique can be found in the utf-8(7) man page. With fullbackwards compatibility to ASCII, and the same functional range of pure Unicode, what is there to lose?ISO 8859 (and similar) character set compatibility.

UTF-8 attempts to bridge the gap between ASCII, which can be viewed as the primitive days of textencoding, and Unicode, which can be viewed as the utopia to aspire toward. Unfortunately, the"intermediate" methods, the ISO 8859 and other alternate character sets, are as incompatible with UTF-8as they are with each other.

Additionally, the simple relationship between the number of characters that are being stored and theamount of space (measured in bytes) it takes to store them is lost. How much space will it take to store879 printed characters? If they are pure ASCII, the answer is 879. If they are Greek or Cyrillic, theanswer is closer to twice that much.

Text Encoding and the Open Source CommunityIn the traditional development of operating systems, decisions such as what type of character encoding touse can be made centrally, with the possible disadvantage that the decision is wrong for some communityof the operating system’s users. In contrast, in the open source development model, these types ofdecisions are generally made by individuals and small groups of contributers. The advantages of the opensource model are a flexible system which can accommodate a wide variety of encoding formats. Thedisadvantage is that users must often be educated and made aware of the issues involved with characterencoding, because some parts of the assembled system use one technique while others parts use another.The library of man pages is an excellent example.

When contributors to the open source community are faced with decisions involving potentiallyincompatible formats, they generally balance local needs with an appreciation for adhering to widelyaccepted standards where appropriate. The UTF-8 encoding format seems to be evolving as an acceptedstandard, and in recent releases has become the default for Red Hat Enterprise Linux.

The following paragraph, extracted from the utf-8(7) man page, says it well:

It can be hoped that in the foreseeable future, UTF-8 will replaceASCII and ISO 8859 at all levels as the common character encoding onPOSIX systems, leading to a significantly richer environment for han-dling plain text.

Internationalization (i18n)As this Workbook continues to discuss many tools and techniques for searching, sorting, andmanipulating text, the topic of internationalization cannot be avoided. In the open source community,

rha030-3.0-0-en-2005-08-17T07:23:17-0400


12


internationalization is often abbreviated as i18n, a shorthand for saying "i-n with 18 letters in between".Applications which have been internationalized take into account different languages. In the Linux (andUnix) community, most applications look for the LANG environment variable to determine whichlanguage to use.

At the simplest, this implies that programs will emit messages in the user’s native language.

[elvis@station elvis]$ echo $LANGen_US.UTF-8[elvis@station elvis]$ chmod 666 /etc/passwdchmod: changing permissions of ‘/etc/passwd’: Operation not permitted[elvis@station elvis]$ export LANG=de_DE.utf8[elvis@station elvis]$ chmod 666 /etc/passwdchmod: Beim Setzen der Zugriffsrechte für »/etc/passwd«: Die Operation ist nicht erlaubt

More subtly, the choice of a particular language has implications for sorting orders, numeric formats, textencoding, and other issues.

The LANG environment variableThe LANG environment variable is used to define a user’s language, and possibly the default encodingtechnique as well. The variable is expected to be set to a string using the following syntax:

LL_CC.enc

The variable context consists of the following three components.

Table 1-5. Components of LANG environment variable

Component RoleLL Two letter ISO 639 Language CodeCC (Optional) Two letter ISO 3166 Country Codeenc (Optional) Character Encoding Code Set

The locale command can be used to examine your current configuration (as can echo $LANG), whilelocale -a will list all settings currently supported by your system. The extent of the support for any givenlanguage will vary.

The following tables list some selected language codes, country codes, and code set specifications.

Table 1-6. Selected ISO 639 Language Codes

Code Languagede Germanel Greeken Englishes Spanishfr Frenchja Japanese





13


Code Languagezh Chinese

Table 1-7. Selected ISO 3166 Country Codes

Code CountryCA CanadaCN ChinaDE GermanyES SpainFR FranceGB Britain (UK)GR GreeceJP JapanNG NigeriaUS United States

Table 1-8. Selected Character Encoding Code Sets

Code Countryutf8 UTF-8iso88591 ISO 8859-1 (Latin 1)iso885915 ISO 8859-15 (Latin 10)iso88596 ISO 8859-6 (Arabic)iso88592 ISO 8859-2 (Latin 2)

See the gettext info pages (info gettext, or pinfo gettext) for a complete listing.

Do I Really Have to Know All of This?We have tried to introduce the major concepts and components which affect how text is encoded andstored within Linux. After reading about character sets and language codes, one might be led to wonder,do I really need to know about all of this? If you are using simple text, restricted to the Latin alphabet of26 characters, the answer is no. If you are asking the question 10 years from now, the answer willhopefully be no. If you do not fit into one of these two categories, however, you should have at least anacquaintance with the concept of internationalization, character sets, and the role of the LANGenvironment variable.

Hopefully, as the open source community converges on a single encoding technique (currently UTF-8seems the most likely), most of these issues will disappear. Until then, these are some key points toremember.

1. An ASCII file is already valid in one of the ISO 8559 character sets.

rha030-3.0-0-en-2005-08-17T07:23:17-0400Copyright (c) 2003-2005 Red Hat, Inc. All rights reserved. For use only by a student enrolled in a Red Hat Academy course taught at a Red Hat Academy.Any other use is a violation of U.S. and international copyrights. No part of this publication may be photocopied, duplicated, stored in a retrieval system, orotherwise duplicated whether in electronic or print format without prior written consent of Red Hat, Inc. If you believe Red Hat course materials are beingused, copied, or otherwise improperly distributed please email [email protected] or phone toll-free (USA) +1 866 626 2994 or +1 (919) 754 3700.

14


2. An ASCII file is already valid in UTF-8.

3. A file encoded in one of the ISO 8559 character sets is not valid in UTF-8, and must be converted.

4. Using UTF-8, There is a one to one mapping between characters and bytes if and only if all of thecharacters are pure ASCII characters.

If you are interested in more information, several man pages provide a more detailed introduction to theconcepts outlined above. Start with charsets(7), and then follow with ascii(7), iso_8859_1(7),unicode(7) and utf-8(7). Additionally, the iconv command can be used to convert text files from oneform of encoding to another.

Revisiting cat, head, and tail

Revisiting catWe have been using the cat command to simply display the contents of files. Usually, the cat commandgenerates a faithful copy of its input, without performing any edits or conversions. When called with oneof the following command line switches, however, the cat command will indicate the presence tabs, linefeeds, and other control sequences, using the following conventions.

Table 1-9. Command Line Switches for the cat Command

Switch Effect-E display line feeds (ASCII 10) as $-T display tabs (ASCII 9) as Î-v display whitespace and control characters as ^n, with n indicating the CTRL sequence

for the nonprinting character.-A Shows "all", same as -vET-t Show "all" except line feeds, same as -vT-e Show "all" except tabs, same as -vE

As an example, in the following, the cat command is used to display the contents of the /etc/hostsconfiguration file.

[student@station student]$ cat /etc/hosts# Do not remove the following line, or various programs# that require network functionality will fail.127.0.0.1 localhost.localdomain localhost station.example.com127.0.0.1 rha-server192.168.0.1 station1 station1.example.com www1 www1.example.com192.168.0.51 station51 station51.example.com192.168.129.201 z

Using the -A command line switch, the whitespace structure of the file becomes evident, as tabs arereplaced with Î, and line feeds are decorated with $.

[student@station student]$ cat -A /etc/hosts

rha030-3.0-0-en-2005-08-17T07:23:17-0400Copyright (c) 2003-2005 Red Hat, Inc. All rights reserved. For use only by a student enrolled in a Red Hat Academy course taught at a Red Hat Academy. Any other use isa violation of U.S. and international copyrights. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise duplicated whetherin electronic or print format without prior written consent of Red Hat, Inc. If you believe Red Hat course materials are being used, copied, or otherwise improperly distributedplease email [email protected] or phone toll-free (USA) +1 866 626 2994 or +1 (919) 754 3700.

15


# Do not remove the following line, or various programs$# that require network functionality will fail.$127.0.0.1Îlocalhost.localdomainÎlocalhost station.example.com $127.0.0.1Îrha-server$192.168.0.1Îstation1 station1.example.com www1 www1.example.com$192.168.0.51Îstation51 station51.example.com$192.168.129.201Îz$

Revisiting head and tailThe head and tail commands have been used to display the first or last few lines of a file, respectively.But what makes a line? Imagine yourself working at a typewriter: click! clack! click! clack! clack!ziiing! Instead of the ziing! of the typewriter carriage at the end of each line, the line feed character(ASCII 10) is chosen to mark the end of lines.

Unfortunately, a common convention for how to mark the end of a line is not shared among the dominantoperating systems in use today. Linux (and Unix) uses the line feed character (ASCII 10, oftenrepresented \n), while Macintosh operating systems uses the carriage return character (ASCII 13, oftenrepresented \r or ^M), and Microsoft operating systems use a carriage return/line feed pair (ASCII 13,ASCII 10).

For example, the following file contains a list of four musicians.

[student@station student]$ cat -A musicianselvis$blondie$prince$madonna$

Had this file been created on a Microsoft or Macintosh operating system, and copied into Linux, the fileswould look like the following.

[student@station student]$ cat -A musicians.doselvis^M$blondie^M$prince^M$madonna^M$[student@station student]$ cat -A musicians.macelvis^Mblondie^Mprince^Mmadonna^M[student@station student]$

Linux (and Unix) text files generally adhere to a convention that the last character of the file must be aline feed for the last line of text. Following the cat of the file musicians.mac, which does not containany conventional Linux line feed characters, the bash prompt is not displayed in its usual location.

Table 1-10. Command Line Switches for the head Command

Switch Effect-N , -nN Display the first N lines of the file.-cN Display the first N bytes of the file.





16


Table 1-11. Command Line Switches for the tail Command

Switch Effect-N , -nN Display the last N lines of the file. If N is prepended by a +, display the remainder of the

file, starting at the Nth line.-cN Display the first N bytes of the file.

The wc (Word Count) Command

Counting Made EasyHave you ever tried to answer a “25 words or less” quiz? Did you ever have to write a 1500-word essay?

With the wc you can easily verify that your contribution meets the criteria.

The wc command counts the number of characters, words, and lines. It will take its input either from filesnamed on its command line or from its standard input. Below is the command line form for the wcprogram:

Figure 1-1. Using the wc command

Switch Results-c Compute character count.-l Compute line count.-w Compute word count.filename Filename to be counted. If no filename is not

specified, then the text will be read from thestandard input. For clarity, the filename will bewritten as the last line of each counting report,even if only one filename is used.

When used without any command line switches, wc will report on the number of characters, lines, andwords. Command line switches can be combined to return any combination of character count, line countor word count.

How To Recognize A Real CharacterText files are composed using an alphabet of characters. Some characters are visible, such as numbersand letters. Some characters are used for horizontal distance, such as spaces and TAB characters. Somecharacters are used for vertical movement, such as carriage returns and line feeds.

A line in a text file is a series of any character other than a NEWLINE (line feed) character and then aNEWLINE character. Additional lines in the file immediately follow the first line.


17


While a computer represents characters as numbers, the exact value used for each symbol variesdepending on which alphabet has been chosen. The most common alphabet for English speakers isASCII, also called “Latin-1”. Different human languages are represented by different computer encodingrules, so the exact numeric value for a given character depends on the human language being recorded.

So, What Is A Word?A word is a group of printing characters, such as letters and digits, surrounded by white space, such asspace characters or horizontal TAB characters.

Notice that our definition of a word does not include any notion of “meaning”. Only the form of the wordis important, not its semantics. As far as Linux is concerned, a line such as:

Now is the time for all good men to foogle.

contains 10 perfectly good words: printing characters surrounded by whitespace or punctuation.

Examples

Example 1. Counting CharactersTo count the characters in a file, just run wc -c:

[student@station student]$ echo hello | wc -c6

In addition to the five letters in the word, the line also has a NL character at the end.

Example 2. Invisible Characters Are Important, TooCharacters that you cannot see still occupy space in a file.

[student@station student]$ echo Hello, World! > greetings[student@station student]$ wc -c greetings14

Keep in mind that spaces and TABs count as characters, too. Remember our typewriter analogy? Boththe spacebar and the TAB key require keystrokes; each character in a text file corresponds to a press of atypewriter key.

Example 3. What’s My Line?Run the command wc -l to count the lines in a file:


18


[student@station student]$ echo First line > foo[student@station student]$ echo Second line >> foo[student@station student]$ echo Third line >> foo[student@station student]$ wc -l foo

3 foo

Example 4. I Want It AllUsing wc without any arguments counts everything: characters, words, and lines:

[student@station station]$ echo one > x[student@station student]$ echo two words >> x[student@station student]$ echo three more words >> x[student@station student]$ wc x

3 6 31 x

Example 5. Linux, Dos, and Macintosh FilesHow would the wc command handle the three musician files mentioned above (one composed on a Linuxmachine, one a Microsoft machine, and one a Macintosh)?

[student@station student]$ wc musicians*4 4 29 musicians4 4 33 musicians.dos

Ê 0 4 29 musicians.mac8 12 91 total

Ê For the file musicians.mac, which did not contain any conventional Linux line feed characters,the number of lines is reported as 0.

In the above output, why does the file musicians.dos have 33 characters, while musicians andmusicians.mac only 29?

Example 6. Counting UsersThe wc command is often used to count the number of things, not just lines, words, and characters. Forexample, the users command generates a list users who are currently logged onto the machine. Thefollowing line would create an alias called nusers, which would report the number of currently loggedon users.

[student@station student]$ alias nusers=’users | wc -w’[student@station student]$ usersstudent student student student root[student@station student]$ nusers

5

rha030-3.0-0-en-2005-08-17T07:23:17-0400


19


Example 7. Counting ProcessesBy examining the output of a command such as ps aux, which prints information about one process perline, the wc -l command can be used to count the number of processes currently running on a machine.Examining the output of the ps aux command, however, the initial line, which contains the column titles,must be removed from the count.

[student@station student]$ ps auxUSER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMANDroot 1 0.0 0.0 1384 76 ? S Sep28 0:04 init [root 2 0.0 0.0 0 0 ? SW Sep28 0:00 [keventd]root 3 0.0 0.0 0 0 ? SW Sep28 0:00 [kapmd]root 4 0.0 0.0 0 0 ? SWN Sep28 0:00 [ksoftirqd_CPU0]...

The tail command, with it’s ability to print the remainder of a file starting from a specified line, can beused to remove the header line.

[student@station student]$ ps aux | tail +2root 1 0.0 0.0 1384 76 ? S Sep28 0:04 init [root 2 0.0 0.0 0 0 ? SW Sep28 0:00 [keventd]root 3 0.0 0.0 0 0 ? SW Sep28 0:00 [kapmd]root 4 0.0 0.0 0 0 ? SWN Sep28 0:00 [ksoftirqd_CPU0]root 9 0.0 0.0 0 0 ? SW Sep28 0:00 [bdflush]...

The following short script combines ps aux, tail +2, and wc, to create a new command called nprocs.When made executable, and placed the ~/bin directory (which is part of the standard executable searchPATH), the script becomes available from the command line.

[student@station student]$ cat nprocs#!/bin/bash

ps aux | tail +2 | wc -l[student@station student]$ mkdir bin[student@station student]$ mv nprocs bin[student@station student]$ chmod a+x bin/nprocs[student@station student]$ nprocs

86

Online Exercises

Lab ExerciseObjective: Use the wc command as a counting tool.

Estimated Time: 10 mins.

rha030-3.0-0-en-2005-08-17T07:23:17-0400

Copyright (c) 2003-2005 Red Hat, Inc. All rights reserved. For use only by a student enrolled in a Red Hat Academy course taught at a Red Hat Academy. Any otheruse is a violation of U.S. and international copyrights. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwiseduplicated whether in electronic or print format without prior written consent of Red Hat, Inc. If you believe Red Hat course materials are being used,copied, or otherwise improperly distributed please email [email protected] or phone toll-free (USA) +1 866 626 2994 or +1 (919) 754 3700.

20


Specification

1. Create the file ~/gplwords.txt, which contains the number of words (as reported by the wccommand) in the file /usr/share/doc/redhat-release-4ES/GPL as its only word.

2. Create the file ~/localusers.txt, which contains the number of locally defined users as its onlyword.

3. Statically compiled libraries conventionally live in the /usr/lib directory, and have names thatstart lib and end with a .a extension. Create the file ~/usrlibs.txt, which contains the numberof files whose name follows this convention in the /usr/lib directory as its only word. (Do notinclude subdirectories.)

4. Create an executable script called ~/bin/nrecent. The script should expect a single argument,which is the name of a directory. Upon execution, the script should return a single number, which isthe number of files in the directory which have been modified in the last 24 hours. The script shouldgenerate no error messages about unaccessible directories on the standard error stream.

If you have implemented the exercises correctly, you should be able to reproduce output akin to thefollowing. (Do not be concerned if your actual numbers differ from those listed below).

[student@station student]$ head *.txt==> gplwords.txt <==

2009

==> localusers.txt <==89

==> usrlibs.txt <==216

[student@station student]$ nrecent /var/log22

Deliverables

1. A file called ~/gplwords.txt, which contains the number of words found in the file/usr/share/doc/redhat-release-4ES/GPL.

2. A file called ~/localusers.txt, which contains the number of locally defined users on the Linux system.

3. A file called ~/usrlibs.txt, which contains the number of files that begin lib and end .a found in the/usr/lib directory.

4. An executable script called ~/bin/nrecent, which expects the name of a directory as its single argument.Upon execution, the script would return a single number which is the number of files under the specified


21


directory that have been modified in the past 24 hours. The script should return no error messages aboutunaccessible directories on the standard error stream.

HintsFor the file ~/localusers.txt, recall that local users are defined in the /etc/passwd file, one userper line.

For the script ~/bin/nrecent, recall that $1 dereferences to a bash script’s first argument. Considerusing the find command to generate a list of files that match the criteria, and then count the number oflines (or words) in the output. You might want to use the /etc or /var/log directories to test yourscript.

Questions

1. Create an empty file using the touch foo command. How many characters does it contain?

( ) a. 0

( ) b. 2

( ) c. 1

( ) d. The wc command does not work on empty files.

( ) e. None of the above

2. Create a file using the echo > foo command. How many characters does it have?

( ) a. 2

( ) b. 0

( ) c. 1

( ) d. The wc command does not work on empty files.


3. Create a file using echo -e ’\n\n\n\n’ > foo; how many words does it have?

( ) a. 2

( ) b. 1

( ) c. 4

( ) d. 5

( ) e. 0


22


4. Which of the following command lines would generate a single word output, which is the sum of the number ofwords found in the files /etc/services and /etc/hosts?

( ) a. cat /etc/services /etc/hosts | wc -w

( ) b. wc -w < /etc/hosts /etc/services

( ) c. wc -w /etc/hosts /etc/services

( ) d. A and C

( ) e. All of the above

5. Which of the following command lines would generate a single word output, which is the number of users loggedinto the local machine (as reported by the w command)?

( ) a. w | wc -u

( ) b. w | tail -3 | wc -w

( ) c. w | tail +3 | wc -l

( ) d. w | tail +USER | wc -c


Use the following transcript to answer the next two questions.

[student@station student]$ cat /etc/adjtime-9.359142 1064838378 0.0000001064838378UTC

6. What would you expect the command wc -w < /etc/adjtime to return?

( ) a. 5

( ) b. 6

( ) c. 7

( ) d. 8


7. What would you expect the command wc -l < /etc/adjtime to return?

( ) a. 0

( ) b. 5

( ) c. 3

( ) d. 4



23


Use the following transcript to answer the next two questions.

[student@station student]$ ls -s /etc/group4 /etc/group

[student@station student]$ ls -l /etc/group-rw-r--r-- 1 root root 2475 Aug 17 12:34 /etc/group

8. What would the command wc -l < /etc/group return?

( ) a. 4

( ) b. 2475

( ) c. 12

( ) d. An error, because wc requires at least one filename as an argument.

( ) e. Not enough information is provided.

9. What would the command wc -c < /etc/group return?

( ) a. 4

( ) b. 2475

( ) c. 12

( ) d. An error, because wc requires at least one filename as an argument.

( ) e. Not enough information is provided.

10. Which of the following commands can be used to distinguish a tab from a series of spaces in a text file?

( ) a. cat -A

( ) b. cat -t

( ) c. cat -uT

( ) d. A and B


Notes1. While this may seem an obvious way to do things, some operating systems take more elaborate

approaches. The Macintosh operating system, for example, stores file using two arrays ofinformation, a data fork and a resource fork.


24

Chapter 2. Finding Text: grep

Key Concepts• grep is a command that prints lines that match a specified text string or pattern.

• grep is commonly used as a filter to reduce output to only desired items.

• grep -r will recursively grep files underneath a given directory.

• grep -v prints lines that do NOT match a specified text string or pattern.

• Many other command line switches allow users to specify grep’s output format.

Discussion

Searching Text File Contents using grepIn an earlier Lesson, we saw how the wc program can be used to count the characters, words and lines intext files. In this Lesson we introduce the grep program, a handy tool for searching text file contents forspecific words or character sequences.

The name grep stands for general regular expression parser. What, you may well ask, is a regularexpression and why on earth should I want to parse one? We will provide a more formal definition ofregular expressions in a later Lesson, but for now it is enough to know that a regular expression is simplya way of describing a pattern, or template, to match some sequence of characters. A simple regularexpression would be “Hello”, which matches exactly five characters: “H”, “e”, two consecutive “l”characters, and a final “o”. More powerful search patterns are possible and we shall examine them in thenext section.

The figure below gives the general form of the grep command line:

Figure 2-1. Form of the grep commands

There are actually three different names for the grep tool 1:

fgrepDoes a fast search for simple patterns. Use this command to quickly locate patterns without anywildcard characters, useful when searching for an ordinary word.

25


grepPattern searches using ordinary regular expressions.

egrepPattern searches using more powerful extended regular expressions.

The pattern argument supplies the template characters for which grep is to search. The pattern isexpected to be a single argument, so if pattern contains any spaces, or other characters special to theshell, you must enclose the pattern in quotes to prevent the shell from expanding or word splitting it.

The following table summarizes some of grep’s more commonly used command line switches. Consultthe grep(1) man page (or invoke grep --help) for more.

Table 2-1. Common Command Line Switches for the grep Command

Switch Effect-c Print a count of matching lines only.-h Suppress filename prefixes.-eexpression

Use expression as a search pattern. (Helpful for specifying several alternate patterns.)

-i Ignore case when determining matches.-l Print filenames that contain matching pattern only.-n Include line numbers along with matching lines.-q "Quiet". Do not write anything to standard out. Instead, exit with a zero exit status if

any match is found.-r Search all files, recursing through directories.-w Only match whole words.-C Include two lines of context before and after the matched line.

Show All Occurrences of a String in a FileUnder Linux, there are often several ways of accomplishing the same task. For example, to see if a filecontains the word “even”, you could just visually scan the file:

[student@station student]$ cat fileThis file has some words.It also has even more words.

Reading the file, we see that the file does indeed contain the letters “even”. Using this method on a largefile suffers because we could easily miss one word in a file of several thousand, or even several hundredthousand, words. We can use the grep tool to search through the file for us in an automatic search:

[student@station student]$ grep even fileIt also has even more words.

rha030-3.0-0-en-2005-08-17T07:23:17-0400


26


Here we searched for a word using its exact spelling. Instead of just a literal string, the patternargument can also be a general template for matching more complicated character sequences; we shallexplore that in a later Lesson.

Searching in Several Files at OnceAn easy way to search several files is just to name them on the grep command line:

[student@station student]$ echo Every cat has one more tail than no cat. > general[student@station student]$ echo No cat has nine tails. > specific[student@station student]$ echo Therefore, every cat has ten tails. > fallacy[student@station student]$ grep cat general specific fallacygeneral:Every cat has one more tail than no cat.specific:No cat has nine tails.fallacy:Therefore, every cat has ten tails.

Perhaps we are more interested in just discovering which file mentions the word “nine” than actuallyseeing the line itself. Adding the -l switch to the grep line does just that:

[student@station student]$ grep -l nine general specific fallacyspecific

Searching Directories RecursivelyGrep can also search all the files in a whole directory tree with a single command. This can be handywhen working a large number of files.

The easiest way to understand this is to see it in action. In the directory /etc/sysconfig are text filesthat contain much of the configuration information about a Linux system. The Linux name for the firstEthernet network device on a system is “eth0”, so you can find which file contains the configuration foreth0 by letting the grep -r command do the searching for you 2:

[student@station student]$ grep -r eth0 /etc/sysconfig 2>/dev/null/etc/sysconfig/network-scripts/ifup-aliases:# Specify multiple ranges using \

multiple files, such as ifcfg-eth0-range0 and/etc/sysconfig/network-scripts/ifup-aliases:# ifcfg-eth0-range1, etc. In these \

files, the following configuration variables/etc/sysconfig/network-scripts/ifup-aliases:# The above example values create \

the interfaces eth0:0 through eth0:253 using/etc/sysconfig/network-scripts/ifup-ipv6:# Example: \

IPV6TO4_ROUTING="eth0-:f101::0/64 eth1-:f102::0/64"/etc/sysconfig/network-scripts/ifcfg-eth0:DEVICE=’eth0’/etc/sysconfig/networking/devices/ifcfg-eth0:DEVICE=’eth0’/etc/sysconfig/networking/profiles/default/ifcfg-eth0:DEVICE=’eth0’

Every file in /etc/sysconfig that mentions eth0 is shown in the results.

We can further limit the files listed to only those referring to an actual device by filtering the grep -routput through a grep DEVICE:


27


[student@station student]$ grep -r eth0 /etc/sysconfig 2>/dev/null | grep DEVICE/etc/sysconfig/network-scripts/ifcfg-eth0:DEVICE=’eth0’/etc/sysconfig/networking/devices/ifcfg-eth0:DEVICE=’eth0’/etc/sysconfig/networking/profiles/default/ifcfg-eth0:DEVICE=’eth0’

This shows a common use of grep as a filter to simplify the outputs of other commands.

If only the names of the files were of interest, the output can be simplified with the -l command lineswitch.

[student@station student]$ grep -rl eth0 /etc/sysconfig 2>/dev/null/etc/sysconfig/network-scripts/ifup-aliases/etc/sysconfig/network-scripts/ifup-ipv6/etc/sysconfig/network-scripts/ifcfg-eth0/etc/sysconfig/networking/devices/ifcfg-eth0/etc/sysconfig/networking/profiles/default/ifcfg-eth0

Inverting grepBy default, grep shows only the lines matching the search pattern. Usually, this is what you want, butsometimes you are interested in the lines that do not match the pattern. In these instances, the -vcommand line switch inverts grep’s operation.

[student@station student]$ head -n 4 /etc/passwdroot:x:0:0:root:/root:/bin/bashbin:x:1:1:bin:/bin:daemon:x:2:2:daemon:/sbinadm:x:3:4:adm:/var/adm:[student@station student]$ grep -v root /etc/passwd | head -n 3bin:x:1:1:bin:/bin:daemon:x:2:2:daemon:/sbin:adm:x:3:4:adm:/var/adm:

Getting Line NumbersOften you may be searching a large file that has many occurrences of the pattern. Grep will list each linecontaining one or more matches, but how is one to locate those lines in the original file? Using the grep-n command will also list the line number of each matching line.

The file /usr/share/dict/words contains a list of common dictionary words. Identify which linecontains the word “dictionary”:

[student@station student]$ fgrep -n dictionary /usr/share/dict/words12526:dictionary

You might also want to combine the -n switch with the -r switch when searching all the files below adirectory:

[student@station station]$ fgrep -nr dictionary /usr/share/dictlinux.words:12526:dictionary


28


words:12526:dictionary

Limiting Matching to Whole WordsRemember the file containing our nursery rhyme earlier?

[student@station student]$ cat rhymeThe catsat onthe matat home.

Suppose we wanted to retrieve all lines containing the word “at”. If we try the command:

[student@station student]$ fgrep at rhymeThe catsat onthe matat home.

Do you see what happened? We matched the “at” string, whether it was an isolated word or part of alarger word. The grep command provides the -w switch to imply that the specified pattern should onlymatch entire words.

[student@station student]$ grep -w at fileat home.

The -w switch considers a sequence of letters, numbers, and underscore characters, surrounded byanything else, to be a word.

Ignoring CaseThe string “Bob” has quite a meaning quite different from the string “bob”. However, sometimes wewant to find either one, regardless of whether the word is capitalized or not. The grep -i command solvesjust this problem.

Look again at our nursery rhyme:

[student@station student]$ cat rhymeThe catsat onthe matat home.

See if the file contains the word “the”, all in lowercase letters:

[student@station student]$ grep the rhymethe mat

rha030-3.0-0-en-2005-08-17T07:23:17-0400

Copyright (c) 2003-2005 Red Hat, Inc. All rights reserved. For use only by a student enrolled in a Red Hat Academy course taught at a Red Hat Academy. Any other use isa violation of U.S. and international copyrights. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise duplicated whetherin electronic or print format without prior written consent of Red Hat, Inc. If you believe Red Hat course materials are being used, copied, or otherwise improperly distributedplease email [email protected] or phone toll-free (USA) +1 866 626 2994 or +1 (919) 754 3700.

29


Now see which lines contain the letters “t”, “h”, and “e” in any combination of lower- or upper-caseletters:

[student@station student]$ grep -in the rhyme1:The cat3:the mat

Notice that we also used the -n switch to add the line numbers to the output.

Examples

Example 1. Finding Simple Character StringsVerify that your computer has the system account “lp”, used for the line printer tools. Hint: the file/etc/passwd contains one line for each user account on the system.

[student@station student]$ grep lp /etc/passwdlp:x:4:7:lp:/var/spool/lpd:

Example 2. In That CaseSearch for an exact copy of the pattern:

[student@station student]$ grep LP /etc/passwd[student@station student]$

Nothing was matched because the pattern does not match the case for the account name. Search againand ignore the case:

[student@station student]$ grep -i LP /etc/passwdlp:x:4:7:/var/spool/lpd:

Example 3. Matching Whole WordsWe have seen that grep will match the pattern wherever the pattern is located, even in the middle ofwords. Search for the pattern “honey” in the system word dictionary /usr/share/dict/words:

[student@station student]$ grep honey /usr/share/dict/wordshoneyhoneybeehoneycombhoneycombedhoneydewhoneymoonhoneymooned


30


honeymoonerhoneymoonershoneymooninghoneymoonshoneysuckleMahoney

Evidently, the dictionary contains several words using the string “honey” as a root word. We can limitthe matching to whole words by using the grep -w command. The grep command considers a word to bea group of letters, digits, or underscores surrounded by anything else. The beginning and end of a linealso qualifies as “anything else”, so the first or last word on a line is recognized correctly. Try to lookup“honey” in the dictionary again:

[student@station student]$ grep -w honey /usr/share/dict/wordshoney

For lack of a better word: perfect.

Example 4. Combining grep and xargsSuppose that you have been placed in charge of maintaining the help file documentation for the vimeditor. As you browse through the already existing files, you notice that in some places, the help files usethe two words command line, and in other places the single word commandline. You would like the helpfiles to be consistent, and decide the former is correct.

You would now like to find every occurrence of the text commandline, and change them to commandline. You start by identifying which files contain the text commandline.

[student@station student]$ grep -ril commandline /usr/share/doc/vim*/usr/share/doc/vim-common-6.1/docs/message.txt/usr/share/doc/vim-common-6.1/docs/options.txt/usr/share/doc/vim-common-6.1/docs/os_risc.txt/usr/share/doc/vim-common-6.1/docs/tags/usr/share/doc/vim-common-6.1/docs/todo.txt/usr/share/doc/vim-common-6.1/docs/various.txt/usr/share/doc/vim-common-6.1/docs/version5.txt

You would now like to open each of these files in the gedit text editor, so that you can make your edits.You pipe the results of your search into the gedit command.

[student@station student]$ grep -ril commandline /usr/share/doc/vim* | gedit

The gedit editor opens, but with an empty buffer titled "untitled". This is not what you had meant! Youhad wanted gedit to open the filenames that the grep command supplied on stdin, not stdin itself.Unfortunately, that’s not how gedit works. gedit (like most text editors) expects filenames to be suppliedas arguments on the command line, not using stdin.

Fortunately, there is a standard Linux (and Unix) utility that helps in just such situations: xargs. Thexargs command will read stdin, and append any words found there to the supplied command line, asadditional arguments. Hopefully, the following example will clarify. With your knowledge of the xargscommand, you modify your previous approach.


31


[student@station student]$ grep -ril commandline /usr/share/doc/vim* | xargs gedit

Now, the gedit editor opens up with multiple buffers, one for each file output by the grep command.

Figure 2-2. Using xargs to Convert Standard In into Arguments for gedit

Notice that you never had to type in the individual file names. The words supplied on stdin wereexchanged for arguments on the command line, thus the name xargs. Nice.

Online Exercises

Lab ExerciseObjective: Use the grep command to find occurrences of specified text.


Specification

1. Create the file ~/bashusers.txt, which contains lines from the /etc/passwd file which containthe text /bin/bash.

2. Create the file ~/nostdhome.txt, which contains only lines from the /etc/passwd file which donot contain the text home (implying that the associated user has a nonstandard home directory).

3. Create the file ~/ansiterms.txt, which contains every line from the /etc/termcap file whichcontains the text ansi, using a case insensitive search. (In other words, ansi, ANSI, Ansi, and AnSiwould all count as matches).

4. Create the file ~/mayhemnum.txt, which contains the line number of the word mayhem from thefile /usr/share/dict/words as its only word.

rha030-3.0-0-en-2005-08-17T07:23:17-0400Copyright (c) 2003-2005 Red Hat, Inc. All rights reserved. For use only by a student enrolled in a Red Hat Academy course taught at a Red Hat Academy. Any otheruse is a violation of U.S. and international copyrights. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwiseduplicated whether in electronic or print format without prior written consent of Red Hat, Inc. If you believe Red Hat course materials are being used,copied, or otherwise improperly distributed please email [email protected] or phone toll-free (USA) +1 866 626 2994 or +1 (919) 754 3700.

32


5. Create the file ~/firstredhat.txt, which contains an alphabetically sorted list of all filesunderneath the /usr/share/firstboot directory (and its subdirectories) that contain the textredhat, using a case insensitive search. The files should be listed one per line using absolutereferences.

Deliverables

1. The file ~/bashusers.txt, which contains lines from the /etc/passwd file which contain the text /bin/bash.

2. The file ~/nostdhome.txt, which contains lines from the /etc/passwd file which do not contain the texthome.

3. The file ~/ansiterms.txt, which contains every line from the /etc/termcap file which contains the textansi, using a case insensitive search.

4. The file ~/mayhemnum.txt, which contains the line number of the word mayhem from the file/usr/share/dict/words as its only word.

5. The file ~/firstredhat.txt, which contains an alphabetically sorted list of all files underneath the/usr/share/firstboot directory that contain the text redhat, using a case insensitive comparison. The filesshould be listed one per line using absolute references.

Questions

1. Which of the following command lines would list lines from the file /etc/group which contain the text elvis?

( ) a. grep /etc/group elvis

( ) b. echo elvis | grep /etc/group

( ) c. echo /etc/group | grep elvis

( ) d. A and C


2. To allow the search pattern HELLO to match both hello and HELLO, you would use the grep command withwhich command line switch?

( ) a. -i

( ) b. -r

( ) c. -w

( ) d. -k

rha030-3.0-0-en-2005-08-17T07:23:17-0400Copyright (c) 2003-2005 Red Hat, Inc. All rights reserved. For use only by a student enrolled in a Red Hat Academy course taught at a Red Hat Academy. Any other use is aviolation of U.S. and international copyrights. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise duplicated whether inelectronic or print format without prior written consent of Red Hat, Inc. If you believe Red Hat course materials are being used, copied, or otherwise improperly distributedplease email [email protected] or phone toll-free (USA) +1 866 626 2994 or +1 (919) 754 3700.

33



3. To search an entire directory hierarchy, you would run grep using which command line switch?

( ) a. -i

( ) b. -f

( ) c. -n

( ) d. -r


4. Which of the following command lines would list all lines from /usr/share/dict/words which contain thetext sun, but only those lines, preceded by their line number?

( ) a. grep -n sun /usr/share/dict/words

( ) b. grep -N /usr/share/dict/words sun

( ) c. grep -r /usr/share/dict/words sun

( ) d. grep -r sun /usr/share/dict/words


5. Which of the following command lines would return the number of lines which contain the text freedom found inthe file /usr/share/doc/redhat-release-3ES/GPL?

( ) a. grep freedom /usr/share/doc/redhat-release-3ES/GPL | wc -w

( ) b. grep freedom /usr/share/doc/redhat-release-3ES/GPL | wc -l

( ) c. grep freedom /usr/share/doc/redhat-release-3ES/GPL | wc -c

( ) d. grep freedom /usr/share/doc/redhat-release-3ES/GPL | wc -n


6. Which of the following command lines would list the names of files (and only the names of files) foundunderneath the /etc directory which contain the word network (i.e., the word networking would not count).

( ) a. grep -rwl network /etc

( ) b. grep -wl network /etc

( ) c. grep -rl network /etc

( ) d. grep -ilw network /etc



34


7. Which of the following command lines would reduce the output of the ps aux command to only those lines whichdo not contain the text root?

( ) a. grep root < ps aux

( ) b. ps aux | grep -v root

( ) c. grep -x root | ps aux

( ) d. ps aux >> grep -k root


8. Which of the following would list lines from the file /etc/nsswitch.conf which contain the text nisplus?

( ) a. grep nisplus /etc/nsswitch.conf

( ) b. grep nisplus < /etc/nsswitch.conf

( ) c. grep -q nisplus /etc/nsswitch.conf

( ) d. All of the above

( ) e. A and B only

9. Which of the following would list every file under the /usr/share/gnome directory which contains the textFree Software Foundation on a single line?

( ) a. grep -ril Free Software Foundation /usr/share/gnome

( ) b. ls /usr/share/gnome | grep -i Free Software Foundation

( ) c. grep -rc Free Software Foundation /usr/share/gnome

( ) d. grep -rl "Free Software Foundation" /usr/share/gnome


10. Which of the following command lines would list every line which contains the text cdrom from the file README,along with two lines of context before and after the matching line?

( ) a. grep cdrom README | head -2 | tail -2

( ) b. grep -n2 cdrom README

( ) c. grep -n2 README cdrom

( ) d. grep -k2 cdrom README


rha030-3.0-0-en-2005-08-17T07:23:17-0400

Copyright (c) 2003-2005 Red Hat, Inc. All rights reserved. For use only by a student enrolled in a Red Hat Academy course taught at a Red Hat Academy. Any otheruse is a violation of U.S. and international copyrights. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwiseduplicated whether in electronic or print format without prior written consent of Red Hat, Inc. If you believe Red Hat course materials are being used, copied,or otherwise improperly distributed please email [email protected] or phone toll-free (USA) +1 866 626 2994 or +1 (919) 754 3700.

35


Notes1. When the original grep program was written, computers did not have much memory, so having small

programs was very desirable. Having a single program that could efficiently implement the threedifferent types of searching was not possible unless the program were to be extraordinarily large, sothe functions were separated into three programs. When the GNU project was started, computerscould easily handle larger programs, so all three searching techniques were built into a singleprogram but all three program names were kept for compatibility with other UNIX-like systems.

2. There are some files under /etc/sysconfig that ordinary users cannot read. We have used the“2>/dev/null” idiom to have all error messages be ignored.


36

Chapter 3. Introduction to RegularExpressions

Key Concepts• Regular expressions are a standard Unix syntax for specifying text patterns.

• Regular expressions are understood by many commands, including grep, sed, vi, and many scriptinglanguages.

• Within regular expressions, . and [] are used to match characters.

• Within regular expressions, +, *, and ?specify a number of consecutive occurrences.

• Within regular expressions, ^ and $ specify the beginning and end of a line.

• Within regular expressions, (, ), and | specify alternative groups.

• The regex(7) man page provides complete details.

Discussion

Introducing Regular ExpressionsIn the previous chapter you saw grep used to match either a whole word or part of a word. This by itsself is very powerful, especially in conjunction with arguments like -i and -v, but it is not appropriate forall search scenarios. Here are some examples of searches that the grep usage you’ve learned so far wouldnot be able to do:

First, suppose you had a file that looked like this:

[biafra@station]$ cat people_and_pets.txt==========================Name: Joe GreenAge: 36Pets:Name: AidaAge: 5Species: Cat------------Name: HawnAge: 1Species: Goldfish

==========================Name: Sarah JaneAge: 29Pets:

37

Chapter 3. Introduction to Regular Expressions

Name: OrfeusAge: 7Species: Dog-------------Name: EuridiceAge: 8Species: Dog

What if you wanted to pull out just the names of the people in people_and_pets.txt? A command likegrep -w Name: would match the ’Name:’ line for each person, but also the ’Name:’ line for eachperson’s pet. How could we match only the ’Name:’ lines for people? Well, notice that the lines for pets’names are all indented, meaning that those lines begin with whitespace characters instead of text. Thus,we could achieve our goal if we had a way to say "Show me all lines that begin with ’Name:’".

Another example: Suppose you and a friend both witnessed a hit-and-run car accident. You both got alook at the fleeing car’s license plate and yet each of you recalls a slightly different number. You read thelicense number as "4I35VBB" but your friend read it as "413SV88". It seems that what you read as an ’I’in the second character, your friend read as a ’1’. Similar differences appear in your interpretations ofother parts of the license like ’5’ vs ’S’ and ’BB’ vs ’88’. The police, having taken both of yourstatements, now need to narrow down the suspects by querying their database of license plates for platesthat might match what you saw.

One solution might be to do separate queries for "4I35VBB" and "413SV88" but doing so assumes thatone of you is exactly right. What if the perpetrator’s license number was actually "4135VB8"? In otherwords, what if you were right about some of the characters in question but your friend was right aboutothers? It would be more effective if the police could query for a pattern that effectively said: "Show meall license numbers that begin with a ’4’, followed by an ’I’ or a ’1’, followed by a ’3’, followed by a ’5’or an ’S’, followed by a ’V’, followed by two characters that are each either a ’B’ or an ’8’".

Query scenarios like these can be solved using regular expressions. While computer scientists sometimesuse the term "regular expression" (or "regex" for short) to describe any method of describing complexpatterns, in Linux and many programming languages the term refers to a very specific set of specialcharacters used for solving problems like the above. Regular expressions are supported by a largenumber of tools including grep, vi, find and sed.

To introduce the usage of regular expressions, lets look at some solutions to two problems introducedearlier. Don’t worry if these seem a bit complicated, the remainder of the unit will start from scratch andcover regular expressions in great detail.

A regex that could solve the first problem, where we wanted to say "Show me all lines that begin with’Name:’" might look like this:

[biafra@station]$ grep ’^Name:’ people_and_pets.txtName: Joe GreenName: Sarah Jane

...that’s it! Regular expressions are all about the use of special characters, called metacharacters torepresent advanced query parameters. The carat ("^"), as shown here, means "Lines that begin with...".Note, by the way, that the regular expression was put in single-quotes. This is a good habit to get intoearly on as it prevents bash from interpreting special characters that were meant for grep.

Ok, so what about the second problem? That one involved a much more complicated query: "Show meall license numbers that begin with a ’4’, followed by an ’I’ or a ’1’, followed by a ’3’, followed by a ’5’


38


or an ’S’, followed by a ’V’, followed by two characters that are each either a ’B’ or an ’8’". This couldbe represented by a regular expression that looks like this:

4[I1]3[5S]V[B8]{2}

Wow, that’s pretty short considering how long it took to write out what we were looking for! There areonly two types of regex metacharacters used here: square braces (’[]’) and curly braces (’{}’). When twoor more characters are shown within square braces it means "any one of these". So ’[B8]’ near the end ofthe expression means "’B’ or ’8’". When a number is shown within curly braces it means "this many ofthe preceding character". Thus, ’[B8]{2}’ means "two characters that are each either a ’B’ or an ’8’".Pretty powerful stuff!

Now that you’ve gotten a taste of what regular expressions are and how they can be used, let’s start fromscratch and cover them in depth.

Regular Expressions, Extended Regular Expressions, andthe grep CommandAs the Unix implementation of regular expression syntax has evolved, new metacharacters have beenintroduced. In order to preserve backward compatibility, commands usually choose to implement regularexpressions, or extended regular expressions. In order to not become bogged down with the differences,this Lesson will introduce the extended syntax, summarizing differences at the end of the discussion.

One of the most common uses for regular expressions is specifying search patterns for the grepcommand. As was mentioned in the previous Lesson, there are three versions of the grep command.Reiterating, the three differ in how they interpret regular expressions.

fgrepThe fgrep command is designed to be a "fast" grep. The fgrep command does not support regularexpressions, but instead interprets every character in the specified search pattern literally.

grepThe grep command interprets each patterns using the original, basic regular expression syntax.

egrepThe egrep command interprets each patterns using extended regular expression syntax.

Because we are not yet making a distinction between the basic and extended regular expression syntax,the egrep command should be used whenever the search pattern contains regular expressions.

Anatomy of a Regular ExpressionIn our discussion of the grep program family, we were introduced to the idea of using a pattern toidentify the file content of interest. Our examples were carefully constructed so that the pattern containedexactly the text for which we were searching. We were careful to use only literal characters in ourregular expressions; a literal character matches only itself. So when we used “hello” as the regularexpression, we were using a five-character regular expression composed only of literal characters. While

rha030-3.0-0-en-2005-08-17T07:23:17-0400


39


this let us concentrate on learning how to operate the grep program, it didn’t allow us to get a fullappreciation of the power of regular expressions. Before we see regular expressions in use, we shall firstsee how they are constructed.

A regular expression is a sequence of:

Literal Characters

Literal characters match only themselves. Examples of literals are letters, digits and most specialcharacters (see below for the exceptions).

Wildcards

Wildcard characters match any character. Within a regular expression, a period (“.”) matches anycharacter, be it a space, a letter, a digit, punctuation, anything.

Modifiers

A modifier alters the meaning of the immediately preceding pattern character. For example, theexpression “ab*c” matches the strings “ac”, “abc”, “abbc”, “abbbc”, and so on, because theasterisk (“*”) is a modifier that means “any number of (including zero)”. Thus, our pattern means tomatch any sequence of characters consisting of one “a”, a (possibly empty) series of “b” characters,and a final “c” character.

Anchors

Anchors establish the context for the pattern, such as "the beginning of a line", or "the end of aword". For example, the expression “cat” would match any occurrence of the three letters, while“^cat” would only match lines that begin “cat”.

Each of these are discussed in more detail in the sections below.

Taking Literals LiterallyLiterals are straightforward because each literal character in a regular expressions matches one, and onlyone, copy of itself in the searched text. Uppercase characters are distinct from lowercase characters, sothat “A” does not match “a”.

Wildcards

The "dot" wildcardThe character “.” is used as a placeholder, to match one of any character. In the following example, thepattern matches any occurrence of the literal characters “x” and “s”, separated by exactly two othercharacters.

[student@station student]$ grep "x..s" /usr/share/dict/words | head -5antitoxinsaxersaxles


40


axonsboxers

Bracket Expressions: Ranges of Literal CharactersNormally a literal character in a regex pattern matches exactly one occurrence of itself in the searchedtext. Suppose we want to search for the string “hello” regardless of how it is capitalized: we want tomatch “Hello” and “HeLLo” as well. How might we do that?

A regex feature called a bracket expression solves this problem neatly. A bracket expression is a range ofliterals enclosed in square brackets (“[” and “]”). For example, the regex pattern “[Hh]” is a characterrange that matches exactly one character: either an uppercase “H” or a lowercase “h” letter. Notice that itdoesn’t matter how large the set of characters within the range is, the set matches exactly one character,if it matches any at all. A bracket expression that matches the set of lowercase vowels could be written“[aeiou]” and would match exactly one vowel.

In the following example, bracket expressions are used to find words from the file/usr/share/dict/words. In the first case, the first five words that contain three consecutive(lowercase) vowels are printed. In the second case, the first 5 words that contain lowercase letters in thepattern of vowel-consonant-vowel-consonant-vowel-consonant are printed.

If the first character of a bracket expression is a “^”, the interpretation is inverted, and the bracketexpression will match any single occurrence of a character not included in the range. For example, theexpression “[âeiou]” would match any character that is not a vowel. The following example first listswords which contain three consecutive vowels, and secondly lists words which contain three consecutiveconsonant-vowel pairs.

[student@station student]$ egrep ’[aeiou][aeiou][aeiou]’ /usr/share/dict/words| head -5absenteeismAchaeanAchaeansacquaintacquaintance[student@station student]$ egrep ’[aeiou][âeiou][aeiou][âeiou][aeiou][âeiou]’/usr/share/dict/words | head -5abasedabasementabasementsabasesabasing

Range Expressions vs. Character Classes: Old School and New SchoolAnother way to express a character range is by giving the start- and end-letters of the sequence this way:“[a-d]” would match any character from the set a, b, c or d. A typical usage of this form would be“[0-9]” to represent any single digit, or “[A-Z]” to represent all capital letters.

How are the characters ordered? For example, does uppercase “C” come before or after lowercase “b”?Recall the discussion about character encoding from the first Lesson. The encoded value of the letter is


41


used to determine if one character is "lesser" or "greater" than another. As long as the character set whichdefines the encoding is ordered correctly, as is the case with ASCII, all is well. But what about theLatin-1 (ISO-8859-1) character set? Does “ö” really come after “z”?

As an alternative to such quandaries, modern regular expression make use character classes. Characterclasses match any single character, using language specific conventions to decide if a given character isuppercase or lowercase, or if it should be considered part of the alphabet or punctuation. The followingtable lists some supported character classes, and the ASCII equivalent range expression, whereappropriate.

Table 3-1. Regular Expression Character Classes

Expression Character Class ASCII equivalent range[:alnum:] alphanumeric A-Za-z0-9[:alpha:] alphabet character A-Za-z[:blank:] space or tab[:digit:] numeric digit 0-9[:lower:] lowercase letters a-z[:punct:] printable characters, excluding

spaces and alphanumerics[:space:] whitespace character[:upper:] uppercase letter A-Z

Character classes avoid problems you may run into when using regular expressions on systems that usedifferent character encoding schemes where letters are ordered differently. For example, suppose youwere to run the command:

[elvis@station]$ grep ’[A-Z]’ /usr/share/dict/words

On a Red Hat Enterprise Linux system, this would match every word in the file, not just those thatcontain capital letters as one might assume. This is because in unicode (utf-8), the character encodingscheme that RHEL uses, characters are alphabetized case-insensitively, so that [A-Z] is equivalent to[AaBbCc...etc]. On older systems, though, a different character encoding scheme is used wherealphabetization is done case-sensitively. On such systems [A-Z] would be equivalent to [ABC...etc].Character classes avoid this pitfall. You can run:

[elvis@station]$ grep ’[[:upper:]]’ /usr/share/dict/words

on any system regardless of the encoding scheme being used and it will only match lines that containcapital letters.

For more details about the predefined range expressions, consult the grep manual page. For moreinformation on character encoding schemes under Linux, refer back to chapter 8.3. To learn about howcharacter encoding schemes are used to support other languages in Red Hat Enterprise Linux, begin withthe locale manual page.

rha030-3.0-0-en-2005-08-17T07:23:17-0400


42


Common Modifier CharactersWe saw a common usage of a regex modifier in our earlier example “ab*c” to match an a and ccharacter with some number of b letters in between. The “*” character changed the interpretation of theliteral b character from matching exactly one letter to matching any number of b’s.

Here are a list of some common modifier characters:

b?

The question mark (“?”) means “either one or none”: the literal character is considered to beoptional in the searched text. For example, the regex pattern “ab?c” matches the strings “ac”, and“abc”, but not “abbc”.

b*

The asterisk (“*”) modifier means “any number of (including zero)” of the preceding literalcharacter. The regex pattern “ab*c” matches the strings “ac”, “abc”, “abbc”, and so on.

b+

The plus (“+”) modifier means “one or more”, so the regex pattern “b+” matches a non-emptysequence of b’s. The regex pattern “ab+c” matches the strings “abc” and “abbc”, but does notmatch “ac”.

b{m,n}

The brace modifier is used to specify a range of between m and n occurrences of the precedingcharacter. The regex pattern “b{2,4}” would match “abbc” and “abbbc”, and “abbbbc”, but not“abc” or “abbbbbc”.

b{n}

With only one integer, the brace modifier is used to specify exactly n occurrences for the precedingcharacter.

In the following example, egrep prints lines from /usr/share/dict/words that contain patternswhich start with a (capital or lowercase) “a”, might or might not next have a (lowercase) “b”, but thendefinitely follow with a (lowercase) “a”.

[student@station student]$ egrep ’[Aa]b?a’ /usr/share/dict/words | head -5AarhusAaronAbabaabackabaft

The following example prints lines which contain patterns which start “al”, then use the “.” wildcard tospecify 0 or more occurrences of any character, followed by the pattern “bra”.

[student@station student]$ egrep ’al.*bra’ /usr/share/dict/words | headalgebraalgebraicalgebraicallyalgebras


43


calibratecalibratedcalibratescalibratingcalibrationcalibrations

Notice we found variations on the words algebra and calibrate. For the former, the .* expressionmatched “ge”, while for the latter, it matched the letter “i”.

The expression “.*”, which is interpreted as "0 or more of any character", shows up often in regexpatterns, acting as the "stretchable glue" between two patterns of significance.

As a subtlety, we should note that the modifier characters are greedy: they always match the longestpossible input string. For example, given the regex pattern:

t.*e

and the input stream:

now is the time

our pattern matches:

the time

instead of just “the”. When used in simple searches, such as grep, the difference is usually insignificant.When regular regular expressions are used in find and replace operations, however, as is done with manytext editors, the difference becomes significant.

Anchored SearchesFour additional search modifier characters are available:

^foo

A caret (“^”) matches the beginning of a line. Our example “^foo” matches the string “foo” onlywhen it is at the beginning of a line

foo$

A dollar sign (“$”) matches the end of a line. Our example “foo$” matches the string “foo” only atthe end of a line, immediately before the newline character.

\<foo\>

By themselves, the less than sign (“<”) and the greater than sign (“>”) are literals. Using thebackslash character to escape them transforms them into meaning “first of a word” and “end of aword”, respectively. Thus the pattern “\>cat\<” matches the word “cat” but not the word“catalog”.

You will frequently see both ^ and $ used together. The regex pattern “^foo$” matches a whole line thatcontains only “foo” and would not match that line if it contained any spaces.


44


The \< and \> are also usually used as pairs.

In the following an example, the first search lists all lines that contain the letters “ion” anywhere on theline. The second search only lists lines which end in “ion”.

[student@station student]$ egrep ion /usr/share/dict/words | head -5abbreviationabbreviationsabductionabductionsaberration[student@station student]$ egrep ’ion$’ /usr/share/dict/words | head -5abbreviationabductionaberrationabjectionablation

Coming to Terms with Regex GroupingThe same way that you can use parenthesis to group terms within a mathematical expression, you alsouse parenthesis to collect regular expression pattern specifiers into groups. This lets the modifiercharacters “?”, “*” and “+” apply to groups of regex specifiers instead of only the immediately precedingspecifier.

Suppose we need a regular expression to match either “foo” or “foobar”. We could write the regex as“foo(bar)?” and get the desired results. This lets the “?” modifier apply to the whole string “bar”instead of only the preceding “r” character.

Grouping regex specifiers using parenthesis becomes even more flexible when the pipe symbol (“|”) isused to separate alternative patterns. Using alternatives, we could rewrite our previous example as“(foo|foobar)”. Writing this as “foo|foobar” is simpler and works just as well, because just likemathematics, regex specifiers have precedence. While you are learning, always enclose your groups inparenthesis.

In the following example, the first search prints all lines from the file /usr/share/dict/words whichcontain four consecutive vowels (compare the syntax to that used when first introducing rangeexpressions, above). The second search finds words that contain a double “o” or a double “e”, followed(somewhere) by a double “e”.

[student@station student]$ egrep ’[aeiou]{4}’ /usr/share/dict/words | head -5aqueousdequeuedequeueddequeuesdequeuing[student@station student]$ egrep ’(o|e){2}.*ee’ /usr/share/dict/wordsbookkeeperbookkeepersbookkeepingChattahoocheedoorkeeper


45


freewheelGreentree

Escaping Meta-CharactersSometimes you need to match a character that would ordinarily be interpreted as a regular expressionwildcard or modifier character. To temporarily disable the special meaning of these characters, simplyescape them using the backslash (“\”) character. For example, the regex pattern “cat.” would match theletters “cat” followed by any character: “cats” or “catchup”. To match only the letters “cat.” at theend of a sentence, use the regex pattern “cat\.” to disable interpreting the period as a wildcardcharacter.

Note one distracting exception to this rule. When the backslash character precedes a “<” or “>”character, it enables the special interpretation (anchoring the beginning or ending of a word) instead ofdisabling the special interpretation. Shudder. It even gets worse - see the footnote at the bottom of thefollowing table.

Summary of Linux Regular Expression SyntaxThe following table summarizes regular expression syntax, and identifies which components are found inbasic regular expression syntax, and which are found only in the extended regular expression syntax.

Table 3-2. Summary of Linux Regular Expression Syntax

Character Role Regex Syntax Interpretation. wildcard basic match one of any

character[abc], [a-z] inclusion range basic match one of any

character included inrange

[âbc], [â-z] exclusion range basic match one of anycharacter not included inrange

? modifier extended match 0 or 1 ofpreceding term

* modifier basic match 0 or more ofpreceding term

+ modifier extended match 1 or more ofpreceding term

{m,n} modifier extended match between m and n(inclusively)occurrences of thepreceding term





46


Character Role Regex Syntax Interpretation{n} modifier extended match exactly n

occurrences of thepreceding term

^ anchor basic mark beginning of a line

$ anchor basic mark end of a line\< anchor basic mark beginning of a

word\> anchor basic mark end of a word(...) grouping basic allow modifiers to act

on a group of characters

(... | ...) grouping extended allow alternate patternsto be specified

\ escape a extended (basic) escape (or enable)special interpretation ofthe following character.

Notes:a. When using extended regular expressions, the backslash (usually) strips special interpretation fromthe following character. Red Hat Enterprise Linux uses GNU extensions when parsing basic regularexpressions, however, which use the backslash to enable extendedish interpretation of the followingcharacter. For example, the expression “e\{3\}” would match “eee” when using basic regularexpressions. shudder-shudder.

Regular Expressions are NOT File GlobbingWhen first encountering regular expressions, students understandably confuse regular expressions withpathname expansion (file globbing). Both are used to match patterns in text. Both share similarmetacharacters (“*”, “?”, “[...])”, etc.). However, they are distinctly different. The following tablecompares and contrasts regular expressions and file globbing.

Table 3-3. Comparing and Contrasting Regular Expressions and File Globbing

Regular Expressions File GlobbingImplemented within search or search and replace

utilities, such as grep, vi, sed, and many scriptinglanguages such as perl, python, etc.

Implemented by the bash shell for the purpose ofmatching filenames, and to a lesser extent is foundin some applications and scripting languages.

Uses the expression “.*” for stretchable glue. Uses the expression “*” for stretchable glue.Uses the expression “.” to match exactly one of

any character.Uses the expression “?” to match exactly one of

any character.

In the following example, the first argument is a regular expression, specifying text which starts with an“l” and ends “.conf”, while the second argument is a file glob which specifies all files in the /etc

rha030-3.0-0-en-2005-08-17T07:23:17-0400


47


directory whose filename starts with “l” and ends “.conf”.

[student@station student]$ egrep ’l.*\.conf’ /etc/l*.conf/etc/ldap.conf:# @(#)$Id: 087_warning.dbk,v 1.2 2004/01/07 16:39:53 bowe Exp $/etc/libuser.conf:# Set this only if it differs from the default in /etc/krb5.conf./etc/ltrace.conf:; ltrace.conf

Take a close look at the second line of output. Why was it matched by the specified regular expression?

In a similar vain, when specifying regular expressions on the bash command line, care must be taken toquote or escape the regex meta-characters, lest they be expanded away by the bash shell with unexpectedresults. In all of the examples found in this discussion, the first argument to the egrep command isprotected with single quotes for just this reason.

Where to Find More Information About Regular ExpressionsWe have barely scratched the surface of the usefulness of regular expressions. The explanation we haveprovided will be adequate for your daily needs, but even so, regular expressions offer much more power,making even complicated text searches simple to perform.

For more online information about regular expressions, you should check:

• The regex(7) manual page.

• The grep(1) manual page.

Examples

Example 1. Literal SearchesNow that we understand regular expressions in more detail, let us revisit some earlier examples and seethem in a new light.

Given the file rhyme that contains the text:

[student@station student]$ cat rhymeThe cat sat on the mat at home.

The regular expression “at” matches:

• the at in “cat”; and

• the at in “mat”; and

• the at in “at”.

The regular expression “\<at\>” matches only the individual word “at”.

rha030-3.0-0-en-2005-08-17T07:23:17-0400


48


Example 2. Range ExpressionsA range expression matches exactly one instance of any one of the characters listed by the rangeexpression.

[student@station student]$ echo bar > file[student@station student]$ echo car >> file[student@station student]$ echo far >> file[student@station student]$ echo are >> file[student@station student]$ grep ’[cf]ar’ filecarfar

The range expression “[cf]ar” matches either a c or an f followed by “ar”.

Example 3. REGEX ModifiersModifiers control how many occurrences of the preceding regex specifier are matched:

[student@station student]$ echo ac > file[student@station student]$ echo abc >> file[student@station student]$ echo abbc >> file[student@station student]$ echo abbbc >> file

The question mark (?) matches exactly one occurrence of the preceding specifier, if it exists.

[student@station student]$ egrep ’ab?c’ fileacabc

The plus sign (+) matches one or more of the preceding specifier:

[student@station station]$ egrep ’ab+c’ fileabcabbcabbbc

The asterisk (*) matches any number, including zero, occurrences of the preceding specifier:

[student@station student]$ egrep ’ab*c’ fileacabcabbcabbbc

Example 4. Anchored SearchesAnchored searches are used to match strings only at the beginning or ending of the input line.

[student@station student]$ echo "i am sam" > file[student@station student]$ echo "sam i am" >> file


49


[student@station student]$ echo "am i sam" >> file[student@station student]$ echo "sam" >> file[student@station student]$ cat filei am samsam i amam i samsam[student@station student]$ egrep ’^sam’ filesam i am[student@station student]$ egrep ’sam$’ filei am samam i samsam[student@station student]$ egrep ’^sam$’ filesam

Where ^ and $ anchor to lines, the anchors \< and \> match the beginning and ends of words:

[student@station student]$ egrep ’\<am\>’ filei am samsam i amam i sam

Example 5. REGEX Term GroupingUse parenthesis to group several regex specifiers into a single unit. Use the pipe symbol (“|”) to indicatealternatives.

Suppose we are writing a letter. We could write a regular expression to match the greeting line like this:

^Dear (Dr|Mr|Ms)\.

This would match the lines:

Dear Dr. SmithDear Mr. SmithDear Ms. Smith

but not match a greeting such as:

Dear Miss Smith

Perhaps we did not match the greeting because we forgot to add the period after the abbreviation. Thisregex pattern would match either way:

^Dear (Dr|Mr|Ms)\.?

whether or not the period was present.

rha030-3.0-0-en-2005-08-17T07:23:17-0400


50


Example 6. Is elvis in the House?The user blondie would like to create a script which checks to see if someone is defined as a local user ona Linux system. The script takes one argument, which is expected to be a username. She could use the idcommand to confirm if a user named username existed, but the id command would include users whomight be defined by an NIS server, or some other type of network accessible database, instead of on thelocal machine. She instead decides to examine the local user database (the /etc/passwd file) directly.

She creates the following script.

[blondie@station blondie]$ cat inhouse#!/bin/bash

if [ ! $# == ’1’ ]; then Ê

echo "usage: inhouse USERNAME"exit 1

fi

if grep -q "^$1:" /etc/passwd; then Ë

echo "$1 is in the house."else

echo "$1 is not in the house."fi

Ê In this stanza, the script ensures it was passed exactly one argument.

Ë This line contains the interesting regular expression. The grep command will look for a line whichbegins with the argument, trailed by a “:”. Recalling the structure of the /etc/passwd file,usernames satisfy these conditions.

Saving the file, and making it executable, blondie tries out the script on the (existing) user elvis and(non-existing) user barney.

[blondie@station blondie]$ mv inhouse bin/[blondie@station blondie]$ chmod a+x bin/inhouse[blondie@station blondie]$ inhouse elviselvis is in the house.[blondie@station blondie]$ inhouse barneybarney is not in the house.

Example 7. Searching for Telephone NumbersThe combination of regular expressions and the grep command creates a powerful tool for extractingdesired nuggets from large amounts of information. In the following, elvis recalls noticing a phonenumber somewhere within the /usr/share/doc directory, but he has no recollection where. He beginsa process of searching for all phone numbers within the /usr/share/doc directory (which in this casecontains nearly 12000 files).

He begins by noting the fact that all United States phone numbers have at least 7 digits, conventionallywritten with the first three digits separated from the last four with either a “-” or a space, such as

rha030-3.0-0-en-2005-08-17T07:23:17-0400


51


555-1212 or 555 1212. He begins by recursively searching through all files in the /usr/share/docdirectory for such a pattern.

[elvis@station doc]$ egrep -r ’[[:digit:]]{3}(-| )[[:digit:]]{4}’ ../hwdata-0.75/COPYING: 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA./hwdata-0.75/COPYING: Foundation, Inc., 59 Temple Place, Suite 330, Boston,MA 02111-1307 USA./libart_lgpl-2.3.11/COPYING: 59 Temple Place, Suite 330,Boston, MA 02111-1307 USA...

After observing the first few lines, elvis realizes that his regex pattern is too general. He is matching zipcodes as well as phone numbers. He refines his pattern, specifying that any preceding character ortrailing character must not be a number.

[elvis@station doc]$ egrep -r ’[^[:digit:]][[:digit:]]{3}(-| )[[:digit:]]{4}[^[:digit:]]’ ../libjpeg-6b/README:642-4900, or from Global Engineering Documents at (800) 854-7179. (ANSI./libjpeg-6b/README: phone (408) 944-6300, fax (408) 944-6314./bash-2.05b/article.ms:\f([email protected]\fP or call \f(CR+1-617-876-3296\fP./esound-0.2.28/esound.ps:7 w(5)p Black 0 TeXcolorgray 795 1077 a Fm(esd)p Black./esound-0.2.28/esound.ps:7 w(5)p Black 0 TeXcolorgray 795 1168 a(esdctl)p Black./esound-0.2.28/esound.ps:0 TeXcolorgray 596 1554 a Fk(4.)20 b(Miscellaneous)d(Information)p./esound-0.2.28/esound.ps:7 w(9)p Black 0 TeXcolorgray 795 1665 a Fm(New)k(Featur)o(es)p...

This time, elvis’s search precedes much better, until he hits the file esound.ps. This file containsPostScript, which routinely uses numbers written in ASCII text to specify coordinates. Knowing that hewas not examining a PostScript file, elvis devises a way to exclude all files that end with the .psextension from his search. He first uses the find command to list every file in the directory. He nextgrep’s the output down to all files that do not end .ps. He then uses the xargs command to feed thesefilenames into his original grep command as arguments. Because his files are now being specifiedindividually as command line arguments, he no longer needs to grep recursively.

[elvis@station doc]$ find . | egrep -v ’\.ps$’ | xargs egrep’[^[:digit:]][[:digit:]]{3}(-| )[[:digit:]]{4}[^[:digit:]]’./libjpeg-6b/README:642-4900, or from Global Engineering Documents at (800) 854-7179. (ANSI./libjpeg-6b/README: phone (408) 944-6300, fax (408) 944-6314./bash-2.05b/article.ms:\f([email protected]\fP or call \f(CR+1-617-876-3296\fP./gawk-3.1.1/README_d/README.solaris:# P.O. Box 354 Home Phone: +972

8 979-0381 Fax: +1 603 761-6761./gawk-3.1.1/README_d/README.solaris:Columbus, Ohio 43210-1174 1-614-292-5310 (Office/Answering Device)...

After observing a few more lines of output, elvis realizes he should also exclude files that end .fig and.pdf from his search, as they also contain many ASCII numbers and are cluttering his output.Modifying his regular expression in his first grep command, he repeats his search.


52


[elvis@station doc]$ find . | egrep -v ’\.(ps|fig|pdf)$’ |xargs egrep -h -C2 ’[^[:digit:]][[:digit:]]{3}(-| )[[:digit:]]{4}[^[:digit:]]’...

Now that the search seems to be going well, elvis revises the output formatting, asking grep to notdisplay filenames, and give 2 lines of context around each phone number.

[elvis@station doc]$ find . | egrep -v ’\.(ps|pdf|fig)$’ |xargs egrep -h -C2 ’[^[:digit:]][[:digit:]]{3}(-| )[[:digit:]]{4}[^[:digit:]]’it’s much cheaper and includes a great deal of useful explanatory material.)In the USA, copies of the standard may be ordered from ANSI Sales at (212)642-4900, or from Global Engineering Documents at (800) 854-7179. (ANSIdoesn’t take credit card orders, but Global does.) It’s not cheap: as of1992, ANSI was charging $95 for Part 1 and $47 for Part 2, plus 7%--

1778 McCarthy Blvd.Milpitas, CA 95035phone (408) 944-6300, fax (408) 944-6314

A PostScript version of this document is available by FTP atftp://ftp.uu.net/graphics/jpeg/jfif.ps.gz. There is also a plain text--The Free Software Foundation sells tapes and CD-ROMscontaining Bash; send electronic mail to\f([email protected]\fP or call \f(CR+1-617-876-3296\fPfor more information..PP--# --# Aharon (Arnold) Jones [email protected] [ <<=== NOTE: NEW ADDRESS!! ]# P.O. Box 354 Home Phone: +972 8 989-0381 Fax: +1 603 761-6761# Nof Ayalon Cell Phone: +972 51 227-545 (See www.efax.com)# D.N. Shimshon 97784 Laundry increases exponentially in the--The Ohio State University http://www.math.ohio-state.edu/~nevai/231 West Eighteenth Avenue http://www.math.ohio-state.edu/~jat/Columbus, Ohio 43210-1174 1-614-292-5310 (Office/Answering Device)The United States of America 1-614-292-1479 (Math Dept Fax)

--

,-*~’‘^‘’~*-,._.,-*~’‘^‘’~*-,._.,-*~’‘^‘’~*-,._.,-*~’‘^‘’~*-,._.,-*~’‘^‘’~*-,Joe Farwell | phone 610-843-6020 | Platinum technology

Systems Administrator | vmail 800-123-9096 x7512 | 620 W. Germantown [email protected] | fax 610-872-6021 | Plymouth Meeting,Pa,19462

’~*-,._.,-*~’‘^‘’~*-,._.,-*~’‘^‘’~*-,._.,-*~’‘^‘’~*-,._.,-*~’‘^‘’~*-,._.,-*~’delay needs to be calibrated using outside sources....

Note that names and numbers have been altered in this output.

All told, elvis ends up with 289 "hits", which he can skim in a reasonable amount of time.

rha030-3.0-0-en-2005-08-17T07:23:17-0400


53


Online Exercises

Lab ExerciseObjective: Use regular expressions to search for patterns of text.


Specification

1. Create a short executable bash script named ~/bin/ispython, which expects a single argument,which is a filename. If the supplied filename’s first line is exactly “#!/usr/bin/python” (nothingmore, nothing less), the script should print the number 1. Otherwise, the script should print thenumber 0.

2. You are looking for files in the /etc directory (but not subdirectories) that contain a standard UnitedStates long distance phone number, written using the pattern of 1-###-###-####, where each # isreplaced with a numeric digit. Collect the filenames of every file in the /etc directory whichcontains such a pattern of numbers, and place them in the file ~/etcphone.txt, one file name perline, sorted alphabetically, using absolute references.

3. The file /usr/share/doc/bash-*/NEWS contains many itemized lists, with list items marked bylines whose first characters are a series of one or more letters, followed by a period and space, as inthe following:y. New prompting expansions: \a, \e, \H, \T, \@, \v, \V.

z. Variable expansion in prompt strings is now controllable via a shelloption (shopt prompt vars).

aa. Bash now defaults to using command-oriented history.

bb. The history file ($HISTFILE) is now truncated to $HISTFILESIZE afterbeing written.

Create the following files, each of which contains the number which answers the specified questionas its single word.

filename questionnewsitems.txt How many lines begin with a series of one or more letters, followed by a

period?newsitems23.txt How many lines begin with a series of two or three letters, followed by a

period?newsitems2.txt How many lines begin with a series of exactly two letters, followed by a

period?





54


filename questionnewsitems3.txt How many lines begin with a series of exactly three letters, followed by a

period?

4. The file /usr/share/dict/words contains a collection of common dictionary words, stored oneper line. Both common words and proper names are included, each appropriately capitalized.

Using only the egrep command, determine which words start with a capital letter followed only byvowels. Do not include single letter words. (For the purposes of this exercise, consider vowels asonly the letters A, E, I, O, or U, both uppercase and lowercase.)

List these words, one per line and sorted alphabetically, in the file ~/vowel2.txt.

Deliverables

1. A script called ~/bin/ispython, which, when executed with a single filename as an argument, will print 1 ifthe specified file’s first line is exactly #!/usr/bin/python. Otherwise, the script should print 0 (hint: This canbe accomplished by combining the head and grep commands).

2. The file ~/etcphone.txt, which contains a list of all files in the /etc directory (but not subdirectories) whichcontain the pattern 1-###-###-####, where each # is replaced by a numeric digit. The files should be listed asabsolute references, one per line, alphabetized.

3. The files ~/newsitems.txt, ~/newsitems23.txt, ~/newsitems2.txt, and ~/newsitems3.txt, eachof which contain a single number as their only word. The number should be the answer to the respectivequestion about the file /usr/share/doc/bash-*/NEWS in the table above.

4. The file ~/vowel2.txt, which contains an alphabetically sorted list of all words from/usr/share/dict/words which start with a capital letter followed only by vowels. (Exclude single letterwords).

QuestionsIn all of the following questions, the term regular expression implies extended regular expression syntax.

1. Which of the following characters is a regular expression literal character?

( ) a. ?( ) b. _( ) c. *


55


( ) d. $( ) e. None of the above

2. Which of the following regular expressions would match the word bookkeeper?

( ) a. o+ke+

( ) b. o{2}[ke]{4}

( ) c. ô+k+e+

( ) d. o+.*e+$


3. Which of the following regular expressions would match United States 5 digit or 9 digit zip codes, which have thepattern of ##### or #####-#### respectively, with each # replaced with a numeric digit?

( ) a. [[:digit:]]-{5|4}

( ) b. [[:digit:]]{5}(-[[:digit:]]{4})?

( ) c. [[:digit:]]-{5}[[:digit:]]{4}?

( ) d. [[:digit:]]{9}[-[[:digit:]]]{4}?


4. Which of the following regular expressions would only match a line that contains entirely capital letters, spaces,and tabs, regardless of the current character set?

( ) a. ^[A-Z]*$

( ) b. ^[A-Z[:blank:]]*$

( ) c. ^[[:upper:][:blank:]]*$


( ) e. B and C

Use the following transcript to answer the next 4 questions.

[student@station student]$ cat /etc/crontabSHELL=/bin/bashPATH=/sbin:/bin:/usr/sbin:/usr/binMAILTO=rootHOME=/

# run-parts01 * * * * root run-parts /etc/cron.hourly02 4 * * * root run-parts /etc/cron.daily22 4 * * 0 root run-parts /etc/cron.weekly42 4 1 * * root run-parts /etc/cron.monthly


56


5. Which of the following commands would print the first 4 lines only from the file /etc/crontab?

( ) a. egrep ’$[[:upper:]]’ /etc/crontab

( ) b. egrep ’^[[:upper:]]’ /etc/crontab

( ) c. egrep ’[[:upper:]]$’ /etc/crontab

( ) d. egrep ’[^[:upper:]]’ /etc/crontab


6. Which of the following commands would print the last 4 lines only from the file /etc/crontab?

( ) a. egrep ’.*’ /etc/crontab

( ) b. egrep ’\*’ /etc/crontab

( ) c. egrep ’*’ /etc/crontab

( ) d. egrep ’*{1}’ /etc/crontab


7. Which of the following commands would print the last 2 lines only from the file /etc/crontab?

( ) a. egrep ’cron.[weekly|monthly]’ /etc/crontab

( ) b. egrep ’cron.(weekly|monthly)’ /etc/crontab

( ) c. egrep ’cron.{weekly|monthly}’ /etc/crontab

( ) d. egrep ’cron.(weekly|monthly)?’ /etc/crontab


8. Which of the following would print only the line that contains the filename /etc/cron.hourly from the file/etc/crontab?

( ) a. egrep ’(\* ){4}’ /etc/crontab

( ) b. egrep ’^0[13579]’ /etc/crontab

( ) c. egrep ’^[01[:punct:][:blank:]]{10}’ /etc/crontab


( ) e. A and B only


57


9. Which of the following regular expressions would match 3.14159?

( ) a. 3.14159

( ) b. 3\.14[:digit:]+

( ) c. [[:digit:]\.]{7}


( ) e. A and C only

The following is extracted from the procmailrc(5) man page. Ignore the line break between the words X-Envelopeand Apparently.

(^((Original-)?(Resent-)?(To|Cc|Bcc)|(X-Envelope|Apparently(-Resent)?)-To):(.*[^-a-zA-Z0-9_.])?)

10. Which of the following lines would match the regular expression?

( ) a. Resent-Cc: [email protected]

( ) b. Original-Resent-Bcc: [email protected]

( ) c. To: [email protected]


( ) e. A and C only

(It could have been worse... the following regular expression is also found in the procmailrc(5) manpage.)

(^(Mailing-List:|Precedence:.*(junk|bulk|list)|To: Multiplerecipients of |(((Resent-)?(From|Sender)|X-Envelope-From):|>?From)([^>]*[^(.%@a-z0-9])?(Post(ma?(st(e?r)?|n)|office)|(send)?Mail(er)?daemon|m(mdf|ajordomo)|n?uucp|LIST(SERV|proc)|NETSERV|o(wner|ps)|r(e(quest|sponse)|oot)|b(ounce|bs\.smtp)|echo|mirror|s(erv(ices?|er)|mtp(error)?|ystem)|A(dmin(istrator)?|MMGR|utoanswer))(([^).!:a-z0-9][-_a-z0-9]*)?[%@>\t ][^<)]*($.*$.*)?)?$([^>]|$)))

rha030-3.0-0-en-2005-08-17T07:23:17-0400


58

Chapter 4. Everything Sorting: sort and uniq

Key Concepts• The sort command sorts data alphabetically.

• sort -n sorts numerically.

• sort -u sorts and removes duplicates.

• sort -k and -t sorts on a specific field in patterned data.

DiscussionIn previous Workbooks, we have introduced the sort command in its simplest form: a tool for arrangingthe lines of a file or output from a command alphabetically. This Lesson will present the sort commandin more detail.

The sort Command

Basic SortingSorting is the process of arranging records into a specified sequence. Examples of sorting would bearranging a list of usernames into alphabetical order, or a set of file sizes into numeric order.

In its simplest form, the sort command will alphabetically sort lines (including any whitespace or controlcharacters which are encountered). The sort command uses the local locale (language definition) todetermine the order of the characters (referred to as the collating order). In the following example,madonna first displays the contents of the file /etc/sysconfig/mouse as is, and then sorts thecontents of the file alphabetically.

[madonna@station madonna]$ cat /etc/sysconfig/mouseFULLNAME="Generic - 2 Button Mouse (PS/2)"MOUSETYPE="ps/2"XEMU3="yes"XMOUSETYPE="PS/2"DEVICE=/dev/psaux[madonna@station madonna]$ sort /etc/sysconfig/mouseDEVICE=/dev/psauxFULLNAME="Generic - 2 Button Mouse (PS/2)"MOUSETYPE="ps/2"XEMU3="yes"XMOUSETYPE="PS/2"

If called with arguments, the arguments are interpreted as (possibly multiple) filenames to be sorted. Ifcalled without argument, the sort command will sort whatever it reads from standard in.

59


Modifying the Sort OrderBy default, the sort command sorts lines alphabetically. The following table lists command line switcheswhich can be used to modify this default sort order.

Table 4-1. Command Line Switches for Specifying Sort Order

Switch Effect-b, --ignore-leading-blanks Ignore spaces and tabs at the beginning of a line.-d, --dictionary-order Consider only blanks and alphanumeric characters.-f, --ignore-case Treat all characters as uppercase.-g, --general-numeric-sort Compare words as floating point numbers.-n, --numeric-sort Compare words as integers.-r, --reverse Sort in descending rather than ascending order.

As an example, madonna is examining the file sizes of all files that start with an m in the /var/logdirectory.

[madonna@station madonna]$ ls -s1 /var/log/m*20 /var/log/maillog

3104 /var/log/maillog.11552 /var/log/maillog.21952 /var/log/maillog.31236 /var/log/maillog.4

4 /var/log/messages384 /var/log/messages.1636 /var/log/messages.2216 /var/log/messages.3560 /var/log/messages.4

She next sorts the output with the sort command.

[madonna@station madonna]$ ls -s /var/log/m* | sort1236 /var/log/maillog.41552 /var/log/maillog.21952 /var/log/maillog.3

20 /var/log/maillog216 /var/log/messages.3

3104 /var/log/maillog.1384 /var/log/messages.1

4 /var/log/messages560 /var/log/messages.4636 /var/log/messages.2

Without being told otherwise, the sort command sorted the lines alphabetically (with 1952 comingbefore 20). Realizing this is not what she intended, madonna adds the -n command line switch.

[madonna@station madonna]$ ls -s /var/log/m* | sort -n4 /var/log/messages

20 /var/log/maillog216 /var/log/messages.3


60


384 /var/log/messages.1560 /var/log/messages.4636 /var/log/messages.2

1236 /var/log/maillog.41552 /var/log/maillog.21952 /var/log/maillog.33104 /var/log/maillog.1

Better, but madonna would prefer to reverse the sort order, so that the largest files come first. She addsthe -r command line switch.

[madonna@station madonna]$ ls -s /var/log/m* | sort -nr3104 /var/log/maillog.11952 /var/log/maillog.31552 /var/log/maillog.21236 /var/log/maillog.4636 /var/log/messages.2560 /var/log/messages.4384 /var/log/messages.1216 /var/log/messages.320 /var/log/maillog4 /var/log/messages

Why ls -1?: Why was the -1 command line switch given to the ls command in the first example, butnot the others? By default, when the ls command is using a terminal for standard out, it will group thefilenames in multiple columns for easy readability. When the ls command is using a pipe or file forstandard out, however, it will print the files one file per line. The -1 command line switch forces thisbehavior for for terminal output as well.

Specifying Sort KeysIn the previous examples, the sort command performed its sort based on the first characters found on aline. Often, formatted data is not arranged so conveniently. Fortunately, the sort command allows usersto specify which column of tabular data to use for determining the sort order, or, in more formally, whichcolumn should be used as the sort key.

The following table of command line switches can be used to determine the sort key.

Table 4-2. Command Line Switches for Specifying Sort Keys

Switch Effect-k, --key=POS Use the key at POS to determine sort order.-t, --field-separator=SEP Use the character(s) SEP to separate fields

(instead of simply whitespace).

Sorting Output by a Particular ColumnAs an example, suppose madonna wanted to reexamine her log files, using the long format of the ls

rha030-3.0-0-en-2005-08-17T07:23:17-0400


61


command. She tries simply sorting her output numerically.

[madonna@station madonna]$ ls -l /var/log/m* | sort -n-rw------- 1 root root 1260041 Sep 14 04:05 /var/log/maillog.4-rw------- 1 root root 1581750 Sep 28 06:15 /var/log/maillog.2-rw------- 1 root root 1993522 Sep 22 10:16 /var/log/maillog.3-rw------- 1 root root 216885 Sep 22 10:22 /var/log/messages.3-rw------- 1 root root 31187 Oct 5 06:05 /var/log/maillog-rw------- 1 root root 3172217 Oct 5 04:05 /var/log/maillog.1-rw------- 1 root root 387345 Oct 5 04:07 /var/log/messages.1-rw------- 1 root root 567049 Sep 14 04:08 /var/log/messages.4-rw------- 1 root root 644859 Sep 28 06:22 /var/log/messages.2-rw------- 1 root root 651 Oct 5 05:40 /var/log/messages

Now that the sizes are no longer reported at the beginning of the line, she has difficulty. Instead, sherepeats her sort using the -k command line switch to sort her output by the 5th column, producing thedesired output.

[madonna@station madonna]$ ls -l /var/log/m* | sort -n -k5-rw------- 1 root root 651 Oct 5 05:40 /var/log/messages-rw------- 1 root root 31187 Oct 5 06:05 /var/log/maillog-rw------- 1 root root 216885 Sep 22 10:22 /var/log/messages.3-rw------- 1 root root 387345 Oct 5 04:07 /var/log/messages.1-rw------- 1 root root 567049 Sep 14 04:08 /var/log/messages.4-rw------- 1 root root 644859 Sep 28 06:22 /var/log/messages.2-rw------- 1 root root 1260041 Sep 14 04:05 /var/log/maillog.4-rw------- 1 root root 1581750 Sep 28 06:15 /var/log/maillog.2-rw------- 1 root root 1993522 Sep 22 10:16 /var/log/maillog.3-rw------- 1 root root 3172217 Oct 5 04:05 /var/log/maillog.1

Specifying Multiple Sort KeysNext, madonna is examining the file /etc/fdprm, which tables low level formatting parameters forfloppy drives. She uses the grep command to extract the data from the file, stripping away comments andblank lines.

[madonna@station madonna]$ grep "^[[:alnum:]]" /etc/fdprm360/360 720 9 2 40 0 0x2A 0x02 0xDF 0x501200/1200 2400 15 2 80 0 0x1B 0x00 0xDF 0x54360/720 720 9 2 40 1 0x2A 0x02 0xDF 0x50720/720 1440 9 2 80 0 0x2A 0x02 0xDF 0x50720/1440 1440 9 2 80 0 0x2A 0x02 0xDF 0x50360/1200 720 9 2 40 1 0x23 0x01 0xDF 0x50720/1200 1440 9 2 80 0 0x23 0x01 0xDF 0x501440/1440 2880 18 2 80 0 0x1B 0x00 0xCF 0x6C1440/1200 2880 18 2 80 0 ???? ???? ???? ???? # ?????1680/1440 3360 21 2 80 0 0x0C 0x00 0xCF 0x6C # ?????cbm1581 1600 10 2 80 2 0x2A 0x02 0xDF 0x2E800/720 1600 10 2 80 0 0x2A 0x02 0xDF 0x2E

She next sorts the data numerically, using the 5th column as her key.


62


[madonna@station madonna]$ grep "^[[:alnum:]]" /etc/fdprm | sort -n -k5360/1200 720 9 2 40 1 0x23 0x01 0xDF 0x50360/360 720 9 2 40 0 0x2A 0x02 0xDF 0x50360/720 720 9 2 40 1 0x2A 0x02 0xDF 0x501200/1200 2400 15 2 80 0 0x1B 0x00 0xDF 0x541440/1200 2880 18 2 80 0 ???? ???? ???? ???? # ?????1440/1440 2880 18 2 80 0 0x1B 0x00 0xCF 0x6C1680/1440 3360 21 2 80 0 0x0C 0x00 0xCF 0x6C # ?????720/1200 1440 9 2 80 0 0x23 0x01 0xDF 0x50720/1440 1440 9 2 80 0 0x2A 0x02 0xDF 0x50720/720 1440 9 2 80 0 0x2A 0x02 0xDF 0x50800/720 1600 10 2 80 0 0x2A 0x02 0xDF 0x2Ecbm1581 1600 10 2 80 2 0x2A 0x02 0xDF 0x2E

Her data is successfully sorted using the 5th column, with the formats specifying 40 tracks grouped at thetop, and 80 tracks grouped at the bottom. Within these groups, however, she would like to sort the databy the 3rd column. She adds an additional -k command line switch to the sort command, specifying thethird column as her secondary key.

[madonna@station madonna]$ grep "^[[:alnum:]]" /etc/fdprm | sort -n -k5 -k3360/1200 720 9 2 40 1 0x23 0x01 0xDF 0x50360/360 720 9 2 40 0 0x2A 0x02 0xDF 0x50360/720 720 9 2 40 1 0x2A 0x02 0xDF 0x50720/1200 1440 9 2 80 0 0x23 0x01 0xDF 0x50720/1440 1440 9 2 80 0 0x2A 0x02 0xDF 0x50720/720 1440 9 2 80 0 0x2A 0x02 0xDF 0x50800/720 1600 10 2 80 0 0x2A 0x02 0xDF 0x2Ecbm1581 1600 10 2 80 2 0x2A 0x02 0xDF 0x2E1200/1200 2400 15 2 80 0 0x1B 0x00 0xDF 0x541440/1200 2880 18 2 80 0 ???? ???? ???? ???? # ?????1440/1440 2880 18 2 80 0 0x1B 0x00 0xCF 0x6C1680/1440 3360 21 2 80 0 0x0C 0x00 0xCF 0x6C # ?????

Now the data has been sorted primarily by the fifth column. For rows with identical fifth columns, thethird column has been used to determine the final order. An arbitrary number of keys can be specified byadding more -k command line switches.

Specifying the Field SeparatorThe above examples have demonstrated how to sort data using a specified field as the sort key. In all ofthe examples, fields were separated by whitespace (i.e., a series of spaces and/or tabs). Often in Linux(and Unix), some other method is used to separate fields. Consider, for example, the /etc/passwd file.

[madonna@station madonna]$ head /etc/passwdroot:x:0:0:root:/root:/bin/bashbin:x:1:1:bin:/bin:/sbin/nologindaemon:x:2:2:daemon:/sbin:/sbin/nologinadm:x:3:4:adm:/var/adm:/sbin/nologinlp:x:4:7:lp:/var/spool/lpd:/sbin/nologinsync:x:5:0:sync:/sbin:/bin/syncshutdown:x:6:0:shutdown:/sbin:/sbin/shutdownhalt:x:7:0:halt:/sbin:/sbin/halt


63


mail:x:8:12:mail:/var/spool/mail:/sbin/nologinnews:x:9:13:news:/etc/news:

The lines are structured into seven fields each, but the fields are separated using a “:” instead ofwhitespace. With the -t command line switch, the sort command can be instructed to use some specifiedcharacter (such as a “:”) to separate fields.

In the following, madonna uses the sort command with the -t command line switch to sort the first 10lines of the /etc/passwd file by home directory (the 6th field).

[madonna@station madonna]$ head /etc/passwd | sort -t: -k6bin:x:1:1:bin:/bin:/sbin/nologinnews:x:9:13:news:/etc/news:root:x:0:0:root:/root:/bin/bashsync:x:5:0:sync:/sbin:/bin/synchalt:x:7:0:halt:/sbin:/sbin/haltdaemon:x:2:2:daemon:/sbin:/sbin/nologinshutdown:x:6:0:shutdown:/sbin:/sbin/shutdownadm:x:3:4:adm:/var/adm:/sbin/nologinlp:x:4:7:lp:/var/spool/lpd:/sbin/nologinmail:x:8:12:mail:/var/spool/mail:/sbin/nologin

The user bin, with a home directory of /bin, is now at the top, and the user mail, with a home directoryof /var/spool/mail, is at the bottom.

SummaryIn summary, we have seen that the sort command can be used to sort structured data, using the -kcommand line switch to specify the sort field (perhaps more than once), and the -t command line switchto specify the field delimiter.

The -k command line switch can receive more sophisticated arguments, which serve to specify characterpositions within a field, or customize sort options for individual fields. See the sort(1) man page fordetails.

The uniq CommandThe uniq program is used to identify, count, or remove duplicate records in sorted data. If givencommand line arguments, they are interpreted as filenames for files on which to operate. If no argumentsare provided, the uniq command operates on standard in. Because the uniq command only works onalready sorted data, it is almost always used in conjunction with the sort command.

The uniq command uses the following command line switches to qualify its behavior.

Table 4-3. Command Line Arguments For uniq

-c, --count Prefix line with the number of its occurrences; this is the length ofthe “run”.





64


-d, --repeated Print only duplicated lines.-f, --skip-fields=n Avoid comparing the first nfields; fields are delimited by

whitespace.-i, --ignore-case Ignore case.-s, --skip-charsn Skip the first n characters.-u, --unique Print only unique lines.-w, --check-chars=n Compare no more than n characters in each line.

In order to understand the uniq command’s behavior, we need repetitive data on which to operate. Thefollowing python script simulates the rolling of three six sided dice, writing the sum of 100 roles once perline. The user madonna makes the script executable, and then records the output in a file called trial1.

[madonna@station madonna]$ cat three_dice.py#!/usr/bin/python

from random import randintfor i in range(100): print randint(1,6)+randint(1,6)+randint(1,6)[madonna@station madonna]$ chmod 755 three_dice.py[madonna@station madonna]$ ./three_dice.py > trial1[madonna@station madonna]$ wc trial1

100 100 260 trial_run[madonna@station madonna]$ head trial11010101388101086

Reducing Data to Unique EntiresNow, madonna would like to analyze the data. She begins by sorting the data and piping the outputthrough the uniq command.

[madonna@station madonna]$ sort -n trial1 | uniq45678910111213


65


1415161718

Without any command line switches, the uniq command has removed duplicate entries, reducing thedata from 100 lines to only 15. Easily, madonna sees that the data looks reasonable: the sum of everycombination for three six sided die is represented, with the exception of 3. Because only one combinationof the dice would yield a sum of 3 (all ones), she expects it to be a relatively rare occurrence.

Counting Instances of DataA particularly convenient command line switch for the uniq command is -c, or --count. This causes theuniq command to count the number of occurrences of a particular record, prepending the result to therecord on output.

In the following example, madonna uses the uniq command to reproduce its previous output, this timeprepending the number of occurrences of each entry in the file.

[madonna@station madonna]$ sort -n trial1 | uniq -c1 44 56 6

10 710 813 913 109 11

13 124 138 144 151 162 172 18

As would be expected (by a statistician, at least), the largest and smallest numbers have relatively fewoccurrences, while the intermediate numbers occur more numerously. The first column can be summedto 100 to confirm that the uniq command identified every occurrence.

Identifying Unique or Repeated Data with uniqSometimes, people are just interested in identifying unique or repeated data. The -d and -u command lineswitches allow the uniq command to do just that. In the first case, madonna identifies the dicecombinations that occur only once. In the second case, she identifies combinations that are repeated atleast once.

[madonna@station madonna]$ sort -n trial0 | uniq -u4


66


16[madonna@station madonna]$ sort -n trial1 | uniq -d567891011121314151718

Examples

Example 1. Sorting the Output of ps auxThe user madonna is examining the processes running on her local machine. She is familiar with the psaux command, which tables information about every running process.

[madonna@station madonna]$ ps aux | head -4USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMANDroot 1 0.0 0.0 1380 76 ? S 02:05 0:04 init [root 2 0.0 0.0 0 0 ? SW 02:05 0:00 [keventd]root 3 0.0 0.0 0 0 ? SW 02:05 0:00 [kapmd]

The following table identifies the selected columns.

Table 4-4. Selected Columns from the ps aux Command

Column Number Title Role1 USER The user that owns the process.2 PID The process ID of the process3 %CPU The relative CPU utilization of

the process4 %MEM The relative Memory utilization

of the process5 VSZ The "virtual size" of the process,

or how much memory theprocess has requested





67


Column Number Title Role6 RSS The "resident size" of the

process, or how much actualmemory it is consuming.

The user madonna would like to order processes in terms of some of these parameters. She first ordersthe processes by their virtual memory size, sorting numerically, and listing in descending order. Noticethe use of the tail +2 command, to remove the header from the list of processes.

[madonna@station madonna]$ ps aux | tail +2 | sort -rn -k5 | headgdm 1074 0.0 5.2 37828 13396 ? S 02:06 0:01 /usr/bin/gdmgreetermadonna 1844 0.0 0.2 19436 632 pts/0 S 03:42 0:00 sort -rn -k5apache 909 0.0 0.5 18320 1464 ? S 02:06 0:00 /usr/sbin/httpdapache 908 0.0 0.5 18320 1464 ? S 02:06 0:00 /usr/sbin/httpdapache 907 0.0 0.5 18320 1464 ? S 02:06 0:00 /usr/sbin/httpdapache 906 0.0 0.5 18320 1464 ? S 02:06 0:00 /usr/sbin/httpdapache 905 0.0 0.5 18320 1464 ? S 02:06 0:00 /usr/sbin/httpdapache 904 0.0 0.5 18320 1464 ? S 02:06 0:00 /usr/sbin/httpdapache 903 0.0 0.5 18320 1464 ? S 02:06 0:00 /usr/sbin/httpdapache 902 0.0 0.5 18320 1468 ? S 02:06 0:00 /usr/sbin/httpd

The gdmgreeter (which manages logins for the X graphical environment) and httpd daemon (whichimplements the Apache Web Server) are the largest processes on her machine, in terms of the amount ofmemory they are requesting. (Note also, the sort command made an appearance).

Next, madonna sorts the output by the sixth column, which tables the resident memory sizes of theprocesses.

[madonna@station madonna]$ ps aux | tail +2 | sort -rn -k6 | headgdm 1074 0.0 5.2 37828 13396 ? S 02:06 0:01 /usr/bin/gdmgreetroot 1066 0.0 2.5 17836 6444 ? R 02:06 0:00 /usr/X11R6/bin/Xroot 914 0.0 1.2 9916 3140 ? S 02:06 0:00 cupsdxfs 978 0.0 0.9 4816 2512 ? S 02:06 0:00 xfs -dropprivelvis 1664 0.0 0.7 6768 2020 ? S 03:31 0:00 /usr/sbin/sshdmadonna 1748 0.0 0.7 6768 2008 ? S 03:31 0:00 /usr/sbin/sshdroot 1662 0.0 0.6 6716 1736 ? S 03:31 0:00 /usr/sbin/sshdroot 1746 0.0 0.6 6716 1724 ? S 03:31 0:00 /usr/sbin/sshdmadonna 1752 0.0 0.5 4388 1472 pts/2 S 03:31 0:00 -bashroot 885 0.0 0.5 18248 1468 ? S 02:06 0:00 /usr/sbin/httpd

Interestingly, a different collection of processes make the top of the list, including the X server, andseveral instances of the sshd daemon (which implements the Secure Shell service). Presumably, these arethe processes that are currently active.

Next, madonna sorts by the third column, relative CPU activity.

[madonna@station madonna]$ ps aux | tail +2 | sort -rn -k3 | headelvis 1744 33.8 0.1 3408 400 pts/1 R 03:31 6:01 cat /dev/zeroelvis 1745 33.7 0.1 3412 400 pts/1 R 03:31 6:00 cat /dev/zeroblondie 1826 33.3 0.1 3412 400 pts/2 R 03:32 5:45 cat /dev/zeroxfs 978 0.0 0.9 4816 2512 ? S 02:06 0:00 xfs -dropprivsmmsp 864 0.0 0.0 5732 4 ? S 02:06 0:00 sendmail: Queuerpc 586 0.0 0.0 1548 4 ? S 02:05 0:00 portmap


68


root 914 0.0 1.2 9916 3140 ? S 02:06 0:00 cupsdroot 9 0.0 0.0 0 0 ? SW 02:05 0:00 [bdflush]root 894 0.0 0.0 1572 172 ? S 02:06 0:00 crondroot 885 0.0 0.5 18248 1468 ? S 02:06 0:00 /usr/sbin/httpd

Her machine is not seeing much current activity, with the exception of three different cat processes,which seem to be evenly dividing her CPU.

Example 2. Using sort and uniq to Collect Information onRunning ProcessesContinuing to examine the processes running on her machine, madonna next uses the ps command withthe -e switch, which specifies to list every process, and the -o switch, which takes a list of column namesas an argument. The -o command line switch allows madonna to list only the information she whichinterests her. She finds the following entries in a table of format specifiers in the ps(1) man page.

Table 4-5. Selected Format Specifiers for the ps Command

Tag Specifiescmd The short name of the commandpid The process IDstate The current state of the process (R=running,

S=sleeping)user The user who owns the process

As some examples of using the -o command line switch, madonna first tables processes with theirprocess ID, the user who owns the process, and the command that is running.

[madonna@station madonna]$ ps -e -o pid,user,cmd | head -5PID USER CMD

1 root init [2 root [keventd]3 root [kapmd]4 root [ksoftirqd_CPU0]

Next, she merely tables process id and state.

[madonna@station madonna]$ ps -e -o pid,state| head -5PID S

1 S2 S3 S4 S

Now that she has built up some familiarity with the ps command and the -o command line switch, she isready to begin asking some questions. She first wants to know who is running processes on the machine,and how many processes they are running. She tables all processes, listing only the username of who

rha030-3.0-0-en-2005-08-17T07:23:17-0400


69


owns the process. She then passes the output through sort and uniq -c. Notice again the use of the tail +2command, to strip the header from the output of the ps command.

[madonna@station madonna]$ ps -e -o user | tail +2 | sort | uniq -c8 apache2 blondie1 daemon3 elvis1 gdm5 madonna

48 root1 rpc1 smmsp1 xfs

She would prefer the output to be sorted, so she adds one more sort to the end of the pipe.

[madonna@station madonna]$ ps -e -o user | tail +2 | sort | uniq -c | sort -rn48 root8 apache6 madonna3 elvis2 blondie1 xfs1 smmsp1 rpc1 gdm1 daemon

Now blondie easily sees that root and apache are running the most processes (presumably daemons in thebackground), followed by madonna, elvis, and blondie (presumably interactive users). How many ofthese processes are currently running, and how many are sleeping? Using a similar trick, but this timelisting the process state instead of the user owner, she comes up with her answer.

[madonna@station madonna]$ ps -e -o state | tail +2 | sort | uniq -c | sort -rn73 S5 R

Most of the processes on her machine (73) are sleeping, while relatively few (5) are running (whichimplies they are actively using the CPU).

Online Exercises

Lab ExerciseObjective: Use the sort and uniq command to manage information efficiently.


rha030-3.0-0-en-2005-08-17T07:23:17-0400


70


Specification

1. The file /etc/fstab is used to predefine mount points on your system. The third column of this filespecifies the filesystem type of the device to be mounted.

Sort the contents of this file in alphabetically ascending order, using the third column as yourprimary key. Store the output in the newly created file ~/fstab.byfs

2. The file /proc/modules lists currently loaded kernel modules, along with the module size (thesecond column) and a current usage count (the third column).

Sort the contents of this file in numerically descending order, using the usage count (third column)as your primary key, and the module size (second column) as your secondary key. Store the resultsin the file ~/modules.byuc

3. Sort the /etc/passwd file in alphabetically ascending order, using the user’s login shell as yourprimary key. Store the results in the file newly created file passwd.bylogin

4. The stat command uses the --format command line switch to specify its output format. As seen inthe stat(1) man page (or stat --help), the following command line will list the permissions of a filein octal notation.[student@station student]$ stat --format="%a" /etc/passwd644

Use this command to list the permissions on all files (and directories, etc.) in the /etc/ directory(but not subdirectories). Use the sort and uniq commands to reduce this information into a simpletable, with the first column being the number of times that the octal mode specified in the secondcolumn occurs. The table should be sorted in numerically descending order, using the number ofoccurrences (the first column) as your primary key. Store the table in a newly created file called~/etcmodes.txt

If completed correctly, your table should have a form similar to the following. (Do not be concernedif the actual values of your table differ.)[student@station student]$ cat etcmodes.txt

127 64463 75516 60015 7776 6405 6643 4442 4001 7751 7501 440

5. The df command lists currently mounted disk partitions, along with the current disk usage. Thefourth column of this command’s output lists the amount of available space in blocks.

Create an executable script called ~/bin/avail. When executed, the script should list availablepartitions (the output of the df command), sorted in numerically descending order, using the amountof available space (the fourth column) as the primary key. The header line generated from the dfcommand should be stripped from the output.

rha030-3.0-0-en-2005-08-17T07:23:17-0400


71


Deliverables

1. The file fstab.byfs, which contains the contents of the /etc/fstab file sorted in alphabetically ascending order,using the third column as the primary key.

2. The file modules.byuc, which contains the contents of the /proc/modules file sorted in numericallydescending order, using the third column as the primary key, and the second column as the secondary key.

3. The file passwd.bylogin, which contains the contents of the /etc/passwd file sorted in alphabeticallyascending order, using the user’s login shell as the primary key.

4. The file etcmodes.txt, which tables the octal permissions (modes) of all files in the /etc directory. Thesecond column of the table should be the octal mode, and the first column the number of files to which the modeapplies. The tables should be sorted in numerically descending order, using the first column as a primary key.

5. The executable script ~/bin/avail, which when executed (without arguments) displays the output of the dfcommand sorted in numerically descending order, using the fourth column as its primary key. The header lineproduced by the df command should be stripped from the output.

If you have performed the exercises correctly, you should be able to generate output similar to the following. Do notbe concerned if your actual values differ.

[student@station student]$ head -5 fstab.byfs modules.byuc passwd.bylogin etcmodes.txt==> fstab.byfs <==/dev/sda1 /mnt/camera auto noauto,user 0 0none /dev/pts devpts gid=5,mode=620 0 0/home/elvis/case.img /home/elvis/case ext2 noauto,loop,user,exec 0 0LABEL=/ / ext3 defaults 1 1LABEL=/boot /boot ext3 defaults 1 2

==> modules.byuc <==ip_tables 15096 5 [iptable_filter ipt_MASQUERADE iptable_nat]ext3 70784 2jbd 51924 2 [ext3]yenta_socket 13504 2ds 8680 2

==> passwd.bylogin <==news:x:9:13:news:/etc/news:alice:x:4021:4021::/home/alice:/bin/bashapache:x:48:48:Apache:/var/www:/bin/basharnold:x:4012:4012::/home/arnold:/bin/bashblondie:x:505:505::/home/blondie:/bin/bash

==> etcmodes.txt <==127 64463 75516 60015 7776 640

[student@station student]$ avail

rha030-3.0-0-en-2005-08-17T07:23:17-0400Copyright (c) 2003-2005 Red Hat, Inc. All rights reserved. For use only by a student enrolled in a Red Hat Academy course taught at a Red Hat Academy. Any other use is a violation of U.S.and international copyrights. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise duplicated whether in electronic or print format withoutprior written consent of Red Hat, Inc. If you believe Red Hat course materials are being used, copied, or otherwise improperly distributed please email [email protected] or phone toll-free (USA) +1 866 626 2994 or +1 (919) 754 3700.

72


/dev/hda3 5131108 4505312 365144 93% /none 127616 0 127616 0% /dev/shm/dev/hda1 124427 26268 91735 23% /boot

Questions

1. Which of the following are legitimate invocations of the sort command?

( ) a. sort -k5 -t: data

( ) b. sort -n data

( ) c. sort -rn -k3 data

( ) d. sort -k2 < data


2. Which of the following would sort the file data in numerically descending order, using the third column as theprimary key?

( ) a. sort -rn -k3 data

( ) b. sort -r -3 data

( ) c. sort -n -p3 data

( ) d. All of the Above

( ) e. A or C

3. Which of the following command lines would sort the file /etc/passwd in numerically ascending order, usingthe third “:” separated field as a primary key?

( ) a. sort --numeric-sort -k3 -t: /etc/passwd

( ) b. sort -nk3 -t: /etc/passwd

( ) c. sort -n k3,: /etc/passwd


( ) e. A and B

Use the output from the following command to answer the next 2 questions.

[student@station hwdata]$ head -20 MonitorsDB## Monitor information for use by Xconfigurator# Supplies similar information to the data in the /usr/X11R6/lib/X11/Cards


73


# file about video cards.## Each line has format:# <Manufacturer>; <Monitor name>; <EISA ID (if any)>; <horiz sync in \# Khz>; <verc sync in Hz>; DPMS support## Horiz and vert sync can be a range; like 35.2-55.75; or 31.5,35.5# BUT remember to use ’;’ to separate fields#

Aamazing; Aamazing CM-8426; cm-8426; 31.0-60.0; 40.0-80.0; 1Aamazing; Aamazing MS-8431; ms-8431; 15.0-36.0; 50.0-70.0; 1Acer; Acer 11D; API440B; 31.0-35.5; 50.0-90.0Acer; Acer 1455; API5514; 30.0-54.0; 50.0-120.0Acer; Acer 1555; API5515; 30.0-54.0; 50.0-120.0Acer; Acer 15P; acer_15p; 15.0-70.0; 45.0-90.0; 1Acer; Acer 211c; API9708; 30.0-107.0; 50.0-160.0

4. Which of the following bash command lines would sort the listed monitors in alphabetically ascending order byEISA ID?

( ) a. grep -v # MonitorsDB | sort -t; -k3

( ) b. grep -v \# MonitorsDB | sort -t\; -k3

( ) c. grep -v \# MonitorsDB | sort -n -t\; -k2

( ) d. grep -v \# MonitorsDB | sort -r -t\; -k2


5. Which of the following bash command lines would sort the listed monitors in numerically descending order,using the (first value of) the monitor’s horizontal "sync frequency" as the primary key?

( ) a. grep -v \# MonitorsDB | sort -nr -t\; -k4

( ) b. grep -v # MonitorsDB | sort -r -t; -k4

( ) c. grep -v # MonitorsDB | sort -t; -k4

( ) d. grep -v \# MonitorsDB | sort -t\; -k4


Use the output from the following command to answer the next 3 questions.

[student@station log]$ ls -l-rw------- 1 root root 2533 Oct 6 02:06 boot.logdrwxr-xr-x 2 servlet servlet 4096 Jan 20 2003 ccm-core-cms-rw------- 1 root root 9795 Oct 6 05:50 crondrwxr-xr-x 2 lp sys 4096 Oct 5 04:07 cups-rw-r--r-- 1 root root 5761 Oct 6 02:05 dmesgdrwxr-xr-x 2 root root 4096 Oct 6 02:06 gdmdrwx------ 2 root root 4096 Oct 5 04:07 httpd


74


-r-------- 1 root root 19136220 Oct 6 03:31 lastlog-rw------- 1 root root 184228 Oct 6 05:08 maillog-rw------- 1 root root 20899 Oct 6 04:51 messages-rwx------ 1 postgres postgres 0 Apr 1 2003 pgsqldrwxrwsr-x 2 root rha 4096 Aug 27 10:58 rha-rw-r--r-- 1 root root 22166 Oct 6 04:10 rpmpkgsdrwxr-xr-x 2 root root 4096 Oct 6 02:10 sadrwx------ 2 root root 4096 Apr 5 2003 samba-rw-r--r-- 1 root root 41382 Aug 21 15:47 scrollkeeper.log-rw------- 1 root root 1161 Oct 6 03:31 secure-rw------- 1 root root 0 Oct 5 04:07 spoolerdrwxr-x--- 2 squid squid 4096 Aug 18 07:05 squid-rw-rw-r-- 1 root root 0 Oct 5 04:08 up2datedrwxr-xr-x 2 root root 4096 Feb 3 2003 vbox-rw------- 1 root root 0 Oct 5 04:08 vsftpd.log-rw-rw-r-- 1 root utmp 122880 Oct 6 03:31 wtmp-rw-r--r-- 1 root root 39015 Oct 6 05:56 xorg.0.logtotal 12532

6. Which of the following command lines would reorder this output in numerically descending order, using the filesize (the fifth column) as the primary key?

( ) a. ls -l | grep -v ^t | sort -rn -k5

( ) b. ls -l | grep -v ^t | sort -n -k4

( ) c. ls -l | grep -v ^t | sort -r -k5

( ) d. ls -l | grep -v ^t | sort -t: -k5


7. Which of the following command lines would reorder this output in numerically ascending order, using the linkcount (the second column) as a primary key, and the file size (the fifth column) as the secondary key?

( ) a. ls -l | grep -v ^t | sort -rn -k5,2

( ) b. ls -l | grep -v ^t | sort -n -k2 -k5

( ) c. ls -l | grep -v ^t | sort -n -k5 -k2

( ) d. ls -l | grep -v ^t | sort -t: -k5 -k2


8. Which of the following command lines would reorder this output in alphabetically ascending order, using thegroup owner (the fourth column) as the primary key, and the filename (the ninth column) as the secondary key?

( ) a. ls -l | grep -v ^t | sort -r -k9 -k4

( ) b. ls -l | grep -v ^t | sort -t- -k4,9

( ) c. ls -l | grep -v ^t | sort -k4 -k9

( ) d. ls -l | grep -v ^t | sort -rn -k4 -k9


75



9. Which of the following would print the number of occurrences of each record in the file data?

( ) a. sort -c data

( ) b. sort data | uniq

( ) c. sort data | uniq -c

( ) d. uniq data | sort -c


10. Which of the following would print repeated records from the file data?

( ) a. sort -r data

( ) b. uniq data | sort -r

( ) c. sort data | uniq -d

( ) d. sort data | uniq


rha030-3.0-0-en-2005-08-17T07:23:17-0400


76

Chapter 5. Extracting and Assembling Text:cut and paste

Key Concepts• The cut command extracts texts from text files, based on columns specified by bytes, characters, or

fields.

• The paste command merges two text files line by line.

DiscussionIn this Lesson, we explore two commands that are used to extract columns from a stream of text, orassemble columns into a wider stream: cut and paste.

The cut Command

Extracting Text with cutThe cut command extracts columns of text from a text file or stream. Imagine taking a sheet of paper thatlists rows of names, email addresses, and phone numbers. Rip the page vertically twice so that eachcolumn is on a separate piece. Hold onto the middle piece which contains email addresses, and throw theother two away. This is the mentality behind the cut command.

The cut command interprets any command line arguments as filenames of files on which to operate, oroperates on the standard in stream if none are provided. In order to specify which bytes, characters, orfields are to be cut, the cut command must be called with one of the following command line switches.

Table 5-1. "Mandatory" Command Line Switches for the cut Command.

Switch Effect-b list Extract bytes specified in list

-c list Extract characters specified in list

-f list Extract fields specified in list

The list arguments are actually a comma-separated list of ranges. Each range can take one of thefollowing forms.

Table 5-2. Range Specifications

N Only item number N .N- Items N through the end of the line.

77

Chapter 5. Extracting and Assembling Text: cut and paste

N-M Items N through M (inclusive).-M From the beginning of the line through item number M (inclusive).- All items from the beginning of the line through the end of the line.

Extracting text by Character Position with cut -cWith the -c command line switch, the list specifies a character’s position in a line of text, where thefirst character is character number 1. As an example, the file /proc/interrupts lists device drivers,the interrupt request (IRQ) line to which they attach, and the number of interrupts which have occurredon that IRQ line. (Do not be concerned if you are not yet familiar with the concepts of a device driver orIRQ line. Focus instead on how cut is used to manipulate the data).

[student@rosemont student]$ cat /proc/interruptsCPU0

0: 4477340 XT-PIC timer1: 25250 XT-PIC keyboard2: 0 XT-PIC cascade3: 7344 XT-PIC ehci-hcd5: 310187 XT-PIC usb-uhci, ohci13948: 1 XT-PIC rtc

10: 166 XT-PIC usb-uhci, eth111: 6575295 XT-PIC usb-uhci, eth0, Audigy12: 544632 XT-PIC PS/2 Mouse14: 80379 XT-PIC ide015: 341407 XT-PIC ide1

NMI: 0ERR: 0

Because the characters in the file are formatted into columns, the cut command can extract particularregions of interest. If just the IRQ line and the number of interrupts were of interest, the rest of the filecould be cut away, as in the following example. (Note the use of the grep command to first reduce thefile to just the lines pertaining to interrupt lines.)

[student@rosemont student]$ grep ’[[:digit:]]:’ /proc/interrupts | cut -c1-150: 45129971: 279542: 03: 73445: 3120958: 1

10: 16611: 662975612: 54552314: 8102515: 344239

Alternately, if only the device drivers bound to particular IRQ lines were of interest, multiple ranges ofcharacters could be specified.


78


[student@rosemont student]$ grep ’[[:digit:]]:’ /proc/interrupts | cut -c1-5,34-0: timer1: keyboard2: cascade3: ehci-hcd5: usb-uhci, ohci13948: rtc

10: usb-uhci, eth111: usb-uhci, eth0, Audigy12: PS/2 Mouse14: ide015: ide1

If the character specifications were reversed, can the cut command be used to rearrange the ordering ofthe data?

[student@rosemont student]$ grep ’[[:digit:]]:’ /proc/interrupts | cut -c34-,1-50: timer1: keyboard2: cascade

...

The answer is no. Text will appear only once, in the same order it appears in the source, even if the rangespecifications are overlapping or rearranged.

Extracting Fields of Text with cut -fThe cut command can also be used to extract text that is structured not by character position, but bysome delimiter character, such as a TAB or “:”. The following command line switches can be used tofurther qualify what is meant by a field, or more selective select source lines.

Table 5-3. Command Line Switches for cut -f

Switch Effect-d DELIM Use DELIM to separate fields on input, instead of the default TAB character.-s Do not include lines that do not contain the delimiter character (useful for

stripping comments and headers).--output-delimiter=STRING

On output, use the text specified by STRING instead of the input fielddelimiter.

For example, the file /usr/share/hwdata/pcitable lists over 3000 vendor IDs and device IDs(which can be probed from PCI devices), and the kernel modules and text strings which should beassociated with them, separated by tabs.

[student@rosemont hwdata]$ head -15 pcitable# This file is automatically generated from isys/pci. Edit# it by hand to change a driver mapping. Other changes will# be lost at the next merge - you have been warned.# Edit by hand to change a driver mapping. Changes to descriptions


79


# will be lost at the next merge - you have been warned.# If you run makeids, please make sure no entries are lost.# The format is ("%d\t%d\t%s\t"%s"\n", vendid, devid, moduleName, cardDescription)# or ("%d\t%d\t%d\t%d\t%s\t"%s"\n", vendid, devid, subvendid, subdevid, moduleName, cardDescription)

0x0675 0x1700 "unknown" "Dynalink|IS64PH ISDN Adapter"0x0675 0x1702 "hisax" "Dynalink|IS64PH ISDN Adapter"0x09c1 0x0704 "unknown" "Arris|CM 200E Cable Modem"0x0e11 0x0001 "ignore" "Compaq|PCI to EISA Bridge"0x0e11 0x0002 "ignore" "Compaq|PCI to ISA Bridge"0x0e11 0x0046 "cciss" "Compaq|Smart Array 64xx"

The following example extracts the third and fourth column, using the default TAB character to separatefields. Note the use of the -s command line switch, which effective strips the header lines (which do notcontain any TABs).

[student@rosemont hwdata]$ cut -s -f3,4 pcitable | head"unknown" "Dynalink|IS64PH ISDN Adapter""hisax" "Dynalink|IS64PH ISDN Adapter""unknown" "Arris|CM 200E Cable Modem""ignore" "Compaq|PCI to EISA Bridge""ignore" "Compaq|PCI to ISA Bridge""cciss" "Compaq|Smart Array 64xx""unknown" "Compaq|NC7132 Gigabit Upgrade Module""unknown" "Compaq|NC6136 Gigabit Server Adapter""tmspci" "Compaq|Netelligent 4/16 Token Ring""ignore" "Compaq|Triflex/Pentium Bridge, Model 1000"

As another example, suppose we wanted to obtain a list of the most commonly referenced kernelmodules in the file. We could use a similar cut command, along with tricks learned in the last Lesson, toobtain a quick listing of the number of times each kernel module appears.

[student@rosemont hwdata]$ cut -s -f3 pcitable | sort | uniq -c | sort -rn | head1988 "unknown"148 "ignore"83 "aic7xxx"70 "gdth"37 "e100"37 "Card:ATI Rage 128"36 "3c59x"24 "Card:ATI Mach64"21 "tulip"20 "agpgart"

Many of the entries are obviously unknown, or intentionally ignored, but we do see that the aic7xxx SCSIdriver, and the e100 Ethernet card driver, are commonly used.

Extracting Text by Byte Position with cut -bThe -b command line switch is used to specify which text to extract by bytes. Extracting text using the -bcommand line switch is very similar in spirit as using -c. In fact, when dealing with text encoded usingthe ASCII or one of the ISO 8859 character sets (such as Latin-1), the two are identical. The -b switch


80


differs from -c, however, when using character sets with variable length encoding, such as UTF-8 (astandard character set on which many people are converging, and the default in Red Hat EnterpriseLinux).

As a quick example, consider the following three characters of Germen text: für. When using UTF-8encoding, the two characters which are part of the ASCII character set, “f” and “r”, are encoded using asingle byte. The “ü”, however, is encoded using two bytes, as is evidenced by the wc command.

[elvis@station elvis]$ echo für | wc -c5

Accounting, we have one byte each for the letters “f” and “r”, one byte for the newline which wasappended to the output, leaving two bytes for the “ü”.

When using cut -c, the “ü” would be considered a single character, but when using cut -b, “ü” would beconsidered two bytes, as in the following example.

[elvis@station elvis]$ echo fü | cut -c 1-2fü[elvis@station elvis]$ echo fü | cut -b 1-2f?

The first time, the cut command counted the two bytes used to encode the “ü” as a single character, butthe second time, it was considered two bytes. As a result, the character was "cut in half" by the cutcommand, and the terminal was not able to display it correctly.

Usually, cut -c is the proper way to use the cut command, and cut -b will only be necessary for technicalsituations.

Note: Notice the inconsistent nomenclature between with wc and cut. With wc -c, the wc commandreally returns the number of bytes contained in a string, while cut -c measures text in characters.Unfortunately, the wc command makes no equivalent distinction made between characters andbytes.

The paste CommandThe paste command is used to combine multiple files into a single output. Recall the fictional piece ofpaper which listed rows of names, email addresses, and phone numbers. After tearing the paper into threecolumns, what if we had glued the first back to the third, leaving a piece of paper listing only names andphone numbers? This is the concept behind the paste command.

The paste command expects a series of filenames as arguments. The paste command will read the firstline from each file, join the contents of each line inserting a TAB character in between, and write theresulting single line to standard out. It then continues with the second line from each file.

Consider the following two files as an example.

[student@station student]$ cat file-1File-1 Line 1File-1 Line 2


81


File-1 Line 3[student@station student]$ cat file-2File-2 Line 1File-2 Line 2File-2 Line 3

The paste command would output this:

[student@station student]$ paste file-1 file-2File-1 Line 1 File-2 Line 1File-1 Line 2 File-2 Line 2File-1 Line 3 File-2 Line 3

If we had more than two files, the first line of each file would become the first line of the output. Thesecond output line would contain the second lines of each input file, obtained in the order we gave themon the command line. As a convenience, the filename - can be supplied on the command line. For this"file", the paste command would read from standard in.

Table 5-4. Command Line Switches for paste

Option Description-d list Reuse characters from list for delimiters (instead of the default

TAB character).-s, --serial Transpose the result, so that each line in the first file is pasted into a

single line, each line of the second file is pasted into the next singleline, etc.

Examples

Example 1. Handling Free-Format RecordsIn a free-format record layout, input record items are identified by their position on the line, not by theircharacter position. Input fields are expected to be separated by exactly one TAB character, but anycharacter that does not appear in the data items themselves may be used. Each occurrence of thedelimiter separates a field.

Our favorite example file /etc/passwd has fields separated by exactly one colon (“:”) character. Field 1is the account name and field 7 gives the shell program used. Using the cut command, we could output anew file with just the account name and the shell name:

[student@station student]$ cut -d: -f1,7 /etc/passwdroot:/bin/bashbin:/sbin/nologindaemon:/sbin/nologinadm:/sbin/nologin...

rha030-3.0-0-en-2005-08-17T07:23:17-0400


82


Notice that the output lines use the same field delimiters as do the input records. We can change that withthe --output-delimiter switch:

[student@station student]$ cut -d: -f7,1 --output-delimiter=, /etc/passwdroot,/bin/bashbin,/sbin/nologindaemon,/sbin/nologinadm,/sbin/nologinlp,/sbin/nologin...

Example 2. Living With Fixed-Format RecordsIn a fixed-format record, data are assigned specific character positions, or columns, that are the same ineach input line. Use the -c switch to identify the input character positions copied to the each output line.

[student@station student]$ cat fixed-dataabc123def456hij789lkm012

We can clip out just characters 3 and 4 like this:

[student@station student]$ cut -c3-4 fixed-datac1f4j7m0

Example 3. Using (and Misusing) a Space as a DelimiterThe mount command, without arguments, returns a list of which devices are mounted to which mountpoints, along with the filesystem type and relevant mount options.

[student@station student]$ mount/dev/hda3 on / type ext3 (rw)none on /proc type proc (rw)usbdevfs on /proc/bus/usb type usbdevfs (rw)/dev/hda1 on /boot type ext3 (rw)none on /dev/pts type devpts (rw,gid=5,mode=620)none on /dev/shm type tmpfs (rw)automount(pid780) on /misc type autofs (rw,fd=5,pgrp=780,minproto=2,maxproto=3)

Noticing that the words are separated by a single spaces, the cut command can be used to easily extractthe third and fifth words (which contain the mount point, and filesystem type, respectively). Thecommand must be supplied with the -d " " command line switch, which instructs it to treat spaces as afield delimiters.


83


[student@station student]$ mount | cut -d" " -f3,5/ ext3/proc proc/proc/bus/usb usbdevfs/boot ext3/dev/pts devpts/dev/shm tmpfs/misc autofs

Will the same technique work for the df command?

[student@station student]$ dfFilesystem 1K-blocks Used Available Use% Mounted on/dev/hda3 5131108 4502000 368456 93% //dev/hda1 124427 26268 91735 23% /bootnone 127616 0 127616 0% /dev/shm[student@station student]$ df | cut -d" " -f1,5Filesystem/dev/hda3/dev/hda1none

Apparently not. The cut command is using a space for a field delimiter. The catch is that the cutcommand does not collapse multiple spaces into a single space, but treats them individually. Where doesthe fifth "field" occur in the df command’s output? Somewhere about halfway between the first twocolumns. The cut command dutifully prints the first and (empty) fifth field.

Unfortunately, this is a commonly encountered limitation of the cut command. Fortunately, we will findtechniques in a later Lesson that can be used to overcome it.

Example 4. Examples of PastingOur initial example showed the most common usage of paste, where the first lines from all the input filesare concatenated together and separated by a delimiter character; the process then repeats for thesubsequent lines. The -s option pastes all the lines from the first input file into the first output line, thenpastes all the lines from the second input file into the second output line, and so on:

[student@station student]$ paste -d: -s file-1 file-2File-1 Line 1:File-1 Line 2:File-1 Line 3File-2 Line 1:File-2 Line 2:File-2 Line 3

Recall that the -d switch list argument can take more than one character. This can be used to provide adifferent delimiter between each pair of portions written to the output. The list characters are recycledif necessary:

[student@station student]$ paste -d+-/ file-1 file-2 file-1 file-2 file-1File-1 Line 1+File-2 Line 1-File-1 Line 1/File-2 Line 1+File-1 Line 1File-1 Line 2+File-2 Line 2-File-1 Line 2/File-2 Line 2+File-1 Line 2File-1 Line 3+File-2 Line 3-File-1 Line 3/File-2 Line 3+File-1 Line 3

rha030-3.0-0-en-2005-08-17T07:23:17-0400


84


Online Exercises

Lab ExerciseObjective: Use cut and paste to manage text.


Specification

1. Use the cut command to extract a list of usernames and login shells from the /etc/passwd file,where the resulting usernames and login shells are separated by a single space. Sort the resulting listin ascending alphabetical order, using the login shell as the primary key, and the username as asecondary key. Store the result in the newly created file ~/usershells.txt.

2. The file /proc/cpuinfo contains information about your system’s detected CPU. Use the cutcommand to extract only the values, not the names or the “:” that is used to separate the names fromthe values. Store the resulting list of values in the newly created file ~/cpuvalues.txt.

3. The file /etc/sysconfig/init is used to define parameters which configure your machine’sstartup method. Parameters are defined using the same syntax used by the bash shell, i.e.,NAME=value.

Use some combination of the grep and cut commands to generate a list of the parameter namesfound in this file, one name per line. Do not include the parameter values or the “=” which is used toseparate them, or any of the comment or empty lines found in the original file. Sort the names inalphabetically ascending order, and store them in the newly create file ~/initparams.txt.

4. The following script can be used to print a series of 10 random numbers.#!/bin/bash

for i in $(seq 10); doecho $RANDOM

done

Create the script in a file of your choosing, and make the file executable. Execute the script 5separate times, each time recording the output in a file named ~/trial1, ~/trial2, ~/trial3,etc.

Create a file called titles, which contains the words run1, run2, ... run10, one per line, on each often lines.

Use the paste command to combine the files named titles, trial1, trial2, trial3 trial4,and trial5, in that order, into a file called trials. Use the default TAB character to separate thecolumns.

If you have completed the lab correctly, you should be able to generate output similar to the following.Do not be concerned if some of the values differ.

[student@station student]$ head -4 usershells.txt cpuvalues.txtinitparams.txt titles trial[15] trials


85


==> usershells.txt <==newsalice /bin/bashapache /bin/basharnold /bin/bash

==> cpuvalues.txt <==0GenuineIntel68

==> initparams.txt <==BOOTUPLOGLEVELMOVE_TO_COLPROMPT

==> titles <==run1run2run3run4

==> trial1 <==97399089671225993

==> trial5 <==320291474834731709

==> trials <==run1 9739 27486 12465 14282 32029run2 9089 27496 13136 15835 1474run3 6712 7089 7467 7969 8347run4 25993 12188 31152 12746 31709

Deliverables

1. The file usershells.txt, which contains a list of all users and login shells defined in the /etc/passwd file,separated by a space. The lines should be sorted in alphabetically ascending order, using login shells as the

rha030-3.0-0-en-2005-08-17T07:23:17-0400Copyright (c) 2003-2005 Red Hat, Inc. All rights reserved. For use only by a student enrolled in a Red Hat Academy course taught at a Red Hat Academy. Any other use is a violationof U.S. and international copyrights. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise duplicated whether in electronic orprint format without prior written consent of Red Hat, Inc. If you believe Red Hat course materials are being used, copied, or otherwise improperly distributed please [email protected] or phone toll-free (USA) +1 866 626 2994 or +1 (919) 754 3700.

86


primary key, and usernames as the secondary key.

2. The file cpuvalues.txt, which lists the values for parameters found in the file /proc/cpuinfo, one per line.The values should appear in the same order as they appear in the file /proc/cpuinfo.

3. The file initparams.txt, which contains an alphabetically ascending list of parameter names found in the file/etc/sysconfig/init.

4. Five files named ~/trial1, ~/trial2, ... ~/trial5, which list 10 random integers, one integer per line.

5. The file ~/titles, which contains the 10 words run1, run2, ... run10, one word per line.

6. The file trials, which contains the contents of the six files titles, trial1, ... trial5 pasted into a singlefile, using the paste command.

Questions

1. Which of the following command lines would extract characters 10-20 from each line of the file /etc/xpdfrc?

( ) a. cut -c10-20 /etc/xpdfrc

( ) b. cut -c10,20 /etc/xpdfrc

( ) c. cut -f10-20 /etc/xpdfrc

( ) d. cut -f10,20 /etc/xpdfrc


2. Which of the following command lines would extract the second and fourth fields from the /etc/group file?Recall that the /etc/group file uses a “:” to separate fields.

( ) a. cut -d: -f2,4 /etc/group

( ) b. cut -f:2,4 /etc/group

( ) c. cut -t: -f2-4 /etc/group

( ) d. cut -t: -f2,4 /etc/group


The file Web defines a palette of colors by listing RGB (Red, Green, and Blue) values for each color, one triplet perline. Use the following transcript to answer the next 2 questions.

[student@station student]$ cat WebGIMP Palette# Netscape -- GIMP Palette file255 255 255255 255 204255 255 153


87


255 255 102255 255 051255 255 000255 204 255255 204 204

3. Which of the following command lines would extract the green values (i.e, the second column), omitting the twoheader lines?

( ) a. tail +3 Web | cut -d" " -f2

( ) b. cut -s -f2 Web

( ) c. grep "[[:digit:]]" Web | cut -c5-7


( ) e. A and C only

4. Which of the following command lines would extract the red and blue values (i.e., the first and third columns),separating them with a “:” instead of a space on output? The two header lines should again be omitted.

( ) a. tail +3 Web | cut -d: -f1,3

( ) b. tail +3 Web | cut -c1-3,9-11 -o:

( ) c. tail +3 Web | cut -d" " -f1,3 --output-delimiter=:


( ) e. B and C only

The file /usr/share/gimp/1.2/gradients/Abstract_1 defines gradient parameters by listing 13 numbers,separated by spaces. Use the following transcript to answer the following 2 questions.

[student@station gradients]$ cat Abstract_1GIMP Gradient60.000000 0.286311 0.572621 0.269543 0.259267 1.000000 1.000000 0.215635 0.407414 0.984953 1.000000 0 00.572621 0.657763 0.716194 0.215635 0.407414 0.984953 1.000000 0.040368 0.833333 0.619375 1.000000 0 00.716194 0.734558 0.749583 0.040368 0.833333 0.619375 1.000000 0.680490 0.355264 0.977430 1.000000 0 00.749583 0.784641 0.824708 0.680490 0.355264 0.977430 1.000000 0.553909 0.351853 0.977430 1.000000 0 00.824708 0.853088 0.876461 0.553909 0.351853 0.977430 1.000000 1.000000 0.000000 1.000000 1.000000 0 00.876461 0.943172 1.000000 1.000000 0.000000 1.000000 1.000000 1.000000 1.000000 0.000000 1.000000 0 0

5. Which of the following command lines would extract the even numbered columns,omitting the first two headerlines?

( ) a. tail +3 Abstract_1 | cut -d" " -c2,4,6,8,10,12

( ) b. tail +3 Abstract_1 | cut -d" " -f2-12:2

( ) c. tail +3 Abstract_1 | cut -d" " -f2,4,6,8,10,12



88


( ) e. B and C

6. Which of the following command lines would extract the first two columns, omitting the first two header lines?

( ) a. tail +3 Abstract_1 | cut -c1-18

( ) b. tail +3 Abstract_1 | cut -f1,2

( ) c. tail +3 Abstract_1 | cut -c1,18


( ) e. A and C

The file /proc/iomem displays physical memory ranges, “:” separated from the devices which are using them.

[student@station student]$ tail /proc/iomeme8000000-ebffffff : ATI Technologies Inc Rage Mobility M3 AGP 2x

e8000000-e87fffff : vesafbfbffd000-fbffd07f : 3Com Corporation Mini PCI 56k Winmodemfbffd400-fbffd4ff : 3Com Corporation Mini PCI 56k Winmodemfbffd800-fbffd87f : 3Com Corporation 3c556 Hurricane CardBusfbffdc00-fbffdc7f : 3Com Corporation 3c556 Hurricane CardBusfbffe000-fbffffff : ESS Technology ES1983S Maestro-3i PCI Audio Acceleratorfd000000-feffffff : PCI Bus #01

fdffc000-fdffffff : ATI Technologies Inc Rage Mobility M3 AGP 2xffe00000-ffffffff : reserved

7. Which of the following command lines would reliably display only the starting memory address from each range?

( ) a. cut -d- -f1 /proc/iomem

( ) b. cut -c 1-8 /proc/iomem

( ) c. cut -d: -f1 /proc/iomem | cut -d- -f1

( ) d. A and C

( ) e. None of the Above

The file /proc/mounts lists all currently mounted devices, along with their mount points, file systems, and mountoptions, each separated by spaces.

[student@station student]$ cat /proc/mountsrootfs / rootfs rw 0 0/dev/root / ext3 rw 0 0/proc /proc proc rw 0 0usbdevfs /proc/bus/usb usbdevfs rw 0 0/dev/hda1 /boot ext3 rw 0 0none /dev/pts devpts rw 0 0none /dev/shm tmpfs rw 0 0automount(pid780) /misc autofs rw 0 0

rha030-3.0-0-en-2005-08-17T07:23:17-0400

Copyright (c) 2003-2005 Red Hat, Inc. All rights reserved. For use only by a student enrolled in a Red Hat Academy course taught at a Red Hat Academy. Any other use is a violationof U.S. and international copyrights. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise duplicated whether in electronic or printformat without prior written consent of Red Hat, Inc. If you believe Red Hat course materials are being used, copied, or otherwise improperly distributed please [email protected] or phone toll-free (USA) +1 866 626 2994 or +1 (919) 754 3700.

89


8. Which of the following command lines would generate a list of currently used filesystems (listed in the thirdcolumn), and the number of times they are being being used?

( ) a. cut -d" " -f3 /proc/mounts | sort | uniq -c

( ) b. cut -c3 /proc/mounts | sort -c

( ) c. cut -f3 /proc/mounts | sort -u | uniq -c

( ) d. cut -d" " -f3 /proc/mounts | sort -uc


9. Which of the following command lines would combine the files a.txt, b.txt, and c.txt?

( ) a. paste -a a.txt b.txt c.txt

( ) b. paste -j a.txt b.txt c.txt

( ) c. paste -m a.txt b.txt c.txt

( ) d. paste a.txt b.txt c.txt


10. Which of the following command lines would combine the files a.txt, b.txt, and c.txt, using a “:” toseparate contents of each?

( ) a. paste -d: a.txt b.txt c.txt

( ) b. paste -m -d: a.txt b.txt c.txt

( ) c. paste -t: a.txt b.txt c.txt

( ) d. paste -t: -j a.txt b.txt c.txt


rha030-3.0-0-en-2005-08-17T07:23:17-0400


90

Chapter 6. Tracking differences: diff

Key Concepts• The diff command summarizes the differences between two files.

• The diff command supports a wide variety of output formats, which can be chosen using variouscommand line switches. The most commonly used of these is the unified format.

• The diff command can be told to ignore certain types of differences, such as changes in white space orcapitalization.

• diff -r recursively summarizes the differences between two directories.

• When comparing directories, the diff command can be told to ignore files whose filenames matchspecified patterns.

Discussion

The diff CommandThe diff command is designed to compare two files that are similar, but not identical, and generate outputthat describes exactly how they differ. The diff command is commonly used to track changes to text files,such as reports, web pages, shell scripts, or C source code. Also, utilities coexist with the diff command,so that given a version of a file, and the output of the diff command comparing it to some other version,the file can be brought up to date automatically. Most notable of these commands is the patch command.

We first introduce the diff command by way of example. In the open source community, documentationgenerally sacrifices correctness of spelling or grammar for timeliness, as demonstrated in the followingREADME.pam_ftp file.

[blondie@station blondie]$ cat README.pam_ftpThis is the README for pam_ftp------------------------------

This module is an authentication module that does simple ftpauthentication.

Recognized arguments:

"debug" print debug messages"users=" comma separated list of users which

could login only with email adress"ignore" allow invalid email adresses

Options for:auth: for authentication it provides pam_authenticate() and

pam_setcred() hooks.

91


James Anderson <[email protected]>, 17. June 1999

Noticing that the words address and addresses are misspelled, blondie sets out to apply changes, first bycorrecting the misspelled words, and secondly by appending a line recording her revisions. She firstmakes a copy of the file, appending the .orig extension. She secondly makes her edits.

[blondie@station blondie]$ cp README.pam_ftp README.pam_ftp.orig[blondie@station blondie]$ nano README.pam_ftp

She now uses the diff command to compare the two revisions of the file.

[blondie@station blondie]$ diff README.pam_ftp.orig README.pam_ftp11,12c11,12< could login only with email adress< "ignore" allow invalid email adresses---> could login only with email address> "ignore" allow invalid email addresses18a19> Spelling corrections applied by blondie, 22 Sep 2003

Without yet going into detail about diff’s syntax, we see that the command has identified the differencesbetween the two files, exemplifying the essence of the diff command. The diff command is so commonlyused, that its output is often referred to as a noun, as in "Here’s the diff between those two files".

Output Formats for the diff CommandThe diff command was conceived in the early days of the Unix community. Over time, improvementshave been made in how diff annotates changes. To preserve backward compatibility, however, olderformats are still available. The following lists commonly used diff formats.

"Standard" diffOriginally, the diff command was used to preserve bandwidth over slow network connections.Rather than transferring a new version of a file, a summary of the revisions would be transferredinstead. This summary was in a format that was easily recognized by the ed command line editor,which is seldom used today. Examining the previous output, one can imagine the ed editor beingasked to change lines 11 and 12, and append a line after line 18.

Soon, however, room for improvement was found. What if an administrator accidentally applied thechanges twice? The ed editor would happily make the changes, corrupting the contents of the file.The solution is a context sensitive diff.

Context diff (diff -c)

The context sensitive diff is generated by specifying the -c or -C N command line switches. (Thesecond form is used to specify that exactly N lines of context should be generated.) Consider thefollowing example.[blondie@station blondie]$ diff -c README.pam_ftp.orig README.pam_ftp*** README.pam_ftp.orig 2003-10-07 15:30:05.000000000 -0400

rha030-3.0-0-en-2005-08-17T07:23:17-0400Copyright (c) 2003-2005 Red Hat, Inc. All rights reserved. For use only by a student enrolled in a Red Hat Academy course taught at a Red Hat Academy. Any otheruse is a violation of U.S. and international copyrights. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise dupli-cated whether in electronic or print format without prior written consent of Red Hat, Inc. If you believe Red Hat course materials are being used, copied, or oth-erwise improperly distributed please email [email protected] or phone toll-free (USA) +1 866 626 2994 or +1 (919) 754 3700.

92


--- README.pam_ftp 2003-10-07 15:30:17.000000000 -0400****************** 8,18 ****


! could login only with email adress! "ignore" allow invalid email adresses



James Anderson <[email protected]>, 17. June 1999--- 8,19 ----


! could login only with email address! "ignore" allow invalid email addresses



James Anderson <[email protected]>, 17. June 1999+ Spelling corrections applied by blondie, 22 Sep 2003

Obviously, the context diff includes several lines of surrounding context before identifying changes.Changes are annotated by using a “!” to mark lines that have changed, “+” to mark lines that havebeen added, and “-” to mark lines that have been removed. Using a content diff, utilities canautomatically detect when an administrator accidentally tries to update a file twice.

Unified diff (diff -u)

The unified diff is generated by specifying the -u or -U N command line switches. (The secondform is used to specify that exactly N lines of context should be generated.) Rather than duplicatinglines of context, the unified diff attempts to record changes all in one stanza, creating a morecompact, and arguably more readable, output.[blondie@station blondie]$ diff -u README.pam_ftp.orig README.pam_ftp

--- README.pam_ftp.orig 2003-10-07 15:30:05.000000000 -0400+++ README.pam_ftp 2003-10-07 15:30:17.000000000 -0400@@ -8,11 +8,12 @@


- could login only with email adress- "ignore" allow invalid email adresses+ could login only with email address+ "ignore" allow invalid email addresses

Options for:

rha030-3.0-0-en-2005-08-17T07:23:17-0400Copyright (c) 2003-2005 Red Hat, Inc. All rights reserved. For use only by a student enrolled in a Red Hat Academy course taught at a Red Hat Academy. Any otheruse is a violation of U.S. and international copyrights. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise dupli-cated whether in electronic or print format without prior written consent of Red Hat, Inc. If you believe Red Hat course materials are being used, copied, or oth-erwise improperly distributed please email [email protected] or phone toll-free (USA) +1 866 626 2994 or +1 (919) 754 3700.

93


auth: for authentication it provides pam_authenticate() andpam_setcred() hooks.

James Anderson <[email protected]>, 17. June 1999+Spelling corrections applied by blondie, 22 Sep 2003

Rather than identifying a line as "changed", the unified diff annotates that the original versionshould be deleted, and the new version added.

Side by side diff (diff -y)

The previous three formats were meant to be easy to read by some other utility, such as the ed editoror the patch utility. In contrast, the "side by side" format is intended to be read by humans. As thename implies, the two versions of the file are displayed side by side, with annotations in the middlethat help identify changes. The following example requests a side by side diff using the -y commandline switch, and further qualifies that the output should be formatted to 80 columns with -W80.[blondie@station blondie]$ diff -y -W80 README.pam_ftp.orig README.pam_ftpThis is the README for pam_ftp This is the README for pam_ftp------------------------------ ------------------------------

This module is an authentication modu This module is an authentication moduauthentication. authentication.

Recognized arguments: Recognized arguments:

"debug" print debug m "debug" print debug m"users=" comma separat "users=" comma separat

could login o | could login o"ignore" allow invalid | "ignore" allow invalid

Options for: Options for:auth: for authentication it provide auth: for authentication it provide

pam_setcred() hooks. pam_setcred() hooks.

James Anderson <[email protected]>, 1 James Anderson <[email protected]>, 1> Spelling corrections applied by blond

While the output would be more effective using a wide terminal, it does provide an intuitive feel forthe differences between the two files.

Quiet diff (diff -q)

The quiet diff merely reports if two files differ, not the nature of the differences.[blondie@station blondie]$ diff -q README.pam_ftp.orig README.pam_ftpFiles README.pam_ftp.orig and README.pam_ftp differ

if-then-else Macro diff (diff -D tag)

This format generates differences using a syntax recognized by the cpp pre-processor. It allowseither the original version or the new version to be included by defining the specified tag. Whilebeyond the scope of this course, it is included for the benefit of those familiar with the cpp Cpreprocessor.


94


Other, less commonly used output formats exist as well. Which format is the right one? The answerdepends on the preferences of the generator of the "diff", or the expectations of whoever might bereceiving the "diff". The diff command is often used in the open source community to communicatesuggestions about exact changes to the source code of some program, in order to fix a bug or add afeature. In this context, the unified diff format is almost always preferred.

The following table summarizes some of the various command line switches which can be used tospecify output format for the diff command.

Table 6-1. Command Line Switches for Specifying diff Output Format

Switch Effect-c Generate the context sensitive format-C, --context[=N] Generate the context sensitive format, using N lines of context, if

specified.-u Generate the unified format-U, --unified[=N] Generate the unified format, using N lines of context, if specified.-N Another format for specifying N lines of context. Only used with -c

or -u.-y, --side-by-side Generate the side by side format-W, --width=N Use N columns when generating side by side format.--left-column Print only the left column when using the side by side format.-q, --brief Only report if files differ, not the details of the difference.

How diff Interprets ArgumentsThe diff command expects to be called with two arguments, a from-file and a to-file (or, in other words,an oldfile and a newfile). The output of the diff command describes what must be done to the from-file tocreate the to-file.

If one of the filenames refers to a regular file, and the other a directory, the diff command will look for afile of the same name in the specified directory. If both are directories, the diff command will comparefiles in both directories, but will not recurse to subdirectories (unless the -r switch is specified, seebelow). Additionally, the special file name “-” will cause the diff command to read from standard ininstead of a regular file.

Customizing diff to be Less PickyIf not told otherwise, the diff command will diligently track all differences between two files. Severalcommand line switches can be used to cause the diff command to have a more relaxed behavior. Thefollowing table summarizes the relevant command line switches.

Table 6-2. Command Line Switches that Specify diff’s Pickyness





95


Switch Effect-b, -w, --ignore-all-space Ignore white space when comparing lines.-B, --ignore-blank-lines Ignore white space when comparing lines.-i, --ignore-case Ignore changes in case (i.e., consider upper and lower case

characters equivalent).-I,

--ignore-matching-lines=regexIgnore changes that insert or delete lines which match the

mandatory argument regex.

As an example, consider the following two files.

[blondie@station blondie]$ cat cal.txtSeptember 2003

Su Mo Tu We Th Fr Sa1 2 3 4 5 6

7 8 9 10 11 12 1314 15 16 17 18 19 2021 22 23 24 25 26 2728 29 30

[blondie@station blondie]$ cat cal_edited.txt======================== This Month ========================

September 2003

Su Mo Tu We Th Fr Sa1 2 3 4 5 6

7 8 9 10 11 12 1314 15 16 17 18 19 2021 22 23 24 25 26 2728 29 30

The file cal_edited.txt differs in two respects. First, a four line header was added to the top.Secondly, an extra (empty) line was added to the bottom. An "ordinary" diff recognizes all of thesechanges.

[blondie@station blondie]$ diff cal.txt cal_edited.txt0a1,4> ====================> ==== This Month ====> ====================>9a14>

With the -B command line switch, however, the diff command ignores the new, empty line at the bottom.

[blondie@station blondie]$ diff -B cal.txt cal_edited.txt


96


0a1,4> ====================> ==== This Month ====> ====================>

With the -I command line switch, the diff command can be told to also ignore any lines that begin with a“=”.

[blondie@station blondie]$ diff -B -I "^=" cal.txt cal_edited.txt

Recursive diff’sThe diff command can act recursively, descending two similar directory trees and annotating anydifferences. The following table lists command line switches relevant to diff’s recursive behavior.

Table 6-3. Command Line Switches for Using diff Recursively

Switch Effect-r, --recursive When comparing directories, recurse through subdirectories as

well.-x, --exclude=pattern When comparing directories recursively, omit filenames that match

pattern.-X, --exclude-from=file When comparing directories recursively, omit filenames that match

patterns specified in file.

As an example, blondie is examining two versions of a project called vreader. The project involvesPython scripts which convert calendering information from the vcal format to an XML format. She hasdownloaded two versions of the project, vreader-1.2.tar.gz and vreader-1.3.tar.gz, andexpanded each of the archives into her local directory.

[blondie@station blondie]$ lsvreader-1.2 vreader-1.2.tar.gz vreader-1.3 vreader-1.3.tar.gz

The directories vreader-1.2 and vreader-1.3 have the following structure.

vreader-1.2/|-- addressbook.vcard|-- calendar.ics|-- conv_db.py|-- conv_db.pyc|-- datebook.xml|-- templates/| ‘-- datebook.xml‘-- vreader.pyvreader-1.3/|-- addressbook.vcard|-- calendar.ics|-- conv_db.py


97


|-- conv_db.pyc|-- datebook.out.xml|-- datebook.xml|-- templates/| ‘-- datebook.xml‘-- vreader.py

In order to summarize the differences between the two versions. She runs a recursive diff on the twodirectories.

[blondie@station blondie]$ diff -r vreader-1.[23]Binary files vreader-1.2/conv_db.pyc and vreader-1.3/conv_db.pyc differOnly in vreader-1.3: datebook.out.xmldiff -r vreader-1.2/templates/datebook.xml vreader-1.3/templates/datebook.xml15a16> <event description="Linux users 331 dabney" categories="" uid="-1010079065" start="873246600" end="873248400" />diff -r vreader-1.2/vreader.py vreader-1.3/vreader.py6a7> time_offset = 0 # in hours348c349< return utime---> return utime + time_offset*3600

The diff command recurses through the two directories, and notes the following differences.

1. The two binary files vreader-1.2/conv_db.pyc and vreader-1.3/conv_db.pyc differ.Because they are not text files, however, the diff command does not try to annotate the differences.

2. The complementary file to vreader-1.3/datebook.out.xml is not found in the vreader-1.2directory.

3. The files vreader-1.2/templates/datebook.xml andvreader-1.3/templates/datebook.xml differ, and diff annotates the changes.

4. The files vreader-1.2/vreader.py and vreader-1.3/vreader.py differ, and diff annotatesthe changes.

Often, when comparing more complicated directory trees, there are files that are expected to change, andfiles that are not. For example, the file conv_db.pyc is compiled Python code automatically generatedfrom the text Python script file conv_db.py. Because blondie is not interested in differences betweenthe compiled versions of the file, she uses the -x command line switch to exclude the file form hercomparisons. Likewise, she is not interested in the files ending .xml, so she specifies them with anadditional -x command line switch.

[blondie@station blondie]$ diff -r -x "*.pyc" -x "*.xml" vreader-1.[23]diff -r -x ’*.pyc’ -x ’*.xml’ vreader-1.2/vreader.py vreader-1.3/vreader.py6a7> time_offset = 0 # in hours348c349< return utime---> return utime + time_offset*3600

rha030-3.0-0-en-2005-08-17T07:23:17-0400


98


Now the output of the diff command is limited to only the file vreader-1.2/vreader.py and itscomplement in vreader-1.3.

As an alternative to listing file patterns to exclude on the command line, they may be collected in asimple text file which is specified instead, using the -X command line switch. In the following, blondiehas created and uses such a file.

[blondie@station blondie]$ cat diff_excludes.txt*.pyc*.xml*.py[blondie@station blondie]$ diff -r -X diff_excludes.txt vreader-1.[23]

Because blondie included *.py in her list of file patterns to exclude, the diff command is left withnothing to say.

Examples

Example 1. Using diff to Examine New Configuration FilesAfter updating her sendmail RPM package, blondie notices that she has a new configuration file in her/etc/mail directory, sendmail.cf.rpmnew. She would like to see how this file compares to heralready existing configuration file, /etc/mail/sendmail.cf. She uses diff to summarize thedifferences.

[blondie@station blondie]$ diff /etc/mail/sendmail.cf /etc/mail/sendmail.cf.rpmnew19,21c19,21< ##### built by [email protected] on Tue Apr 1 15:09:38 EST 2003< ##### in /etc/mail< ##### using /usr/share/sendmail-cf/ as configuration include directory---> ##### built by [email protected] on Wed Sep 17 14:45:22 EDT 2003> ##### in /usr/src/build/308253-i386/BUILD/sendmail-8.12.8/cf/cf> ##### using ../ as configuration include directory40d39<101c100< DSnimbus.example.com---> DS

She is satisfied that the new version of the configuration file differs only by some comment lines, and thelack of a local configuration she had added to her version of the file.

rha030-3.0-0-en-2005-08-17T07:23:17-0400


99


Example 2. Using diff to Examine Recent Changes to/etc/passwdThe system administrator on a machine has noticed that the useradd utility creates a backup of the/etc/passwd file whenever it makes a change to it, named /etc/passwd-. As the user root, she wouldlike to view the most recent change to the /etc/passwd file.

[root@station root]# diff /etc/passwd- /etc/passwd68a69> desktop:x:80:80:desktop:/var/lib/menu/kde:/sbin/nologin

Apparently, a new system user has recently been added, probably as a result of adding new softwareusing an RPM package file.

Example 3. Creating a PatchAfter downloading the vreader-1.3.tar.gz archive and expanding its contents, blondie decides thatshe could improve on the project and would like to make changes to the file vreader.py. She firstmakes a copy of the "pristine" source (the version she unpacked from the distributed archive), and thenedits her copy of the file.

[blondie@station blondie]$ tar xzf vreader-1.3.tar.gz[blondie@station blondie]$ cp -a vreader-1.3 vreader-1.3.local[blondie@station blondie]$ nano vreader-1.3.local/vreader.py

After editing her copy of the file, she would like to submit her changes to the person who coordinateschanges to the vreader project. In the open source community, this person is usually referred to as themaintainer of the project. Rather than sending a full copy of her version, she records the differencesbetween her version and the original in a file called vreader-1.3.blondie.patch.

[blondie@station blondie]$ diff -ru vreader-1.3 vreader-1.3.local> vreader-1.3.blondie.patch

She now emails only the patch file to the project maintainer, who can easily use a command called patchto apply the changes to pristine version.

[blondie@station blondie]$ mail -s "my changes" [email protected] <vreader-1.3.blondie.patch

Online Exercises

Lab ExerciseObjective: Use the diff command to track changes to files.Estimated Time: 10 mins.

rha030-3.0-0-en-2005-08-17T07:23:17-0400


100


Specification

1. Use the diff command to annotate the differences between the files/usr/share/doc/pinfo-0*/COPYING and /usr/share/doc/mtools-3*/COPYING, using thecontext sensitive format. Record the output in the newly created file ~/COPYING.diff. Whenspecifying the filenames on the command line, list the pinfo file first, and use an absolute referencefor both.

2. Create a local copy of the directory /usr/share/gedit-2, using the following command (in yourhome directory).[student@station student]$ cp -a /usr/share/gedit-2 .

To your local copy of the gedit-2 directory, make the following changes.

a. Remove any two files.

b. Create an arbitrarily named file somewhere underneath the gedit-2 directory, with arbitrarycontent.

c. Using a text editor, delete three lines from any file in the gedit-2/taglist directory.

Once you have finished, generate a recursive "diff" between /usr/share/gedit-2 and your copy,gedit-2. Record the output in the newly created file ~/gedit.diff. When specifying thedirectories on the command line, specify the original copy first, and use an absolute reference forboth. Do not modify the contents of your gedit-2 unless you also reconstruct your file~/gedit.diff.

Deliverables

1. The file ~/COPYING.diff, which contains a context sensitive "diff" of the files/usr/share/doc/pinfo*/COPYING and /usr/share/doc/mtools*/COPYING, where the pinfo versionof the file is used as the original, and each file is specified using an absolute reference.

2. The file ~/gedit.diff, which contains a recursive "diff" of the directories /usr/share/gedit-2 and~/gedit-2. Both directories should be specified using absolute references, and the system directory should beused as the original.

Questions

1. Which of the following command lines would generate a "diff" of the two files using the context sensitive format?

( ) a. diff -y origfile newfile


101


( ) b. diff -k origfile newfile

( ) c. diff -c origfile newfile

( ) d. diff --context-sensitive origfile newfile


2. Which of the following command lines would generate a "diff" of the two files using the unified format?

( ) a. diff -u origfile newfile

( ) b. diff -U2 origfile newfile

( ) c. diff --unified origfile newfile


( ) e. A and B only

3. Which of the following command lines would generate a "diff" of the two files using the side by side format?

( ) a. diff -s origfile newfile

( ) b. diff -y origfile newfile

( ) c. diff --side origfile newfile


( ) e. A and C only

Use the following two directory structures to answer the next 2 questions.

vreader-1.2/|-- addressbook.vcard|-- calendar.ics|-- conv_db.py|-- conv_db.pyc|-- datebook.xml|-- templates/| ‘-- datebook.xml‘-- vreader.py

vreader-1.3/|-- addressbook.vcard|-- calendar.ics|-- conv_db.py|-- conv_db.pyc|-- datebook.out.xml|-- datebook.xml|-- templates/| ‘-- datebook.xml‘-- vreader.py

2 directories, 15 files

rha030-3.0-0-en-2005-08-17T07:23:17-0400

Copyright (c) 2003-2005 Red Hat, Inc. All rights reserved. For use only by a student enrolled in a Red Hat Academy course taught at a Red Hat Academy. Any other use is a violationof U.S. and international copyrights. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise duplicated whether in electronic or printformat without prior written consent of Red Hat, Inc. If you believe Red Hat course materials are being used, copied, or otherwise improperly distributed please [email protected] or phone toll-free (USA) +1 866 626 2994 or +1 (919) 754 3700.

102


4. Which of the following command lines would compare the files vreader-1.2/datebook.xml andvreader-1.3/datebook.xml?

( ) a. diff -u vreader-1.2/datebook.xml vreader-1.3

( ) b. diff -u vreader-1.2 vreader-1.3/datebook.xml

( ) c. diff -u datebook.xml vreader-1.2 vreader-1.3


( ) e. A and B only

5. Which of the following would include a summary of the differences between the filesvreader-1.2/templates/datebook.xml and vreader-1.3/templates/datebook.xml?

( ) a. diff vreader-1.2 vreader-1.3

( ) b. diff -r vreader-1.2 vreader-1.3

( ) c. diff -r vreader-1.2/templates vreader-1.3

( ) d. diff -r -x "*.xml" vreader-1.2 vreader-1.3


Use the output of the following command to answer the next 2 questions.

[student@station student]$ cal > cal.txt[student@station student]$ cp cal.txt cal2.txt[student@station student]$ cp cal.txt cal3.txt[student@station student]$ echo "" >> cal2.txt[student@station student]$ echo "hello world" >> cal3.txt

6. Which of the following command lines would report no differences between the files cal.txt and cal2.txt?

( ) a. diff --no-white-space cal.txt cal2.txt

( ) b. diff -B cal.txt cal2.txt

( ) c. diff -w cal.txt cal2.txt

( ) d. diff -I cal.txt cal2.txt


7. Which of the following command lines would report no differences between the files cal.txt and cal3.txt?

( ) a. diff -I "^world" cal.txt cal3.txt

( ) b. diff --ignore-regex "world$" cal.txt cal3.txt

( ) c. diff -i "world" cal.txt cal3.txt

( ) d. diff -r cal.txt cal3.txt


rha030-3.0-0-en-2005-08-17T07:23:17-0400

Copyright (c) 2003-2005 Red Hat, Inc. All rights reserved. For use only by a student enrolled in a Red Hat Academy course taught at a Red Hat Academy. Any other use is a violation of U.S.and international copyrights. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise duplicated whether in electronic or print format withoutprior written consent of Red Hat, Inc. If you believe Red Hat course materials are being used, copied, or otherwise improperly distributed please email [email protected] or phone toll-free (USA) +1 866 626 2994 or +1 (919) 754 3700.

103


Use the output of the following command to answer the next 2 questions.

[student@station student]$ diff -u /etc/sysconfig/tux tux--- /etc/sysconfig/tux 2003-01-28 20:11:34.000000000 -0500+++ tux 2003-09-08 05:56:09.000000000 -0400@@ -3,7 +3,6 @@# TUXTHREADS sets the number of kernel threads (and associated daemon# threads) that will be used. $TUXTHREADS defaults to the number of# CPUs on the system.

-# TUXTHREADS=1

# DOCROOT is the document root; it works the same way as other web# servers such as apache. /var/www/html/ is the default.

@@ -22,7 +21,7 @@# are opened as user/group root, which means that the _init() function,# if it exists, is run as root. This feature is only designed to help# protect from programming mistakes; it is NOT really a security mechanism.

-# DAEMON_UID=nobody+DAEMON_UID=tux# DAEMON_GID=nobody

# CGIs can be started in a chroot environment by default.

8. This is an example of which diff output format?

( ) a. The "standard" format

( ) b. The unified format

( ) c. The context sensitive format

( ) d. The side by side format

( ) e. Not enough information is provided

9. Which of the following best describes the differences between the files /etc/sysconfing/tux and tux?

( ) a. Two lines have been added to the file tux.

( ) b. One line has been removed from and one line changed in the file /etc/sysconfig/tux.

( ) c. One line has been removed from and one line changed in the file tux.

( ) d. Two lines have been added to the file /etc/sysconfig/tux.

( ) e. Not enough information has been provided

10. Which of the following command lines would not report differences in capitalization between the two files?

( ) a. diff -i origfile newfile

( ) b. diff --ignore-capitalization origfile newfile

( ) c. diff -I origfile newfile

( ) d. diff -I ’[[:upper:]]’ origfile newfile


104



rha030-3.0-0-en-2005-08-17T07:23:17-0400


105

Chapter 7. Translating Text: tr

Key Concepts• The tr command performs translations on data read from standard in.

• In its most basic form, the tr command performs byte for byte substitutions.

• Using the -d command line switch, the tr command will delete specified characters from a stream.

• Using the -s command line switch, the tr command will squeeze a series of repeated characters in astream into a single instance of the character.

Discussion

The tr CommandThe tr command is a versatile utility that performs character translations on streams. Translating canmean replacing one character for another, deleting characters, or "squeezing" characters (collapsingrepeated sequences of a character into one). Each of these uses will be examined in the followingsections.

Unlike all of the previous commands in this section, the tr command does not expect filenames asarguments. Instead, the tr command operates exclusively on the standard in stream, reserving commandline arguments to specify transformations.

The following table specifies the various ways of invoking the tr command.

Table 7-1. Invocation Syntax for the tr Command

Syntax Effecttr SET1 SET2 Substitute the characters specified in SET2 for the complementary

characters specified in SET1.tr -d SET Delete all characters specified in SET.tr -s SET Squeeze all characters specified in SET.tr -s SET1 SET2 First substitute all characters found in SET2 for the complementary

characters found in SET1 then squeeze all characters found in SET2.

tr -ds SET1 SET2 First delete all characters found in SET1, then squeeze all charactersfound in SET2.

106


Character SpecificationAs the above table makes clear, the tr command makes extensive use of characters defined in sets. Thesyntax for defining a range of characters is based upon the range specifier found in regular expressions.The following expressions may be used when specifying characters.

Table 7-2. Specifying Characters for the tr Command

Syntax Character(s)literal Most characters match a literal translation of themselves.

\n The new line character.\r The return character.\t The (horizontal) tab character.\\ The \ character.[A-Z] The range of characters bounded by the specified characters.

Deprecated, because how the ordering of the range is determined isdependent on the character set used to encode the data.

[:alnum:] All letters and digits.[:alpha:] All letters.[:blank:] All horizontal white space.[:digit:] All digits.[:lower:] All lower case characters.[:print:] All printable characters.[:punct:] All punctuation characters.[:space:] All horizontal or vertical white space.[:upper:] All upper case characters.

The table is not meant to be a complete list. Consult the tr(1) man page, or tr --help, for moreinformation.

Using tr to Translate CharactersUnless instructed otherwise (using command line switches), the tr command expects to be called withtwo arguments, each of which specify a range of characters. For each of the characters specified in thefirst set, the tr will substitute the character found in the same position in the second set. Consider thefollowing trivial example.

[madonna@rosemont madonna]$ echo "abcdefghi" | tr fed xyzabczyxghi

Notice that in the output, the character “d” is replaced with the character “z”, “e” is replaced with thecharacter “y”, and “f” is replaced with the character “x”. The ordering of the sets is important. The thirdletter from the first set is replaced with the third letter from the second set.

What happens if the lengths of the two sets have unequal lengths? the second set is extended to the lengthof the first set by copying the last character.


107


[madonna@rosemont madonna]$ echo "abcdefghi" | tr fed xyabcyyxghi

A classic example of the tr command is to translate text into all upper case or all lower case letters. The"old school" syntax for such a translation would use character ranges.

[madonna@rosemont madonna]$ cat /etc/hosts# Do not remove the following line, or various programs# that require network functionality will fail.127.0.0.1 localhost.localdomain localhost rha-server192.168.0.254 rosemont.example.com rosemont192.168.0.51 hedwig.example.com hedwig h192.168.129.201 z[madonna@rosemont madonna]$ tr a-z A-Z < /etc/hosts# DO NOT REMOVE THE FOLLOWING LINE, OR VARIOUS PROGRAMS# THAT REQUIRE NETWORK FUNCTIONALITY WILL FAIL.127.0.0.1 LOCALHOST.LOCALDOMAIN LOCALHOST RHA-SERVER192.168.0.254 ROSEMONT.EXAMPLE.COM ROSEMONT192.168.0.51 HEDWIG.EXAMPLE.COM HEDWIG H192.168.129.201 Z

As mentioned in the Lesson on regular expressions, however, range specifications can produce oddresults when various character sets are considered. The "new school" approach is to use character classes.

[madonna@rosemont madonna]$ tr ’[:lower:]’ ’[:upper:]’ < /etc/hosts# DO NOT REMOVE THE FOLLOWING LINE, OR VARIOUS PROGRAMS# THAT REQUIRE NETWORK FUNCTIONALITY WILL FAIL.127.0.0.1 LOCALHOST.LOCALDOMAIN LOCALHOST RHA-SERVER192.168.0.254 ROSEMONT.EXAMPLE.COM ROSEMONT192.168.0.51 HEDWIG.EXAMPLE.COM HEDWIG H192.168.129.201 Z

Recalling that the ordering of the character ranges is important to the tr command, the character classeswould need to generate consistently ordered ranges. Only the [:lower:] and [:upper:] character classesare guaranteed to do so, implying that they are the only classes appropriate for use when using tr forcharacter translation.

Using tr to Delete CharactersWhen invoked with the -d command line switch, the tr command adopts a radically different behavior.The tr command now expects a single argument (as opposed to two, above), which is again a set ofcharacters. The tr command will now filter the standard in stream, deleting each of the specifiedcharacters writing it to standard out.

Consider the following couple of examples.

[madonna@station madonna]$ echo abcdefghi | tr -d defabcghi[madonna@station madonna]$ echo ’hark, I hear an elephant!’ | tr -d [:upper:][:punct:]hark hear an elephant

rha030-3.0-0-en-2005-08-17T07:23:17-0400


108


In the first case, the specified literal characters “d”, “e”, and “f” were deleted. In the second case, allcharacters that belonged to either the [:punct:] or [:upper:] character classes were deleted.

Using tr to Squeeze CharactersBy using the -s command line switch, the tr command can be used to squeeze a continues series ofcharacters into a single character. If called with one argument, the tr command will simply squeeze thespecified set of characters, as in the following example.

[madonna@station madonna]$ echo "aaabbbcccdddeeefffggg" | tr -s bcfaaabcdddeeefggg

If called with the -s command line switch and two arguments, the tr command will perform substitutions(as if the -s had not been specified), but the squeeze any characters from the second set.

[madonna@station madonna]$ echo "aaabbbcccdddeeefffggg" | tr -s bcf xyeaaaxydddeggg

Notice that this is essentially the same as performing the two operation separately.

[madonna@station madonna]$ echo "aaabbbcccdddeeefffggg" | tr bcf xyeaaaxxxyyydddeeeeeeggg[madonna@station madonna]$ echo "aaabbbcccdddeeefffggg" | tr bcf xye | tr -s xyeaaaxydddeggg

Lastly, the tr command can be called with both the -s and -d command line switches. In this case, the trcommand expects two arguments. The tr command will first delete the first set of characters, and thensqueeze the second set.

[madonna@station madonna]$ echo "aaabbbcccaaadddeeefffggg" | tr -ds bcf aeadddeggg

Note the order of operations carefully. This command is essentially the same as a delete (tr -d) followedby a squeeze (tr -s).

[madonna@station madonna]$ echo "aaabbbcccaaadddeeefffggg" | tr -d bcfaaaaaadddeeeggg[madonna@station madonna]$ echo "aaabbbcccaaadddeeefffggg" | tr -d bcf | tr -s aeadddeggg

Complementing SetsOther than -s and -d, there are only two command line switches which modify tr’s behavior, tabledbelow.

Table 7-3. Command Line Switches for the tr Command

Switch Effect





109


Switch Effect-c, --complement Complement SET1 before operating (i.e., use the set of characters excluded by

SET1)-t, --truncate-set1 Truncate the length of SET1 to that of SET2 before operating.

As a quick example of the -c command line switch, the following deletes every character that is not avowel or a white space character from standard in.

[madonna@station madonna]$ echo aaabbbcccdddeee | tr -cd ’aeiouAEIOU[:space:]’aaaeee

One Final Caution: Avoid File Globbing!One final note before we leave our “a”s and “e”s and head for more practical examples.

In some of the previous examples, madonna was careful to protect expressions such as [:punct:] withsingle quotes, and sometimes she was not. When she didn’t, she got lucky. Consider the followingsequence.

[madonna@station madonna]$ echo ’hark, I hear an elephant!’ | tr -d [:punct:]hark I hear an elephant[madonna@station madonna]$ touch n[madonna@station madonna]$ echo ’hark, I hear an elephant!’ | tr -d [:punct:]hark, I hear a elephat!

Why did madonna get two very different results from the same command line? If you don’t know theanswer, and even if you do, you should protect arguments to the tr command with quotes.

Examples

Example 1. Using tr to Clean Up the df CommandRecall a few Lessons ago, when we were discussing the cut command, and its ability to extract fields oftext from a stream. We tried to use the cut command to extract the first and fifth fields from the dfcommand’s output, specifying a space as the field delimiter.

[madonna@station madonna]$ dfFilesystem 1K-blocks Used Available Use% Mounted on/dev/hda3 5131108 4499548 370908 93% //dev/hda1 124427 26268 91735 23% /bootnone 127616 0 127616 0% /dev/shm[madonna@station madonna]$ df | cut -d" " -f1,5Filesystem/dev/hda3/dev/hda1none

rha030-3.0-0-en-2005-08-17T07:23:17-0400


110


We previously identified the problem with this approach. The cut command does not recognize a seriesof spaces as separating two fields, but a series of fields (one for each space). With her newfoundknowledge of the tr command, madonna knows how to solve the problem.

She first uses the tr command to squeeze multiple spaces into a single space.

[madonna@station madonna]$ df | tr -s ’ ’Filesystem 1K-blocks Used Available Use% Mounted on/dev/hda3 5131108 4499556 370900 93% //dev/hda1 124427 26268 91735 23% /bootnone 127616 0 127616 0% /dev/shm

Now, she can use the cut command to easily extract the appropriate columns.

[madonna@station madonna]$ df | tr -s ’ ’ | cut -d" " -f1,5Filesystem Use%/dev/hda3 93%/dev/hda1 23%none 0%

Example 2. Using tr to Convert Dos Text Files to UnixThe user madonna has recently discovered Project Gutenberg, an online repository for texts which haveentered the public domain. 1 She has downloaded one of her favorite texts, A Tale of Two Cities, andstored it in the file 2city12.txt.

Upon examining the file, she realizes that it uses the DOS convention for separating lines (a carriagereturn/new line pair), as illustrated by the “^M$” combination when using the cat -A command.

[madonna@station madonna]$ head -5 2city12.txt | cat -AThe Project Gutenberg Etext of A Tale of Two Cities, by Dickens^M$^M$Please take a look at the important information in this header.^M$We encourage you to keep this file on your own disk, keeping an^M$electronic path open for the next readers. Do not remove this.^M$

She would prefer the text to use the Unix convention (a single new line character). She uses the trcommand to delete all instances of the carriage return character, storing the result into the file2city12unix.txt.

[madonna@station madonna]$ tr -d ’\r’ < 2city12.txt > 2city12unix.txt

In order to confirm that the conversion happened appropriately, she performs a couple of checks. She firstexamines the file with cat -A, and notes that the “^M” characters have been removed.

[madonna@station madonna]$ head -5 2city12unix.txt | cat -AThe Project Gutenberg Etext of A Tale of Two Cities, by Dickens$$Please take a look at the important information in this header.$We encourage you to keep this file on your own disk, keeping an$electronic path open for the next readers. Do not remove this.$

rha030-3.0-0-en-2005-08-17T07:23:17-0400


111


Secondly, she performs a word count on both files, using the wc command.

[madonna@station madonna]$ wc 2city12*16364 137667 787603 2city12.txt16364 137667 771239 2city12unix.txt32728 275334 1558842 total

She notes that the difference in the number of characters (bytes) in the two files is the same as thenumber of lines in the files (787603 - 771239 = 16364). This is appropriate if the tr command deletedone character per line, as expected.

Finally, because she is no longer interested in keeping the DOS formatted version of the file, she renamesthe file 2city12unix.txt to 2city12.txt.

[madonna@station madonna]$ mv 2city12unix.txt 2city12.txt

Example 3. Using tr to Count Word Frequencies 2

Good writing often requires that authors avoid overusing certain key words. The user madonna wouldlike to put the test to Charles Dickens. She first uses a text editor to extract the opening paragraph fromthe text.

[madonna@station madonna]$ cat para1It was the best of times, it was the worst of times,it was the age of wisdom, it was the age of foolishness,it was the epoch of belief, it was the epoch of incredulity,it was the season of Light, it was the season of Darkness,it was the spring of hope, it was the winter of despair,we had everything before us, we had nothing before us,we were all going direct to Heaven, we were all going directthe other way--in short, the period was so far like the presentperiod, that some of its noisiest authorities insisted on itsbeing received, for good or for evil, in the superlative degreeof comparison only.

She would now like to generate a count of how often particular words are used. In order to use the uniq-c command, she would like to rearrange the text so that the words appear one per line. She outlines thefollowing plan.

1. Delete all punctuation marks.

2. Convert all uppercase characters into lowercase, so that It and it are considered the same word.

3. Covert every space character into a new line character, and squeeze multiple new line characters intoone.

She begins implementing her plan one step at a time, so that she can observe the intermediate results.

[madonna@station madonna]$ tr -d ’[:punct:]’ < para1 | head -5It was the best of times it was the worst of timesit was the age of wisdom it was the age of foolishnessit was the epoch of belief it was the epoch of incredulity


112


it was the season of Light it was the season of Darknessit was the spring of hope it was the winter of despair[madonna@station madonna]$ tr -d ’[:punct:]’ < para1 | tr ’[:upper:]’ ’[:lower:]’ |head -5it was the best of times it was the worst of timesit was the age of wisdom it was the age of foolishnessit was the epoch of belief it was the epoch of incredulityit was the season of light it was the season of darknessit was the spring of hope it was the winter of despair[madonna@station madonna]$ tr -d ’[:punct:]’ < para1 | tr ’[:upper:]’ ’[:lower:]’ |tr -s ’ ’ ’\n’ | head -5itwasthebestof

At this point, madonna is comfortable enough with sort and uniq to finish off the process.

[madonna@station madonna]$ tr -d ’[:punct:]’ < para1 | tr ’[:upper:]’ ’[:lower:]’ |tr -s ’ ’ ’\n’ | sort | uniq -c | sort -rn | head -5

14 the12 of11 was10 it4 we

Inspired by her progress, she next repeats the technique on the entire text. (The process took about 8seconds on a 700MHz processor).

[madonna@station madonna]$ tr -d ’[:punct:]’ < 2city12.txt | tr ’[:upper:]’ ’[:lower:]’ |tr -s ’ ’ ’\n’ | sort | uniq -c | sort -rn | head -5

8082 the4967 and4061 of3517 to2952 a

Example 4. Rot13In the early days of Usenet newsgroups, people adopted a convention for obscuring text called rot13.Suppose you were posting a joke, and wanted to include the punch line, but did not want the punch line tobe immediately obvious. The punch line could be transformed by rotating each letter by 13 places, so that“a” would become “n”, “b” would become “o”, and “z” would become “m”, as in the following example.

Q: Why did the chicken cross the road?A: Gb trg gb gur bgure fvqr.

How would someone find the answer? By piping the text through a tr implemented rot13 translator.

[madonna@station madonna]$ echo "Gb trg gb gur bgure fvqr." | tr A-Za-z N-ZA-Mn-za-m


113


To get to the other side.

Online Exercises

Lab ExerciseObjective: Gain familiarity with the tr command.Estimated Time: 10 mins.

Specification

1. The /etc/passwd file uses colons as a field delimiter. Create the file ~/passwd.tsv, which is acopy of the /etc/passwd file converted to use tabs as field delimiters (i.e., every “:” is convertedto a tab).

2. Create the file ~/file_roller.converted, which is a copy of the file/usr/share/file-roller/glade/file_roller.glade, with the following transformations.

a. Convert all tabs to spaces.

b. Convert double quotes (") to single quotes (’). (Do not use backticks (‘).)

3. Create a file called ~/openssl.converted, which is a copy of the file/usr/share/ssl/openssl.cnf, with the following transformations.

a. All comments lines (lines whose first non-whitespace character is a #) are removed.

b. All empty lines are removed.

c. All upper case letters are folded into lower case letters.

d. All digits are replaced with the underscore character (“_”).

Deliverables

1. The file ~/passwd.tsv, which is a copy of the /etc/passwd file with tabs substituted for colons.

2. The file ~/file_roller.converted, which is a copy of the file/usr/share/file-roller/glade/file_roller.glade, with all tabs converted to spaces, and all double

rha030-3.0-0-en-2005-08-17T07:23:17-0400Copyright (c) 2003-2005 Red Hat, Inc. All rights reserved. For use only by a student enrolled in a Red Hat Academy course taught at a Red Hat Academy. Any other use is a violationof U.S. and international copyrights. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise duplicated whether in electronic orprint format without prior written consent of Red Hat, Inc. If you believe Red Hat course materials are being used, copied, or otherwise improperly distributed please [email protected] or phone toll-free (USA) +1 866 626 2994 or +1 (919) 754 3700.

114


quotes (") converted to single quotes (’).

3. The file ~/openssl.converted, which is a copy of the file /usr/share/ssl/openssl.cnf, with allcomment lines (those whose first non-whitespace character is a “#”) removed, all empty lines removed, allupper case letters converted to lower case, and all numeric digits replaced with the underscore character (_).

Questions

1. Which of the following command lines would convert all ASCII carriage return characters in the file text.macto ASCII new line characters?

( ) a. tr -c CR LF text.mac

( ) b. tr \r \n text.mac

( ) c. tr CR LF < text.mac

( ) d. tr \r \n < text.mac


2. Which of the following command lines would squeeze a series of repeated space characters in the file df.outinto a single space?

( ) a. tr --squeeze \s df.out

( ) b. tr -s " " df.out

( ) c. tr -ds " " < df.out

( ) d. tr -s SPC < df.out


3. Which of the following command lines would delete the trailing slash (“/)” from the string etc/?

( ) a. echo etc/ | tr -d ’[:punct:]’

( ) b. tr -d / etc/

( ) c. tr -d [:letter:] < /etc/fstab

( ) d. echo etc/ | tr -cd /


In the following transcript, madonna is trying to save the ls(1) man page, and then edit it, only to find that it is full ofcontrol characters and other mess.

[madonna@station madonna]$ man ls > ls.man.out


115


[madonna@station madonna]$ head ls.man.out | cat -ALS(1) FSF LS(1)$$$$N^HNA^HAM^HME^HE$

ls - list directory contents$$S^HSY^HYN^HNO^HOP^HPS^HSI^HIS^HS$

l^Hls^Hs [_^HO_^HP_^HT_^HI_^HO_^HN]... [_^HF_^HI_^HL_^HE]...$$

4. Which of the following commands would effectively remove all of the “^H” control sequences from the filels.man.out?

( ) a. tr -d ^H < ls.man.out

( ) b. tr -cd ’[:lower:]’ ls.man.out

( ) c. tr -d ’[:punct:]’ ls.man.out

( ) d. tr -cd ’[:print:][:space:]’ < ls.man.out


After successfully removing the “^h” control sequences, and storing the results in the file ls.man.noh, madonna isstill left with a mess to clean up.

[madonna@station madonna]$ tail +5 ls.man.noh | headNNAAMMEE

ls - list directory contents

SSYYNNOOPPSSIISSllss [_O_P_T_I_O_N]... [_F_I_L_E]...

DDEESSCCRRIIPPTTIIOONNList information about the FILEs (the current directory by default).Sort entries alphabetically if none of --ccffttuuSSUUXX nor ----ssoorrtt.Mandatory arguments to long options are mandatory for short options

5. Which of the following command lines would remove all underscores from the file ls.man.noh?

( ) a. tr -d ’[:alnum:]’ < ls.man.noh

( ) b. tr -d _ < ls.man.noh

( ) c. tr -cd _ ls.man.noh

( ) d. tr -d _ ls.man.noh


After successfully removing the _ characters, and storing the results in ls.man.noh_, madonna is still frustrated bya large number of "doubled" letters and hyphens (“-”).


116


[madonna@station madonna]$ tail +5 ls.man.noh_ | headNNAAMMEE


SSYYNNOOPPSSIISSllss [OPTION]... [FILE]...

DDEESSCCRRIIPPTTIIOONNList information about the FILEs (the current directory by default).Sort entries alphabetically if none of --ccffttuuSSUUXX nor ----ssoorrtt.Mandatory arguments to long options are mandatory for short options

6. Which of the following command lines would convert the doubled letters and hyphens into a single instance?

( ) a. tr -s ’[:alpha:]’ ls.man.noh_

( ) b. tr -s ’[:lower:-]’ ls.man.noh_

( ) c. tr -cs ’[:alpha:]-’ < ls.man.noh_

( ) d. tr -s ’[:alpha:-]’ < ls.man.noh_

( ) e. tr -s ’[:alpha:]-’ < ls.man.noh_

She successfully removes the doubled characters, and stores the results in the file ls.man.clean.

[madonna@station madonna]$ tail +5 ls.man.clean | headNAME


SYNOPSISls [OPTION]... [FILE]...

DESCRIPTIONList information about the FILEs (the current directory by default).Sort entries alphabetically if none of -cftuSUX nor -sort.

7. After applying the correct answer from the previous question, what potential inaccuracies are in the text?

( ) a. Any line originally containing an underscore has been deleted from the text.

( ) b. Any line containing repeated punctuation characters now has only a single instances of the character.

( ) c. Any word originally containing a sequence of repeated letters now has only a single instance of the letter.

( ) d. Any line originally containing a bracket (“[ ]”) or colon has been deleted from the text.

( ) e. There should be no distortions to the original text.


117


8. Which of the following command lines could madonna have used to both delete the underscore characters fromthe file ls.man.noh, and squeeze doubled letters and hyphens, using the tr command only once?

( ) a. tr -sd ’[:alpha:-]’ _ < ls.man.noh

( ) b. tr -ds _ ’[:alpha:]-’ < ls.man.noh

( ) c. tr -csd _ ’[:alpha:]-’ < ls.man.noh

( ) d. tr -ds ’[:alpha:]-’ ’[:punct:]’ < ls.man.noh


The user madonna is now having trouble using the tr command to delete punctuation characters from a file. Whiletrying to diagnose her problem, she runs the following commands.

[madonna@station madonna]$ ls2city12.txt ls.man.clean ls.man.noh ls.man.noh_ ls.man.out para1 t[madonna@station madonna]$ echo "test, one, two, three" | tr -d [:punct:]es, one, wo, hree

9. What is the best advice you can give her?

( ) a. Avoid using character classes. Specify the punctuation characters literally.

( ) b. When using character classes, the tr command must be invoked with the -c command line switch.

( ) c. When specifying character classes on the command line, always protect the bracketed expressions withquotes.

( ) d. She is using the wrong syntax for specifying the character class. She should use [[:punct:]] instead.

( ) e. None of the above adequately explain her problem.

Use the following transcript to answer the next question.

[madonna@station madonna]$ echo aaabbbcccdddeee | ???????ebbbe

10. Which of the following expression could replace the expression ????????

( ) a. tr -s acde

( ) b. tr -s acd e

( ) c. tr -d acd e

( ) d. tr -d acd | tr -s e



118


Notes1. Project Gutenberg is based at the web site http://gutenberg.net.

2. This example is inspired by a similar example found in the coreutils info page (info coreutils).


119

Chapter 8. Spell Checking: aspell

Key Concepts• The aspell -c command performs interactive spell checks on files.

• The aspell -l command performs a non-interactive spell check on the standard in stream.

• The aspell dump command can be used to view the system’s master or a user’s personal dictionary.

• The command aspell create personal and aspell merge personal can be used to create or append to auser’s personal dictionary from a word list.

DiscussionIn the Red Hat Enterprise Linux distribution, the aspell utility is the primary utility for checking thespelling of text files. In this Lesson, we learn how to use aspell to interactively spell check a file andcustomize the spell checker with a personal dictionary.

Using aspellWhen running aspell, the first argument (other than possible command line switches) is interpreted as acommand, telling aspell what to do. The following commands are supported by aspell.

Table 8-1. Aspell commands

Command Action-c file, check file Perform an interactive spell check on the file

file.-l, list Print a list of misspelled words found in the

standard in stream.config Dump the current aspell configuration to standard

out.dump master|personal|repl Dump a copy of the master word list, personal

word list, or personal replacement list,respectively.

create master|personal|repl Create the master word list, personal word list, orpersonal replacement list, respectively, readingentries from standard in.

merge master|personal|repl Merge entries read from standard in into themaster word list, personal word list, or personalreplacement list, respectively.

The following table lists some of the more common command line switches that are used with the aspell

120


command.

Table 8-2. Command Line Switches for the aspell Command

Switch Effect-W --ignore=N Ignore words less than N characters. (By default, only single

letters are ignored.)--ignore-case Ignore case when performing word comparisons.-p, --personal=filename Use the word list filename for the personal word list.-x, --dont-backup Do not create a backup file when performing the spell check.

Performing an Interactive Spell CheckThe user prince has composed the following message, which he plans to email to the user elvis.

[prince@station prince]$ cat toelvisHey Elvis!

I heard that you were about to take the lab test for the stringprocesing workbook in Red Hat Academy. IIRC, its pretystraightforward, if you’ve been keeping up with the exercises.

LOL, Prince

Before sending the message, prince uses aspell -c to perform an interactive spell check.

[prince@station prince] aspell -c toelvis

Upon execution, the aspell command open an interactive session, highlighting the first recognizedmisspelled word.

Hey Elvis!

I heard you were about to take the lab test for the stringprocesing workbook in Red Hat Academy. IIRC, its pretystraightforward, if you’ve been keeping up with the exercises.

LOL, Prince

=====================================================================1) processing 6) preceding2) precessing 7) professing3) precising 8) promising4) proceeding 9) proposing


121


5) prosingi) Ignore I) Ignore allr) Replace R) Replace alla) Add x) Exit=====================================================================?

At this point, prince has a "live" keyboard, meaning that single key presses will take effect without himneeding to use the return key. He may choose from the following options.

Use Suggested Replacement

The aspell command will do its best to suggest replacements for the misspelled word from itslibrary. If it has found a correct suggestion (as in this case, it has), that suggestion can be replacedby simply hitting the numeric key associated with it.

Ignore the Word

By pressing i, aspell will simply ignore the word this instance and move on. Pressing capital I willcause aspell to ignore all instances of the word in the current file.

Replace the Word

If aspell was not able to generate an appropriate suggestion, prince may use r to manually replacethe word. When finished, aspell will pick up again, first rechecking the specified replacement. Byusing capital R, aspell will remember the replacement and automatically replace other instances ofthe misspelled word.

Add the Word to the Personal Dictionary

If prince would like aspell to learn a new word, so that it will not be flagged when checking futurefiles, he may press a to add the word to his personal dictionary.

Exit aspellBy pressing x, prince can immediately exit the interactive aspell section. Any spelling correctionsalready implemented will be saved.

As prince proceeds through the interactive session, aspell flags procesing, prety, IIRC, and LOL asmisspelled. For the first two, prince accepts aspell’s suggestions for the correct spelling. The last two"words" are abbreviations that prince commonly uses in his emails, so he adds them to his personaldictionary. Unfortunately, because its is a legitimate word, aspell does not report prince’s misuse of it.

When finished, prince now has two files, the corrected version of toelvis, and an automaticallygenerated backup of the original, toelvis.bak.

[prince@station prince]$ lstoelvis toelvis.bak[prince@station prince]$ diff toelvis.bak toelvis4c4< processing workbook in Red Hat Academy. IIRC, its prety---> processing workbook in Red Hat Academy. IIRC, its pretty

rha030-3.0-0-en-2005-08-17T07:23:17-0400


122


Performing a Non-interactive Spell CheckUsing the -l command line switch, the aspell command can be used to perform spell checks in anon-interactive batch mode. Used this way, aspell simple reads standard in, and writes to standard outevery word it would flag as misspelled.

In the following, suppose prince performed a non-interactive spell check before he had run the aspellsession interactively.

[prince@station prince]$ aspell -l < toelvisprocesingIIRCpretyLOL

The aspell utility lists the four words it would flag as misspelled. After the interactive spell check, princeperforms a non-interactive spell check on his backup of the original file.

[prince@station prince]$ aspell -l < toelvis.bakprocesingprety

Because the words IIRC and LOL were added to prince’s personal dictionary, they are no longer flaggedas misspelled.

Managing the Personal DictionaryBy default, the aspell command uses two dictionaries when performing spell checks: the system widemaster dictionary, and a user’s personal dictionary. When prince chooses to add a word, the word getsstored in his personal dictionary. He uses aspell’s ability to dump to view his personal dictionary.

[prince@station prince]$ aspell dump personalLOLIIRC

Likewise, he could dump the system’s master dictionary as well.

[prince@station prince]$ aspell dump master | wc -l153675

[prince@station prince]$ aspell dump master | grep "âdd.*ion$"addictionadditionadduction

The aspell command can also automatically create a personal dictionary (if it doesn’t already exist), ormerge into it (if it does) using words read from standard in. Suppose prince has a previous emailmessage, in which he used many of his commonly used abbreviations. He would like to add all of theabbreviations found in that email to his personal dictionary. He first uses aspell -l to extract the wordsfrom the original message.

[prince@station prince]$ aspell -l < good_email.txtFWIW


123


AFKRSNTTFN

After observing the results, he decides to add all of these words to his personal dictionary, using aspellmerge personal. When he finishes, he again dumps his (expanded) personal dictionary.

[prince@station prince]$ aspell -l < good_email.txt | aspell merge personal[prince@station prince]$ aspell dump personalTTFNAFKLOLRSNIIRCFWIW

What happens if prince tries to create the personal dictionary instead?

[prince@station prince]$ echo "foo" | aspell create personalSorry I won’t overwrite "/home/prince/.aspell.english.pws"

In aspell’s unwillingness to clobber an already existing personal dictionary, we discover where it isstored: ~/.aspell.language.pwd.

Getting HelpWhere would prince expect to find help for the aspell command?

[prince@station prince]$ man aspellNo manual entry for aspell

A reasonable first guess, but in this case wrong. Like most commands, aspell will generate a usagesummary when called with the --help command line switch. Additional documentation can be found inthe /usr/share/doc/aspell-0*/man-text/ directory (as simple text files), or/usr/share/doc/aspell-0*/man-html/ in html format. The following command, when executedfrom an X terminal, will start prince off in the html based documentation.

[prince@station prince]$ mozilla /usr/share/doc/aspell-0.33.7.1/man-html/index.html

Examples

Example 1. Adding Service Names to aspell’s PersonalDictionaryThe user prince is commonly answering questions related to Linux’s networking services in his emails,and aspell consistently flags the conventional service names as misspelled words. He would like to add


124


the service names found in the file /etc/services to his personal dictionary.

He first spell checks the /etc/services file non-interactively, and stores the results in~/services.maybe.

[prince@station prince]$ aspell -l < /etc/services > services.maybe

Using the less pager to browse the file services.maybe, he finds many duplicate entries. He makes lifeeasier for himself (and eventually aspell) by regenerating the list, removing duplicates.

[prince@station prince]$ aspell -l < /etc/services | sort | uniq > services.maybe

Browsing the file again, prince is satisfied that the list contains words he would rather not have flagged asmisspelled. He adds the word list to his personal dictionary.

[prince@station prince]$ aspell merge personal < services.maybe

For confirmation, he again spell checks, non-interactively, the /etc/services file.

[prince@station prince]$ aspell -l < /etc/services

As expected, no words were flagged as misspelled.

Online Exercises

Lab ExerciseObjective: Use the aspell command to perform routine spell checks.Estimated Time: 10 mins.

SetupIn order to prepare for this Exercise, remove any personal dictionaries (or replacement list) you haveaccumulated using the following command.

[student@station student]$ rm .aspell*

Specification

1. Generate a list of all words that aspell flags as misspelled found in all files underneath the/etc/sysconfig directory, and its subdirectories. The list should be alphabetically ascendingsorted, and duplicates words should be removed. Store the list (one word per line) in the file~/sysconfig.spell.txt.

rha030-3.0-0-en-2005-08-17T07:23:17-0400Copyright (c) 2003-2005 Red Hat, Inc. All rights reserved. For use only by a student enrolled in a Red Hat Academy course taught at a Red Hat Academy.Any other use is a violation of U.S. and international copyrights. No part of this publication may be photocopied, duplicated, stored in a retrieval system, orotherwise duplicated whether in electronic or print format without prior written consent of Red Hat, Inc. If you believe Red Hat course materials are beingused, copied, or otherwise improperly distributed please email [email protected] or phone toll-free (USA) +1 866 626 2994 or +1 (919) 754 3700.

125


With creative use of the find, xargs, cat, sort, and uniq commands, this can be accomplished in onecommand line.

2. Copy the file /usr/share/doc/which-*/README into your home directory. Perform aninteractive spell check on the file, using the aspell spell checker. Use the following policies for thefollowing misspelled words.

a. Add the following words to your personal dictionary: gcc, stdout, texinfo, stdin, usr, csh

b. Use aspell’s suggestions to correct: explicity (should read explicitly)

c. Manually replace the word Litmaath with Smith.

d. Ignore all other flagged words.

If performed correctly, you should be able to reproduce output similar to the following.[student@station student]$ diff README.bak README20c20< Maarten Litmaath called ‘which-v6’, he was using ‘-i’ as option---> Maarten Smith called ‘which-v6’, he was using ‘-i’ as option59c59< to explicity search for normal binaries, while using---> to explicitly search for normal binaries, while using73c73< ful to explicity search for normal binaries, while---> ful to explicitly search for normal binaries, while[student@station student]$ aspell dump personalstdoutusrcshtexinfogccstdin

Deliverables

1. The file ~/sysconfig.spell.txt, which contains an alphabetically ascending sorted list of all words flaggedby aspell as misspelled found in all files underneath the /etc/sysconfig directory, and its subdirectories. Thefile should not contain duplicate words.

2. The file ~/README, which is a copy of the file /usr/share/doc/which-*/README, which has been spellchecked with the aspell command. The word explicity should be replaced with explicitly, and the word Litmaathwith Smith.

3. An aspell personal dictionary that contains exactly the words gcc, stdout, texinfo, stdin, usr, and csh.

rha030-3.0-0-en-2005-08-17T07:23:17-0400


126


Questions

1. Which of the following command lines would start an interactive aspell spell check on the file report.txt?

( ) a. aspell report.txt

( ) b. aspell -c report.txt

( ) c. aspell -c < report.txt

( ) d. aspell < report.txt


2. Which of the following command lines would start a non-interactive aspell spell check on the file report.txt?

( ) a. aspell -l report.txt

( ) b. aspell < report.txt

( ) c. aspell -b report.txt

( ) d. aspell -l < report.txt


3. Which of the following cannot be performed when aspell flags an unrecognized word during an interactive spellcheck?

( ) a. The unrecognized word can be added to the system’s master dictionary.

( ) b. The unrecognized word can be added to the user’s personal dictionary.

( ) c. The unrecognized word can replaced from a list of suggested replacements.

( ) d. The unrecognized word can manually replaced by the user.

( ) e. All of the above actions can be performed.

4. Assuming the file mywords.txt contains a series of whitespace separated words, which of the followingcommand lines could be used to add the words to a user’s personal dictionary?

( ) a. aspell merge mywords.txt

( ) b. aspell merge personal mywords.txt

( ) c. aspell merge personal < mywords.txt

( ) d. aspell merge < mywords.txt



127


5. Which of the following command lines would dump the words contained in the master dictionary to standard out?

( ) a. aspell dump master

( ) b. aspell -d master

( ) c. aspell dump

( ) d. aspell -m


6. Which of the following actions can be performed directly by the aspell command when performing anon-interactive spell check?

( ) a. The unrecognized words can be added to the system’s master dictionary.

( ) b. The unrecognized words can be added to the user’s personal dictionary.

( ) c. The unrecognized words can be replaced automatically with aspell’s first suggested replacement.

( ) d. The unrecognized words can be decorated with a +++ character sequence, so they can be easily searchedfor with a text editor.

( ) e. None of the above can be performed by aspell directly using a non-interactive spell check.

7. Which of the following command lines would effectively add all of the unrecognized words in the filereport.txt to a user’s personal dictionary?

( ) a. aspell -l report.txt | sort | uniq | aspell merge personal

( ) b. aspell -l < report.txt | sort | uniq | aspell merge personal

( ) c. aspell < report.txt | sort | uniq | aspell merge personal

( ) d. aspell -l < report.txt | sort | uniq | aspell merge


8. Which of the following command lines would perform an interactive spell check of the file report.txt, but notcreate a backup file?

( ) a. aspell -x report.txt

( ) b. aspell -x -c report.txt

( ) c. aspell -X < report.txt

( ) d. aspell -b -c < report.txt



128


9. Which of the following would list all unrecognized words in the file report.txt which are greater than 4characters in length?

( ) a. aspell -W4 -l < report.txt

( ) b. aspell -W3 report.txt

( ) c. aspell -W5 -c < report.txt

( ) d. aspell -W3 report.txt


10. Which of the following would replace a user’s (already existing) personal dictionary with the words found in thefile mywords.txt?

( ) a. aspell create personal < mywords.txt

( ) b. aspell -c merge personal < mywords.txt

( ) c. aspell -r create personal < mywords.txt

( ) d. aspell clobber personal < mywords.txt

( ) e. Once a personal dictionary exists, it cannot be removed by the aspell command directly.

rha030-3.0-0-en-2005-08-17T07:23:17-0400


129

Chapter 9. Formatting Text (fmt) and SplittingFiles (split)

Key Concepts• The fmt command can reformat text to differing widths.

• Using the -p command line switch, the fmt command will only reformat text that begins with thespecified prefix, preserving the prefix.

• The split command can be used to split a single file into multiple files based on either a number oflines or a number of bytes.

Discussion

The fmt Command

Motivation for the fmt CommandHopefully, the Lessons in this Workbook encountered so far have demonstrated the powerful ways thattext can be manipulated using basic Linux (and Unix) command line utilities. Because Linux providessuch a useful toolkit of text manipulation commands, the data that people handle is often left as simpletext. The /etc/passwd file is the classic example. Rather than embedding user definitions in somedatabase that requires a custom utility for access, they are defined in a simple text file that anyone withknowledge of the grep command can search.

The common use of the simple text editor follows as a natural result of the common occurrence of thesimple text file. We emphasize again that text editors are not word processors. Elaborate word processingapplications, such as OpenOffice or AbiWord, generally store information using elaborate markup orbinary formatting to define fonts, colors, and other such details about the text’s appearance. In contrast,simple text editors such as nano, vim, or gedit store just the data: what you see is what you get. As aresult, users use text editor to edit text files with much more control and predictability.

One side effect of the variety of text editors in Linux, and in particular the coexistence of text editors andword processors, is the inconsistencies with which word wrapping is handled. To a word processor, andmany HTML based text entry forms, new line characters are usually considered not worthy of theconcern of users. A user begins typing text, without ever using the RETURN key, and the applicationdecides when to wrap a line and where to insert a new line character. While this is not a problem, andperhaps even desirable, for writing a letter to a friend, it can cause significant problems when editing aline based configuration file (such as the /etc/passwd file, the /etc/hosts file, the /etc/fstab file,etc..., etc...).

As an example of the inconsistencies of various text editors, the user elvis tries a simple experiment. Hetypes the first sentence from the previous paragraph using four different applications: the nano text

130

Chapter 9. Formatting Text (fmt) and Splitting Files (split)

editor, the vim text editor, the gedit text editor, and the OpenOffice word processor. In each case, hetypes the sentence without ever hitting the RETURN key, and saves the document asside_effect.extension using the default settings. The only exception is the OpenOffice wordprocessor, whose default format uses binary encoding. For this applications, elvis saved the file twice.Once, using the "default" settings (the OpenOffice format), and once choosing the simplest "save as text"setting possible.

Figure 9-1. Text Handling Using gedit

Figure 9-2. Text Handling Using gvim

Figure 9-3. Text Handling Using OpenOffice

rha030-3.0-0-en-2005-08-17T07:23:17-0400


131


Figure 9-4. Text Handling Using nano

What result does wc show? The four different applications used four different conventions for displayingand saving the simple text sentence (five, if you include the binary OpenOffice format).

[elvis@station elvis]$ wc side_effect.* 2>/dev/null1 31 188 side_effect.gedit1 31 188 side_effect.gvim

16 109 4950 side_effect.ooffice.sxw0 31 187 side_effect.ooffice.txt3 31 190 side_effect.nano

21 233 5703 total

The nano text editor was the only application that implemented word wrapping by default. Althoughelvis never hit the return key, three ASCII new line characters were inserted. The gedit and gvimapplications were consistent with Linux (and Unix) convention: they did not insert new line characters inthe middle of the text, but they would not let a text file end without a terminating new line character.Although consistent with each other in terms of how the file was stored, they differed in how the text waspresented to the user: gedit wrapped the text at word boundaries, while gvim wrapped the text only whenit could fit no more on a line. Like gedit, the OpenOffice application wrapped the text while displayingit, but did not add the conventional Linux new line to the end of the file while saving it to disk. We can’teven begin to discuss why the OpenOffice standard format took nearly 5000 bytes of binary data to storeabout 200 characters.

All of this is to say that how an application handles the word wrapping issues is not obvious to the casualuser, and often, when reading text with one utility that was written by another, word wrapping issuescause problems.

Rewrapping Text with the fmt CommandThe fmt command is used to rewrap text, inserting newlines at word boundaries to create lines of aspecified length (75 character) by default. As a quick example, consider how the fmt command reformatsthe file side_effect.gedit.

[elvis@station elvis]$ cat side_effect.gvimOne side effect of the variety of text editors in Linux, and in particular the coexistence of text editors and word processors, is the inconsistencies with which word wrapping is handled.[elvis@station elvis]$ fmt side_effect.gvimOne side effect of the variety of text editors in Linux, and inparticular the coexistence of text editors and word processors, is the


132


inconsistencies with which word wrapping is handled.[elvis@station elvis]$ fmt side_effect.gvim | wc

3 31 188

The cat command, true to its nature, performed no formatting on the file when it displayed it. The factthat the lines wrapped at 80 characters is a side effect of the terminal that was displaying it. The fmtcommand, on the other hand, wrapped the text at word boundaries so that no line was over 75 charactersin length.

fmt Command SyntaxLike most of the text processing commands encountered in this Workbook, the fmt command interpretsarguments as filenames on which to operate, or operates on standard in if none are provided. Its output iswritten to standard out. The following table list command lines switches that can be used to modify fmt’sbehavior.

Table 9-1. Command Line Switches for the fmt Command

Switch Effect-w, --width=N , -N Format text to N columns.-p, --prefix=STRING Only format lines beginning with STRING.-u, --uniform spacing Enforce spacing of one space between words, two spaces between

sentences.

Formatting to a Specific WidthThe maximum width of the resulting text can be specified with the -w N command line switch, or moresimply just -N , where N is the maximum line width measured in characters. In the following example,elvis reapplies the format command to the file side_effect.gvim, formatting it first to a width of 60characters, and then to a width of 40 characters.

[elvis@station elvis]$ fmt -w60 side_effect.gvimOne side effect of the variety of text editors in Linux,and in particular the coexistence of text editors andword processors, is the inconsistencies with which wordwrapping is handled.[elvis@station elvis]$ fmt -40 side_effect.gvimOne side effect of the variety of texteditors in Linux, and in particular thecoexistence of text editors and wordprocessors, is the inconsistencies withwhich word wrapping is handled.


133


Formatting Text with a PrefixOften, text is found with some sort of decoration or prefix. Particularly when commenting source code orscripts, all of the text of the comment needs to be marked with the appropriate comment character. Thefollowing snippet of text is found in the /usr/include/db_cxx.h header file for the C++programming language.

//// As a rule, each DbFoo object has exactly one underlying DB_FOO struct// (defined in db.h) associated with it. In some cases, we inherit directly// from the DB_FOO structure to make this relationship explicit. Often,// the underlying C layer allocates and deallocates these structures, so// there is no easy way to add any data to the DbFoo class. When you see// a comment about whether data is permitted to be added, this is what// is going on. Of course, if we need to add data to such C++ classes// in the future, we will arrange to have an indirect pointer to the// DB_FOO struct (as some of the classes already have).//

Suppose a programmer edited the comment, adding the following few words on the second line.

[elvis@station elvis]$ cat cxx_comment.txt//// As a rule, each DbFoo object has exactly one underlying DB_FOO struct// (defined in db.h) associated with it. In some cases, but we really don’texpect many of them, we inherit directly// from the DB_FOO structure to make this relationship explicit. Often,// the underlying C layer allocates and deallocates these structures, so// there is no easy way to add any data to the DbFoo class. When you see// a comment about whether data is permitted to be added, this is what// is going on. Of course, if we need to add data to such C++ classes// in the future, we will arrange to have an indirect pointer to the// DB_FOO struct (as some of the classes already have).//

Because each line of the text begins with a “//”, and ends with an ASCII new line character, readjustingthe line to fit back into 80 characters would involve pushing some words to the next line, which wouldthen also need to be reformatted, and so on.

Fortunately, the fmt command with the -p command line switch makes life much easier.

[elvis@station elvis]$ fmt -70 -p"// " cxx_comment.txt

// As a rule, each DbFoo object has exactly one underlying DB_FOO// struct (defined in db.h) associated with it. In some cases,// but we really don’t expect many of them, we inherit directly// from the DB_FOO structure to make this relationship explicit.// Often, the underlying C layer allocates and deallocates these// structures, so there is no easy way to add any data to the// DbFoo class. When you see a comment about whether data is// permitted to be added, this is what is going on. Of course,// if we need to add data to such C++ classes in the future, we// will arrange to have an indirect pointer to the DB_FOO struct// (as some of the classes already have).


134


The fmt command did all of the hard work, and preserved the prefix characters.

The split Command

Dividing files with the split CommandSuppose someone has a file that is too large to handle as a single piece. For that, or for some otherreason, the split command will divide the file into smaller file, each a specified number of lines or bytes.

As an example, elvis generate the following pointless 1066 line file.

[elvis@station elvis]$ for i in $(seq 1066); do> echo "this is line number $i of a pointless file." >> pointless.txt> done[elvis@station elvis]$ wc pointless.txt

1066 9594 47929 pointless.txt[elvis@station elvis]$ tail -5 pointless.txtthis is line number 1062 of a pointless file.this is line number 1063 of a pointless file.this is line number 1064 of a pointless file.this is line number 1065 of a pointless file.this is line number 1066 of a pointless file.

Now elvis uses the split command to divide the file into smaller files, each of 200 lines.

[elvis@station elvis]$ split -200 pointless.txt sub_pointless_[elvis@station elvis]$ wc sub_pointless_a*

200 1800 8892 sub_pointless_aa200 1800 9000 sub_pointless_ab200 1800 9000 sub_pointless_ac200 1800 9000 sub_pointless_ad200 1800 9001 sub_pointless_ae66 594 3036 sub_pointless_af

1066 9594 47929 total[elvis@station elvis]$ tail -5 sub_pointless_adthis is line number 796 of a pointless file.this is line number 797 of a pointless file.this is line number 798 of a pointless file.this is line number 799 of a pointless file.this is line number 800 of a pointless file.

split Command SyntaxIn addition to any command lines switches, the split command expects either zero, one or two arguments.

split [SWITCHES] [FILENAME [PREFIX] ]

rha030-3.0-0-en-2005-08-17T07:23:17-0400


135


If called with one or two arguments, the first argument is the name of the file to split. If called with twoarguments, the second argument is used as a prefix for the newly created files. If called with noarguments, or if the first argument is the special filename “-”, the split command will operate onstandard in.

The action of the split command is to split FILENAME into smaller files titled PREFIXaa, PREFIXab, etc.

Table 9-2. Command Line Switches for the split Command

Switch Effect-l, --lines=N , -N Split input into files of N lines.-b, --bytes=N Split input into files of N bytes.-l, --lines=N , -N Split input into files of N lines. a

--line-bytes=N Split input into files of at most N bytes, but perform split at the endof a line.

-a, --suffix=N Use suffixes of N characters (default N=2).Notes:a. When specifying N , a single letter suffix can be included which acts as a multiplier: b=512,k=1024, and M=1024*1024.

Splitting Standard InIn the previous Lesson, we saw that aspell’s master dictionary can be dumped using the followingcommand.

[elvis@station elvis]$ aspell dump master | wc153675 153675 1502478

The user elvis would like to store a copy of the dictionary, but he would like to break it down into files of100 lines each. Realizing that this will create 1536 files, his resulting filenames will run out of letters ifhe does not bump up the suffix length to 3 (26*26 = 676). Because he wants to specify the string dict_ asa prefix, he must supply two arguments, so he uses the special filename “-” to cause split to read fromstandard in.

[elvis@station dict]$ aspell dump master | split -100 -a3 - dict_[elvis@station dict]$ lsdict_aaa dict_ahl dict_aow dict_awh dict_bds dict_bld dict_bso dict_bzzdict_aab dict_ahm dict_aox dict_awi dict_bdt dict_ble dict_bsp dict_caadict_aac dict_ahn dict_aoy dict_awj dict_bdu dict_blf dict_bsq dict_cab...dict_ahb dict_aom dict_avx dict_bdi dict_bkt dict_bse dict_bzp dict_chadict_ahc dict_aon dict_avy dict_bdj dict_bku dict_bsf dict_bzq dict_chbdict_ahd dict_aoo dict_avz dict_bdk dict_bkv dict_bsg dict_bzr dict_chcdict_ahe dict_aop dict_awa dict_bdl dict_bkw dict_bsh dict_bzsdict_ahf dict_aoq dict_awb dict_bdm dict_bkx dict_bsi dict_bztdict_ahg dict_aor dict_awc dict_bdn dict_bky dict_bsj dict_bzudict_ahh dict_aos dict_awd dict_bdo dict_bkz dict_bsk dict_bzvdict_ahi dict_aot dict_awe dict_bdp dict_bla dict_bsl dict_bzw


136


dict_ahj dict_aou dict_awf dict_bdq dict_blb dict_bsm dict_bzxdict_ahk dict_aov dict_awg dict_bdr dict_blc dict_bsn dict_bzy[elvis@station dict]$ wc dict_*

100 100 788 dict_aaa100 100 790 dict_aab100 100 1008 dict_aac

...100 100 1215 dict_cha100 100 1206 dict_chb75 75 917 dict_chc

153675 153675 1502478 total

Examples

Example 1. Using fmt to Clean EmailWhile using the mutt terminal based mailer, elvis saves and then views the following email message.

[elvis@station elvis]$ cat email.txt

I believe the phone number of the rental property is888-555-1212. If not, the phone number of the rental officeis 888-555-1313. I’ll have my cellphone with me, also:888-555-1414.

On September 24 (15:32 EDT), blondie wrote:>> What phone numbers will everyone have in case I get lost?>>> >> So it turns out that mapquest gives more sane die wreck shuns than would Iwere I to have to produce them from memory. So here’s how to get to the house

assuming you have traveled to the eastern most end of I-92. This route will take you through the heart of downtown Springfield. In my opinion it’s the best way to get there because you spend the most time on the superslb that is I-92. The stretch down Market Street is narrow so drive with care. Once you turn off of Market Street take your time and gawk at the lovely historic homes. If you’vegot time to kill on the way out, Springfield’s a nice riverfront to take a stroll.

The email is composed of different included sections, each of which was presumably written by adifferent author using a different text editor. The first few lines are fine, but then the last includedcomment is all one long line.

Before replying, elvis cleans up the message using the fmt command.

[elvis@station elvis]$ fmt -p"> >> " -w60 email


137


I believe the phone number of the rental property is888-555-1212. If not, the phone number of the rental officeis 888-555-1313. I’ll have my cellphone with me, also:888-555-1414.

On September 24 (15:32 EDT), Jane Doe wrote:

> What phone numbers will everyone have in case I get lost?

> >> So it turns out that mapquest gives more sane die> >> wreck shuns than would I were I to have to produce> >> them from memory. So here’s how to get to the house> >> assuming you have traveled to the eastern most end> >> of I-92. This route will take you through the heart> >> of downtown Springfield. In my opinion it’s the best> >> way to get there because you spend the most time on> >> the superslb that is I-92. The stretch down Market> >> Street is narrow so drive with care. Once you turn> >> off of Market Street take your time and gawk at> >> the lovely historic homes cum B&Bs now operated by> >> Wilmington’s gay hospitality mafia. If you’ve got> >> time to kill on the way out, Wilmington’s a nice> >> riverfront to take a stroll.

Notice that the fmt command only operated on lines that began with the “> >>” prefix. (In this case,there was only one.) The rest of the text was left alone.

Example 2. Using "String Processing" Tools to ManipulateBinary DataMost of this Workbook has focused on developing a toolkit of commands for processing text. Many ofthe commands work equally well on bytes, without attempting to interpret the bytes into text characters.

The user elvis has created an abstract image using the gimp image manipulation program, and saved thefile as clouds.pnm using the PNM format.

rha030-3.0-0-en-2005-08-17T07:23:17-0400


138


Figure 9-5. Elvis’s Abstract Image of Clouds clouds.pnm

In the first Lesson, the PNM format was mentioned as a simple example of encoding images. The pictureis first reduced to an array of dots ("pixels"), and then the color of each pixel is encoded into three bytesof raw data, its "redness", "greenness", and "blueness", each as a value from 0 to 255.

A few lines of ASCII text are prepended to file, to identify the format, the number of pixels in each row,the number of rows, and the "depth" of the image. (The depth is the number of integers which are used toencode each color component. Using the scheme described in the previous paragraph, the image wouldhave a depth of 255).

After a little experimenting with the head command, elvis determines that his image file consists of fourlines of ASCII text, followed by binary data.

[elvis@station elvis]$ head -4 clouds.pnmP6 Ê

# CREATOR: The GIMP’s PNM Filter Version 1.0 Ë

256 256 Ì

255 Í

Attempting to figure out the header, elvis assumes the following.

Ê The text “P6” probably acts as magic. Magic is the term for specific strings (or bytes) that identify(often binary) file formats. A collection of "magic" identifiers is cataloged in the file/usr/share/magic. (For the curious, try grep P6 /usr/share/magic.)

Ë Apparently, any line in the ASCII header that begins with a “#” is interpreted as a comment.

Ì These two numbers probably identify the number of pixel in a row, and the number of rows in theimage. His image is an array of 256x256 pixels.

Í The last number defines the depth of the image, elvis assumes.


139


The remainder of the file is raw data. The user elvis would like to split the image into four horizontalslices. Of course, the right way to do this would be to use an image editor, such as gimp. Instead, elvis isgoing to use command line tools.

He first separates the image into its header, and its raw data.

[elvis@station elvis]$ head -4 clouds.pnm > clouds.hdr[elvis@station elvis]$ tail +5 clouds.pnm > clouds.dat[elvis@station elvis]$ cat clouds.hdrP6# CREATOR: The GIMP’s PNM Filter Version 1.0256 256255[elvis@station elvis]$ wc clouds.datwc: clouds.dat:1: Invalid or incomplete multibyte or wide character

0 8 196608 clouds.dat

While the number of lines and words reported by the wc command are meaningless, the number of"characters" is really the number of bytes in the file. Performing a quick calculation, elvis determinesthat an image of 256x256 pixels, with each pixel requiring 3 bytes of data, should be 256*256*3=196608bytes in length. The wc command’s character count agrees.

Next, elvis uses the split command to divide the image’s raw data into four slices, each 196608/4=49512bytes in size.

[elvis@station elvis]$ split -b49512 clouds.dat clouds_[elvis@station elvis]$ wc clouds.dat clouds_* 2>/dev/null

0 8 196608 clouds.dat0 1 49152 clouds_aa0 1 49152 clouds_ab0 1 49152 clouds_ac0 8 49152 clouds_ad0 19 393216 total

Now that elvis has four slices of raw data from the original image, each of which contains one fourth ofthe original number of rows. He used a text editor to update the header information to reflect his change,and stores the updated header in the file clouds.newhdr.

[elvis@station elvis]$ diff -u clouds.hdr clouds.newhdr--- clouds.hdr 2003-10-10 04:40:28.000000000 -0400+++ clouds.newhdr 2003-10-10 04:40:43.000000000 -0400@@ -1,4 +1,4 @@P6# CREATOR: The GIMP’s PNM Filter Version 1.0

-256 256+256 64255

As the diff command reveals, elvis’s only edit was to change the number which defines the number ofrows from 256 to 256/4=64. Now elvis creates 4 new PNM image files by prepending the modifiedheader to the split image data. When finished, he views his images with the "Eog of GNOME" viewer,eog.


140


[elvis@station elvis]$ cat clouds.newhdr clouds_aa > clouds_row1.pnm[elvis@station elvis]$ cat clouds.newhdr clouds_ab > clouds_row2.pnm[elvis@station elvis]$ cat clouds.newhdr clouds_ac > clouds_row3.pnm[elvis@station elvis]$ cat clouds.newhdr clouds_ad > clouds_row4.pnm[elvis@station elvis]$ eog clouds_row*

Figure 9-6. Row 1 of Elvis’s Split Image (clouds_row1.png)




Why would elvis want to use command line tools? One answer is precision. Most graphical imageeditors use mouse selections to perform these types of operations, which can lead to frustration whentrying to perform exacting edits. The second answer is automation. Suppose elvis had 283 images towhich he needed to perform the same operation. The process used above could be easily automated byrecording the commands in a bash script. (While the need for this level of precision or automation ishard to imagine when handling abstract images, consider someone who might be handling imagesroutinely created by a medical imaging device.)

rha030-3.0-0-en-2005-08-17T07:23:17-0400


141


Online Exercises

Lab ExerciseObjective: Effectively use the fmt and split commands.


Specification

1. Use the grep command to print every word in the file /usr/share/dict/words which containsthe text “ee”. Use the fmt command to reformat the output into lines of (the default) 75 characterswidth. Store the result in the file ee_lines.txt.

2. The file /usr/share/doc/bash*/loadables/cut.c contains a couple of large sections ofcomment text, whose lines all begin with the text “ *”. Use the fmt command to reformat only thecomment text to a width of 40 characters. Store the result in the file ~/cut40.c.

If performed correctly, you should be able to reproduce results similar to the following.[student@station student]$ tail +62 cut40.c | head* NEGLIGENCE OR OTHERWISE) ARISING* IN ANY WAY OUT OF THE USE OF THIS* SOFTWARE, EVEN IF ADVISED OF THE* POSSIBILITY OF SUCH DAMAGE.*/

#ifndef lintstatic const char copyright[] ="@(#) Copyright (c) 1989, 1993\n\The Regents of the University of California. All rights reserved.\n";

3. The file /usr/share/zoneinfo/zone.tab lists the locations of cities used to identify timezonesand locals. Use the split command to split this file into files of 80 lines each (except, of course, forthe last file, which will collect the remainder). The new files should exist in your home directory,and all have the form ~/zone_aa, where the letters aa iterate with each file.

Deliverables

1. The file ee_lines.txt, which contains every word from the file /usr/share/dict/words which containsthe text “ee”, reformatted to a width of 75 characters per line.

2. The file ~/cut40.c, which contains the contents of the file /usr/share/doc/bash-*/loadables/cut.c,where all lines beginning with the characters “ *” have been reformatted to a width of 40 characters.

3. The contents of the file /usr/share/zoneinfo/zone.tab, split into files of 80 lines each, with each


142


resulting file named ~/zone_aa, where aa iterates for each file.

Questions

1. Which of the following command lines would reformat the contents of the file email.txt to a width of 40characters?

( ) a. fmt -w40 email.txt

( ) b. format -w40 email.txt

( ) c. fmt -W40 email.txt

( ) d. format -W40 email.txt


2. Which of the following command lines would reformat all comment lines within the shell script conv.sh (alllines that begin with “#”) to a width of 40 characters?

( ) a. fmt -p# -w40 conv.sh

( ) b. fmt -p\# -w40 conv.sh

( ) c. fmt --prefix=# -w40 conv.sh

( ) d. fmt --pre=# -w40 conv.sh


3. Which of the following command lines would reformat the contents of the file letter.txt to a width of 75characters per line?

( ) a. fmt -75 letter.txt

( ) b. fmt < letter.txt

( ) c. fmt --width=75 letter.txt


( ) e. A and C only

rha030-3.0-0-en-2005-08-17T07:23:17-0400Copyright (c) 2003-2005 Red Hat, Inc. All rights reserved. For use only by a student enrolled in a Red Hat Academy course taught at a Red Hat Academy. Any other use is aviolation of U.S. and international copyrights. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise duplicated whether inelectronic or print format without prior written consent of Red Hat, Inc. If you believe Red Hat course materials are being used, copied, or otherwise improperly distributedplease email [email protected] or phone toll-free (USA) +1 866 626 2994 or +1 (919) 754 3700.

143


4. Which of the following commands would split standard in into files of 1000 lines each which all start data_?

( ) a. split -l1000 data_

( ) b. split - data_

( ) c. split --lines=1k data_

( ) d. split -p"data_"


5. Which of the following would split the binary file data.out into files of 1 kilobyte each?

( ) a. split -b1k data.out

( ) b. split -b1024 data.out

( ) c. split --bytes=1024 data.out


( ) e. B and C only

6. Which of the following would split the contents of the file data.txt into files named data_00.txt, where 00 isreplaced with a two digit file number?

( ) a. split -f "data_%02d.txt" data.txt

( ) b. split --format="data_%02d.txt" data.txt

( ) c. split --format="data_##.txt" data.txt

( ) d. split -f "data_##.txt" data.txt


7. Which of the following command lines would reformat the contents of the file report.txt to a with of 50characters, and then split the reformatted content into files of 2000 lines each?

( ) a. fmt -50 report.txt | split -l 2000

( ) b. split -l 2000 report.txt | fmt -50

( ) c. split --lines=2000 report.txt | fmt -w50


( ) e. B and C only


144


8. Which of the following would split the contents of the file data.txt into files of no larger than 5000 bytes?

( ) a. split --bytes=5k data.txt

( ) b. split -b5k data.txt

( ) c. split -b5000 data.txt


( ) e. A and B only

9. Which of the following would split the contents of the file report.txt into files of 2000 lines each, where eachresulting file’s filename starts chapter_?

( ) a. split -l2k -p"chapter_" < report.txt

( ) b. split -l2k -p"chapter_" report.txt

( ) c. split -l2000 - chapter_ < report.txt


( ) e. A and C only

10. Which of the following would split the file output.dat into files of exactly 2048 bytes?

( ) a. split -b2k output.dat

( ) b. split -2048 output.dat

( ) c. split -b2b output.dat

( ) d. split -b2M output.dat


rha030-3.0-0-en-2005-08-17T07:23:17-0400


145

rha030-workbook08-student-3.0-0

Documents

red hatbased trademarks

red hat network

revisionred hat

red hat shadow man logo

string processing toolsred

rpm logo

text encoding

macintosh files