Linux 常用文本处理命令

Linux 系统有许多的工具用来对文本进行过滤和处理，下面对一些常用的工具进行简单总结。

grep

grep 工具对文件的每一行搜索指定的模式字符串。如果找到了匹配这个字符串的行，就打印该行的内容。

基本语法

1
2
3

grep [OPTIONS] PATTERN [FILE...] 

grep [OPTIONS] [-e PATTERN | -f FILE] [FILE...]

常用选项

# 指定pattern，可用于需要匹配多个模式的情况
-e PATTERN

# 指定文件中，每行存储一个模式
-f FILE

# 忽略大小写
-i, --ignore-case
    
# 反检索，只显示不匹配的行。
-v, --invert-match
    
# 匹配整个单词
-w, --word-regexp
    
# 匹配整行
-x, --line-regexp
  
  
# 只输出匹配的行数，不显示匹配的内容。
-c, --count

# 高亮匹配的字符串
# WHEN is never, always, auto
--color[=WHEN]    

# 搜索多个文件时，不显示匹配文件名 
-h，--no-filename

# 显示行号
-n，--line-number

# 只显示匹配的部分
-o, --only-matching

# 不显示错误信息。
-s，--no-messages

# 不显示信息，只返回退出状态。0表示匹配成功
-q，--quiet

# 列出匹配成功的文件
-l，--files-with-matches

# 列出不匹配模式的文件
-L，--files-without-match

示例

# 搜索字符串 golf 的所有实例
$ grep golf grep.txt

# 搜索 golf 和 Golf
$ grep [gG]olf grep.txt

# g 和 lf 之间可以是任意字符
$ grep g.lf grep.txt

# g 和 lf 之间只可以是a和z之间的字符
$ grep g[a-z]lf grep.txt

# g 和 lf 之间可以是a和z之间的字符或A和Z之间的字符
$ grep g[a-zA-Z]lf grep.txt

# g 和 lf 之间不可以是数字
$ grep g[^0-9]lf

# g 和 lf 之间的 o 可以出现0次，1次或多次
$ grep go*lf grep.txt

# o 出现0次或1次
$ grep go?lf grep.txt

# o 出现1次或多次
$ grep go+lf grep.txt

# 行首为 golf
$ grep ^golf grep.txt

# 行尾为 golf
$ grep golf$ grep.txt

# 行尾位 .go
$ grep '\.go$' grep.txt

# 所有不含 foo 或 bar 的行
$ grep -v -e 'foo' -e 'bar' grep.txt

fgrep

fgrep(等同于grep -F)对文件搜索某个字符串，并打印包含这个字符串的所有行的内容。与grep不同的是，fgrep搜索的是一个字符串，而不可以匹配使用模式的正则表达式。fgrep一次可以搜索多个对象(用回车分隔)。通常速度比grep更快。

-f 选项可以让我们可以指定一个包含搜索项的文件，其中可以列出经常搜索的内容。

示例：

1
2
3

# File:search_items
foo
bar

1 2	# 从 myfile 中查找 search_items 文件列出的搜索项，输出匹配的行 $ fgrep -f search_items myfile

egrep

egrep(等同于grep -E)是grep的一个功能更加强大的版本，它让我们可以一次搜索多个对象。要搜索的对象是使用回车符（与fgrep一样）或管道符（|）来分隔的。

1	$ egrep "first\|second" myfile

除了搜索多个目标的功能之外，egrep还提供了重复搜索和分组搜索的功能：

‘?’ 查找问号前面字符的零次匹配或一次匹配。
‘+’ 查找加号前面字符的一次或多次匹配。
‘( )’ 表示一个分组。

cut

使用 cut 工具，我们可以将文件中数据域的各个列分隔开来。默认的分隔符是制表符。

基本语法

1
2
3

cut -b list [-n] [file ...]
cut -c list [file ...]
cut -f list [-w | -d delim] [-s] [file ...]

常用选项

-f 可以用来指定希望显示的域
-c 允许我们指定字符而不是域
-d 允许我们指定其他分隔符，而不是制表符

示例

# File:sample
one    two    three
four   five   six
seven  eight  nine
ten    eleven twelve

执行：

1	$ cut -f2 sample

two
five
eight
eleven

命令修改为：

1	$ cut -f1,3 sample

返回第1和第3个字段：

one    three
four   six
seven  nine
ten    twelve

再修改为：

1	$ cut -f2- sample

返回第2到最后一个字段：

two    three
five   six
eight  nine
eleven twelve

paste

paste 工具可以对文件中的域进行合并。它从每个源文件中提取一行内容，并将其与另外一个源文件中的一行内容合并在一起。

基本语法

1	paste [-s] [-d list] file ...

示例

# File:fileone
one
two
three

# File:filetwo
four    seven
five    eight
six     nine

执行：

1	$ paste fileone filetwo

得到输出：

1
2
3

one     four    seven
two     five    eight
three   six     nine

制表符是默认的分隔符，但是我们可以使用-d选项将其修改成任何其他值。

1	$ paste -d", " fileone filetwo

得到输出

1
2
3

one,four    seven
two,five    eight
three,six     nine

我们也可以使用-s选项将fileone的内容在一行中输出，后面加上一个回车键，然后再显示filetwo的内容。

1	$ paste -s fileone filetwo

得到输出：

1 2	one two three four seven five eight six nine

join

join是paste的一个很好的增强版本。join只有在所要连接的文件共享某个共同的域时才会工作。默认情况下，它期望这个共同的域就是第一个域。

基本语法

1 2	join [-a file_number \| -v file_number] [-e string] [-o list] [-t char] [-1 field] [-2 field] file1 file2

示例

# File:fileone
1   one
2   two
3   three

# File:filetwo
1   first
2   second
3   third

执行：

1	$ join fileone filetwo

得到输出：

1
2
3

1 one first
2 two second
3 three third

此时第一个域相同的地方就会被识别出来，匹配项也就进行了合并。这种匹配必须非常精确。

假如在fileone文件中添加一行内容：

# File:fileone
1   one
2   two
4   four
3   three

执行：

1	$ join fileone filetwo

得到输出：

1 2	1 one first 2 two second

默认情况下，join只会查找第一个域进行匹配，并输出所有列的内容；不过我们可以对这种行为进行修改。

-1 选项让我们可以指定使用哪个域作为fileone中的匹配项
-2 选项让我们可以指定使用哪个域作为filetwo中的匹配项。
-o 选项可以以 {file.field} 格式来指定输出结果。

举例来说，要对 fileone 的第二个域和 filetwo 的第三个域进行匹配，我们可以使用下面的语法：

1	$ join -1 2 -2 3 fileone filetwo

如要在匹配行上打印fileone的第二个域和filetwo的第三个域，语法为：

1	$ join -o 1.2 -o 2.3 fileone filetwo

sort

sort 命令对 File 参数指定的文件中的行排序，并将结果写到标准输出。如果 File 参数指定多个文件，那么 sort 命令将这些文件连接起来，并当作一个文件进行排序。

基本语法

1	sort [OPTION]... [FILE]...

常用选项

-f, --ignore-case
    忽略大小写的差异
-b, --ignore-leading-blanks
    忽略最前面的空格符部分
-M, --month-sort
    按月份来排序：(unknown) < `JAN' < ... < `DEC'
-n, --numeric-sort
    按纯数字进行排序(默认是以文字型态来排序的)
-r, --reverse
    反向排序；
-u, --unique
    相同的数据仅出现一行
-t, --field-separator=SEP
    指定非空格分隔符
-k, --key=POS1[,POS2]
    以指定Field进行排序

示例

默认排序

sort 是默认以第一个数据来排序，而且默认是以字符串形式来排序,所以由字母 a 开始升序排序。

1	$ cat /etc/passwd \| sort

输出

adm:x:3:4:adm:/var/adm:/sbin/nologin
apache:x:48:48:Apache:/var/www:/sbin/nologin
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin

按指定域排序

/etc/passwd 内容是以 : 来分隔的，下面我们以第三栏来排序。

1	$ cat /etc/passwd \| sort -t ':' -k 3

输出

root:x:0:0:root:/root:/bin/bash
uucp:x:10:14:uucp:/var/spool/uucp:/sbin/nologin
operator:x:11:0:operator:/root:/sbin/nologin
bin:x:1:1:bin:/bin:/sbin/nologin
games:x:12:100:games:/usr/games:/sbin/nologin

按数字排序

1	$ cat /etc/passwd \| sort -t ':' -k 3n

等同于

1	$ cat /etc/passwd \| sort -t ':' -k 3 -n

输出

1
2
3

root:x:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/bin/sh
bin:x:2:2:bin:/bin:/bin/sh

倒序

1	$ cat /etc/passwd \| sort -t ':' -k 3nr

等同于

1	$ cat /etc/passwd \| sort -t ':' -k 3 -nr

输出

nobody:x:65534:65534:nobody:/nonexistent:/bin/sh
ntp:x:106:113::/home/ntp:/bin/false
messagebus:x:105:109::/var/run/dbus:/bin/false
sshd:x:104:65534::/var/run/sshd:/usr/sbin/nologin

基于多个域排序

先以第六个域的第2个字符到第4个字符进行正向排序，再基于第一个域进行反向排序。

1	$ cat /etc/passwd \| sort -t':' -k 6.2,6.4 -k 1r

输出

sync:x:4:65534:sync:/bin:/bin/sync
proxy:x:13:13:proxy:/bin:/bin/sh
bin:x:2:2:bin:/bin:/bin/sh
sys:x:3:3:sys:/dev:/bin/sh

去重

1	$ cat /etc/passwd \| sort -t':' -k 7 -u

输出

root:x:0:0:root:/root:/bin/bash
syslog:x:101:102::/home/syslog:/bin/false
daemon:x:1:1:daemon:/usr/sbin:/bin/sh
sync:x:4:65534:sync:/bin:/bin/sync
sshd:x:104:65534::/var/run/sshd:/usr/sbin/nologin

uniq

uniq命令可以去除排序过的文件中的重复行，为了使uniq起作用，所有的重复行必须是相邻的，因此uniq经常和sort合用。

基本语法

1	uniq [-c \| -d \| -u] [-i] [-f num] [-s chars] [input_file [output_file]]

常用选项

-i        忽略大小写字符的不同；
-c        计数重复次数
-d        仅输出存在重复的行
-u        仅输出不重复的行
-f num    忽略每一行的前num个域，每个域由blank分隔
-s chars  忽略每一行的前chars个字符

示例

words文件内容如下：

hello
world
friend
hello
world
hello

直接使用uniq

不排序直接使用uniq，该文件内容将原样输出。

1	$ uniq words

先排序再去重

1	$ cat words \| sort \| uniq

输出

1
2
3

friend
hello
world

输出重复次数

1	$ sort words \| uniq -c

输出

1
2
3

1 friend
3 hello
2 world

仅显示重复行

1	$ sort words \| uniq -d

输出

1 2	hello world

仅显示不重复的行

1	$ sort words \| uniq -u

输出

friend

awk

awk本身实际上是一种编程语言，可以实现复杂的逻辑语句，还可以简化部分文本的提取。

基本语法

awk命令包括一个模式和由一条或多条语句构成的操作，语法如下所示：

1	$ awk '/pattern/ {action}' file

请注意：

awk 测试指定文件中的每个记录是否符合模式匹配。如果找到匹配项，就执行指定的操作。
awk 可以在管道中作为过滤器，如果没有指定文件，它也可以从键盘（标准输入）中接收输入。

提取数据

一种非常有用的操作是提取并打印数据，下面是如何引用一条记录中的域：

$0 —— 整条记录
$1 —— 该记录中的第一个域
$2 —— 该记录中的第二个域

我们可以从一条记录中提取多个域，之间用逗号分开。

例如，要提取/etc/passwd文件中的第一个和第六个域，命令如下：

1	$ awk -F: '{print $1,$6}' /etc/passwd

-F是由预先定义的FS变量所定义的输入域分隔符

在域之间使用短横线作为分隔符进行输出，命令如下：

1	$ awk -F: '{OFS="-"}{print $1,$6}' /etc/passwd

过滤记录

我们可以使用比较运算符来过滤记录：==, !=, >, <, >=, <=

多个条件可用逻辑运算符连接：&&, ||

例如，获取第一列的值等于”root”的记录，命令如下：

1	$ awk -F: '$1=="root" ' /etc/passwd

字符串匹配

我们可以指定模式进行字符串匹配，~ 表示模式开始。/ /中是模式。

例如，对第5列匹配模式”System”，命令如下：

1	$ awk -F: '$5 ~ /System/ ' /etc/passwd

也可以像grep一样匹配一行：

1	$ awk '/System/ ' /etc/passwd

可以使用’!’对模式取反：

1 2	$ awk -F: '$5 !~ /System/ ' /etc/passwd $ awk '!/System/ ' /etc/passwd

拆分文件

我们可以使用重定向符号’>’对文件进行拆分。

以下命令根据第1列的值进行分割：

1	$ ps aux \| awk '$1=="root"\|\|$1=="myusername" {print > $1}'

统计

下面的命令计算所有的C文件，CPP文件和H文件的文件大小总和。

1	$ ls -l .cpp .c *.h \| awk '{sum+=$5} END {print sum}'

统计每个用户的进程的占了多少内存（其中NR!=1表示不处理表头）：

1	$ ps aux \| awk 'NR!=1{a[$1]+=$6;} END { for(i in a) print i ", " a[i]"KB";}'

head

head 工具打印每个文件的最开始部分的内容（默认是10行）。如果没有给定文件，它就从标准输入中读入内容，如果给定了文件名就从文件中读入内容。

基本语法

1	head [-n count \| -c bytes] [file ...]

示例

1 2	# displays the first 100 lines of a file $ head -100 myfile

我们可以使用-c选项指定要显示的字节个数。

1 2	# dieplays the first 2 bytes of a file $ head -c 2 myfile

tail

tail 工具打印每个文件的最末尾部分的内容（默认是10行）。如果没有给定文件，它就从标准输入中读入内容，如果给定了文件名就从文件中读入内容。

基本语法

1	tail [-F \| -f \| -r] [-q] [-b number \| -c number \| -n number] [file ...]

示例

1 2	# displays the last 100 lines of a file $ tail -100 myfile

我们可以使用-c选项指定要显示的字节个数。

1 2	# dieplays the last 2 bytes of a file $ tail -c 2 myfile

wc

统计文件里面有多少单词，多少行，多少字符等等。

基本语法

1	wc [-clmw] [file ...]

示例

1	$ wc [-clmw] [file ...]

使用wc统计/etc/passwd

1	$ wc /etc/passwd

得到：

1	40 45 1719 /etc/passwd

其中40是行数，45是单词数，1719是字节数。

使用不同的选项：

$ wc -l /etc/passwd  #统计行数
40 /etc/passwd       #表示系统有40个账户

$ wc -w /etc/passwd  #统计单词数
45 /etc/passwd

$ wc -m /etc/passwd  #统计字节数
1719

参考资料

http://man7.org/linux/man-pages/dir_section_1.html

http://www.ibm.com/developerworks/cn/linux/l-textutils.html