Java正则表达式

一、正则表达式

正则表达式是一种强大而灵活的文本处理工具，通常被用来检索、替换符合某个模式(规则)的文本或者按某种规则提取匹配的内容。

二、Pattern的创建与使用

使用Pattern.compile(String regex)将regex编译为pattern

Pattern p = Pattern.compile("he.*ld");
Matcher m = p.matcher("hello world");
boolean b = m.matches();

也可以通过另一个方法：Pattern.compile(String regex, int flags)创建pattern，此方法接收一个标记来调整匹配的行为

常用flag	说明
`CASE_INSENSITIVE`	忽略大小写(默认区分大小写，也可以通过内置表达式`(?i)`开启)
`MULTILINE`	多行匹配(也可以通过内置表达式`(?m)`开启)
`COMMENTS`	忽略空格符和以#开头的行(也可以通过内置表达式`(?x)`开启)

String input = "hello";
Pattern p = Pattern.compile("Hello", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(input);
System.out.println(m.matches());//true
System.out.println(Pattern.matches("(?i)Hello", input));//true

三、构造正则的字符

字符	说明
`x`	字符x
`\\`	反斜线
`\t`	制表符
`\n`	换行(new line)
`\r`	回车(carriage return)
`\e`	转义(escape)
`\f`	换页(form-feed)

字符类	说明
`[abc]`	a、b、c中的任何一个
`[^abc]`	除a、b、c中的其他字符
`[a-zA-Z]`	a~z和A~Z之间的字符
`[a-d[m-p]]`	a~d或m~p间的字符
`[a-z&&[def]]`	d、e、f中的任何一个
`[a-z&&[^bc]]`	a~z之间的除去b、c的其他字符
`[a-z&&[^m-p]]`	a~l和q-z之间的字符
`[\u4e00-\u9fa5]`	中文字符

预定义字符类	说明
`.`	任何字符
`\d`	数字[0-9]
`\D`	非数字[^0-9]
`\w`	词字符[a-zA-Z0-9_](word)
`\W`	非词字符[^\w](non-word)
`\s`	空白字符
`\S`	非空白字符[^\s]

边界匹配	说明
`^`	行的开头
`$`	行的结尾
`\b`	词边界
`\B`	非词边界

贪婪量词	说明
`X?`	X不出现或出现一次
`X*`	X不出现或出现多次
`X+`	X出现一次或多次
`X{n}`	X出现n次
`X{n,}`	X至少出现n次(>=n)
`X{n,m}`	X至少出现n次但不超过m次(x>=n&&x<=m)

勉强量词	说明
`X?`	X不出现或出现一次
`X*`	X不出现或出现多次
`X+`	X出现一次或多次
`X{n}`	X出现n次
`X{n,}`	X至少出现n次(>=n)
`X{n,m}`	X至少出现n次但不超过m次(x>=n&&x<=m)

占有量词	说明
`X?`	X不出现或出现一次
`X*`	X不出现或出现多次
`X+`	X出现一次或多次
`X{n}`	X出现n次
`X{n,}`	X至少出现n次(>=n)
`X{n,m}`	X至少出现n次但不超过m次(x>=n&&x<=m)

　　贪婪量词、勉强量词与占有量词的详细介绍

逻辑操作符	说明
`XY`	X后面是Y
`X\|Y`	X或Y
`(X)`	X作为一个捕获组

特殊构造	说明
`(?<name>X)`	X作为一个命名捕获组
`(?:X)`	X作为一个非捕获组
`(?>X)`	X作为一个独立的非捕获组
`(?=X)`	匹配后面是X的部分(零宽[肯定]向前)
`(?!X)`	匹配后面不是X的部分(零宽[否定]向前)
`(?<=X)`	匹配前面是X的部分(零宽[肯定]向后)
`(?<!X)`	匹配前面不是X的部分(零宽[否定]向后)

　　零宽断言的详细介绍

四、捕获组

捕获组是通过从左到右计算开始的圆括号(。例：
在表达式((A)(B(C)))中，可以捕获到4组：

((A)(B(C)))
(A)
(B(C))
(C)

第0组永远代表整个表达式。

Pattern p = Pattern.compile("((A)(B(C)))");
Matcher m = p.matcher("ABC");
if(m.find()){
	System.out.println(m.group(0));
	System.out.println(m.group(1));
	System.out.println(m.group(2));
	System.out.println(m.group(3));
	System.out.println(m.group(4));
}

输出：

ABC
ABC
A
BC
C

也可以给每组命名

Pattern p = Pattern.compile("(?<all>(?<alpha>A)(?<bravo>B(?<charlie>C)))");
Matcher m = p.matcher("ABC");
if(m.find()){
	System.out.println(m.group("all"));
	System.out.println(m.group("alpha"));
	System.out.println(m.group("bravo"));
	System.out.println(m.group("charlie"));
}

输出：

ABC
A
BC
C

五、贪婪量词、勉强量词与占有量词

贪婪量词是尽可能多的匹配字符；
勉强量词与贪婪量词相反，也被称为非贪婪量词，它是尽可能少的匹配字符；
占有量词与贪婪量词相似，但它在匹配失败时不会回溯。

1、举例说明

输入：String input = "xfooxxxxxxfoo";

贪婪型(.*foo)

Pattern greedyP = Pattern.compile(".*foo");
Matcher greedyM = greedyP.matcher(input);
boolean found = false;
while(greedyM.find()){
	found = true;
	System.out.println(String.format("[Greedy:] found %s, index range: %s~%s", greedyM.group(), greedyM.start(), greedyM.end()));
}
if(!found){
	System.err.println("[Greedy:]not found!");
}

输出:

[Greedy:] found xfooxxxxxxfoo, index range: 0~13

勉强型(.*?foo)

Pattern reluctantP = Pattern.compile(".*?foo");
Matcher reluctantM = reluctantP.matcher(input);
boolean found = false;
while(reluctantM.find()){
	found = true;
	System.out.println(String.format("[Relucatnt:] found %s, index range: %s~%s", reluctantM.group(), reluctantM.start(), reluctantM.end()));
}
if(!found){
	System.err.println("[Relucatnt:] not found!");
}

输出：

[Relucatnt:] found xfoo, index range: 0~4
[Relucatnt:] found xxxxxxfoo, index range: 4~13

占有型(.*+foo)

Pattern possessiveP = Pattern.compile(".*+foo");
Matcher possessiveM = possessiveP.matcher(input);
boolean found = false;
while(possessiveM.find()){
	found = true;
	System.out.println(String.format("[Possessive:] found %s, index range: %s~%s", possessiveM.group(), possessiveM.start(), possessiveM.end()));
}
if(!found){
	System.out.println("[Possessive:] not found!");
}

输出：

[Possessive:] not found!

2、分析

贪婪型

贪婪量词第一次匹配时会尽可能多的匹配，因此.*匹配整个字符串，然后继续匹配f，此时已经没有剩余字符串可以匹配了，匹配器将会每次向前回溯一个字符继续匹配，直到回溯三个字符后（此时.*匹配’xfooxxxxxx’，剩余’foo’），f匹配成功，匹配器继续匹配，f后的两个o也匹配成功，匹配结束。
勉强型

勉强型量词第一次匹配时会尽可能少的匹配，因些.*不匹配任何字符，剩下的整个字符串都是未匹配的。然后匹配器开始匹配f，但剩余字符串是以x开头的，因此匹配失败。匹配器回溯，这次.*匹配’x’，剩余’fooxxxxxxfoo’，再次匹配f，这次匹配成功，后面的两个o也匹配成功，找到第一处匹配，索引为0~4；匹配器继续处理剩余字符串，直到整个字符串被用完。匹配器找到另一处匹配，索引为4~13。
占有型

占有量词与贪婪量词相似，但它不回溯。因此.*匹配整个字符串，没有剩余字符可以匹配f了，此时又不会回溯，因此匹配失败，找不到匹配的字符。

六、零宽断言

(?=X)零宽正向先行

匹配后面是X的部分。

//zero-width positive lookahead

String input = "query qatar equal QCU question acquire opqrst";
//匹配q后面是u的单词
Pattern p = Pattern.compile("\\w*q(?=u)\\w*", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(input);
while(m.find()){
	System.out.println(String.format("found %s index range: %s~%s", m.group(), m.start(), m.end()));
}

输出：

found query index range: 0~5
found equal index range: 12~17
found question index range: 22~30
found acquire index range: 31~38

(?!X)零宽负向先行

匹配后面不是X的部分

String input = "query qatar equal QCU question acquire opqrst";
//匹配q后面不是u的单词
Pattern p = Pattern.compile("\\w*q(?!u)\\w*", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(input);
while(m.find()){
	System.out.println(String.format("found %s index range: %s~%s", m.group(), m.start(), m.end()));
}

输出：

found qatar index range: 6~11
found QCU index range: 18~21
found opqrst index range: 39~45

(?<=X)零宽正向后行

匹配前面是X的部分

String input = "cab bed daub debt herb kabob dab lamb flab";
//匹配b前面是a的单词
Pattern p = Pattern.compile("\\w*(?<=a)b\\w*");
Matcher m = p.matcher(input);
while(m.find()){
	System.out.println(String.format("found %s index range: %s~%s", m.group(), m.start(), m.end()));
}

输出：

found cab index range: 0~3
found kabob index range: 23~28
found dab index range: 29~32
found flab index range: 38~42

(?<!X)零宽负向后行

匹配前面不是X的部分

String input = "cab bed daub debt herb kabob dab lamb flab";
//匹配b前面不是a的单词
Pattern p = Pattern.compile("\\w*(?<!a)b\\w*");
Matcher m = p.matcher(input);
while(m.find()){
	System.out.println(String.format("found %s index range: %s~%s", m.group(), m.start(), m.end()));
}

输出：

found bed index range: 4~7
found daub index range: 8~12
found debt index range: 13~17
found herb index range: 18~22
found kabob index range: 23~28
found lamb index range: 33~37

附：

Java1.8 API
正则量词
正则表达式
汉字正则：[\u4E00-\u9FA5]