Regular Expression

Knowledge Learning#

Regular expressions are a powerful text processing tool that can help you quickly search, replace, match, and analyze text. Here are some basic regular expression syntax and examples:

Basic Character Matching#

a: Matches the character "a".
Example: The regular expression a can match the first 'a' in the string "apple".

Dot (.)#

.: Matches any single character (except for newline).
Example: The regular expression h.t can match "hat", "hot", "hit", etc.

Asterisk (*)#

*: Matches the preceding character zero or more times.
Example: The regular expression ho* can match "h", "ho", "hoo", "hooo", etc.

Plus (+)#

+: Matches the preceding character one or more times.
Example: The regular expression ho+ can match "ho", "hoo", "hooo", etc., but cannot match "h".

Question Mark (?)#

?: Matches the preceding character zero or one time.
Example: The regular expression ho? can match "h" and "ho".

Character Set ([ ])#

[abc]: Matches any one of the characters in the brackets (a, b, or c).
Example: The regular expression [ch]at can match "cat" and "hat".

Exclusion Character Set ([^ ])#

[^abc]: Matches any character not in the brackets.
Example: The regular expression [^a]n can match "bn", "cn", etc., but not "an".

Range Symbol (-)#

[a-z]: Matches any lowercase letter.
[A-Z]: Matches any uppercase letter.
[0-9]: Matches any digit.
Example: The regular expression [A-Z]at can match "Cat", "Hat", etc., but not "cat".

Digit (\d) and Non-digit (\D)#

\d: Matches any digit, equivalent to [0-9].
\D: Matches any non-digit character.
Example: The regular expression \d\d can match "12", "45", etc.

Word Character (\w) and Non-word Character (\W)#

\w: Matches any word character (including letters, digits, and underscores).
\W: Matches any non-word character.
Example: The regular expression \w\w can match "ab", "12", "a1", etc.

Boundary Matchers (\b and \B)#

\b: Matches a word boundary.
\B: Matches a non-word boundary.
Example: The regular expression \bcat\b can match "The cat sat." but not "caterpillar".

Escape Character (\)#

\\: Used to match characters that have special meanings, such as ., *, ?, +, etc.
Example: To match ".", use the regular expression \\..

Anchors (^ and $)#

^: Matches the start of a string.
$: Matches the end of a string.
Example: The regular expression ^cat can match "cat" and "caterpillar", but not "scatter".

Grouping ( )#

( ): Used to group multiple characters into a single unit, can use the | operator for grouping.
Example: The regular expression (ab|cd) can match "ab" or "cd".
(?: ): Used to create a non-capturing group that does not capture the matched content.
Example: The regular expression (?:ab|cd) can match "ab" or "cd", but will not capture the matched content.
|: Used to create a logical OR operation within a group, matching one of two or more patterns.
Example: The regular expression a|b can match "a" or "b".

Relationship Between Capture Groups#

In regular expressions, when you use multiple capture groups consecutively, their relationship is "sequential AND". This means that for the entire expression to match successfully, each capture group must find a match in the target string in the order they appear, and each capture group's match must immediately follow the previous capture group's match.

For example, consider the regular expression (A)(B):

This expression contains two capture groups: (A) and (B).
For a successful match, the target string must first contain a match for A, followed immediately by a match for B.
In this case, the string AB would successfully match because it first contains A, followed by B.

However, it is important to note that when capture groups are used in assertions, such as lookahead assertions (?!...) or (=...), the matches within the capture groups do not consume characters (i.e., they do not change the current matching position of the regular expression engine). This means that capture groups used in assertions allow you to check multiple conditions at the same position without requiring them to be adjacent. For example, the regular expression ^(?=.*\d)(?=.*[a-z])(?=.*[A-Z]) checks three conditions at the same starting position of the string: at least one digit, at least one lowercase letter, and at least one uppercase letter.

Thus, while capture groups typically follow "sequential AND", how they interact also depends on how they are used, especially when assertions are involved.

How is the Sequential Relationship of Capture Groups Achieved?#

The sequential relationship of capture groups in regular expressions is achieved by the regular expression engine evaluating each part of the expression from left to right. This process generally includes the following steps:

Start Matching: The regular expression engine begins searching from the start of the string or from the end of the last successful match.
Evaluate One by One: The engine reads the first element of the regular expression (which may be a character, character set, capture group, or assertion) and attempts to find a match at the current position. If successful, the engine moves to the position after the matched part, ready to evaluate the next element.
Continue Forward: The engine continues processing the next element of the regular expression in order, moving forward each time a match is successful.
Capture Group Processing: When the engine encounters a capture group, it attempts to match the expression within the capture group. If the capture group matches successfully, the matched content is stored for later reference. The engine then continues moving forward from the position after the capture group, processing the remaining parts of the regular expression.
Sequential Matching: Since the regular expression engine processes each element of the regular expression in order, the sequential relationship between capture groups (and other expression elements) is naturally formed. Each capture group must find a match after the content matched by the previous capture group, thus achieving "sequential AND".
Overall Match Success or Failure: If all parts of the regular expression match successfully in order, the entire expression matches successfully; if any part fails, the entire matching attempt fails, and the engine may restart the whole process from the next position in the string (depending on the specific structure and matching mode of the regular expression).

In this way, the regular expression engine ensures that the capture groups (and other elements) in the expression must find matches in the target string in the order they appear, thus establishing their sequential relationship.

Quantifiers ({ })#

{n}: Matches the preceding character exactly n times.
{n,}: Matches the preceding character at least n times.
{n,m}: Matches the preceding character at least n times but not more than m times.
Example: The regular expression a{2,4} can match "aa", "aaa", or "aaaa".

/s、/S、/t、/n#

\s: Matches any whitespace character, including spaces, tabs, newlines, etc.
\S: Matches any non-whitespace character.
\t: Matches a tab character.
\n: Matches a newline character.
Example: The regular expression \s+ can match one or more whitespace characters.

Assertions#

In regular expressions, assertions are used to define certain conditions during the matching process without consuming characters (i.e., not moving to the next position in the string). Assertions can be seen as checks performed at specific positions that determine whether a match is successful, but they do not affect the actual text content matched. Common types of assertions include:

Lookahead Assertion: Lookahead assertions are used to check the text following a certain position to determine if specific conditions are met. They are divided into positive lookahead and negative lookahead.
- Positive Lookahead Assertion ((?=pattern)): The assertion succeeds only if pattern can match after the current position. For example, q(?=u) will match the "q" in "quiet" but will not match the "q" in "Iraq".
- Negative Lookahead Assertion ((?!pattern)): The assertion succeeds only if pattern cannot match after the current position. For example, q(?!u) will match the "q" in "Iraq" but will not match the "q" in "quiet".
Lookbehind Assertion: Lookbehind assertions are used to check the text preceding a certain position to determine if specific conditions are met. They are also divided into positive lookbehind and negative lookbehind.
- Positive Lookbehind Assertion ((?<=pattern)): The assertion succeeds only if pattern can match before the current position. For example, (?<=\$)\d+ will match "100" in "$100" but will not match "100 dollars".
- Negative Lookbehind Assertion ((?<!pattern)): The assertion succeeds only if pattern cannot match before the current position. For example, (?<!\$)\d+ will match "100" in "100 dollars" but will not match "100" in "$100".
Word Boundary Assertions (\b and \B): Used to determine whether a character is at a word boundary.
- \b: Matches a word boundary, i.e., the position between a letter and a non-letter character. For example, \bword\b can match "word" in "word is" but not in "swordfish".
- \B: Matches a non-word boundary.
Start and End of String Assertions (^ and $): These assertions are used to match the start and end of a string, respectively.
- ^: Matches the start of a string. For example, ^Hello will match strings that start with "Hello".
- $: Matches the end of a string. For example, world$ will match strings that end with "world".

Assertions are very powerful tools in regular expressions, allowing for complex conditional matching without actually including the matched text. This is particularly useful in handling complex text patterns, such as in password validation, data validation, and text analysis.

What Happens When an Assertion Fails?#

When an assertion fails, it means that the text at the current check position does not meet the conditions specified by the assertion. This will cause the entire pattern match to fail at that position, and the regular expression engine may continue to search for new positions in the text to attempt to match again. Here are the failure scenarios for each type of assertion along with examples:

Positive Lookahead Assertion Failure ((?=pattern)): The assertion fails when the text following does not match pattern.
- Example: The regular expression X(?=Y) aims to match X that is immediately followed by Y. In the string "XY", it will match X because X is followed by Y. But in the string "XA", since X is followed by A instead of Y, the assertion fails, and thus it does not match.
Negative Lookahead Assertion Failure ((?!pattern)): The assertion fails when the text following matches pattern.
- Example: The regular expression X(?!Y) aims to match X that is not immediately followed by Y. In the string "XA", it will match X because X is not followed by Y. But in the string "XY", since X is followed by Y, the assertion fails, and thus it does not match.
Positive Lookbehind Assertion Failure ((?<=pattern)): The assertion fails when the text preceding does not match pattern.
- Example: The regular expression (?<=Y)X aims to match X that is preceded by Y. In the string "YX", it will match X because X is preceded by Y. But in the string "AX", since X is preceded by A instead of Y, the assertion fails, and thus it does not match.
Negative Lookbehind Assertion Failure ((?<!pattern)): The assertion fails when the text preceding matches pattern.
- Example: The regular expression (?<!Y)X aims to match X that is not preceded by Y. In the string "AX", it will match X because X is not preceded by Y. But in the string "YX", since X is preceded by Y, the assertion fails, and thus it does not match.
Word Boundary Assertion Failure (\b or \B): The assertion fails when \b is expected at a word boundary but the position is not at a word boundary, or \B is expected at a non-word boundary but the position is at a word boundary.
- Example (failure of \b): The regular expression \bword\b aims to match the complete word "word". In the string "word is", it will match "word" because "word" is surrounded by word boundaries. But in the string "swordfish", since "word" is not surrounded by word boundaries, the \b assertion fails, and thus it does not match.
- Example (failure of \B): The regular expression \Bis\B aims to match "is" surrounded by other letters. In the string "visit", it will match "is" because "is" is not at a word boundary. But in the string "island", "is" is preceded by a word boundary, so the \B assertion fails, and thus it does not match.

Assertion failure means that the regular expression engine cannot find a match that satisfies the entire pattern at the current position. Depending on the other parts of the regular expression and the specific engine implementation, the engine may continue trying to match from other positions in the string.

^(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(.{8,})$
This has three assertions, which automatically match at the start of the string, checking if there is any value followed by a digit, a letter, and an uppercase letter. Then, after checking, . represents any value, and {8,} indicates at least 8 characters.
(?=.*\d) checks for at least one digit, (?=.*[a-z]) checks for at least one lowercase letter, and (?=.*[A-Z]) checks for at least one uppercase letter.

Where Are the Boundaries of Assertions?#

The regular expression ^(?=.*[A-Z])$ seems to require at least one uppercase letter in the string, but in reality, this expression will not match any string containing characters, including "aaSin12112". This is because the construction of this expression leads to a logical contradiction. Here are the components of this expression:

^ and $: These two symbols represent the start and end of the string, respectively. When they are adjacent, it means that there cannot be any characters between them, which corresponds to an empty string.
(?=.*[A-Z]): This is a positive lookahead assertion that requires at least one uppercase letter to exist after the current position. This assertion itself does not consume any characters (i.e., it does not move the matching position), it merely checks if the condition is met.

When this assertion is combined with ^ and $, ^(?=.*[A-Z])$ effectively requires the string to be empty (because ^ and $ are directly adjacent), while also requiring that there exists an uppercase letter somewhere in this empty string, which is clearly impossible to satisfy. Therefore, this regular expression will not match any string containing characters, including those with uppercase letters.

If you want to match a string that contains at least one uppercase letter, you should remove the direct connection between ^ and $, allowing the regular expression to permit the existence of other characters, for example: ^(?=.*[A-Z]).+$. Here, .+ matches at least one character, ensuring that the string is not empty, while (?=.*[A-Z]) ensures that the string contains at least one uppercase letter.

The standalone (?=.*[A-Z]) is a positive lookahead assertion that checks whether there is at least one uppercase letter in the string. This assertion itself does not match any characters but sets a condition that must be met before any actual matching occurs.

What About (?=.*[A-Z])?#

In this specific assertion:

(?= ... ) is the start and end of the assertion, indicating this is a lookahead operation that will not consume any characters.
.* indicates any number of any characters (except for newline), meaning the uppercase letter can appear anywhere in the string.
[A-Z] matches any uppercase letter.

When you use (?=.*[A-Z]) alone, you are saying: "After the current position, I expect to find zero or more characters followed by an uppercase letter." However, since this expression is just an assertion and does not specify which characters to match, it will not lead to any actual matches.

To make this assertion meaningful as a regular expression that can match actual strings, you need to add some expression parts that can match characters. For example, if you want to match any string containing at least one uppercase letter, you could use the following expression:

(?=.*[A-Z]).*

In this expression, . matches any single character, and * indicates matching zero or more of the preceding elements. Thus, .* will match any length of any character sequence, ensuring that the entire expression can match any string containing at least one uppercase letter.

How Are Assertions Executed?#

Assertions are a prerequisite condition for matching the corresponding string!
In the regular expression (?=.*[A-Z]).*, the execution process of the assertion (?=.*[A-Z]) follows the way the regular expression engine works. This process can be roughly broken down into the following steps:

Engine Locates the Start of the String: The regular expression engine starts working from the beginning of the string.
Perform Assertion Check: When the engine encounters the positive lookahead assertion (?=.*[A-Z]), it looks ahead (but does not move the matching point) to check whether the character sequence following the current position satisfies the pattern .*[A-Z].
- .* indicates any number of any characters (except for newline), meaning that the uppercase letter [A-Z] can be located anywhere in the string.
- [A-Z] indicates any uppercase letter.
Success or Failure of the Assertion:
- If there exists at least one uppercase letter in the character sequence starting from the current position, even if that uppercase letter is not at the current position, the assertion will succeed. This is because .* allows any number of any characters to appear before the uppercase letter.
- If the assertion succeeds, the engine will continue to execute the other parts of the regular expression, but its matching position remains at the original starting position because lookahead assertions do not consume any characters.
- If there is no uppercase letter in the character sequence starting from the current position, the assertion will fail, and the entire expression matching attempt will fail at this starting position. The engine may attempt to restart the entire matching process from the next position in the string (depending on the specific usage and context of the regular expression).
Processing .*: If the assertion succeeds, the regular expression engine will then encounter the .* part.
- This will match any number of any characters starting from the current position (which is the start of the string) until the end of the string.
- This means that if the assertion succeeds, the entire expression (?=.*[A-Z]).* will match the entire string because .* ensures that all characters from the current position to the end of the string are included in the match.

In summary, the assertion (?=.*[A-Z]) in the regular expression (?=.*[A-Z]).* serves as a prerequisite condition that ensures the entire string contains at least one uppercase letter, while .* is responsible for actually matching the entire string that meets this condition.

Test Questions#

Website to practice regular expressions - Regex101
Here are some regular expression practice questions to help deepen your understanding of basic regular expression syntax:

Match Simple Numbers
Write a regular expression to match any simple positive integer (without leading zeros).
\d+
Email Address Validation
^[\w]+@[A-Za-z0-9]+.[A-Za-z]+$

Create a regular expression to validate a simple email address, which should contain "@" and ".", and "@" should appear before ".".
^[\w.-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$
Here is an explanation of each part of this regular expression:
- ^: Indicates the start of the string.
- [\w.-]+: Matches one or more alphanumeric characters, underscores, dots, or hyphens. This part matches the user part of the email address.
- @: Ensures that the email address contains "@".
- [A-Za-z0-9.-]+: Matches one or more letters, digits, dots, or hyphens. This part matches the domain part of the email address.
- \.: Ensures that the email address contains a dot.
- [A-Za-z]{2,}: Matches two or more letters. This part is typically used to match the top-level domain.
- $: Indicates the end of the string.
URL Matching
Write a regular expression to match standard HTTP or HTTPS URLs. The URL should start with http:// or https:// and can contain a domain name and path.
^http[s]*:\/\/[\w]+.[\w]{2,}+(\/[\w]*|)$
^https?:\/\/[\w.-]+\.[\w]{2,}(\/[\w\/.-]*)?$
- ^: Indicates the start of the string.
- https?: The s after http is optional, allowing for both http and https.
- :\/\/: Matches "://". The slashes need to be escaped in regular expressions, so \\ is used.
- [\w.-]+: Matches one or more alphanumeric characters, underscores, dots, or hyphens. Used to match part of the domain name.
- \.: The dot character is used to separate parts of the domain name and needs to be escaped as \\..
- [\w]{2,}: Matches two or more alphanumeric characters, used for the top-level domain.
- (\/[\w\/.-]*)?: This is a capturing group used to match the path part of the URL. \/ matches the slash (the start of the path), and [\w\/.-]* matches alphanumeric characters, slashes, dots, or hyphens in the path. The entire group is optional, indicated by the final ?.
- $: Indicates the end of the string.
Date Format Validation
Create a regular expression to match the date format "YYYY-MM-DD", where the year is a four-digit number, and the month and day are two-digit numbers.
^\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01])$
- ^: Indicates the start of the string.
- \d{4}: Matches four digits, representing the year.
- -: Literal character, indicating the separator between date parts.
- (0[1-9]|1[0-2]): This is a capturing group used to match months from 01 to 09 and 10 to 12. 0[1-9] matches 01 to 09, and 1[0-2] matches 10 to 12.
- -: Again matches the separator between date parts.
- (0[1-9]|[12][0-9]|3[01]): This is another capturing group used to match days in the month. 0[1-9] matches 01 to 09, [12][0-9] matches 10 to 29, and 3[01] matches 30 and 31.
- $: Indicates the end of the string.
Mobile Number Validation
Write a regular expression to validate a simple mobile number, which should start with "1" and have a total of 11 digits.
1[\d]{10}
IP Address Matching
Create a regular expression to match standard IPv4 addresses, where each octet should be a number between 0 and 255, separated by ".".
(25[0-5]|[0-1][0-9][0-9]|2[0-4][0-9])(.(25[0-5]|[0-1][0-9][0-9]|2[0-4][0-9])){3}
HTML Tag Matching
Write a regular expression to match simple HTML tags, such as <div> or <a href="...">, where the tag name can be any combination of letters.
<([a-zA-Z]+)(\s+[a-zA-Z]+="[^"]*")*\s*>
Consider an HTML tag with multiple attributes and special characters in the attribute values, such as an <a> tag with a data-attribute containing a JSON object:

<a href="/example" data-attribute="{\"key\": \"value\"}" class="link">

In this example:
- Using <[\w\/\s\+\=\"\']+> may not correctly match the entire tag because special characters like {, }, and : in the attribute values are not included in the defined character set. This may cause the match to break at the first encountered special character.
- Using <([a-zA-Z]+)(\s+[a-zA-Z]+="[^"]*")*\s*> can more effectively match the entire tag because it specifically looks for a space-separated pattern of attribute="value", where the value is enclosed in double quotes. This expression does not care about the specific content inside the double quotes, allowing special characters like {, }, :, or any other characters as long as they are inside the double quotes.
Therefore, while the first regular expression is more general and flexible, it may encounter difficulties when handling complex attribute values. The second regular expression provides a more stable and accurate match by strictly defining the attribute structure, especially when dealing with attribute values containing special characters.

The difference between the two is that the second one uses ^" to define the attribute, allowing it to match special characters within the attribute, but in Vue, there is also the @ symbol, so this regular expression is still not perfect.
Password Strength Validation
Create a regular expression to validate password strength. The password must contain at least one digit, one uppercase letter, one lowercase letter, and have a total length of at least 8 characters.
To create a regular expression to validate password strength, ensuring that the password contains at least one digit, one uppercase letter, one lowercase letter, and a total length of at least 8 characters, we can use the following regular expression:

^(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{8,}$

The components of this expression are explained as follows:
- ^: Indicates the start of the string.
- (?=.*\d): This is a lookahead assertion to ensure that there is at least one digit in the string. . matches any character (except for newline), * indicates zero or more characters, and \d indicates a digit character.
- (?=.*[a-z]): This is another lookahead assertion to ensure that there is at least one lowercase letter in the string. [a-z] matches any lowercase letter.
- (?=.*[A-Z]): This is the third lookahead assertion to ensure that there is at least one uppercase letter in the string. [A-Z] matches any uppercase letter.
- .{8,}: . matches any character (except for newline), and {8,} indicates that the preceding character must be matched at least 8 times.
- $: Indicates the end of the string.
This regular expression uses lookahead assertions to independently check each required character type (digit, lowercase letter, and uppercase letter) and uses .{8,} to ensure that the total length of the password is at least 8 characters. Note that this expression does not limit the maximum length of the password and assumes that the password can contain any characters as long as they meet the conditions of having at least one digit, one uppercase letter, and one lowercase letter.
Code Comment Matching
Write a regular expression to match JavaScript or C-style single-line comments, which start with "//".
To match JavaScript or C-style single-line comments, which start with "//" and continue until the end of the line, you can use the following regular expression:

^\/\/.*

The components of this expression are explained as follows:
- ^: Indicates the start of the string. If you want to match comments at any position in the text, you can remove this symbol.
- \/\/: Since / is a special character in regular expressions, it needs to be escaped with a backslash \. Therefore, \/\/ is used to match the "//" string in the text.
- .*: This indicates matching any number of any characters (except for newline) after "//" until the end of the line.
When using this regular expression, note a few aspects:
- If you use this expression for a global search and do not want it to start matching from the beginning of each line, remove ^.
- This expression assumes that comments do not contain newline characters. If comments can span multiple lines (though this contradicts the definition of single-line comments), then this expression would not be applicable.
- In some regular expression processors, to match any character including newline, you may need to use a different symbol or mode flag, depending on the tool or language you are using.
File Path Matching
Create a regular expression to match Unix-style file paths, which can contain letters, digits, slashes /, and dots ..
^[\/\w.]+$
You can use online regular expression testing tools, such as Regex101, to practice these questions and test your solutions.