Lecture 23 Regular Expressions
- Questions
- When do you need to use the backslash, such as in phone number quantifier example
- Does raw strings have to be double quotes or can it be single quotes?
- Regex
- Sequence of characters that specifies a pattern, usually utilized by string searching algorithms
- Not a programming language, just a standard that is implemented
- Using regex to specify a pattern we want to find within a string
- Matching exact strings
- Most characters in RegEx will match exactly the characters as they appear in the expression
- Some characters in regex have reserved meanings
- Have to use escape sequence to search for those characters
- expression: {abc}\ → {abc}
- The Dot character
- reserved character that will match any single character that is not a new line
- . can basically be anything
- Expression: .a.a.a
- Fully matched by: banana, aaaaaa
- Custom Character Classes
- Character classes match any of a set of characters - one instance of a character class will match exactly one character
- Expression: [ab]c[ab]c
- Its basically, either use a or b, then c, then a or b, then c
- Only 4 characters longth with that pattern
- Can use a dash to get entire groups of characters
- Expressions: [a-z] [0-9]
- All the characters between those numbers
- Fully matched by a0, b8, z0, g5
- Expressions: [a-z] [0-9]
- Note: Parentheses and Brackets are different
- Note: Whitespace matters
- Common Character Classes
- Shorthands for common character classes
- Put on cheatsheet or study guide, don’t need to necessarily memorize
- . matches any non-newline character
- \d matches digits, equivalent to [0-9]
- \w matches “word characters”, equivalent to [A-Za-z0-9_]
- \s matches whitespace characters
- [^] matches any character except whatever comes after [^]
- \D matches any non-digit character (opposite of \d)
- Also \S and \W (capital)
- can also use the not shorthand
- Shorthands for common character classes
- Examples:
- [^ab]c
- Any character that is not an ‘a’ or ‘b’ followed by a ‘c’
- Correct: cc, zc, !c
- Incorrect: ab, bc
- a*b
- b, ab, aaaaaaaab
- a+b
- ab, aaaaaaaaab
- Note: only b doesn’t work, because we need at least a
- [^ab]c
- Quantifiers
- allow us to specify multiple occurrences of the same character or character class
- a* zero or more occurrences of a
- a+ one or more occurrences of a
- a? zero or one occurrences of a
- a{2} two occurrences of a
- a{2,4} two, three, or four occurrences of a
- a{2,} at least 2 occurrences of a
- Combining patterns
- The pipe | operator matches either the expression on its left or its right
- Basically an or statement
- But cannot be both
- Example:
- \d+|Inf
- Correct: Inf, 8
- Incorrect: 8Inf
- You can use parentheses to group expressions
- (<3)+
- <3, <3<3<3
- (<3)+
- Anchors
- Another unique expression
- They don’t match characters-instead, they match positions in a string where an expression could land
- ^ matches the beginning of a string
- $ matches the end of a string
- \b matches a “word” boundary” (whitespace, punctuation)
- Examples:
- ^aw+
- Correct: aww, awwwwwwwwwww
- Incorrect: aww aww ← second aww won’t be highlighted only first one
- ^aw+
- Note: ^ has two meanings
- [^a] → inverse (used inside character class)
- ^a → start of string (used outside character class)
-
Regular Expressions in Python
- We use the re module in Python
import re re.search(<pattern>, <string>) #Returns a Match object representing the first occurrence of <pattern> in <string> #usually used with bool bool(re.search())
-
Raw Strings
- Python has escape characters built in to string evaluation such as the new line character
- Regex has \, we use raw strings without Python thinking \ is python
- print(r”hellow\nthere!”)
-
hello\nthere
-
- Use raw strings with regex
- Match objects
- Re module has methods that attempt to match a pattern to a string- if they find a match, they’ll return a match object
- If they don’t find a match, they’ll return None (falsey)
- re.search
- re.fullmatch(
, ) - Returns a match object requiring that pattern entirely matches string
- re.match
- String must start with pattern
- Examples:
- Our first value is the first value that matches the pattern
- mat = re.search
-
mat.group(0)
- ‘35’
- Capturing groups
- When we use parentheses to group sub-expressions, they define capture groups that we can then access individually
- Other re functions
- re.finall
- Returns a list of all substrings within
that match , read from left to right
- Returns a list of all substrings within
- re.sub (
, , ) - pattern is replace with repl
- re.finall
- Center embedding
- Context-free grammars (CFGs) 1.