Python Regular Expression

python-regular-expression-feature-image

Regular expression (regex for short) is a sequence of characters which allows us to specify a pattern of text to search for. In this tutorial, we’re gonna look at way to work with Python Regular Expression.

Related Posts:
Python Regular Expression to extract phone number from text
Python Regular Expression to extract email from text

How to use Python Regular Expression

We have some steps to using regular expressions:

Import the regex module

All Python regex functions in re module. Remember to import it at the beginning of Python code or any time IDLE is restarted.

>>> import re
Create Regex object

We create a Regex object by passing a string value representing regular expression to re.compile().

For example, to match the phone number pattern:

>>> phoneRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
Get Match object

Regex object has search() method that searches the string that matches to the regex. It returns:
None if the regex pattern is not found
– a Match object if the pattern is found

>>> mo = phoneRegex.search('My phone number is 123-555-4242.')
Get matched text

We call Match object’s group() method to get the actual matched text from the searched string.

>>> print('Found: ' + mo.group())

At a glance:

>>> import re
>>> phoneRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
>>> mo = phoneRegex.search('My phone number is 123-555-4242.')
>>> print('Found: ' + mo.group())
Found: 123-555-4242

Basic Python Regular Expression example

Use parentheses to group

– We add parentheses to create groups in the regex: (group_1)text(group_2)text(group3).

– Then We use the group() of Match object method to grab the matching text. The first set of parentheses in a regex string is group(1), the second set is group(2)group(0) or group() will return the entire matched text.

>>> regex = re.compile(r'(\d\d)(\d)-(\d\d\d-\d\d\d\d)')
>>> mo = regex.search('My phone number is 123-555-4242.')
>>> mo.group(1)
'12'
>>> mo.group(2)
'3'
>>> mo.group(3)
'555-4242'
>>> mo.group(0)
'123-555-4242'
>>> mo.group()
'123-555-4242'

– We can get all groups at once with groups() method that returns a tuple of multiple values:

>>> mo.groups()
('12', '3', '555-4242')
Match multiple groups

We use | character to match one of many expressions.
For example, regular expression r'ozenero|JavaSampleApproach' will match either 'ozenero' or 'JavaSampleApproach'.

>>> regex = re.compile(r'ozenero|\w\w\w\wSampleApproach')
>>> mo = regex.search('What does ozenero means?')
>>> mo.group()
'ozenero'

When both expressions occur in the searched string, the first occurrence of matching text will be returned.

>>> regex = re.compile(r'ozenero|\w\w\w\wSampleApproach')
>>> mo = regex.search('JavaSampleApproach.com was the predecessor website to ozenero.com.')
>>> mo.group()
'JavaSampleApproach'

We can also match one of patterns (as part of regular expression) with | character.

>>> regex = re.compile(r'(ozenero|gkz|grokee).com')

>>> mo = regex.search('JavaSampleApproach.com was the predecessor website to ozenero.com.')
>>> mo.group()
'ozenero.com'

>>> mo = regex.search('gkz.com and ozenero.com are one.')
>>> mo.group()
'gkz.com'

Using findall() method, we can find all matching occurrences that’s shown later in this tutorial.

Match optionally

We can make regex find a match when text is there or not by using ? character.
The group preceding ? character will be an optional part of the pattern.
Remember that the first occurrence of matching text will be returned.

>>> regex = re.compile(r'(gro)?konez.com')

>>> mo = regex.search('gkz.com and ozenero.com are one.')
>>> mo.group()
'ozenero.com'

>>> mo = regex.search('konez.com is the parent site of ozenero.com.')
>>> mo.group()
'konez.com'
Match zero or more

The group that precedes the star * can occur any number of times (zero or more) in the text.

>>> regex = re.compile(r'gro(ko)*nez.com')

>>> mo = regex.search('gkz.com and ozenero.com are one.')
>>> mo.group()
'ozenero.com'

>>> mo = regex.search('gkz.com and gronez.com.')
>>> mo.group()
'gronez.com'

>>> mo = regex.search('gkz.com and grokokokonez.com are one.')
>>> mo.group()
'grokokokonez.com'
Match one or more

Unlike the star, we use the plus + character to indicate that the group preceding a plus must appear at least once.

>>> regex = re.compile(r'gro(ko)+nez.com')

>>> mo = regex.search('gkz.com and gronez.com.')
>>> mo == None
True

>>> mo = regex.search('gkz.com and ozenero.com are one.')
>>> mo.group()
'ozenero.com'

>>> mo = regex.search('gkz.com and grokokokonez.com are one.')
>>> mo.group()
'grokokokonez.com'
Match with specific repetition

We can specify the number of times that a group repeats by using a number in curly brackets.

>>> regex = re.compile(r'gro(ko){3}nez.com')

>>> mo = regex.search('gkz.com and grokokokonez.com are one.')
>>> mo.group()
'grokokokonez.com'

>>> mo = regex.search('gkz.com and ozenero.com are one.')
>>> mo == None
True

We can also limit the number of occurrences with the second number in curly brackets.

>>> regex = re.compile(r'gro(ko){3,5}nez.com')

# 3 'ko'
>>> mo = regex.search('grokokokonez.com.')
>>> mo.group()
'grokokokonez.com'

# 4 'ko'
>>> mo = regex.search('grokokokokonez.com.')
>>> mo.group()
'grokokokokonez.com'

# 5 'ko'
>>> mo = regex.search('grokokokokokonez.com.')
>>> mo.group()
'grokokokokokonez.com'

# 6 'ko'
>>> mo = regex.search('grokokokokokokonez.com.')
>>> mo.group()
>>> mo == None
True
Greedy and Nongreedy matching

(gkz){3,5} can match 3, 4, or 5 instances of 'gkz' in the string 'gkzgkzgkzgkzgkz'.

By default, Python regular expression are greedy, which means that the longest string will be matched.

>>> regex = re.compile(r'(gkz){3,5}')
>>> mo = regex.search('gkzgkzgkzgkzgkz.')
>>> mo.group()
'gkzgkzgkzgkzgkz'
#instead of 'gkzgkzgkz' (3)

To match the shortest string (nongreedy), we use a question mark ? character right after the curly brackets:

>>> regex = re.compile(r'(gkz){3,5}?')
>>> mo = regex.search('gkzgkzgkzgkzgkz.')
>>> mo.group()
'gkzgkzgkz'
Get all matches

Regex object has findall() method that returns list of all matches (each string representing one match) in the searched string.

>>> regex = re.compile(r'gro[ko]*nez.com')
>>> regex.findall('gkz.com, ozenero.com, grokokonez.com, grokokokonez.com are one.')
['ozenero.com', 'grokokonez.com', 'grokokokonez.com']

Remember that in the code above, we use square brackets []. If there are groups (with ()) in the regular expression, findall() will return a list of tuples.

>>> regex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
>>> regex.findall('Cell: 123-678-6789 Work: 123-555-9999')
[('123', '678-6789'), ('123', '555-9999')]

Python Regex Symbols

Basic character classes

In the code above, we have used \d for any numeric digit. \d is shorthand for the regular expression (0|1|2|3|4|5|6|7|8|9).

There are other shorthand character classes:

\d a numeric digit from 0 to 9
\D NOT a numeric digit from 0 to 9
\w letter, numeric digit, underscore character
\W NOT a letter, numeric digit, underscore character
\s space, tab, newline character
\S NOT a space, tab, newline

Regular expression \w+:\s\d+ will match text:
+ has one or more letter/digit/underscore characters (\w+)
+ followed by a : character
+ followed by a space, tab, or newline character (\s)
+ ends with one or more numeric digits (\d+)

>>> regex = re.compile(r'\w+:\s\d+')
>>> text = 'The zoo has cats: 12, dogs: 8, elephants: 6...'
>>> regex.findall(text)
['cats: 12', 'dogs: 8', 'elephants: 6']

>>> regex = re.compile(r'(\w+):\s(\d+)')
>>> text = 'The zoo has cats: 12, dogs: 8, elephants: 6...'
>>> regex.findall(text)
[('cats', '12'), ('dogs', '8'), ('elephants', '6')]
Custom character classes
Define custom character class

We can define our own character class using square brackets.

For example, [aeiouAEIOU] will match any vowel (lowercase and uppercase):

>>> regex = re.compile(r'[aeouiAEOUI]')
>>> regex.findall('ozenero Programming Tutorials')
['o', 'o', 'e', 'o', 'a', 'i', 'u', 'o', 'i', 'a']
Character class in Range

We can include ranges of letters or numbers by using a hyphen:
– Character class [1-7]: only the number from 1 to 7.
– Character class [b-f]: only the letter from b to f (b,c,d,e,f).
– Character class [a-zA-Z0-9]: all lowercase letters, uppercase letters, and numbers.

>>> regex = re.compile(r'[a-f]')
>>> regex.findall('ozenero Programming tutorials')
['e', 'a', 'a']
Negative character class

To make negative character class, we use a caret symbol ^ right after the character class’s opening bracket [.

>>> regex = re.compile(r'[^3-7]')
>>> regex.findall('123456789')
['1', '2', '8', '9']

# consonant
>>> regex = re.compile(r'[^aeiouAEIOU]')
>>> regex.findall('ozenero')
['g', 'r', 'k', 'n', 'z']
Caret symbol & Dollar sign
Caret symbol

To indicate that a match must occur at the beginning of the searched text, we use caret symbol ^ as the first character of the regex.

>>> regex = re.compile(r'^grok')

>>> regex.search('ozenero Programming Tutorials')
<_sre.SRE_Match object; span=(0, 4), match='grok'>

>>> regex.search('Learn programming with ozenero') == None
True
Dollar sign

To indicate that a match must occur at the end of the searched text, we use dollar sign $ as the last character of the regex.

>>> regex = re.compile(r'konez$')

>>> regex.search('Learn programming with ozenero')
<_sre.SRE_Match object; span=(26, 31), match='konez'>

>>> regex.search('ozenero Programming Tutorials') == None
True
Match entire string

We can use ^ and $ together to indicate that the entire string must match the regex:

>>> regex = re.compile(r'^\d+$')

>>> regex.search('The quantity is 3457439') == None
True

>>> regex.search('3457439')
<_sre.SRE_Match object; span=(0, 7), match='3457439'>

>>> regex.search('345 7439') == None
True
Wildcard character

The wildcard character . (dot) in regex match any character (except newline character).

>>> regex = re.compile(r'.ro')
>>> regex.findall('introduction to ozenero robust programming tutorials')
['tro', 'gro', ' ro', 'pro']
Match everything except newline

We have known that:
. (dot) character means: any single character except the newline
* star character means zero or more

=> So dot-star .* is for everything (except newline character).

>>> regex = re.compile('Name: (.*) - Location: (.*)')
>>> mo = regex.search('Name: grokoneer - Location: US')
>>> mo.groups()
('grokoneer', 'US')
Match everything

We can pass re.DOTALL to compile() method as the second argument to make the dot character match all characters (including newline).

>>> regex = re.compile('.*')
>>> mo = regex.search('ozenero\nProgramming tutorials')
>>> mo.group()
'ozenero'

>>> regex = re.compile('.*', re.DOTALL)
>>> mo = regex.search('ozenero\nProgramming tutorials')
>>> mo.group()
'ozenero\nProgramming tutorials'
Greedy & Nongreedy

By default, dot-star works in greedy mode: match as much text as possible.
To match text in nongreedy mode, use it with question mark .*?.

# greedy
>>> regex = re.compile(r'')
>>> mo = regex.search('regular expression testing code-->')
>>> mo.group()
'regular expression testing code-->'

# nongreedy
>>> regex = re.compile(r'')
>>> mo = regex.search('regular expression testing code-->')
>>> mo.group()
''
Python Regex Symbols Review
? zero or one
+ one or more (nongreedy: +?)
* zero or more (nongreedy: *?)
{n} exactly n times
{n,} n or more
{,n} 0 to n
{n,m} at least n & at most m (nongreedy: {n,m}?)
^text must begin with text
text$ must end with text
. any character, except newline
\d, \w, \s digit, word, or space character
\D, \W, \S anything except digit, word, or space character
[abc] any character of a, b, c
[^abc] any character, except a, b, c

Python Regular Expression with Flags

Many Python Regex methods and functions use Flag arguments which can change the regex pattern effectively:
re.A: ASCII-only matching
re.I: ignore case
re.L: locale dependent
re.M: multi-line
re.S: dot matches all
re.U: Unicode matching
re.X: verbose (allow comment)

Case-insensitive Regex

If we want to match text without caring about uppercase or lowercase, just use re.I in re.compile() method.

>>> regex = re.compile(r'ozenero', re.I)

>>> regex.search('Grokonez programming tutorials')
<_sre.SRE_Match object; span=(0, 8), match='Grokonez'>

>>> regex.search('GroKonez Python tutorials')
<_sre.SRE_Match object; span=(0, 8), match='GroKonez'>

>>> regex.search('GROKONEZ tutorials')
<_sre.SRE_Match object; span=(0, 8), match='GROKONEZ'>
Combine Flags

If we want to ignore capitalization and include newline to match the dot character, just combine the re.I and re.S (or re.IGNORECASE & re.DOTALL) using the pipe character |:

>>> regex = re.compile('name: j.*location: us', re.I | re.S)
>>> text = '''
... name: Jack
... location: US
... '''
>>> regex.search(text).group()
'name: Jack\nlocation: US'

One thought on “Python Regular Expression”

Leave a Reply

Your email address will not be published. Required fields are marked *