September 24, 2014
This is the second in a series of posts I'm writing as I explore some of the cool features of the Python programming language. I've just started learning about Regular Expressions and though there is a lot of depth some parts are quite simple. I only seek to present some of the easy stuff here so this post will be an incomplete introduction.
I'm scripting in Python v3.4.1, which is the current stable version and is what the code in this post is based upon.
Regular expressions are a powerful way to search, replace, parse and maybe even construct complex patterns of text in strings. The re module must be loaded at the beginning of the script (or at least before you use it) and then there are several functions available for working with patterns in strings. You can read about the available functions for yourself in the Python documentation. Let's look at a simple example and then take it apart to see what it's doing. You can copy and paste this code into a text editor and try to run it.
The result is just a string '13579' as you would probably expect. If I hadn't used the join method on an empty string then re would have returned a list with five elements--each digit that matches pattern [0-9]. The square brackets are used to create the pattern to be matched. In this case I wanted to match the digits 0 through 9. If I wanted to match any digits in the range 2-6 inclusive then I could simply make my pattern [2-6]. Change the pattern to [a-z] and see what happens:
The result 'hisisjustabitoftextwithnumbers' may not be exactly what you expected, but it isn't too hard to figure out what is going on. We are looking for lower case letters so the initial 'T' doesn't match the pattern and thus gets left out. Same with the numbers and the period at the end. Spaces aren't part of the 26 letter alphabet either so everything gets mashed together. That's easy enough to understand. Let's say we want to add the capital letters so we pick up the 'T' at the beginning. It seems logical to add [A-Z] to our search. Try if for yourself:
Now the result 'Th' might be truly surprising, but remember that re is matching patterns--not characters. It's looking for the pattern of an upper case letter followed by a lowercase letter. That may not be what you expected or even what you wanted, but it certainly illuminates a bit more of how re works. So how do we get all the lowercase and uppercase letters. Just add search values inside the metacharacters like this: [a-zA-Z]. If you want to also get the numbers just add the range you want. Do you want to get spaces and even the period at the end? Just add them: [a-zA-Z0-9. ].
We have barely scratched the surface of re, but that should be enough to get a curious character started down the path. A journey of a thousand miles...