Every programmer should embrace and use regular expressions (INCLUDING Database programmers). There are many places where regular expressions can be used to reduce a 20 line piece of code into a 1 liner. Why write 20 lines of code when you can write 1.
Regular expressions are a domain language just like SQL. Just like SQL they are embedded in many places. You have them in your program editor. You see it in sed, grep, perl, PHP, Python, VB.NET, C#, in ASP.NET validators and javascript for checking correctness of input. You have them in PostgreSQL as well where you can use them in SQL statements, domain definitions and check constraints. You can mix regular expressions with SQL. When you mix the two domain languages, you can do enchanting things with a flip of a wrist that would amaze your less informed friends. Embrace the power of domain languages and mix it up. PostgreSQL makes that much easier than any other DBMS we can think of.
For more details on using regular expressions in PostgreSQL, check out the manual pages Pattern Matching in PostgreSQL
The problem with regular expressions is that they are slightly different depending on what language environment you are running them in. Different enough to be frustrating. We'll just focus on their use in PostgreSQL, though these lessons are applicable to other environments.
We are going to go backwards a bit. We will start with demonstrations of PostgreSQL SQL statements that find and replace things and list things out with regular expressions. For these exercises we will be using a contrived table called notes, which you can create with the following code.
CREATE TABLE notes(note_id serial primary key, description text); INSERT INTO notes(description) VALUES('John''s email address is johnny@johnnydoessql.com. Priscilla manages the http://www.johnnydoessql.com site. She also manages the site http://jilldoessql.com and can be reached at 345.678.9999 She can be reached at (123) 456-7890 and her email address is prissy@johnnydoessql.com or prissy@jilldoessql.com.'); INSERT INTO notes(description) VALUES('I like ` # marks and other stuff that annoys militantdba@johnnydoessql.com. Militant if you have issues, give someone who gives a damn a call at (999) 666-6666.');
PostgreSQL has a rich set of functions and operators for working with regular expressions. The ones we commonly use are ~, regexp_replace, and regexp_matches.
We use the PostgreSQL g flag in our use more often than not. The g flag is the greedy flag that returns, replaces all occurrences of the pattern. If you leave the flag out, only the first occurrence is replaced or returned in the regexp_replace, regexp_matches constructs. The ~ operator is like the LIKE statement, but for regular expressions.
The power of databases is not only do they allow you to store/retrieve information quickly, but they allow you to destroy information just as quickly. Every database programmer should be versed in the art of information destruction.
-- remove email addresses if description has email address
UPDATE notes SET description = regexp_replace(description,
E'[A-Za-z0-9._%-]+@[A-Za-z0-9.-]+[.][A-Za-z]{2,4}',
'---', 'g')
WHERE description
~ E'[A-Za-z0-9._%-]+@[A-Za-z0-9.-]+[.][A-Za-z]{2,4}';
-- remove website urls if description has website urls
-- matches things like http://www.something.com or http://something.com
UPDATE notes SET description = regexp_replace(description,
E'http://[[[:alnum:]]+.]*[[:alnum:]]+[.][[:alnum:]]+',
E'--', 'g')
WHERE description
~ E'http://[[[:alnum:]]+.]*[[:alnum:]]+[.][[:alnum:]]+';
-- remove phone numbers if description
-- has phone numbers e.g. (123) 456-7890 or 456-7890 or 123.456.7890
UPDATE notes SET description = regexp_replace(description,
E'[\(]{0,1}[0-9]{3}[\).-]{0,1}[[:space:]]*[0-9]{3}[.-]{0,1}[0-9]{4}',
'---', 'g')
WHERE description
~ E'[\(]{0,1}[0-9]{3}[\).-]{0,1}[[:space:]]*[0-9]{3}[.-]{0,1}[0-9]{4}';
-- set anything to single space that is not not a \ ( ) & * /,;. > < space or alpha numeric
UPDATE notes set description = regexp_replace(description, E'[^\(\)\&\/,;\*\:.\>\<[:space:]a-zA-Z0-9-]', ' ')
WHERE description ~ E'[^\(\)\&\/,;\*\:.\<\>[:space:]a-zA-Z0-9-]';
-- replace high byte characters with space
-- this is useful if you have your database in utf8 and you often need to use latin1 encoding
-- and you have a table that shouldn't have high byte characters
-- such as junk you get from scraping websites
-- high byte characters don't convert down to latin1
UPDATE notes SET description = regexp_replace(description,E'[^\x01-\x7E]', ' ', 'g')
WHERE description ~ E'[^\x01-\x7E]';
These examples use similar to our destroy but show us in a table, a list of stuff that match. Here we use our favorite PostgreSQL regexp_matches function.
-- list first email address from each note -- SELECT note_id, (regexp_matches(description, E'[A-Za-z0-9._%-]+@[A-Za-z0-9.-]+[.][A-Za-z]+'))[1] As email FROM notes WHERE description ~ E'[A-Za-z0-9._%-]+@[A-Za-z0-9.-]+[.][A-Za-z]+' ORDER By note_id, email; -- result note_id | email ---------+------------------------------- 1 | johnny@johnnydoessql.com 2 | militantdba@johnnydoessql.com
--list all email addresses -- note this uses the PostgreSQL 8.4 unnest construct -- to convert the array returned to a table SELECT note_id, unnest( regexp_matches(description, E'[A-Za-z0-9._%-]+@[A-Za-z0-9.-]+[.][A-Za-z]+', 'g') ) As email FROM notes WHERE description ~ E'[A-Za-z0-9._%-]+@[A-Za-z0-9.-]+[.][A-Za-z]+' ORDER By note_id, email; -- returns note_id | email ---------+------------------------------- 1 | johnny@johnnydoessql.com 1 | prissy@jilldoessql.com 1 | prissy@johnnydoessql.com 2 | militantdba@johnnydoessql.com
Here we will just cover what we consider the parts you need to know if you don't have the patience or memory to remember more. Regular expressions are much richer than our simplified view of things. There is the great backreferencing feature we won't get into which allows you to reference an expression and use it as part of your replacement expression.
Part | Example |
---|---|
Classes [] | Regular expression classes are a set of characters that you can treat as interchangeable. They are formed by enclosing the characters in a bracket. They can also have nested classes. For example [A-Za-z] will match any letter between A-Z and a-z. [A-Za-z[:space:]] will match those plus white spaces in PostgreSQL. If you need to match a regular expression character such as ( then you escape it with \. so [A-Za-z\(\)] will match A thru Z a thru z and ( ). Classes can contain other classes and expressions as members. |
. | The famous . matches any character. So the infamous .* means one or more of anything. |
Quantity {} + * | You denote quantities with {}, +, * + means 1 or more. * means 0 or more and {} to denote allowed quantity ranges. [A-Za-z]{1,5} means you can have between 1 and 5 alpha characters in and expression for it to be a match. |
() | This is how you denote a subexpression. A subexpression can be composed of multiple classes etc and can be backreferenced. They define a specific sequence of characters. |
[^members here] | This is the NOT operator in regular expressions so for example [^A-Za-z] will match any character that is not in the alphabet. |
Special classes | [[:alnum:]] any alphanumeric, [[:space:]] any white space character. There are others, but those are the most commonly used. |