I was reviewing some old code when I stumbled across something I must have completely forgotten
or someone else some time ago knew. That is that it is possible to use the function substring for regular expression work.
Most of the regexp functions in PostgreSQL usually start with regexp or are operators like ~. But I completely forgot
about the substring function that it understands regular expressions too. When I have to parse out some little piece of text using regular expressions, I usually reach
for one of those regexp_* functions like regexp_replace, regexp_match, or regexp_matches and use substring only when I want to extract a range of characters by index numbers from a string. But there was code right there telling me perhaps I or someone else was much smarter back then. All these functions are summarized on Documentation: PostgreSQL string functions.
The common substring function most people use is:
substring(string, start_position, length)
which can also be written as
substring(string FROM start_position FOR length)
With the length
and FOR length
clause being optional and when left out giving
you the string from the start position to the end of string.
How you use it is as follows:
SELECT substring('Jill lugs 5 pails', 11, 7);
returns: 5 pails
These pedestrian forms of substring are even present in databases that completely suck at doing regular expressions.
I shall not name these databases. Just know they exist. PostgreSQL has another form which I find sometimes more understandable and shorter to write
than doing the same with the aforementioned regexp_* functions.
Lets repeat the above example but with the understanding that we need to pick out what exactly Jill is lugging and how many of them she is carrying.
SELECT substring('Jill lugs 5 pails', '[0-9]+\s[A-Za-z]+');
Which returns the same thing, but without having to know postion, but only be concerned with phraseology.
Now sure in many cases you would be better off with the more powerful regexp_match or regexp_matches siblings
, like when you are trying to separate parts of a statement in one go.
Like just maybe I want my count of pails to be separate from the thing pails
then sure I'd do.
SELECT r[1] AS who, r[2] AS action, r[3] AS how_many, r[4] AS what
FROM regexp_match('Jill lugs 5 pails', '([A-Za-z]+)\s+([A-Za-z]+)\s+([0-9]+)\s([A-Za-z]+)') AS r;
who | action | how_many | what
------+--------+----------+-------
Jill | lugs | 5 | pails
(1 row)
And when I need to be a bit greedy and return multiple records cause I'm to grab parts of each sentence that fit my phraseology
in a pool of word soup, I will reach for regexp_matches.
SELECT r[1] AS who, r[2] AS action, r[3] AS how_many, r[4] AS what
FROM regexp_matches('Jill lugs 5 pails. Jack lugs 10 pails.', '([A-Za-z]+)\s+([A-Za-z]+)\s+([0-9]+)\s([A-Za-z]+)', 'g') AS r;
who | action | how_many | what
------+--------+----------+-------
Jill | lugs | 5 | pails
Jack | lugs | 10 | pails
(2 rows)
While both of these are great they both always return arrays or sets of arrays.
regexp_matches on the upside is a set returning function, so if you need multiple answers that match your pattern
it does the trick, but when nothing matches, your query blows up returning no records. This makes it slightly dangerous to
use in a FROM clause unless you are aware of this and don't care or have come up with ways to avoid such as using a LEFT JOIN.
regexp_match while it always returns back cause it's not a set returning, is annoying cause you always get back an array you
must pick out your values. So the original question becomes a slight cacophony of extra ()[]
as follows:
SELECT (regexp_match('Jill lugs 5 pails', '[0-9]+\s[A-Za-z]+'))[1];
As I get older those extra characters annoy me cause it's more typing and more useless characters to look at.
Although I haven't tested, I suspect it's slower too.