regexp_split_to_table and string_to_array unnest performance

Friday, July 01. 2016

regexp_split_to_table and string_to_array unnest performance

Whenever you need to split a text into multiple records breaking by some delimeter, there are two common options that PostgreSQL provides. The first is regpexp_split_to_table and then next popular is using the unnest function in combination with string_to_array.

Here is an example using regexp_split_to_table:

SELECT a
FROM regexp_split_to_table('john,smith,jones', ',') AS a;

Which outputs:

   a
-------
 john
 smith
 jones
(3 rows)

You can achieve the same result by using the construct:

SELECT a
FROM unnest(string_to_array('john,smith,jones', ',')) AS a;

With short text you won't notice much perfomance difference. But what happens if we pass in a humungous text?

We'll create a table with one row with one text column. The text column contains 500,000 characters with line breaks thrown in.

DROP TABLE IF EXISTS sample_data; 
CREATE UNLOGGED TABLE sample_data AS
SELECT string_agg(CASE WHEN mod(i,64) = 0 THEN E'\n' ELSE CHR(64 + mod(i,64)) END,'') AS data
FROM generate_series(1, 500000) i;

We'll first use regexp_split_to_table to split the single row into multiple rows:

DROP TABLE IF EXISTS sample_data_rows; 
CREATE UNLOGGED TABLE sample_data_rows AS
SELECT regexp_split_to_table(data, E'\n')
FROM sample_data;

Results
Query returned successfully: 7813 rows affected, 22.3 secs execution time.

Now we repeat the same exercise, but using unnest(string_to_array) instead of regexp_split_to_table:

DROP TABLE IF EXISTS sample_data_rows; 
CREATE UNLOGGED TABLE sample_data_rows AS
SELECT unnest(string_to_array(data, E'\n'))
FROM sample_data;

Results
Query returned successfully: 7813 rows affected, 40-61 msec execution time.

And the winner is unnest(string_to_array).

Observe that unnest(string_to_array) is orders of magnitude faster than the equivalent regexp_spit_to_table.

As the text gets bigger, regexp_split_to_table gets exponentially worse. For example, if you had a 250,000 piece of text, regexp_spit_to_table would take 5.7 secs, compared to the 22.3 secs for the 500,000 character text example.

There are many cases where unnest(string_to_array) can't substitute for regexp_split_to_table. You need to use regexp_split_to_table in cases where you have to split by a non-simplistic delimeter, such as a sequence of any specific characters. For example if you wanted to split whenever you have a sequence of spaces or commas as in the case of the following:

SELECT *
FROM regexp_split_to_table('john    smith,jones', E'[\\s,]+') AS a;

   a
-------
 john
 smith
 jones
(3 rows)

There is no way to achieve the same result with unnest(string_to_array).

Posted by Leo Hsu and Regina Obe in 8.3, 8.4, 9.0, 9.1, 9.2, 9.3, 9.4, 9.5, 9.6, postgresql versions at 23:13 | Comments (11) | Trackbacks (0)

Trackbacks

Trackback specific URI for this entry

No Trackbacks

Comments

Display comments as (Linear | Threaded)

No comments

Add Comment

Name
Email
Homepage
In reply to
Comment	E-Mail addresses will not be displayed and will only be used for E-Mail notifications. To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly. Enter the string from the spam-prevention image above: Phone* What is nine minus six?
	Remember Information? Subscribe to this entry

regexp_split_to_table and string_to_array unnest performance

Postgres OnLine Journal

PostGIS in Action About the Authors Consulting

Friday, July 01. 2016

regexp_split_to_table and string_to_array unnest performance

Entry's Links

Quicksearch

Calendar

Categories

Archives

Subscribe

Blog Administration

regexp_split_to_table and string_to_array unnest performance

Postgres OnLine Journal PostGIS in Action About the Authors Consulting

Friday, July 01. 2016

regexp_split_to_table and string_to_array unnest performance

Entry's Links

Quicksearch

Calendar

Categories

Archives

Subscribe

Blog Administration

Postgres OnLine Journal

PostGIS in Action About the Authors Consulting