Table Of Contents
What's new and upcoming in PostgreSQL
Using procedures for batch geocoding and other batch processing
One of the features we are looking forward to in upcoming PostgreSQL 11 is the introduction of procedures via the
CREATE PROCEDURE ANSI-SQL construct.
The major benefit that sets apart procedures from functions is that procedures are not wrapped in an outer transaction and can have COMMITs within them.
This means it's not an all or nothing like it is with functions. Even if you stop a procedure in motion, whatever work has been done and committed is saved.
In the case of functions, a stop or failure would roll-back all the work. It also means you can see work in progress of a stored procedure since the work will already have been committed.
This is a huge benefit for batch processing. Batch processing covers a lot of use-cases of PostGIS users since a good chunk of
PostGIS work involves doing some kind of batch processing of data you get from third-parties or machines.
We'll demonstrate with batch geocoding, using a similar approach we discussed in. psql watch for batch processing. Folks often resort to doing this stuff outside of the database like using Python as a driver to push batches to postgres. Now postgres is capable of doing it all.
In addition to the code and data here, you'll need postgis_tiger_geocoder extension and loaded with MA census data.
Why batch process data?
For geocoding of data or any kind of lengthy processing work, you often do this in batches of say 1, 10, or 50, but never the whole 1,000,000. The reasons for that are
- For each update statement, PostgreSQL needs to be able to rollback the update. If your server goes down for whatever reason your last update would be lost. A million records of work is a lot to lose.
- Since PostgreSQL needs to be able to rollback, all that workspace to get ready to rollback is occupied and eats up memory. In that sense you may not even have the memory available to handle a batch of 1,000,000 in one bite.
You can't put your batch work in a PostgreSQL function, at least not all of it, because each function has an explicit transaction wrapped around it. This is one of the reasons why trying to do it in a function can take 10 or more times longer than just embedding the logic in a script that pushes the work in smaller commits. Now we've got PostgreSQL procedures where we can put in pretty much the same code as we would have done in our function (except now we can have COMMITS), and can explicitly do the commit after each batch. The great thing about this is if your connection is lost, the server crashes, you loose power, or whatever, the work that has already been committed is not lost. It also means the memory is freed after each commit, and not wait for all the work to be done in the procedure before making available for other processes.
Test Data Set
Our test dataset is small so not very interesting. To simulate more records, we repeat each address 1000 times.
DROP TABLE IF EXISTS addr_to_geocode ; CREATE TABLE addr_to_geocode(addid serial PRIMARY KEY, address text, pt geography(point, 4326), new_address text, rating integer); INSERT INTO addr_to_geocode(address) VALUES ('529 Main Street, Boston MA, 02129'), ('77 Massachusetts Avenue, Cambridge, MA 02139'), ('25 Wizard of Oz, Walaford, KS 99912323'), ('26 Capen Street, Medford, MA'), ('124 Mount Auburn St, Cambridge, Massachusetts 02138'), ('950 Main Street, Worcester, MA 01610'); -- repeat address to get more data INSERT INTO addr_to_geocode(address) SELECT address FROM addr_to_geocode, generate_series(1,1000);
The batch processing procedure
To create our procedure, we are also going to take advantage of the FOR UPDATE SKIP LOCKED feature introduced in PostgreSQL 9.5. This will allow us to have multiple postgres connections calling the same procedure, but each being allocated a different set of records to work on so there is no lock contention.
CREATE OR REPLACE PROCEDURE batch_geocode() LANGUAGE 'plpgsql' AS $$ BEGIN WHILE EXISTS (SELECT 1 FROM addr_to_geocode WHERE rating IS NULL) LOOP -- we use FOR UPDATE SKIP LOCKED to prevent multiple concurrent procedure calls -- from grabbing the same set of records WITH a AS ( SELECT addid, address FROM addr_to_geocode WHERE rating IS NULL ORDER BY addid LIMIT 5 FOR UPDATE SKIP LOCKED) UPDATE addr_to_geocode SET (rating, new_address, pt) = (COALESCE(g.rating,-1), COALESCE ((g.addy).address::text, '') || COALESCE(' ' || (g.addy).predirabbrev, '') || COALESCE(' ' || (g.addy).streetname,'') || ' ' || COALESCE(' ' || (g.addy).streettypeabbrev, '') || COALESCE(' ' || (g.addy).location || ', ', '') || COALESCE(' ' || (g.addy).stateabbrev, '') || COALESCE(' ' || (g.addy).zip, '') , ST_SetSRID(g.geomout,4326)::geography ) FROM (SELECT addid, (gc).rating, (gc).addy, (gc).geomout FROM a LEFT JOIN LATERAL geocode(address,1) AS gc ON (true) ) AS g WHERE g.addid = addr_to_geocode.addid; COMMIT; END LOOP; RETURN; END; $$;
To call the procedure, you would use the new
CALL statement like so:
You'll notice that if you then cancel the procedure (e.g. by clicking the stop icon in pgAdmin ), the geocoding that has already been done before canceling the call is not lost, unlike it would be if this were a function.
Unpivoting data using JSON functions
One of the common cases we use json support is what we call UNPIVOTING data. We demonstrated this in Postgres Vision 2018 presentation in slide 23. This trick won't work in other relational databases that support JSON because it also uses a long existing feature of PostgreSQL to be able to treat a row as a data field.
Unpivoting data is the process of converting columns to rows. It's the reverse of a cross tabluation (aka PIVOT). So why would you ever need to unpivot data?
Take for example if you had a spreadsheet where your HR person decided to input each employees hours in columns where each column represents a separate week.
The spreadsheet data might look something like this:
If you are using a spreadsheet as your data entry and presentation tool, then this is a great setup, but for storing the data in a relational database and being able to do arbitrary date summary aggregations, it's easier to work with the data in a format such as below because it's more flexible how you rollup date ranges and you don't have to add a new column to your table every time a new week comes along.
Note this trick you can accomplish using the hstore extension as well as we covered a while back in Unpivoting data in PostgreSQL. Since JSON/JSONB functions are built-in no CREATE EXTENSION is required to use them.
INSERT INTO staff_hours(employee, ppe, hours) SELECT hours_raw."Employee" As employee, j.key::date AS ppe, j.val::numeric As hours FROM hours_raw, LATERAL jsonb_each_text(to_jsonb(hours_raw)) AS j(key,val) WHERE j.key NOT IN('Employee');
The above query is composed for 4 steps
- A row in a query (not just tables, though we are only showing a table here), can be output as a single thing. In this case we are going to use our hours_raw table as a set of elements of type hours_raw and feed it to the to_jsonb function.
- to_jsonb (and it's companion to_json) can convert any element (including row elements) into json representations.
- The jsonb_each_text function returns a set of key / value pairs. When used in conjunction with LATERAL (which is the default for functions when used in from), it generates a new set of rows for each row of hours_raw
- For this example we assume all columns except for Employee are hour columns where the header is a pay period ending (PPE). We only want to return the Key/Values that do not correspond to employee name.
Using PostgreSQL Contribs
FDWS for PostgreSQL Windows 32 and 64-bit
We've updated our binaries for PostgreSQL 10 windows, both 32 and 64-bit. The 64-bit should work fine with EnterpriseDb windows as well as BigSQL.
Note this package does not include our favorite FDW ogr_fdw since that is packaged as part of PostGIS packages from EnterpriseDb Stackbuilder and also BigSQL distributions.
If you do not have PostGIS installed (and don't want to for some reason) and want to use ogr_fdw on windows, you can download from: Winnie's PG 10 Extras. ogr_fdw is a great FDW for querying not just spatial data, but also a ton of other file formats or relational since spatial is a superset.
These packages contain the following FDWs:
- odbc_fdw (recently updated to work with PostgreSQL 10)
- file_textarray_fdw, great and fast for working with arbitrary and pesky delimited data. Especially where they weren't considerate enough to give you the same number of columns per row.
- WWW FDW exposes web services as queryable tables. I'll note we haven't really used this since 9.4 since the 9.5 beyond we had some random crashes we haven't investigated. That said the http extension, Windows binaries here, is a better alternative especially for newer PostgreSQL. We'll do another article soon on the new features of the http extension which is pretty sweet if you have a need to use web services (even ones that require authentication) . We've used it for pushing our client data to SalesForce as well as posting and pulling down files from other third-party sources our clients are subscribed to. Both the http and www_fdw rely on Curl for their heavy lifting.
Using PostgreSQL Contribs
PostgresVision 2018 Slides and Impressions
Leo and I attended PostgresVision 2018 which ended a couple of days ago.
Unfortunately there are no slides of the pgRouting part, except the one that says PGRouting Live Demos because Leo will only do live demos. He has no fear of his demos not working.
Side note, if you are on windows and use the PostGIS bundle, all the extensions listed in the PostGIS box of the spatial extensions diagram, as well as the pointcloud, pgRouting, and ogr_fdw are included in the bundle.
Postgres Vision 2018 was surprisingly good. We feared the talks would be mostly focused on upper management folks, but there turned out to be quite a few very technical ones to chew on as well such as sharding, JIT, WAL, and ZHeap to name some. Though Bruce Momjian's talk Why Postgres will live forever wasn't that technical it was the best we saw. It was inspiring, comical, and full of computer history tidbits. I love computer history.
Most exciting, I got to meet fellow package maintainer Devrim Gündüz for the first time. Devrim manages yum.postgresql.org and as such does packaging for PostGIS in addition to the other 100s of packages he maintains. I only do packaging for PostGIS bundle for windows and even that feels like a lot.