A while back, we discussed using pgAdmin pgScript as a quicky way for
running a repetitive update script where you want each loop to commit right away. Since stored functions have to commit as a whole, you can't use stored functions alone for this kind of processing.
Question: Can you do similar easily with psql?
Answer: yes with the
\watch command described nicely by Michael Paquier a while back.
If you are using the psql client packaged with PostgreSQL 9.3 or above,
then you can take advantage of the \watch
command that was introduced in that version of psql. We'll demonstrate that
by doing a batch geocoding exercise with PostGIS tiger geocoder and also revise our example from the prior article to use the more efficient and terser LATERAL construct introduced in PostgreSQL 9.3.
Coming in PostgreSQL 9.4 are dynamic background workers which will provide more options for doing batch processing directly in the database. That unfortunately will require more coding in C and permissions to install custom built extensions.
Our test dataset
Our test dataset is small so not very interesting. In normal case you'd have like 10,000 or more addresses you are geocoding.
DROP TABLE IF EXISTS addr_to_geocode ;
CREATE TABLE addr_to_geocode(addid serial PRIMARY KEY, address text,
pt geography(point, 4326), new_address text, rating integer);
INSERT INTO addr_to_geocode(address)
VALUES ('529 Main Street, Boston MA, 02129'),
('77 Massachusetts Avenue, Cambridge, MA 02139'),
('25 Wizard of Oz, Walaford, KS 99912323'),
('26 Capen Street, Medford, MA'),
('124 Mount Auburn St, Cambridge, Massachusetts 02138'),
('950 Main Street, Worcester, MA 01610');
Using \watch to geocode
The \watch num_seconds command runs preceding statements every num_seconds. So if we wanted to geocode in batches of 500,
we'd do the below in our psql console. We also revised the example to use the number of record limit that is in the latest postgis_tiger_geocoder geocode function.
UPDATE addr_to_geocode
SET (rating, new_address, pt)
= (COALESCE(g.rating,-1),
COALESCE ((g.addy).address::text, '')
|| COALESCE(' ' || (g.addy).predirabbrev, '')
|| COALESCE(' ' || (g.addy).streetname,'')
|| ' ' || COALESCE(' ' || (g.addy).streettypeabbrev, '')
|| COALESCE(' ' || (g.addy).location || ', ', '')
|| COALESCE(' ' || (g.addy).stateabbrev, '')
|| COALESCE(' ' || (g.addy).zip, '')
,
ST_SetSRID(g.geomout,4326)::geography
)
FROM (SELECT addid, (gc).rating, (gc).addy, (gc).geomout
FROM (SELECT addid, address FROM addr_to_geocode
WHERE rating IS NULL ORDER BY addid LIMIT 500) As a
LEFT JOIN LATERAL geocode(address,1) AS gc ON (true)
) AS g
WHERE g.addid = addr_to_geocode.addid; \watch 5
The downside of the \watch is unlike the pgScript approach you can't say run X number of times, so you'll need to stop it after you have concluded all your addresses have been geocoded. That would be a really nice feature to have for future versions of \watch
. Now what are we doing in this example which may not be obvious:
- We are using ST_SetSRID to cast to wgs 84 (SRID 4326), instead of doing a ST_Transform. Reason is the tiger geocoder data is in NAD 84 long lat (4269) which is close enough to 4326 that it is good enough to assume they are the same, especially given street center-line accuracy won't be exact anyway.
- We do a LEFT JOIN instead of CROSS JOIN (aka ,) because since geocode is a set returning function, if no records are returned, we'd lose a record, and we want to flag records with no matches with a -1 for rating and null for everything else. Using a LEFT JOIN ensures all records are returned and records with no matches return a NULL gc.
- You could leave out the word
LATERAL
entirely and in 9.3 and above things would still work fine since LATERAL is an optional keyword. We like
to include because its sometimes hard visually to distinquish invalid syntax from prior versions (or unintentional usages in newer versions) without that optional keyword.
- Side note: you'll want to make sure autocommit is on in psql for batch processing. By default it is.