Table Of Contents
Planner Statistics Beginner
From the Editors
PostGIS 1.3.6 is out and new upcoming PostGIS book
This month is jam packed with a lot of PostGIS news.
PostGIS 1.3.6 is out
PostGIS 1.3.6 has been released. It is mostly a bug fix relase and is the first PostGIS that can be compiled under PostgreSQL 8.4 beta. Details can be found at PostGIS 1.3.6 release notes. We don't have Windows binaries ready yet, but expect to see that in the next week or so.
We are writing a PostGIS Book
Leo and I have been writing a PostGIS in Action book for the past couple of months, and now that it is finally listed on the Manning website, we can talk about it. We are working on our chapter 4 right now. If you are interested in learning PostGIS, check it out. The first chapter is free and with the Manning Early Access Program (MEAP), you can purchase the book now and have great influence on the direction of the book.
The book starts off hopefully with a gentle introduction to OpenGIS Consortium (OGC) spatial databases and concepts in general and PostgreSQL/PostGIS in particular. As we move further into the book, we cover more advanced ground. We plan to cover some of the new PostgreSQL 8.4 features in conjunction with PostGIS, writing stored functions to solve spatial problems and some of the other new exciting stuff and ancillary tools for PostGIS such as PgRouting, Tiger Geocoder, and WKT Raster.
Given all that ground, I suspect our estimate of 325 pages, may be a little low when all is said and done. It is funny that when we started out, we thought to ourselves -- "How can anyone fill up 325 pages." Turns out very easily especially once you start throwing in diagrams and pictures to demonstrate a point. Diagrams are kind of important to have when describing GIS and geometry concepts. So far its been fun and has forced us to sit down and walk thru all the things we took for granted and thought we understood but didn't. You realize just how little you understand when you try to explain something to someone else who really doesn't understand. So perhaps the process of explaining is the greatest of all learning experiences.
PGCon 2009 We'll be speaking
This year we are speaking at PgCon 2009 on PostGIS. Yes Josh Berkus asked politely and we thought, "Why the heck not!" We are looking forward to seeing all the faces we've only read about.
We are diligently preparing our presentations hoping not to make total fools of ourselves.
On a side note -- Paul Ramsey will be giving a workshop on spatial databases at Where 2.0. As you can see he has already started to make a fool of himself by demonstrating evil application architectures that would make any self-respecting database programmer's jaws drop.
PostGIS 1.4.0 not out yet
When will PostGIS 1.4.0 come out? I have no clue. We are still nursing this one. I guess because everyone on the PostGIS team has been so busy with other side activities or making fools of themselves. I'm hoping for late May or early June.
PostgreSQL Q & A
Creating instance of custom type Intermediate
Someone asked me this recently and not playing with custom types much, I'm not sure this is the best solution. Anyrate here it goes. Note this works for PostgreSQL 8.2 and above. Note sure about 8.1 etc. I assume it probably does.
Let us say you created a custom type something like this:
CREATE TYPE my_type1 AS
(name varchar(150),
rotation_x double precision,
rotation_y double precision,
x_pos integer,
y_pos integer
);
First observe that a row object can be cast to a simple data type, because all table rows are really implemented as types in PostgreSQL. Therefore you can create an instance of this type by doing this:
SELECT CAST(ROW('motor', 0.5, 0, 10, 11) As my_type1) As an_instance;
Note you could write it using the more abbreviated PostgreSQL specific way, but I prefer the more verbose CAST since it exists in most relational databases
SELECT ROW('motor', 0.5, 0, 10, 11)::my_type1 As an_instance;
an_instance
---------------------
(motor,0.5,0,10,11)
If you wanted to select an item from this beast you would do:
SELECT (CAST(ROW('motor', 0.5, 0, 10, 11) As my_type1)).name As motor_name;
motor_name
------------
motor
(1 row)
If you wanted to select all the fields of the type you would do this:
SELECT (CAST(ROW('motor', 0.5, 0, 10, 11) As my_type1)).*;
name | rotation_x | rotation_y | x_pos | y_pos
-------+------------+------------+-------+-------
motor | 0.5 | 0 | 10 | 11
Compound types
What if you had a compound type, how could you instantiate such a thing?
CREATE TYPE my_type2 AS ( compound_name varchar(150), right_item my_type1, left_item my_type1 ); SELECT CAST(ROW('superduper motor', charged, supercharged) as my_type2) As mycompound FROM (SELECT CAST(ROW('motor1', 0.5, 0, 10, 11) As my_type1) As charged, CAST(ROW('motor2', 0, 0.5, 11, 12) As my_type1) As supercharged) As foo mycompound ------------------------------------------------------------------ ("superduper motor","(motor1,0.5,0,10,11)","(motor2,0,0.5,11,12)") (1 row)Pulling specific elements out
SELECT ((CAST(ROW('superduper motor', charged, supercharged) as my_type2)).right_item).name As r_charged_name
FROM
(SELECT CAST(ROW('motor1', 0.5, 0, 10, 11) As my_type1) As charged,
CAST(ROW('motor2', 0, 0.5, 11, 12) As my_type1) As supercharged) As foo;
r_charged_name
------------
motor1
(1 row)
Custom Constructors, Operators for custom types
Scott Bailey created a nice custom type example demonstrating how to create custom types and operators for your custom types.
Timespan Custom TypeSimon Greener over at SpatialDbAdvisor also has some custom types up his sleeve for PostGIS work. Check out his Vectorization: Exploding a linestring or polygon into individual vectors in PostGIS.
PostgreSQL Q & A
Restore of functional indexes gotcha Intermediate
This has been bugging me for a long time and I finally complained about it and Tom Lane kindly gave a reason for the problem and that its by design and not a bug.
So I thought I would post the situation here without getting into too many embarassing specifics in case others have suffered from a similar fate and can learn from this.
The situation:
- You create a function lets call it myniftyfunc() in the public schema.
- Then you create another function that depends on myniftyfunc(), lets call it mysuperniftyfunc() also in public schema.
- Then because your function is such a super nifty function, you decide to create a functional index with that super function on your table that sits in mysuperdata schema - mysuperdata.mysupertable
Your super nifty function is doing its thing; your table is happy; the planner is spitting out your queries lightning fast using the super nifty index on your super table; The world is good.
One day you decide to restore your nifty database backup and to your chagrin, your nifty index is not there. The planner is no longer happily spitting out your queries lighting fast and everything has come to a painful crawl. Your super nifty index is gone. What happened to super nifty functional index?
I have to admit that I'm the type of person that assumes the public schema is always there and always in search_path and that my assumption is a flawed one. After all the public schema is there by default on new databases for convenience, but one can change it not to be in the search_path and in fact pg_dump does just that. So if everything you have is kept in public schema -- you don't run into this particular misfortune. If however you have your functions in public and your tables in different schemas, during restore -- the search path is changed to the schema being restored and your super functional indexes based on super functions that depend on other super functions fail because public is no longer in the search_path.
Below is a simple script to recreate the issue so its clear:
CREATE DATABASE superdata;
CREATE OR REPLACE FUNCTION myniftyfunc(myint integer) RETURNS integer AS
$$ SELECT 1 + $1;$$
LANGUAGE 'sql' IMMUTABLE;
CREATE OR REPLACE FUNCTION mysuperniftyfunc(myint integer) RETURNS integer AS
$$ SELECT myniftyfunc($1); $$
LANGUAGE 'sql' IMMUTABLE;
CREATE SCHEMA mysuperdata;
CREATE TABLE mysuperdata.mysupertable(sid integer PRIMARY KEY, super_key integer);
CREATE INDEX idx_mysupertable_super_index
ON mysuperdata.mysupertable USING btree (mysuperniftyfunc(super_key));
INSERT INTO mysuperdata.mysupertable(sid,super_key)
VALUES(1,1);
--Backup superdata
"C:\Program files\postgresql\8.3\bin\pg_dump" --host=localhost --port=5432 --username=postgres --format=plain --verbose --file="C:\superdata.sql" superdata
--Restore
"C:\Program files\postgresql\8.3\bin\psql" -U postgres -h localhost -p 5432 -d superdata2 -f "C:\superduper.sql"
--Get non-super cool error
psql:C:/superduper.sql:99: ERROR: function myniftyfunc(integer) does not exist
LINE 1: SELECT myniftyfunc($1);
^
HINT: No function matches the given name and argument types. You might need to add explicit type casts.
QUERY: SELECT myniftyfunc($1);
CONTEXT: SQL function "mysuperniftyfunc" during startup
Normally I do my backup in compressed format, but did it in plain to demonstrate the problem and here is what pg_dump produces.
--
-- PostgreSQL database dump
--
-- Started on 2009-06-18 21:45:59
SET statement_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = off;
SET check_function_bodies = false;
SET client_min_messages = warning;
SET escape_string_warning = off;
--
-- TOC entry 6 (class 2615 OID 1086067)
-- Name: mysuperdata; Type: SCHEMA; Schema: -; Owner: postgres
--
CREATE SCHEMA mysuperdata;
ALTER SCHEMA mysuperdata OWNER TO postgres;
SET search_path = public, pg_catalog;
--
-- TOC entry 21 (class 1255 OID 1086065)
-- Dependencies: 3
-- Name: myniftyfunc(integer); Type: FUNCTION; Schema: public; Owner: postgres
--
CREATE FUNCTION myniftyfunc(myint integer) RETURNS integer
LANGUAGE sql IMMUTABLE
AS $_$ SELECT 1 + $1;$_$;
ALTER FUNCTION public.myniftyfunc(myint integer) OWNER TO postgres;
--
-- TOC entry 22 (class 1255 OID 1086066)
-- Dependencies: 3
-- Name: mysuperniftyfunc(integer); Type: FUNCTION; Schema: public; Owner: postgres
--
CREATE FUNCTION mysuperniftyfunc(myint integer) RETURNS integer
LANGUAGE sql IMMUTABLE
AS $_$ SELECT myniftyfunc($1); $_$;
ALTER FUNCTION public.mysuperniftyfunc(myint integer) OWNER TO postgres;
SET search_path = mysuperdata, pg_catalog;
-- this is not a bug, but would be super
cool if public or whatever the default search path of the database was in here.
SET default_tablespace = '';
SET default_with_oids = false;
--
-- TOC entry 1465 (class 1259 OID 1086068)
-- Dependencies: 6
-- Name: mysupertable; Type: TABLE; Schema: mysuperdata; Owner: postgres; Tablespace:
--
CREATE TABLE mysupertable (
sid integer NOT NULL,
super_key integer
);
ALTER TABLE mysuperdata.mysupertable OWNER TO postgres;
--
-- TOC entry 1735 (class 0 OID 1086068)
-- Dependencies: 1465
-- Data for Name: mysupertable; Type: TABLE DATA; Schema: mysuperdata; Owner: postgres
--
COPY mysupertable (sid, super_key) FROM stdin;
1 1
\.
--
-- TOC entry 1734 (class 2606 OID 1086072)
-- Dependencies: 1465 1465
-- Name: mysupertable_pkey; Type: CONSTRAINT; Schema: mysuperdata; Owner: postgres; Tablespace:
--
ALTER TABLE ONLY mysupertable
ADD CONSTRAINT mysupertable_pkey PRIMARY KEY (sid);
--
-- TOC entry 1732 (class 1259 OID 1086073)
-- Dependencies: 22 1465
-- Name: idx_mysupertable_super_index; Type: INDEX; Schema: mysuperdata; Owner: postgres; Tablespace:
--
CREATE INDEX idx_mysupertable_super_index ON mysupertable USING btree (public.mysuperniftyfunc(super_key));
--
-- TOC entry 1740 (class 0 OID 0)
-- Dependencies: 3
-- Name: public; Type: ACL; Schema: -; Owner: postgres
--
REVOKE ALL ON SCHEMA public FROM PUBLIC;
REVOKE ALL ON SCHEMA public FROM postgres;
GRANT ALL ON SCHEMA public TO postgres;
GRANT ALL ON SCHEMA public TO PUBLIC;
-- Completed on 2009-06-18 21:45:59
--
-- PostgreSQL database dump complete
--
Solution (workaround):
The work around for this situation is to either explicitly qualify the functions you are using within another or for PostgreSQL 8.3 -- add to your function set search_path=public
Solution 1:CREATE OR REPLACE FUNCTION mysuperniftyfunc(myint integer) RETURNS integer AS $$ SELECT public.myniftyfunc($1); $$ LANGUAGE 'sql' IMMUTABLE;Solution 2: (only works for PostgreSQL 8.3+)
CREATE OR REPLACE FUNCTION mysuperniftyfunc(myint integer) RETURNS integer AS $$ SELECT myniftyfunc($1); $$ LANGUAGE 'sql' IMMUTABLE; ALTER FUNCTION mysuperniftyfunc(integer) SET search_path=public;
Of course neither of these solutions is particularly satisfying if you are a package author. If you are and that is how this mess started. You want people to be able to install your functions in whatever schema they like and if they wanted to use it globally they would add it to their database default search path. Though that is arguable. Perhaps all packages should live in specifically named schemas.
Basics
Running totals and sums using PostgreSQL 8.4 Windowing functions Intermediate
One thing that is pretty neat about windowing functions in PostgreSQL 8.4 aside from built-in windowing functions (row_number(), rank(), lead(), lag(), dense_rank(), percent_rank(), cume_dist(), first_value, last_value, nth_value) as documented in the manual Windowing Functions is that you can use windows with most aggregate functions (built-in or custom defined) as well as define your own specific windowing functions. In a later article, we'll demonstrate creating custom windowing functions.
In our PGCon 2009 PostGIS presentation one of the last slides demonstrates using lead() and lag() windowing functions to show a family's income level in the same record with the income levels of the next door neighbors in the fictitious town we created. This is not terribly useful unless you live in a somewhat dysfunctional neighborhood where everyone is concerned about how rich their neighbors are compared to themselves. Our town was very dysfunctional but mostly geographically dysfunctional. We will have much more useful use cases of this as applied to GIS in our upcoming PostGIS in Action book.
Hitoshi Harada and David Fetter did a presentation of this in PGCon 2009 which sadly we missed since we were giving our own presentation. Check out the PGCon2009 PostgreSQL 8.4 Windowing Functions Video. Also check out the slides at Introducing Windowing Functions.
Those who have used SQL Server 2005+, Oracle or IBM DBII are probably familar or have run into examples of Windowing functions in those products. Windowing in PostgreSQL 8.4 works more or less the same way. In a prior article, we demonstrated how to return running totals and sums using rudimentary SQL. To precelebrate the eminent arrival of PostgreSQL 8.4 and the current PostgreSQL 8.4 beta 2 release, we shall demonstrate the same exercise using the new ANSI SQL:2003 Windowing functionality built into the upcoming PostgreSQL 8.4.
--First create our table and dataset
CREATE TABLE orders(order_id serial PRIMARY KEY, customer_id integer,
order_datetime timestamp, order_total numeric(10,2));
INSERT INTO orders(customer_id, order_datetime, order_total)
VALUES (1,'2009-05-01 10:00 AM', 500),
(1,'2009-05-15 11:00 AM', 650),
(2,'2009-05-11 11:00 PM', 100),
(2,'2009-05-12 11:00 PM', 5),
(3,'2009-04-11 11:00 PM', 100),
(1,'2009-05-20 11:00 AM', 3);
Recall in our previous example we wanted our running total column to include the total of all purchases made by the customer prior to the current purchase. We accomplished this with a self join.
--non windowing version
SELECT n.customer_id, n.order_id, n.order_total,
COALESCE(SUM(o.order_total),0) As past_order_total
FROM orders AS n LEFT JOIN orders AS o
ON (o.customer_id = n.customer_id
AND n.order_datetime > o.order_datetime)
GROUP BY n.customer_id, n.order_datetime, n.order_id, n.order_total
ORDER BY n.customer_id, n.order_datetime, n.order_id;
customer_id | order_id | order_total | past_order_total
-------------+----------+-------------+---------------
1 | 7 | 500.00 | 0
1 | 8 | 650.00 | 500.00
1 | 12 | 3.00 | 1150.00
2 | 9 | 100.00 | 0
2 | 10 | 5.00 | 100.00
3 | 11 | 100.00 | 0
(6 rows)
The above shows an approach to write running sums using self-joins. If you do not have windowing functions this is a common approach. With windowing functions, you have the option of doing this. Note in our example we want to exclude the current order from our running total sum.
SELECT n.customer_id, n.order_id, n.order_total,
SUM(n.order_total)
OVER (PARTITION BY n.customer_id
ORDER BY n.order_datetime) - n.order_total As past_order_total
FROM orders AS n
ORDER BY n.customer_id, n.order_datetime, n.order_id;
customer_id | order_id | order_total | past_order_total
-------------+----------+-------------+------------------
1 | 7 | 500.00 | 0.00
1 | 8 | 650.00 | 500.00
1 | 12 | 3.00 | 1150.00
2 | 9 | 100.00 | 0.00
2 | 10 | 5.00 | 100.00
3 | 11 | 100.00 | 0.00
An equal statement to the above that uses pure windowing with no subtract would probably use something like a RANGE BETWEEN x PRECEEDING AND Y FOLLOWING etc. Unfortunately PostgreSQL 8.4 doesn't appear to support the Oracle like:
BETWEEN x PRECEDING AND y FOLLOWING
BETWEEN x PRECEDING AND y PRECEDING
As described in
Oracle Windowing functions
therefore we had to put the - n.order_total to achieve more or less the same result as our previous. Also note
we are making the assumption in our prior version that a given customer will not place 2 orders at the same time.
Otherwise the results would be different between the 2.
Named windows
One other feature that is nice about window functions is that you can name windows and reuse them, also have windowing different from your basic sort order, and have multiple windowing functions in the same query. Below is a demonstration of this where we sequentially order our orders by order_datetime in descending order and then reuse our existing window to get an ordering for each customer partitioning such that we have a row_number recount for each customer by ascending time.
--multiple windows and multiple window function calls
SELECT row_number() OVER(ORDER BY order_datetime DESC) As rtime_d,
n.customer_id, row_number() OVER(window_custtime) As cr_num, n.order_id, n.order_total,
SUM(n.order_total) OVER (window_custtime) - n.order_total As past_order_total
FROM orders AS n
WINDOW window_custtime AS (PARTITION BY n.customer_id
ORDER BY n.order_datetime)
ORDER BY n.order_datetime, n.order_id, n.customer_id;
rtime_d | customer_id | cr_num | order_id | order_total | past_order_total
---------+-------------+--------+----------+-------------+------------------
6 | 3 | 1 | 11 | 100.00 | 0.00
5 | 1 | 1 | 7 | 500.00 | 0.00
4 | 2 | 1 | 9 | 100.00 | 0.00
3 | 2 | 2 | 10 | 5.00 | 100.00
2 | 1 | 2 | 8 | 650.00 | 500.00
1 | 1 | 3 | 12 | 3.00 | 1150.00
(6 rows)
Below is the PgAdmin graphical explain plan from PgAdmin III 1.10 Beta 3
Observe that using the same window results in one Windows Agg which is a bit clearer in the next example.
--using same window twice results in only one windowagg call
SELECT row_number() OVER(window_custtime) As rtime_d,
n.customer_id, lead(order_id) OVER(window_custtime) As cr_num, n.order_id, n.order_total
FROM orders AS n
WINDOW window_custtime AS (PARTITION BY n.customer_id
ORDER BY n.order_datetime)
ORDER BY n.order_datetime, n.order_id, n.customer_id;
rtime_d | customer_id | cr_num | order_id | order_total
---------+-------------+--------+----------+-------------
1 | 3 | | 5 | 100.00
1 | 1 | 2 | 1 | 500.00
1 | 2 | 4 | 3 | 100.00
2 | 2 | | 4 | 5.00
2 | 1 | 6 | 2 | 650.00
3 | 1 | | 6 | 3.00
(6 rows)
Basics
Planner Statistics Beginner
You'll often hear the term planner statistics thrown around by database geeks. Did you update your statistics. This lingo isn't even limited
to PostgreSQL, but is part and parcel to how most decent databases work. For example in PostgreSQL you do a vacuum analyze
to update your planner statistics in addition
to cleaning up dead space. In SQL Server you do an UPDATE STATISTICS
. In MySQL you do an
ANALYZE TABLE
or a more invasive OPTIMIZE TABLE
.
Normally all this "update your stats so your planner can be happy" is usually unnecessary unless you just did a bulk load or a bulk delete or you are noticing your queries are suddenly slowing down. These stat things are generally updated behind the scenes by most databases on an as needed basis.
What makes SQL really interesting and a bit different from procedural languages is that it is declarative (like functional and logical programming languages) and relies on the database planner to come up with strategies for navigating the data. Its strategy is not fixed as it is in procedural languages. A big part of this strategy is decided on by the query planner which looks at distributions of data. Given different WHERE conditions for similar queries, it could come up with vastly different strategies if one value has a significantly higher distribution in a table than another. This is also the mystery of why it sometimes refuses to use an index on a field because it has decided a table scan is more efficient and also why some people consider HINTS evil because they pollute the imperative nature of the language.
So what do these stat things look like exactly? Well you can see these things in PostgreSQL by interrogating the pg_stats view. The pg_stats view exists for all currently supported versions of PostgreSQL (7.4+). Not sure if it existed in prior versions.
Details of the fields are here: pg_stats fields
Below is a sample query from one of our test databases and the output.
SELECT attname As colname, n_distinct,
array_to_string(most_common_vals, E'\n') AS common_vals,
array_to_string(most_common_freqs, E'\n') As dist_freq
FROM pg_stats
WHERE schemaname = 'assets' and tablename = 'land';
colname | n_distinct | common_vals | dist_freq
-----------------+------------+-------------------+-----------
pid | -1 | |
land_name | -1 | |
land_type | 13 | park | 0.104587
: college : 0.0990826
: highschool : 0.0899083
: hospital : 0.0899083
: 2 family : 0.0862385
: 3 family : 0.0825688
: elementary school : 0.0788991
: commercial : 0.0770642
: government : 0.0752294
: condo : 0.0697248
: police station : 0.0623853
: 1 family : 0.0458716
: vacant land : 0.0385321
the_geom | -1 | |
land_type_other | -0.166972 | |
(5 rows)
So the above tells us that land_type has 13 distinct values with park the most common value in this table with 10 percent of the sampled data being that and that pid, the_geom and name are fairly distinct.
Keep in mind sampling may not be the all the records and in general is not and sampling can be indirectly controlled per column with ALTER TABLE tablename ALTER COLUMN column_name.
For columns with no specific sampling set, you can check the default with:
show default_statistics_target;
The statistics setting controls the number of items that can be stored in the common_vals, frequencies, and histogram_bounds fields arrays for each table and that would indirectly increase the sampling of records. Prior to 8.4 -- this was defaulted to 10 though can be increased by changing the postgresql.conf file. At 8.4 this the default setting was increased to 100.
Observe what happens to our data set when we up the statistics to as high as we can set for land_type_other and then reanalyze.
ALTER TABLE assets.land
ALTER COLUMN land_type_other SET STATISTICS 10000;
vacuum analyze assets.land;
SELECT attname As colname, n_distinct,
array_to_string(most_common_vals, E'\n') AS common_vals,
array_to_string(most_common_freqs, E'\n') As dist_freq
FROM pg_stats
WHERE schemaname = 'assets' and tablename = 'land';
colname | n_distinct | common_vals | dist_freq
-----------------+------------+----------------------------------+-------------
land_type | 13 | park | 0.104587
: college : 0.0990826
the_geom | -1 | |
land_type_other | -0.166972 | {"3 family",college} | 0.00550459
: {condo,park} : 0.00550459
: {highschool,"police station"} : 0.00550459
: {"1 family",government,hospital} : 0.00366972
: {"1 family",hospital} : 0.00366972
: {"2 family","police station"} : 0.00366972
: {college,commercial} : 0.00366972
: {college,government} : 0.00366972
: {college,"police station"} : 0.00366972
: {commercial,condo} : 0.00366972
: {government} : 0.00366972
: {highschool,park} : 0.00366972
: {hospital,park} : 0.00366972
: {park,"police station"} : 0.00366972
land_name | -1 | |
pid | -1 | |