Table Of Contents
String Aggregation in PostgreSQL, SQL Server, and MySQL Intermediate
PL/R Part 2: Functions that take arguments and the power of aggregation Intermediate
PL/R Part 3: Sharing Functions across PL/R functions with plr_module Intermediate
Hello LibreOffice, Goodbye OpenOffice
PostgreSQL 9 High Performance Book Review
From the Editors
The State of PostGIS, Joys of Testing, and PLR the Prequel
I've always enjoyed dismantling things. Deconstruction was a good way of analyzing how things were built by cataloging all the ways I could dismantle or destroy them. I experimented with mechanical systems, electrical circuitry, chemicals and biological systems sometimes coming close to bodily harm. In later years I decided to play it safe and just stick with programming and computer simulation as a convenient channel to enjoy my destructive pursuits. Now getting to the point of this article.
In later articles, I'll start to demonstrate the use of PL/R, the procedural language for PostgreSQL that allows you to program functions in the statistical language and Environment R. To make these examples more useful, I'll be analyzing data generated from PostGIS tests I've been working on for stress testing the upcoming PostGIS 2.0. PostGIS 2.0 is a major and probably the most exciting release for us. Paul Ramsey did a summary talk recently of Past, Present, Future of PostGIS at State of PostGIS FOSS4G Japan http://www.ustream.tv/recorded/10667125 which provides a brief glimpse of what's in store in 2.0.
Of course, no doubt, the attendees at PgDay EU will hear an earful of it from fellow PostGIS core development team members - Mark Cave-Ayland and Olivier Coutin in their - PostGIS 1.5 and beyond: a technical perspective. You won't want to miss the talk Discover PostGIS: GIS for PostgreSQL, by Vincent Picavet or any of Mark's tutorials Intro to PostGIS and Advanced PostGIS. If that were not enough, there are even more PostGIS talks at PgDay Europe: PostGIS , OpenStreetMap, OpenLayers by Hartmut Holzgraefe and PostGIS - das Wo? in der Datenbank (PostGIS the where? in the database) by Stefan Keller.
We are in the middle of gutting out a lot of the PostGIS inside code structure, changing the on disk format. In addition to significant code refactoring, PostGIS 2.0 introduces raster support and true 3D support.
I don't like writing unit tests. Units tests are useful, but too tedious to develop to provide satisfactory code coverage. In PostGIS we have unit tests built with CUnit which many PostGIS development team members take great deal of pride and joy in building. We also have various PostgreSQL specific tests, many of which were built from past bug ticket reports. Such things bore me to tears. I have a different dream. I dream of a world that allows for a more automated way of determining incorrectness. A testing engine that can build itself and find its way. A testing machine that doesn't need as much spoon feeding as unit tests do.
My general philosophy of life is to always try to kill at least 2 birds with every stone you throw. I write a lot of the PostGIS official documentation, but documentation would be a bit of a waste if it merely existed just for reading by humans. Around when we were developing PostGIS 1.4, I wrote an XSL file that takes our PostGIS doc-book documentation and converts it to a merciless battery of SQL tests by taking a permutation of every function we have documented and a reference example of every kind of geometry we support and cross joining to create a lot of SQL spatial query statements. A damn lot of valid SQL statements that are otherwise non-sensical. Basically, putting the rules of PostGIS and PostgreSQL grammar together with a population of arguments to formulate valid giberrish. I call this a machine gun test because its main intent is to wreak carnage on PostGIS with the intent of finding gaping holes in its armor. It provided much better code coverage than our unit tests, but was childish in its questions -- asking what is the length of a point or what happens if you try to use ST_MakeLine with a polygon and a linestring. Childish is not a bad thing since a lot of PostGIS users ask a lot of childish questions, so our system needs to handle them without crumbling. It was pretty good at detecting crashable areas, particularly when major code changes were happening such as we are doing in PostGIS 2.0 and as we did in PostGIS 1.4.
It also ended up doing a little more than I had planned for it, like raising flags when we had documented functions that had been accidentally or intentionally removed from the code base, functions that worked with certain new geometry types by accident rather than deliberate design, or even errors in our documentation that didn't jive with our codebase in terms of the arguments it claimed should go into the function. The little more, often required some inspection of the test results.
You could use it for regression testing as well by diffing the log outputs from two different versions of PostGIS run on the same server. Diffing two 300 MB test result files and visually comparing was not the most useful way of spending time though it was interesting since you could spot check changes in various GEOS versions etc.
In PostGIS 2.0 I started working on enhancements that would log results success, failure, and timing to a table for easier analysis. I also created an XSL for raster support that would take all the documented raster functions and create a battery of raster tests ALA raster grammar gibberish generator :). Now I'll start analyzing these with R and along the way demonstrate how to use PL/R in PostgreSQL. If you do attend Pg Europe and are interested in learning more about PL/R, then make sure to attend Joe Conway's Advanced analytics with PL/R.
PostgreSQL Q & A
greatest and least - oldies but goodies Beginner
Problem: You have a set of numbers, or characters or whatever and you are trying to find the max or min of this set?
If the values are separate records in a table or query, the answer is well known and respected across all relational databases -- use the aggregate MAX and MIN functions.
But what if you have a set of free wheeling numbers or text not in separate records, and you want the max or min of each. Here is where the least and greatest functions come in handy.
PostgreSQL has had these functions for as far back as I can remember and is not the only database to sport these marvelous functions. Our beloved MySQL and Oracle database have these functions as well. Sadly our more beloved SQL Server even in the SQL Server 2008 variant - lacks these functions.
Okay how to use these functions -- you use it like this:
SELECT least(1,-2,5) As num_least, greatest('Bobby', 'Catty', 'Kitty') As greatest_cat;
Yields:
num_least| greatest_cat
----------+----------
-2 | Kitty
We would classify these functions along the lines of COALESCE. They are like COALESCE because they take an arbitrary number of arguments and the datatype that is returned is highest datatype that all arguments in the function can be autocast to. If there is no autocast then well you get an error. To demonstrate, guess what happens when you do this:
SELECT least(-1, 'Kitty');
Well do this in PostgreSQL at least in 8.3+, you get a nice slap if you haven't installed any deprecated autocasts:
ERROR: invalid input syntax for integer: "Kitty" LINE 1: SELECT least('Kitty', -1)Do this in MySQL - so friendly and forgiving, and great reader of minds and you get:
-1
OF COURSE
I apologize for the ambiguous sarcasm, its just sometimes I want my mind read and sometimes I don't and I just can't figure out whether today is one of those days or the other day.
PostgreSQL Q & A
String Aggregation in PostgreSQL, SQL Server, and MySQL Intermediate
Question: You have a table of people and a table that specifies the activities each person is involved in. You want to return a result that has one record per person and a column that has a listing of activities for each person separated by semicolons and alphabetically sorted by activity. You also want the whole set alphabetically sorted by person's name.
This is a question we are always asked and since we mentor on various flavors of databases, we need to be able to switch gears and provide an answer that works on the client's database. Most often the additional requirement is that you can't install new functions in the database. This means that for PostgreSQL/SQL Server that both support defining custom aggregates, that is out as an option.
Normally we try to come up with an answer that works in most databases, but sadly the only solution that works in most is to push the problem off to the client front end and throw up your hands and proclaim -- "This ain't something that should be done in the database and is a reporting problem." That is in fact what many database purists do, and all I can say to them is wake up and smell the coffee before you are out of a job. We feel that data transformation is an important function of a database, and if your database is incapable of massaging the data into a format your various client apps can easily digest, WELL THAT's A PROBLEM.
We shall now document this answer rather than trying to answer for the nteenth time. For starter's PostgreSQL has a lot of answers to this question, probably more so than any other, though some are easier to execute than others and many depend on the version of PostgreSQL you are using. SQL Server has 2 classes of answers neither of which is terribly appealing, but we'll go over the ones that don't require you to be able to install .NET stored functions in your database since we said that is often a requirement. MySQL has a fairly simple, elegant and very portable way that it has had for a really long time.
Is the new way always better?
For PostgreSQL and SQL Server, we'll demonstrate more than one approach. Just because we demonstrate newer approaches doesn't make the older approaches obsolete or necessarily worse. Don't feel the need to use something new because it's new. Think of the newer approaches as just extra options that may or may not be faster for your use-case.
Test Table
Below is the script to setup our test tables. This script will work fine for all databases we will discuss here.
CREATE TABLE people(p_name varchar(75) PRIMARY KEY);
INSERT INTO people(p_name) VALUES('Jack');
INSERT INTO people(p_name) VALUES('Jill');
INSERT INTO people(p_name) VALUES('Humpty');
INSERT INTO people(p_name) VALUES('Dopey');
CREATE TABLE people_activities(p_name varchar(75) NOT NULL,
activity varchar(100) NOT NULL);
ALTER TABLE people_activities
ADD CONSTRAINT pk_people_activities
PRIMARY KEY (p_name,activity);
INSERT INTO people_activities(p_name, activity)
SELECT 'Jack', 'Hiking'
UNION ALL
SELECT 'Jack', 'Skiing'
UNION ALL
SELECT 'Jack', 'Fishing'
UNION ALL
SELECT 'Jill', 'Hiking'
UNION ALL
SELECT 'Jill', 'Bear Hunting'
UNION ALL
SELECT 'Humpty', 'Tumbling'
UNION ALL
SELECT 'Humpty', 'Seating'
UNION ALL
SELECT 'Humpty', 'Cracking';
-- we want our output to look like --
p_name | activities
--------+--------------------------------
Dopey |
Humpty | Cracking;Seating;Tumbling
Jack | Fishing;Hiking;Skiing
Jill | Bear Hunting;Hiking
The PostgreSQL ways
In PostgreSQL 9.0, PostgreSQL introduced the STRING_AGG function and also the ORDER BY predicate in any aggregate function, which made this task much easier and probably much more efficient as well. Note that in Oracle 11G Release 2, Oracle introduced an aggregate function call LISTAGG which is similar. Details of using LISTAGG and Oracle LISTAGG, the older WM_CONCAT and other string aggregation approaches for Oracle are described in Oracle String Aggregation Techniques.
-- PostgreSQL 9.0 way --
SELECT p.p_name,
STRING_AGG(a.activity, ';' ORDER BY a.activity) As activities
FROM people AS p
LEFT JOIN people_activities As a ON (p.p_name = a.p_name)
GROUP BY p.p_name
ORDER BY p.p_name;
PostgreSQL 8.4 introduced the ANSI-SQL array_agg function which IBM DB2 also has. Not sure about other databases. This provided another option for doing a string aggregation. Without the ORDER BY feature for aggregates introduced in PostgreSQL 9.0, this option is less powerful.
-- PostgreSQL 8.4+ --
-- This you should use with caution if you care about order
-- It seems to work but well that is probably by accident rather than design
SELECT p.p_name,
array_to_string(ARRAY_AGG(a.activity),';') As activities
FROM people AS p
LEFT JOIN
(SELECT *
FROM people_activities
ORDER BY p_name, activity) As a ON (p.p_name = a.p_name)
GROUP BY p.p_name
ORDER BY p.p_name;
Here is the PostgreSQL 8.4+ equivalent of a third option that people mentioned is available for SQL Server - the Recursive Window CTE.
-- This is the PostgreSQL 8.4+ equivalent to the SQL Server Window CTE approach --
WITH RECURSIVE
t1 AS (
SELECT
p.p_name,
a.activity,
ROW_NUMBER() OVER(PARTITION BY a.p_name ORDER BY a.activity) AS rn
FROM people AS p
LEFT JOIN people_activities AS a ON a.p_name = p.p_name
),
t2 AS (
SELECT
t1.p_name,
CAST(t1.activity AS varchar) AS activities,
t1.rn
FROM t1
WHERE t1.rn = 1
UNION ALL
SELECT
t2.p_name,
t2.activities || ';' || t1.activity,
t1.rn
FROM t2
INNER JOIN t1 ON
t1.p_name = t2.p_name AND
t1.rn = t2.rn + 1
)
SELECT
t2.p_name,
COALESCE(t2.activities, '')
FROM t2
INNER JOIN (
SELECT
t1.p_name,
MAX(t1.rn) AS max_rn
FROM t1
GROUP BY t1.p_name
) AS t ON
t.p_name = t2.p_name AND
t.max_rn = t2.rn
ORDER BY t2.p_name;
One thing that PostgreSQL has always had going for it that both MySQL and SQL Server still lack is phenomenal array support. Arrays are first class citizens in PostgreSQL witha rich set of functions to complement them. They are extremely versatile and make creating things like aggregate functions pretty trivial. For example in MySQL you'd have to become a C programmer to build an aggregate function which is generally out of the reach of most users. In SQL Server you'd have to learn .NET (VB or C#) and write a non-trivial amount of code which is much more accessible than C but not wihtout its issues. In PostgreSQL all you need to know is SQL and write a couple of lines of code to build your own custom aggregate functions. That is just SUPER COOL.
-- All versions of PostgreSQL
SELECT p.p_name,
array_to_string(ARRAY(
SELECT activity
FROM people_activities AS a
WHERE a.p_name = p.p_name ORDER BY a.activity) ,';') As activities
FROM people AS p
ORDER BY p.p_name;
SQL Server Ways
SQL Server has a fairly cryptic way of doing this which one would consider a HACK and repurpose of functionality that was designed for something else. On the plus side, we do find its CROSS APPLY and OUTER APPLY functionality which I crave a lot when working with MySQL and a little with PostgreSQL (PostgreSQL has similar but not quite as elegant equivalents). Not so much for this particular use case.
-- SQL Server 2005+
-- Note this won't work unless you also have SQL Server 2005+ compatibility mode --
-- We haven't done benchmarks to see which approach is faster.
-- use subselect - the stuff is to get rid of the leading ;
-- In theory this should be faster than subselect of SQL 2000
-- Note we use OUTER APPLY here
-- but for this particular use case CROSS APPLY would work
-- fine too since one record seem to be returned regardless
-- As a general rule of thumb you should use OUTER APPLY for LEFT JOIN like behavior and
-- CROSS APPLY as inner join like behavior
SELECT p.p_name,
activities
FROM people AS p OUTER APPLY
(SELECT STUFF(( SELECT ';' + a.activity
FROM people_activities AS a
WHERE a.p_name = p.p_name
ORDER BY a.activity
FOR XML PATH('') ), 1,1,'') AS activities ) AS f
ORDER BY p.p_name;
SQL Server Recursive CTE + WindowThis is the Window + CTE approach that someone posted on reddit that we revised slightly to be a bit more ANSI-SQL compliant. Note: we listed the PostgrreSQL 8.4+ equivalent of this approach in the PostgreSQL section.
The reason many people prefer this approach is that it doesn't leave any XML entities you have to strip with REPLACE. e.g. if you had & in your original text you would have to use REPLACE(final,'&', '&')
to strip off the xmlish stuff.
The other reason is people felt this was less of a hack. Note: that the PostgreSQL 8.4 variant looks pretty close to SQL Server since both support ANSI SQL Window and CTE. The main differences:
- PostgreSQL forces you to put in a RECURSIVE word if any of the subexpressions is RECURSIVE and SQL SERVER would reject that word and consider it invalid.
- SQL SERVER uses the non-ANSI SQL + operator for string concatenation while PostgreSQL uses the ANSI SQL ||
- SQL Server won't allow text + varchar concatenation and if your varchar is unlimited, you must call it varchar(MAX). PostgreSQL assumes if no length is specified,then its limitless so varchar, text mean much the same in PostgreSQL and varchar(MAX) is equivalent to PostgreSQL varchar.
-- SQL Server 2005+ using Window and CTE
WITH
t1 AS (
SELECT
p.p_name,
a.activity,
ROW_NUMBER() OVER(PARTITION BY a.p_name ORDER BY a.activity) AS rn
FROM people AS p
LEFT JOIN people_activities AS a ON a.p_name = p.p_name
),
t2 AS (
SELECT
t1.p_name,
CAST(t1.activity AS varchar(MAX)) AS activities,
t1.rn
FROM t1
WHERE t1.rn = 1
UNION ALL
SELECT
t2.p_name,
t2.activities + ';' + t1.activity,
t1.rn
FROM t2
INNER JOIN t1 ON
t1.p_name = t2.p_name AND
t1.rn = t2.rn + 1
)
SELECT
t2.p_name,
COALESCE(t2.activities, '')
FROM t2
INNER JOIN (
SELECT
t1.p_name,
MAX(t1.rn) AS max_rn
FROM t1
GROUP BY t1.p_name
) AS t ON
t.p_name = t2.p_name AND
t.max_rn = t2.rn
ORDER BY t2.p_name;
This is the old tried and true way which should work even if you are in SQL Server 2000 compatibility mode or are stuck using SQL Server 2000
-- SQL Server 2000+ SELECT p.p_name, STUFF(( SELECT ';' + a.activity FROM people_activities AS a WHERE a.p_name = p.p_name ORDER BY a.activity FOR XML PATH('') ), 1,1,'') AS activities FROM people AS p ORDER BY p.p_name;
The MySQL way
Database purists like to make fun of MySQL, but in some ways it's way ahead of others. Its focus has always been to make the common easy and as a result was more advanced than some other databases. Some might say it catered too much to the whims of its users at the expense of doing things the right way. That's another story. This catering is no doubt a large reason for MySQL's popularity. There is something to be said about listening to the needs of users, which I think most other databases dismiss. For example MySQL was the first of these 3 databases to have a LIMIT OFFSET feature, which PostgreSQL later copied, and SQL Server 2010 (Denali) is finally introducing in the form of OFFSET FETCH NEXT. It was also the first to have the MULTI ROW VALUE insert option which was later adopted by PostgreSQL 8.2 and SQL Server 2008 and I think Oracle is yet to adopt. Heck Oracle still has that arcane FROM DUAL stupid requirement to my knowledge. How Oracle users put up with that nonsense I don't know.
SET SESSION group_concat_max_len = 32768;
In addition it has always had this aggregate function for as far back as MySQL 3.23 days GROUP_CONCAT. It's really convenient to be able to use this on any version of MySQL. Below is how you would use it.
-- MySQL Way --
SELECT p.p_name,
GROUP_CONCAT(a.activity ORDER BY a.activity SEPARATOR ';') As activities
FROM people AS p
LEFT JOIN people_activities As a ON (p.p_name = a.p_name)
GROUP BY p.p_name
ORDER BY p.p_name;
PL Programming
Quick Intro to R and PL/R - Part 1 Intermediate
In this article we'll provide a summary of what PL/R is and how to get running with it. Since we don't like repeating ourselves, we'll refer you to an article we wrote a while ago which is still fairly relevant today called Up and Running with PL/R (PLR) in PostgreSQL: An almost Idiot's Guide and just fill in the parts that have changed. We should note that particular series was more geared toward the spatial database programmer (PostGIS in particular). There is a lot of overlap between the PL/R, R, and PostGIS user-base which is comprised of many environmental scientists and researchers in need of powerful charting and stats tools to analyse their data who are high on the smart but low on the money human spectrum.
This series will be more of a general PL/R user perspective. We'll follow more of the same style we did with Quick Intro to PL/Python. We'll end our series with a PL/R cheatsheet similar to what we had for PL/Python.
As stated in our State of PostGIS article, we'll be using log files we generated from our PostGIS stress tests. These stress tests were auto-generated from the PostGIS official documentation. The raster tests are comprised of 2,095 query executions exercising all the pixel types supported. The geometry/geograpy tests are comprised of 65,892 spatial SQL queries exercising every PostGIS geometry/geography supported in PostGIS 2.0 -- yes this includes TINS, Triangles,Polyhedral Surfaces, Curved geometries and all dimensions of them. Most queries are unique. If you are curious to see what these log tables look like or want to follow along with these exercises, you can download the tables from here.
What is R and PL/R and why should you care?
R is both a language and an environment for doing statistics and generating graphs and plots. It is GNU-licensed and a common favorite of Universities and Research institutions. PL/R is a procedural language for PostgreSQL that allows you to write database stored functions in R. R is a set-based and domain specific language similar to SQL except unlike the way relational databases treat data, it thinks of data as matrices, lists and vectors. I tend to think of it as a cross between LISP and SQL though more experienced Lisp and R users will probably disagree with me on that. This makes it easier in many cases to tabulate data both across columns as well as across rows. The examples we will show in these exercises, could be done in SQL, but they are much more succinct to write in R. In addition to the language itself, there are a whole wealth of statistical and graphing functions available in R that you will not find in any relational database. These functions are growing as more people contribute packages. Its packaging system called Comprehensive R Archive (CRAN) is similar in concept to Perl's CPAN and the in the works PGXN for PostgreSQL.
What do you need before you can use PL/R?
- PostgreSQL and latest version of PL/R at this time plr-8.3.0.11 which works on PostgreSQL 8.3-9.0
- R installed on your PostgreSQL server. Note: That there are both 32-bit and 64-bit binaries available for Windows as well as Mac OSX and several variants of Linux. You can get these from R CRAN - http://cran.r-project.org/
- PL/R handler for PostgreSQL. For Windows both 32-bit and 64-bit, Fedora, and Debian there are installers available which you can find on http://www.joeconway.com/plr/. For others you'll have to
compile from source.
NOTE: the windows binaries were built with R 2.11.1 and installer won't work with the newer or older R 2.12, but you can use the binaries if you are using R below 2.11.1 or above R.2.11.1 if you set the environment variables as we described in Up and Running with PL/R (PLR) in PostgreSQL: An almost Idiot's Guide. For Windows R 2.12 -- the path variable you need to add to your path environment variable has changed a little. It's now path to R install\bin\i386 or R\bin\x64. So for example if I am on 32-bit Windows I would add to my environment path variable -- C:\Program Files\R\R-2.12.0\bin\i386 and my R_HOME would be C:\Program Files\R\R-2.12.0. Even if you are on 64-bit windows, you will need to use the 32-bit package if you are using 32-bit PostgreSQL or you want to use R packages that are not currently supported in 64-bit Windows R.
Then you run the plr.sql file which gets installed in share/contrib in your PostgreSQL database to enable the language in that database.
Our first PL/R stored function: creating data frames and saving to RData file
In R there are 5 basic kinds of data structures: scalars, vectors, arrays, dataframes, and lists. Lists are the most complex of all because they are collections of the other 4 data structures. You can think of a PostgreSQL query output of table as a data frame. In this simple example, we'll demonstrate how to feed a PostgreSQL query into an R dataframe and save it in an R data file.
CREATE OR REPLACE FUNCTION save_postgis_logs_rdata() RETURNS text AS
$$
pggeom20log <<- pg.spi.exec("SELECT logid ,log_label,func,g1,g2,log_start,log_end,
date_part('epoch',age(log_end, log_start)) As dur_sec
FROM postgis_garden_log ORDER BY log_start")
pgrast20log <<- pg.spi.exec("SELECT logid ,log_label,func,g1,g2,log_start,log_end,
date_part('epoch',age(log_end, log_start)) As dur_sec
FROM raster_garden_log ORDER BY log_start")
save(pggeom20log,pgrast20log, file="/temp/postgisstats20.RData")
return("done")
$$
language 'plr';
The most common function used in PL/R is the pg.spi.exec routine which takes as input SQL statements and converts them to R dataframes. In the example above, we are dumping our geometry/geography logging test stats and our raster stats from a PostGIS 2.0 test run and saving them to 2 R DataFrames. In addition we are saving these data frames to a custom R Data file. An R Data file is a file that stores R objects in native R format. It's useful for saving datasets you are working on for easy load back.
Using R interactively
R is both a scriptable and an interactive environment and one that you can also automate a good bit of with PL/R. Before automating things, I like to test processes out interactively in R. For this next example, we'll load back the data we saved using PL/R into R and play with it.
To do so, launch R-GUI or R command line and at the prompt type:
load("/temp/postgisstats20.RData")
ls()
-- the ls output will return listing of variables we have loaded
[1] "pggeom20log" "pgrast20log"
summary(pgrast20log)
-- yields logid log_label func g1 Min. : 1.0 Length:2095 Length:2095 Length:2095 1st Qu.: 524.5 Class :character Class :character Class :character Median :1048.0 Mode :character Mode :character Mode :character Mean :1048.0 3rd Qu.:1571.5 Max. :2095.0 g2 log_start log_end dur_sec Length:2095 Length:2095 Length:2095 Min. :0.000000 Class :character Class :character Class :character 1st Qu.:0.000000 Mode :character Mode :character Mode :character Median :0.000000 Mean :0.008964 3rd Qu.:0.016000 Max. :0.187000
So we know from the above summary that we have 2095 observations and basic stats about the average duration of each process in our raster log tests. We ran the above test on a fairly new beefed up Winodw 7 and an older clunkier Windows Xp. Both 32-bit running PostgreSQL 9.0 / PostGIS 2.0. The windows xp in general followed the same pattern as the windows 7 except timings were generally 50-80% worse.
One of my favorite functions in R is the table
function because it's always present and fairly fast and easy to use. To get details on the function run this command:
help(table)
Basically what it does is provide a cross tab of counts. Here is an example use where we take our raster log and output the time durations but each time duration we round to 2 decimals so that our numbers in 0.11, 0.112 etc. collapse to 0.11
table(pgrast20log$g1, round(pgrast20log$dur_sec, digits=2))
If you have a lot of variables coming from the same table or want to run a lot of R steps on the same table, you may want to write something like below which is kind of equivalent to a SELECT a, b FROM sometable; As opposed to our above which is more like a SELECT sometable.a, sometable.b FROM sometable;. I generally like to be more explicit especially when I have a lot of R variables floating around.
attach(pgrast20log)
table(g1, round(dur_sec, digits=2))
detach(pgrast20log)
From this excerpt, we can tell that 64 bit floating point raster bands have the worst performance in general of any of the raster types. To get a true estimate we would run our simulation on different platforms and many times on same machine as well. We did run this a couple of times on our 32-bit windows XP and 7 and the 64BF is consistently slightly worse than the others. Our Windows 7 is about twice as fast in general but interestingly enough on the Windows 7, we see some suspicious outliers for 16BUI and 1BB. We think this is just turbulence since on other runs 64BF most always shows as the longest running.
0 0.02 0.03 0.05 0.06 0.08 0.09 0.11 0.16 0.19 16BSI 119 52 12 7 0 0 0 0 0 0 16BUI 122 48 18 1 0 0 0 0 1 0 1BB 121 57 8 1 1 1 0 0 0 1 2BUI 132 52 6 0 0 0 0 0 0 0 32BF 114 39 13 15 6 3 0 0 0 0 32BSI 115 41 10 17 6 0 1 0 0 0 32BUI 116 39 14 19 1 0 1 0 0 0 4BUI 130 53 7 0 0 0 0 0 0 0 64BF 113 41 1 2 10 12 10 1 0 0 8BSI 128 59 3 0 0 0 0 0 0 0 8BUI 124 59 5 2 0 0 0 0 0 0
Now we do the same exercise except flipping to see the culprit functions. Here we just want for each function if its better or worse than the 3rd Quantile standard deviation of our tests. Though this query can be done in SQL its much more succinct in R. Note: the 2 spellings in ST_SetValue, yes our machine gun test has spotted an inconsistency in our PostGIS documentation. Shame on us.
table(pgrast20log$func, pgrast20log$dur_sec > summary(pgrast20log$dur_sec)["3rd Qu."])
FALSE TRUE && 120 1 &< 121 0 &> 121 0 AddRasterColumn 3 8 DropRasterTable 11 0 insert data 10 1 PostGIS_Raster_Lib_Build_Date 1 0 PostGIS_Raster_Lib_Version 1 0 ST_AddBand 13 53 ST_AsBinary 6 5 ST_BandHasNoDataValue 22 0 ST_BandMetaData 7 15 ST_BandNoDataValue 22 0 ST_BandPath 22 0 ST_BandPixelType 22 0 ST_Box2D 11 0 ST_ConvexHull 11 0 ST_DumpAsPolygons 21 1 ST_Envelope 11 0 ST_GeoReference 22 0 ST_Height 11 0 ST_Intersection 220 0 ST_Intersects 438 2 ST_MakeEmptyRaster 14 0 ST_MetaData 11 0 ST_NumBands 11 0 ST_PixelAsPolygon 18 4 ST_PixelSizeX 11 0 ST_PixelSizeY 11 0 ST_Polygon 22 0 ST_Raster2WorldCoordX 22 0 ST_Raster2WorldCoordY 22 0 ST_SetBandHasNoDataValue 13 9 ST_SetBandNoDataValue 11 11 ST_SetGeoReference 11 11 ST_SetPixelSize 13 9 ST_SetSkew 11 11 ST_SetSRID 6 5 ST_SetUpperLeft 6 5 ST_SetValue 41 36 ST_SEtValue 31 24 ST_SkewX 11 0 ST_SkewY 11 0 ST_SRID 11 0 ST_UpperLeftX 11 0 ST_UpperLeftY 11 0 ST_Value 131 1 ST_Width 7 4 ST_World2RasterCoordX 77 0 ST_World2RasterCoordY 77 0
Second PL/R stored function, creating a plain text report
When you are writing PL/R code, you are pretty much just writing R code and living in the R environment. However, there are certain functions in R that you may come to prefer when working in PL/R than you would prefer in R. In PL/R you want data to be output in variables so you can easily return them in PL/R. The PostgreSQL language also doesn't provide a whole lot of mappings for the objects you get back in R. For our next example we want a simple text output of representation of the cross tabs we created and we want it to be returnable as an SQL statement. For this next example, we are going to lean on our favorite R functions for doing this. Other's more experienced may have better ways.
- capture.output - this is an R function that will instead of returning to the screen the output of a variable or a command -- will return to a variable what it would normally return to the viewer's screen. So it will give us the same output as we have above. The only caveate is that it returns a string array where each line is a separate line on the screen. THis works kind of like the sink function in R except it doesn't have the messiness of needing a file to redirect screen output to.
- paste - this we use to commonly collapse output such as arrays. It has many uses, but we are using it as a reverse split function. Kind of like a implode in PHP.
- sprintf - does pretty much the same thing as it does in C and allows you to format text with place holders for variables
- subset -- the subset function in R is akin to a SELECT FROM WHERE in SQL. Except | means or & means AND and == means =. The c() function in R is a way of creating vectors
So now we show our reporting function which exercises all the above R functions.
CREATE OR REPLACE FUNCTION output_garden_stats() RETURNS text AS
$$
#geometry/geography stats we are only selecting tests that involve only one geometry type
pggeom20log <<- pg.spi.exec("SELECT logid ,log_label,func,g1, g2, log_start,log_end,
date_part('epoch',age(log_end, log_start)) As dur_sec
FROM postgis_garden_log ORDER BY log_start")
pgrast20log <<- pg.spi.exec("SELECT logid ,log_label,func,g1,g2,log_start,log_end,
date_part('epoch',age(log_end, log_start)) As dur_sec
FROM raster_garden_log ORDER BY log_start")
attach(pgrast20log) #make pgrast20log fields global
prsum.breakq <- summary(dur_sec)["3rd Qu."]
prsum <- table(func, dur_sec > prsum.breakq)
prdur <- table(g1, round(dur_sec,2))
detach(pgrast20log)
attach(pggeom20log) #make pggeom20log fields global
pggdur <- table(g1, round(dur_sec,1),useNA="ifany")
pgfuncdur <- table(func, round(dur_sec,1),useNA="ifany")
detach(pggeom20log) #make it not globlal
#create subset of data only containing CURVEPOLYGONS
pgpoly<-subset(pggeom20log, g1 == 'CURVEPOLYGON' | g2 == 'CURVEPOLYGON', select=c(func, dur_sec))
#create a cross tab count for just curvepolygon
#TRUE count for completions
#FALSE for processes that throw errors
pgpolydur <- table(pgpoly$func, !is.na(pgpoly$dur_sec))
#we use capture.output to capture the string representation of the data tables as would be shown on screen
reportoutput <- sprintf("RASTER results is duration greater than %s secs %s \r\r RASTER results duration breakout \r %s
GEOMETRY GEOGRAPHY by type %s \r GEOMETRY\GEOGRAPHY By function duration %s \r\r CURVEPOLYGON failure = FALSE, completion = TRUE by function %s", prsum.breakq,
paste(capture.output(prsum), sep ="", collapse="\r"), paste(capture.output(prdur), sep ="", collapse="\r"),
paste(capture.output(pggdur), sep ="", collapse="\r"), paste(capture.output(pgfuncdur), sep ="", collapse="\r")
, paste(capture.output(pgpolydur), sep ="", collapse="\r")
)
return(reportoutput)
$$
language 'plr';
To run our report, we run this command: SELECT output_garden_stats();
The report output can be seen http://www.postgresonline.com/demos/plr/postgis20report.txt
There are a lot of surprising things that we discovered simply looking at counts. Some might be things in the testing scripts (or even the PostGIS documentation) that need to be cleaned up and some may be PostGIS issues or just predictable oddities when a function is applied to 2 different kinds of geometries.
For example for the CURVEPOLYGON report:
- the total counts for the operators are not the same. The && for most is about twice that of most of the other operators. This may mean we are overtesting && perhaps maybe its listed twice in the documentation.
- Another oddity is that the ST_Split function, ST_ShortestLine and some other functions that are not supposed to work on CURVEPOLYGONS return an answer 3 out of 29 times. What are these 3 times? On closer inspection of the detail records -- it turns out that specialty geometries such as our favorite EMPTY geometry, MULTI NULL, and what we call a SINGLE NULL (hmm is that a real thing - we should probably just change that to 1 NULL), will work because they are special cases. When you split an empty geometry with anything you get back an empty geometry for example.
- Another interesting conclusion is the ST_SnapToGrid which again shouldn't work on CURVEPOLYGONS. Yet we have 2 cases where it does. This is because there are about 3 implementations of ST_SnapToGrid and one is snapping to another geoemtry. If you snap any geometry to NULL evidentally you get a NULL back.
So the moral of the story -- You can learn a lot from a dummy.
Join us next time as we continue our exploration of R and PL/R and demonstrate creating aggregates with PL/R and then later creating graphs and plots.
PL Programming
PL/R Part 2: Functions that take arguments and the power of aggregation Intermediate
In Intro to PL/R and R, we covered how to enable PL/R language in the database and wrote some PL/R functions that rendered plain text reports using the R environment. What makes combining R and PostgreSQL in PL/R most powerful is when you can start writing SQL summary queries that use R functions like any other SQL function.
In this next example, we'll be using PostGIS test runs from tests we autogenerated from the Official PostGIS documentation (Documentation Driven Testing (DDT)) as described in the Garden Test section of the PostGIS Developer wiki. We've also updated some of our logging generator and test patterns so future results may not represent what we demonstrated in the last article.
On a side note: Among the changes in the tests was to introduce more variants of the Empty Geometry now supported by PostGIS 2.0. Our beloved PostGIS 2.0 trunk is at the moment somewhat unstable when working with these new forms of emptiness and stuffing geometries in inappropriate places. At the moment it doesn't survive through the mindless machine gun battery of tests we have mercilessly inflicted. It's been great fun trying to build a better dummy while watching Paul run around patching holes to make the software more dummy proof as the dummy stumbles across questionable but amusing PostGIS use cases not gracefully handled by his new serialization and empty logic.
On yet another side note, it's nice to see that others are doing similar wonderful things with documentation. Check out Euler's comment on catalog tables where he uses the PostgreSQL SGML documentation to autogenerate PostgreSQL catalog table comments using OpenJade's OSX to convert the SGML to XML and then XSL similar to what we did with PostGIS documentation to autogenerate PostGIS function/type comments and as a platform for our test generator.
For our next exercises we'll be using the power of aggregation to push data into R instead of pg.spi.execute. This will make our functions far more reusable and versatile.
A basic PL/R function
For our first example we'll create a tally report similar to what we did in our Part 1, except the tally report data will be completely defined by how we group data in our SQL queries.
CREATE OR REPLACE FUNCTION tally_report(param_row_label text, param_col_label text,
param_row_data text[], param_col_data text[])
RETURNS text AS
$$
#create the tally table
p_break <- table(param_row_data, param_col_data)
#calculate total row
Total <- margin.table(p_break,2)
#define new table composed of
#our original tally with total row added
p_total <- rbind(p_break,Total)
#rename the headings to be what was passed in
names(dimnames(p_total)) <- list(param_row_label,param_col_label)
#we use capture.output to capture the string representation
#of the data tables as would be shown on screen
# use ftable to prevent wrapping
reportoutput <- sprintf("%s",
paste(capture.output(ftable(p_total)) , sep ="", collapse="\r")
)
return(reportoutput)
$$ language plr;
As you can see above, you can use the named arguments directly in PL/R without any specialness. As a general convention we like to prefix our input arguments with param_. This is just our personal convention and not dictated by PL/R or R. We do this so it doesn't conflict with other variable names we use in the function.
Using our PL/R function in SQL
Now we can use our function by writing an SQL statement:
-- report on only geometries that are not of some variant of empty or null
-- if the test did not complete mark it as F in the timings column
SELECT
tally_report(func, 'timings(ms)',
array_agg(g1),
array_agg(
COALESCE(
CAST(date_part('epoch', age(log_end,log_start))*1000 As text),
'F')
)
) As basic_tally
FROM postgis_garden_log
WHERE func LIKE 'ST_3D%'
AND g1 NOT LIKE '%Empt%'
AND g2 NOT LIKE '%Empt%'
AND g1 NOT LIKE '%NULL%'AND g2 NOT LIKE '%NULL%'
GROUP BY func
ORDER BY func;
Our example uses the array_agg function introduced in PostgreSQL 8.4 so will not work on lower versions of PostgreSQL. The output of our query looks like:
timings(ms) ST_3DClosestPoint 0 15 16 F CIRCULARSTRING 0 0 0 27 CURVEPOLYGON 0 0 0 27 GEOMETRYCOLLECTIONM 0 0 0 27 GEOMETRYCOLLECTIONZ 0 0 0 27 LINESTRING 6 0 0 21 LINESTRINGM 5 1 0 21 LINESTRINGZ 6 0 0 21 MULTILINESTRING 4 1 1 21 MULTILINESTRINGM 6 0 0 21 MULTILINESTRINGZ 3 0 3 21 MULTIPOINT 8 2 2 15 MULTIPOINTM 12 0 0 15 MULTIPOINTZ 11 1 0 15 MULTIPOLYGON 0 0 0 27 MULTIPOLYGONM 0 0 0 27 MULTIPOLYGONZ 0 0 0 27 MULTISURFACE 0 0 0 27 POINT 11 0 1 15 POINTM 8 2 2 15 POINTZ 10 2 0 15 POLYGON 0 0 0 27 POLYGONM 0 0 0 27 POLYGONZ 0 0 0 27 POLYGONZM 0 0 0 27 POLYHEDRALSURFACE 0 0 0 27 TIN 0 0 0 27 TRIANGLE 0 0 0 27 Total 90 9 9 621 timings(ms) ST_3DDFullyWithin 0 15 16 F CIRCULARSTRING 0 0 0 27 CURVEPOLYGON 0 0 0 27 GEOMETRYCOLLECTIONM 0 0 0 27 GEOMETRYCOLLECTIONZ 0 0 0 27 LINESTRING 6 0 0 21 LINESTRINGM 5 1 0 21 LINESTRINGZ 5 1 0 21 MULTILINESTRING 5 0 1 21 MULTILINESTRINGM 4 2 0 21 MULTILINESTRINGZ 6 0 0 21 MULTIPOINT 9 3 0 15 MULTIPOINTM 10 2 0 15 MULTIPOINTZ 12 0 0 15 MULTIPOLYGON 0 0 0 27 MULTIPOLYGONM 0 0 0 27 MULTIPOLYGONZ 0 0 0 27 MULTISURFACE 0 0 0 27 POINT 11 1 0 15 POINTM 9 1 2 15 POINTZ 10 0 2 15 POLYGON 0 0 0 27 POLYGONM 0 0 0 27 POLYGONZ 0 0 0 27 POLYGONZM 0 0 0 27 POLYHEDRALSURFACE 0 0 0 27 TIN 0 0 0 27 TRIANGLE 0 0 0 27 Total 92 11 5 621 : :
We can reuse this same function to give us a tally of the kinds of geometry companions that are supported by these functions with:
-- report on only geometries that are not of some variant of empty or null
-- if the test completes
-- mark the geometries. If fails mark first geometry as row and second as F for failure
SELECT
tally_report(func, 'supported',
array_agg(g1),
array_agg(
CASE WHEN log_end IS NOT NULL THEN g2 ELSE 'F' END
)
) As basic_tally
FROM postgis_garden_log
WHERE func LIKE 'ST_3D%Distance'
AND g1 NOT LIKE '%Empt%'
AND g2 NOT LIKE '%Empt%'
AND g1 NOT LIKE '%NULL%'AND g2 NOT LIKE '%NULL%'
GROUP BY func
ORDER BY func;
supported F LINESTRING LINESTRINGM LINESTRINGZ MULTILINESTRING MULTILINESTRINGM MULTILINESTRINGZ MULTIPOINT MULTIPOINTM MULTIPOINTZ POINT POINTM POINTZ ST_3DDistance CIRCULARSTRING 27 0 0 0 0 0 0 0 0 0 0 0 0 CURVEPOLYGON 27 0 0 0 0 0 0 0 0 0 0 0 0 GEOMETRYCOLLECTIONM 27 0 0 0 0 0 0 0 0 0 0 0 0 GEOMETRYCOLLECTIONZ 27 0 0 0 0 0 0 0 0 0 0 0 0 LINESTRING 21 0 0 0 0 0 0 1 1 1 1 1 1 LINESTRINGM 21 0 0 0 0 0 0 1 1 1 1 1 1 LINESTRINGZ 21 0 0 0 0 0 0 1 1 1 1 1 1 MULTILINESTRING 21 0 0 0 0 0 0 1 1 1 1 1 1 MULTILINESTRINGM 21 0 0 0 0 0 0 1 1 1 1 1 1 MULTILINESTRINGZ 21 0 0 0 0 0 0 1 1 1 1 1 1 MULTIPOINT 15 1 1 1 1 1 1 1 1 1 1 1 1 MULTIPOINTM 15 1 1 1 1 1 1 1 1 1 1 1 1 MULTIPOINTZ 15 1 1 1 1 1 1 1 1 1 1 1 1 MULTIPOLYGON 27 0 0 0 0 0 0 0 0 0 0 0 0 MULTIPOLYGONM 27 0 0 0 0 0 0 0 0 0 0 0 0 MULTIPOLYGONZ 27 0 0 0 0 0 0 0 0 0 0 0 0 MULTISURFACE 27 0 0 0 0 0 0 0 0 0 0 0 0 POINT 15 1 1 1 1 1 1 1 1 1 1 1 1 POINTM 15 1 1 1 1 1 1 1 1 1 1 1 1 POINTZ 15 1 1 1 1 1 1 1 1 1 1 1 1 POLYGON 27 0 0 0 0 0 0 0 0 0 0 0 0 POLYGONM 27 0 0 0 0 0 0 0 0 0 0 0 0 POLYGONZ 27 0 0 0 0 0 0 0 0 0 0 0 0 POLYGONZM 27 0 0 0 0 0 0 0 0 0 0 0 0 POLYHEDRALSURFACE 27 0 0 0 0 0 0 0 0 0 0 0 0 TIN 27 0 0 0 0 0 0 0 0 0 0 0 0 TRIANGLE 27 0 0 0 0 0 0 0 0 0 0 0 0 Total 621 6 6 6 6 6 6 12 12 12 12 12 12 supported F LINESTRING LINESTRINGM LINESTRINGZ MULTILINESTRING MULTILINESTRINGM MULTILINESTRINGZ MULTIPOINT MULTIPOINTM MULTIPOINTZ POINT POINTM POINTZ ST_3DMaxDistance CIRCULARSTRING 27 0 0 0 0 0 0 0 0 0 0 0 0 CURVEPOLYGON 27 0 0 0 0 0 0 0 0 0 0 0 0 GEOMETRYCOLLECTIONM 27 0 0 0 0 0 0 0 0 0 0 0 0 GEOMETRYCOLLECTIONZ 27 0 0 0 0 0 0 0 0 0 0 0 0 LINESTRING 21 0 0 0 0 0 0 1 1 1 1 1 1 LINESTRINGM 21 0 0 0 0 0 0 1 1 1 1 1 1 LINESTRINGZ 21 0 0 0 0 0 0 1 1 1 1 1 1 MULTILINESTRING 21 0 0 0 0 0 0 1 1 1 1 1 1 MULTILINESTRINGM 21 0 0 0 0 0 0 1 1 1 1 1 1 MULTILINESTRINGZ 21 0 0 0 0 0 0 1 1 1 1 1 1 MULTIPOINT 15 1 1 1 1 1 1 1 1 1 1 1 1 MULTIPOINTM 15 1 1 1 1 1 1 1 1 1 1 1 1 MULTIPOINTZ 15 1 1 1 1 1 1 1 1 1 1 1 1 MULTIPOLYGON 27 0 0 0 0 0 0 0 0 0 0 0 0 MULTIPOLYGONM 27 0 0 0 0 0 0 0 0 0 0 0 0 MULTIPOLYGONZ 27 0 0 0 0 0 0 0 0 0 0 0 0 MULTISURFACE 27 0 0 0 0 0 0 0 0 0 0 0 0 POINT 15 1 1 1 1 1 1 1 1 1 1 1 1 POINTM 15 1 1 1 1 1 1 1 1 1 1 1 1 POINTZ 15 1 1 1 1 1 1 1 1 1 1 1 1 POLYGON 27 0 0 0 0 0 0 0 0 0 0 0 0 POLYGONM 27 0 0 0 0 0 0 0 0 0 0 0 0 POLYGONZ 27 0 0 0 0 0 0 0 0 0 0 0 0 POLYGONZM 27 0 0 0 0 0 0 0 0 0 0 0 0 POLYHEDRALSURFACE 27 0 0 0 0 0 0 0 0 0 0 0 0 TIN 27 0 0 0 0 0 0 0 0 0 0 0 0 TRIANGLE 27 0 0 0 0 0 0 0 0 0 0 0 0 Total 621 6 6 6 6 6 6 12 12 12 12 12 12
PL Programming
PL/R Part 3: Sharing Functions across PL/R functions with plr_module Intermediate
In Part 2 of PL/R we covered how to build PL/R functions that take arrays and output textual outputs of generated R objects. We then used this in an aggregate SQL query using array_agg. Often when you are building PL/R functions you'll have R functions that you want to reuse many times either inside a single PL/R function or across various PL/R functions.
Unfortunately, if you wanted to call a PL/R function from another PL/R function, this is not possible unless you are doing it from spi.execute call. There is another way to embed reusable R code in a PostgreSQL database. In order to be able to share databases stored R code across various PL/R functions, PL/R has a feature called a plr_module. In this tutorial we'll learn how to create and register shareable R functions with plr_module. In the next part of this series we'll start to explore generating graphs with PL/R.
Creating and Using PL/R modules
In R you define a function much like you do data. Functions are stored in variables. In fact PL/R functions are stored as anonymous R functions. In our last examples, we had a snippet of code that would output the contents of the screen. In this next example, we'll rewrite our function to define this snippet as a plr_module and then reuse this in our pl/r function.
The plr_module model is described in Loading R Modules at Startup.
PL/R modules are always installed on the startup of the server, and are named R functions. If you install new modules or change existing, you can force a reload by calling the reload_plr_modules() PL/R function. In this next example, we'll define a module function called print_variable that we will then reuse in our functions.
When you install PL/R, it doesn't automatically create a plr_module table. This you do by running the below command:
CREATE TABLE plr_modules (
modseq int4 primary key,
modsrc text
);
The modseq will defin the order in which plr_modules are registered.
Next we register our new function, but before we do, we do a:
SET standard_conforming_strings='on';
So we don't have to worry about back slashing our quotes etc in the text.
Then we register our new R function as a module in the plr_modules table.
INSERT INTO plr_modules
VALUES (1, 'postgis_tests.print_variable <-function(param_variable){
print(sprintf("%s",
paste(capture.output(param_variable) , sep ="", collapse="\r") )
)
}');
Next we do a:
SELECT reload_plr_modules();
To force a registration of our new function. Now we are able to use this new function in our PL/R functions. The below function is exactly like the other tally_report we created except we are also outputting the summary and have encapsulated all our capture output logic in a reusuable R function.
CREATE OR REPLACE FUNCTION tally_report2(param_row_label text, param_col_label text,
param_row_data text[], param_col_data text[])
RETURNS text AS
$$
#create the tally table
p_break <- table(param_row_data, param_col_data)
#rename the headings to be what was passed in
names(dimnames(p_break)) <- list(param_row_label,param_col_label)
return (sprintf("%s \r%s",postgis_tests.print_variable(ftable(p_break))
,postgis_tests.print_variable(summary(p_break))) )
$$ language plr;
Then we use as normal:
-- report on only geometries that are not of some variant of empty or null
-- put in a column F and count those that did not complete
SELECT tally_report2('geometry type', func || ' timings(ms)',
array_agg(g1),
array_agg(
COALESCE(
CAST(date_part('epoch', age(log_end,log_start))*1000 As text),
'F')
)
) As basic_tally
FROM postgis_garden_log
WHERE func LIKE 'ST_3D%'
AND g1 NOT LIKE '%Empt%'
AND g2 NOT LIKE '%Empt%'
AND g1 NOT LIKE '%NULL%'AND g2 NOT LIKE '%NULL%'
GROUP BY func
ORDER BY func LIMIT 1;
Which gives us the below - note that we put in a limit 1, but if we left that out, we'd get a report for each function:
ST_3DClosestPoint timings(ms) 0 15 16 31 32 F geometry type CIRCULARSTRING 0 0 0 0 0 28 CURVEPOLYGON 0 0 0 0 0 28 CURVEPOLYGON2 0 0 0 0 0 28 GEOMETRYCOLLECTIONM 0 0 0 0 0 28 GEOMETRYCOLLECTIONZ 0 0 0 0 0 28 LINESTRING 7 1 4 0 0 16 LINESTRINGM 9 1 2 0 0 16 LINESTRINGZ 8 1 3 0 0 16 MULTILINESTRING 10 0 2 0 0 16 MULTILINESTRINGM 9 2 1 0 0 16 MULTILINESTRINGZ 2 5 4 0 1 16 MULTIPOINT 7 2 3 0 0 16 MULTIPOINTM 7 1 4 0 0 16 MULTIPOINTZ 9 2 1 0 0 16 MULTIPOLYGON 0 0 0 0 0 28 MULTIPOLYGONM 0 0 0 0 0 28 MULTIPOLYGONZ 0 0 0 0 0 28 MULTISURFACE 0 0 0 0 0 28 POINT 8 1 2 1 0 16 POINTM 11 0 1 0 0 16 POINTZ 8 2 2 0 0 16 POLYGON 0 0 0 0 0 28 POLYGONM 0 0 0 0 0 28 POLYGONZ 0 0 0 0 0 28 POLYGONZM 0 0 0 0 0 28 POLYHEDRALSURFACE 0 0 0 0 0 28 TIN 0 0 0 0 0 28 TRIANGLE 0 0 0 0 0 28 Number of cases in table: 784 Number of factors: 2 Test for independence of all factors: Chisq = 346.7, df = 135, p-value = 1.466e-20 Chi-squared approximation may be incorrect
Which tells us there are 784 tests for ST_ClosestPoint executed (all the ones with 28 F means no test involving that type returns an answer).
Product Showcase
PostgreSQL 9 Admin Cookbook Book Review
I was excited when PostgreSQL 9 Admin Cookbook by Simon Riggs and Hannu Krosing and the companion book PostgreSQL 9 High Performance, by Greg Smith were available. All three authors are well known experts in the PostgreSQL community and consultants at 2ndQuadrant, so you are sure to learn a lot from both books. Both books are published by Packt Publishing and can be bought directly from Packt Publishing or via Amazon. Packt is currently running a 50% off sale if you buy both books (e-Book version) directly from Packt. In addition Packt offers free shipping for US, UK, Europe and select Asian countries. The pair of books make attractive companions.
The main thing I felt missing in this duo was a book dedicated to PostgreSQL: The platform that would cover all the various PL languages and the various neat ways PostgreSQL is used and has been extended by many to do things one would not normally expect of a database. Some day perhaps someone will write such a book.
This article is a review about PostgreSQL 9 Admin Cookbook and we'll be following up later with PostgreSQL 9 High Performance.
This is my first book review. I have a lot of patience for writing, but little patience when it comes to reading. That said, I found PostgreSQL 9 Admin Cookbook an easy and enjoyable read, and a book that I managed to learn more tricks from than I care to admit. It is a handy book to have for reference regardless of if you consider yourself a novice, intermediate or advanced user.
As the book title suggests, it's a cookbook, but a cookbook that combines a question and answer style with a discussion style of writing. The tasks are neatly categorized into 12 chapters and each task smoothly builds on previous tasks discussed. It is still categorized in such a way that you can jump to a particular task you are currently having problems with without having read the other parts of the book.
Although it is titled PostgreSQL 9 -- it covers earlier versions as well.
This is one of those books I wish I had when training some of our customers or had early on. A lot of the questions - we are commonly asked or have stumbled on - like how to troubleshoot bad queries, how to tell what are my biggest tables, how to deal with data corruption etc, are all succinctly covered in this book.
Just to get a taste of what this book offers:
- Chapter 1: First steps This is mostly a newbie chapter, that introduces you to PostgreSQL, guides you thru connecting to the database using commandline and PgAdmin. Some other examples of commonly used GUIs. Configuring access control, troubleshooting failed connections. It provides tips both for the Linux as well as the Windows user.
- Chapter 2: Exploring the database starts to get into what I would call intermediate territory. It covers tasks such as determining where your database files are, how to determine disk space utilization for both whole database and individual tables. Getting quick estimate of number of rows for large tables where doing a count would be really slow. Using psql and the various system tables to determine object dependency.
- Chapter 3: Configuration A good chapter not just for PostgreSQL users but I would say any database designer. I think this chapter holds useful nuggets for all levels of users. It covers basic considerations when planning your database and based on the needs how to configure the key PostgreSQL configuration settings by server, user, and database. It had tricks like how to determine which parameters are at non-default settings. I am ashamed to admit, that I learned a lot from this chapter. Stuff I really should have known before. My only lame excuse is that PostgeSQL has always worked so well, that I never had a need to delve into these nuances.
- Chapter 4: Server Control - A very intermediate chapter slowly getting into more nuances. It covers connection pooling, viciously and not so viciously kicking users off, gracefully shutting down, using schemas to partition data. Again a lot of stuff here I didn't know so will have to reread when I find myself in these situations.
- Chapter 5: Tables & Data - This chapter covers general good practices for namings tables and other objects, setting up keys, finding duplicate data, deleting duplicate data, . My favorite pasttime, techniques to generate test data. It concludes with loading data from flat files and csv files. Fundamentals of using pgloader utility created by Dmitri Fountaine.
- Chapter 6: Security Covers new GRANT management in PostgreSQL 9.0, setting up users, auditing DDL and data changes, integrating with LDAP, using SSL, ensuring secure passwords, and encrypting data with pgcrypto.
- Chapter 7: Database Administration - starts getting a bit into creating functions using plpgsql, plproxy. Running SQL scripts in parallel. Fundamentals of PostgreSQL DDL such as creating schemas, tablespaces and setting new configuration options for those introduced in 9.0. Moving objects to different schemas and tablespaces.
- Chapter 8: Monitoring and Diagnosis - covers how to catch blocking queries, query monitoring. Using pgFouine for log processing and alerting you of issues. The thing I most enjoyed about this chapter were the cut and paste queries utilizing various PostgreSQL system tables to track down issues.
- Chapter 9: Regular Maintenance - this topic is normally a very dry one,at least for me. It covers fundamentals you should know, like backup,recovery testing, reclaiming space, transaction wrap around (less of an issue with newer versions of PostgreSQL), but that puts a narcoleptic like me into a comatose state. I awoke when reading the topic of Carefully removing unwanted indexes. It shows you how to turn indexes off without removing them so you can see if they are really useful. I've really got to try this one out.
- Chapter 10: Performance & Concurrency - My favorite chapter. Can't get enough of this stuff. Just read it. You'll learn a lot regardless your walk of life.
- Chapter 11: Backup & Recovery - Covers using PostgreSQL specific tools such as pg_dump and other tools such as rman and rsynch for doing differential backups.
- Chapter 12: Replication and Upgrades - Covers in a nutshell everything you could ever dream of knowing (well all my limited imagination can dream of) about replication and using replication for high availability and how to do upgrades. It covers replication terminology, the new 9.0 streaming replication, hot standby as well as longtime favorites like Londiste, Slony-2.0 and load balancing with pgpool, and using pg_upgrade.
Product Showcase
Hello LibreOffice, Goodbye OpenOffice
Many people have been concerned with Oracle's stewardship of past Sun Microsystems open source projects. There are Java, MySQL, OpenSolaris to name a few. Why are people concerned? Perhaps the abandoning of projects such as OpenSolaris, the suing of Google over Java infringements, the marshalling out of many frontline contributors of core Open Source projects from Oracle, the idea of forking over license rights to a single company so they can relicense your code. We have no idea. All we know is that there is an awful lot of forking going on.
To Oracle's defense, many do feel that they have done a good job with progressing the advancements of some of the Open Source projects they have shepherded. For example getting MySQL patches more quickly in place etc. For some projects where there is not much of a monetary incentive, many feel they have at best neglected e.g. OpenSolaris. Perhaps it's more Oracle's size and the size that Sun was before takeover that has made people take notice that no Open Source project is in stable hands when its ecosystem is predominantly controlled by the whims of one big gorilla.
One new fork we were quite interested to hear about is LibreOffice, which is a fork of OpenOffice. In addition to the fork, there is a new organization called Document Foundation to cradle the new project. Document Foundation is backed by many OpenOffice developers and corporate entities (Google, Novell,Canonical) to name a few. The Document Foundation mission statement is outlined here. There is even a document foundation planet for LibreOfficerians to call home.
The LibreOffice starter screen looks similar to the OpenOffice starter screen, except instead of the flashy Oracle logo we have come to love and fear, it has a simple text Document Foundation below the basic multi-colored Libre Office title. Much the same tools found in OpenOffice are present. The project has not forked too much in a user-centric way from its OpenOffice ancestor yet. The main changes so far are the promise of not having to hand over license assignment rights to a single company as described in LibreOffice - A fresh page for OpenOffice as well as some general cleanup and introduction of plugins that had copy assignment issues such as some from RedHat and Go-OO. My favorite quote listed in the above article is It feels like Oracle is "a mother who loves her child but is not aware that her child wants to walk alone." by André Schnabel. So perhaps Oracle's greatest contribution and legacy to Open Source and perhaps the biggest that any for-profit company can make for an Open Source project is to force its offspring to grow feet to walk away.
In later posts we'll test drive Libreoffice with PostgreSQL to see how it compares to its OO ancestor and what additional surprises it has in store.
Product Showcase
PostgreSQL 9 High Performance Book Review
In a prior article we did a review of PostgreSQL 9 Admin Cookbook, by Simon Riggs and Hannu Krosing. In this article we'll take a look at the companion book PostgreSQL 9 High Performance by Greg Smith.
Both books are published by Packt Publishing and can be bought directly from Packt Publishing or via Amazon. Packt is currently running a 50% off sale if you buy both books (e-Book version) directly from Packt. In addition Packt offers free shipping for US, UK, Europe and select Asian countries.
For starters: The PostgreSQL 9 High Performance book is a more advanced book than the PostgreSQL 9 Admin Cookbook and is more of a sit-down book. At about 450 pages, it's a bit longer than the PostgreSQL Admin Cookbook. Unlike the PostgreSQL 9 Admin Cookbook, it is more a concepts book and much less of a cookbook. It's not a book you would pick up if you are new to databases and trying to feel your way thru PostgreSQL, however if you feel comfortable with databases in general, not specific to PostgreSQL and are trying to eek out the most performance you can it's a handy book. What surprised me most about this book was how much of it is not specific to PostgreSQL, but in fact hardware considerations that are pertinent to most relational databases. In fact Greg Smith, starts the book off with a fairly shocking statement in the section entitled PostgreSQL or another database? There are certainly situations where other database solutions will perform better. Those are words you will rarely hear from die-hard PostgreSQL users, bent on defending their database of choice against all criticism and framing PostgreSQL as the tool that will solve famine, bring world peace, and cure cancer if only everyone would stop using that other thing and use PostgreSQL instead:). That in my mind, made this book more of a trustworthy reference if you came from some other DBMS, and wanted to know if PostgreSQL could meet your needs comparably or better than what you were using before.
In a nutshell, if I were to contrast and compare the PostgreSQL 9 Admin Cookbook vs. PostgreSQL High Performance, I would say the Cookbook is a much lighter read focused on getting familiar with and getting the most out of the software (PostgreSQL), and PostgreSQL High Perofrmance is focused on getting the most out of your hardware and pushing your hardware to its limits to work with PostgreSQL. There is very little overlap of content between the two and as you take on more sophisticated projects, you'll definitely want both books on your shelf. The PostgreSQL 9 High Perofrmance book isn't going to teach you too much about writing better queries,day to day management, or how to load data etc, but it will tell you how to determine when your database is under stress or your hardware is about to kick the bucket and what is causing that stress. It's definitely a book you want to have if you plan to run large PostgreSQL databases or a high traffic site with PostgreSQL.
PostgreSQL 9 High Performance is roughly about 25% hardware and how to choose the best hardware for your budget, 40% in-depth details about how PostgreSQL works with your hardware and trade-offs made by PostgreSQL developers to get a healthy balance of performance vs. reliability, and another 35% about various useful monitoring tools for PostgreSQL performance and general hardware performance. Its focus is mostly on Linux/Unix, which is not surprising since most production PostgreSQL installs are on Linux/Unix. That said there is some coverage of windows such as FAT32/NTFS discussion and considerations when deploying terabyte size databases on Windows and issues with shared memory on Windows.
Full disclosure: I got a free e-Book copy of this book just as I did with PostgreSQL 9 Admin Cookbook.
Quality of Book Print
One person yelled at me on reddit for not going into details about the quality of the print and so forth. So here is my general biased comment of it..I have just the e-Book so can't comment on the hard-copy aside from what I know about PacktPub books I do have hard-copies off. As far as e-Book quality goes, I can only speak for the PDF version. What I can say about it, is that it fits my requirements. I can copy and paste code from the book, the index in the back and table of contents is clickable and takes me to any section when I click on the page number link. The text is fairly easy to read. It's not very colorful though. I think there are only about 3 or 4 color graphs in this book. Some colorful graphical explain plans would have helped A LOT. In short it needs more color :).
Andrew Dunstan also did a review of PostgreSQL 9 High Performance where he did complain about the code wrapping and the explain plans being difficult to read because of the wrapping. I personally find text explain plans difficult to read period. Sure things could have been wrapped a bit better or you could have YAML'd the damn thing, but it wouldn't have helped me much. What I really wanted to see was Graphical PgExplain plans along side the text plans. As strange as it sounds, perhaps because I come from a SQL Server background, I use the graphical explain plan as a roadmap into my information overloaded textual explain plan. Without that crutch, I feel a bit naked. The other snippets of code like SQL examples, I did find well-indented and easy to read, so have no idea what Andrew was whining about :).
Overview coverage of chapters
I liked David Christensen's review quote about the key theme of this book Measure don't guess. I think that quote sums this book up quite nicely. Here is my slightly more detailed account. I'm not going to bore you with the gory details of each chapter as was suggested on reddit, but will try to summarize the flavor of each chapter. If you don't like that tough and write your own review :).
- Chapter 1: PostgreSQL Versions Although the book is called PostgreSQL 9, it really covers PostgreSQL 8.2-9.0 and goes through a history of major enhancements in each release that have improved performance and dears to ask the question, is PostgreSQL really right for your needs? It will give you a good idea of if you are using version X of PostgreSQL, if there is a compelling reason for you to upgrade, and if you have chosen not to use PostgreSQL in the past, if the issue that stopped you has been remedied.
- Chapter 2: Database Hardware This chapter focuses on selecting the best hardware to run PostgreSQL and in terms of pricing which hardware features are the biggest bang for the buck. Although it phrases the selection in the context of PostgreSQL, I would say that this is a chapter that is almost as applicable to any relational database today, particularly any you would run on a Unix system. It talks about choosing the right RAID configurations, SANS vs NASs, Disk controllers, onboard RAM, Solid State drives (SSDs) etc and how they play a role in query and general health of your PostgreSQL database server. All those things that generally bore me to death, but I know are important and something I would quickly pass off to Leo (the one that is suspicious of any server he didn't put together with his own bare hands) and say "Make sure you know this stuff".
- Chapter 3: Database Hardware Benchmarking Again another chapter, not too specific to PostgreSQL. It describes a ton of Unix utilities that would make performance fanatics giddy. Things like sysbench, bonnie, memtest, hdtune etc. Various techniques for doing memory tests with PostgreSQL test queries.
- Chapter 4: Disk Setup Again stuff I would consider not specific to PostgreSQL. It talks about how you should partition your data and why it's generally a bad idea to put your data on an Operating System partition. It talks about various Linux file system types like ext2, ext3, ext4, XFS, Solaris and FreeBSD (UFS, ZFS options) and the pros and cons of each. File system limits etc. It does get into specific PostgreSQL areas such as how to setup your temp table spaces.
- Chapter 5: Memory for Database Caching This chapter is very PostgreSQL focused and discusses how PostgreSQL does caching and how the various parameters in postgresql.conf affect its caching behavior. It also goes into detail how about how caching and the parameters have changed from PostgreSQL 8.2 to present. It's definitely a chapter you want to read if you are upgrading or thinking of upgrading your PostgreSQL server.
- Chapter 6: Server Configuration Tuning This is an extension fo the Chapter 5 topics and covers every key parameter in postgresql.conf. Issues to watch out for with connection pooling when running on Windows that are pretty much non-issues on Linux. Also performance settings that you can set at the client level.
- Chapter 7: Routine Maintenance This is a chapter I would consider an important read for both newbies and long-time PostgreSQL users. It goes into detail about techniques and tools for monitoring performance of queries, gotchas and advantages of how PostgreSQL implements MVCC, auto explain, log file analysis tools such as pgFouine.
- Chapter 8: Database Benchmarking Mostly focused on using pgbench, setting up custom scripts, and analyzing pgbench results.
- Chapter 9: Database Indexing Covers the various kinds of indexes, how to determine size of an index relative to table size so you can do a better cost/benefit of creating the index, measuring speed of index creation, concepts such as clustering and fill factor, benchmarking how an index improves query performance.
- Chapter 10: Query Optimization This particular chapter WILL help you write better queries. This covers various tools for analyzing queries, how to read explain plans, the new XML/YAML etc outputs introduced in PostgreSQL 9, visual explain plans. Various characteristics of different join strategies employed by PostgreSQL and postgresql settings that affect query performance. It uses the pagilla and Dell Store 2 databases for most of the exercises.
- Chapter 11: Database Activity and Statistics This covers the various built-in views in PostgreSQL for checking on statistics and some sample queries covering everything from reading table I/O, background writer stats to disk usage and table locks.
- Chapter 12: Monitoring and Trending This covers various toolkits (mostly Linux) for monitoring both server as well as specifically PostgreSQL behavior over time. It doesn't go into too much detail about how to use any of them, but does provide where to find more about each and the pros and cons of each. Items covered are Bucardo checkpostgres tool, Staplr,Cacti etc.
- Chapter 13: Pooling and Caching Covers using pgPool-II, pgBouncer, memchached and pgmemcache.
- Chapter 14: Scaling and Replication Covers using various replication strategies and pros and cons of each : Hot Standby (introduced in PostgreSQL 9), Londiste, Slony, Bucardo, replication features of pgPool-II
- Chapter 15: Partitioning Data Covers how to partition data in PostgreSQL using inheritance, how to arrive at optimal sizes for your partitions, list partitioning, range partitioning and redirecting inserts with partition rules vs. dynamic triggers. It also provides a brief overview of horizontal partitioning and sharding with PL/Proxy and GridSQL.
- Chapter 16: Avoiding Problems This covers common mistakes people perform when bulk loading data, cautions when backing up with a newer pg_restore than the database version, trigger memory usage and avoiding running out of memory etc.
Special Feature
PostgreSQL 9.0 pg_dump, pg_dumpall, pg_restore cheatsheet overview Beginner
Backup and Restore is probably the most important thing to know how to do when you have a database with data you care about.
The utilities in PostgreSQL that accomplish these tasks are pg_restore, pg_dump, pg_dumpall, and for restore of plain text dumps - psql.
A lot of the switches used by pg_dump, pg_restore, pg_dumpall are common to all three. You use pg_dump to do backups of a single database or select database objects and pg_restore to restore it either to another database or to recover portions of a database. You use pg_dumpall to dump all your databases in plain text format.
Rather than trying to keep track of which switch works with which, we decided to combine all into a single cheat sheet with a column denoting which utility the switch is supported in. Pretty much all the text is compiled from the --help switch of each.
We created a similar Backup and Restore cheatsheet for PostgreSQL 8.3 and since then some new features have been added such as the jobs parallel restore feature in 8.4. We have now created an updated sheet to comprise all features present in PostgreSQL 9.0 packaged pg_dump, pg_restore, pg_dumpall command line utilities.
PDF Portrait version 8.5 x 11" of this cheatsheet is available at PostgreSQL 9.0 Dump Restore 8.5 x 11 and also available in PDF A4 format and HTML.
As usual please let us know if you find any errors or omissions and we'll be happy to correct.