In this article we'll provide a summary of what PL/R is and how to get running with it. Since we don't like repeating ourselves,
we'll refer you to an article we wrote a while ago which is still fairly relevant today called Up and Running with PL/R (PLR) in PostgreSQL: An almost Idiot's Guide
and just fill in the parts that have changed. We should note that particular series was more geared toward the spatial database programmer (PostGIS in particular). There is a lot of overlap between the PL/R, R, and PostGIS user-base which is comprised
of many environmental scientists and researchers in need of powerful charting and stats tools to analyse their data who are high on the smart but low on the money human spectrum.
This series will be more of a general PL/R user perspective. We'll follow more of the same style we did with Quick Intro to PL/Python. We'll end our series with a PL/R cheatsheet similar to what we had for PL/Python.
As stated in our State of PostGIS article, we'll be using log files we generated from our PostGIS stress tests. These stress tests were auto-generated from the PostGIS official documentation.
The raster tests are comprised of 2,095 query executions exercising all the pixel types supported. The geometry/geograpy tests are comprised of 65,892 spatial SQL queries exercising every PostGIS geometry/geography supported in PostGIS 2.0 -- yes this includes TINS, Triangles,Polyhedral Surfaces, Curved geometries and all dimensions of them.
Most queries are unique. If you are curious to see what these log tables look like or want to follow along with these exercises, you can download the tables from here.
What is R and PL/R and why should you care?
R is both a language and an environment for doing statistics and generating graphs and plots. It is GNU-licensed and a common favorite of Universities and Research institutions. PL/R is a procedural language for PostgreSQL that allows you to write database stored functions
in R. R is a set-based and domain specific language similar to SQL except unlike the way relational databases treat data, it thinks of data as matrices, lists and vectors. I tend to think of it as a cross between LISP and SQL though more experienced Lisp and R users will probably disagree with me on that. This makes it easier in many cases to tabulate data both across columns as well as across rows.
The examples we will show in these exercises, could be done in SQL, but they are much more succinct to write in R. In addition to the language itself, there are a whole wealth of statistical and graphing functions available in R that you will not
find in any relational database. These functions are growing as more people contribute packages. Its packaging system called Comprehensive R Archive (CRAN) is similar in concept to Perl's CPAN and the in the works PGXN for PostgreSQL.
What do you need before you can use PL/R?
- PostgreSQL and latest version of PL/R at this time plr-8.3.0.11 which works on PostgreSQL 8.3-9.0
- R installed on your PostgreSQL server. Note: That there are both 32-bit and 64-bit binaries available for Windows as well as Mac OSX and several variants of Linux. You can get these from R CRAN - http://cran.r-project.org/
- PL/R handler for PostgreSQL. For Windows both 32-bit and 64-bit, Fedora, and Debian there are installers available which you can find on http://www.joeconway.com/plr/. For others you'll have to
compile from source.
NOTE: the windows binaries were built with R 2.11.1 and installer won't work with the newer or older R 2.12, but you can use the binaries if you are using R below 2.11.1 or above R.2.11.1 if you set the environment variables as we described in Up and Running with PL/R (PLR) in PostgreSQL: An almost Idiot's Guide.
For Windows R 2.12 -- the path variable you need to add to your path environment variable has changed a little. It's now path to R install\bin\i386 or R\bin\x64. So for example
if I am on 32-bit Windows I would add to my environment path variable -- C:\Program Files\R\R-2.12.0\bin\i386 and my R_HOME would be C:\Program Files\R\R-2.12.0.
Even if you are on 64-bit windows, you will need to use the 32-bit package if you are using 32-bit PostgreSQL or you want to use R packages that are not currently supported in 64-bit Windows R.
Then you run the plr.sql file which gets installed in share/contrib in your PostgreSQL database to enable the language in that database.
Our first PL/R stored function: creating data frames and saving to RData file
In R there are 5 basic kinds of data structures: scalars, vectors, arrays, dataframes, and lists. Lists are the most complex of all because they are collections of the other 4 data structures. You can think of a PostgreSQL
query output of table as a data frame. In this simple example, we'll demonstrate how to feed a PostgreSQL query into an R
dataframe and save it in an R data file.
CREATE OR REPLACE FUNCTION save_postgis_logs_rdata() RETURNS text AS
$$
pggeom20log <<- pg.spi.exec("SELECT logid ,log_label,func,g1,g2,log_start,log_end,
date_part('epoch',age(log_end, log_start)) As dur_sec
FROM postgis_garden_log ORDER BY log_start")
pgrast20log <<- pg.spi.exec("SELECT logid ,log_label,func,g1,g2,log_start,log_end,
date_part('epoch',age(log_end, log_start)) As dur_sec
FROM raster_garden_log ORDER BY log_start")
save(pggeom20log,pgrast20log, file="/temp/postgisstats20.RData")
return("done")
$$
language 'plr';
The most common function used in PL/R is the pg.spi.exec routine which takes as input SQL statements and converts them to R dataframes.
In the example above, we are dumping our geometry/geography logging test stats and our raster stats from a PostGIS 2.0 test run and saving them
to 2 R DataFrames. In addition we are saving these data frames to a custom R Data file. An R Data file is a file that stores R objects in native R format.
It's useful for saving datasets you are working on for easy load back.
Using R interactively
R is both a scriptable and an interactive environment and one that you can also automate a good bit of with PL/R. Before automating things, I like to
test processes out interactively in R. For this next example, we'll load back the data we saved using PL/R into R and play with it.
To do so, launch R-GUI or R command line and at the prompt type:
load("/temp/postgisstats20.RData")
ls()
-- the ls output will return listing of variables we have loaded
[1] "pggeom20log" "pgrast20log"
summary(pgrast20log)
-- yields
logid log_label func g1
Min. : 1.0 Length:2095 Length:2095 Length:2095
1st Qu.: 524.5 Class :character Class :character Class :character
Median :1048.0 Mode :character Mode :character Mode :character
Mean :1048.0
3rd Qu.:1571.5
Max. :2095.0
g2 log_start log_end dur_sec
Length:2095 Length:2095 Length:2095 Min. :0.000000
Class :character Class :character Class :character 1st Qu.:0.000000
Mode :character Mode :character Mode :character Median :0.000000
Mean :0.008964
3rd Qu.:0.016000
Max. :0.187000
So we know from the above summary that we have 2095 observations and basic stats about the average duration of each process in our raster log tests. We ran the above test on a fairly new beefed up Winodw 7 and an older clunkier Windows Xp. Both 32-bit
running PostgreSQL 9.0 / PostGIS 2.0. The windows xp in general followed the same pattern as the windows 7 except timings were generally 50-80% worse.
One of my favorite functions in R is the table
function because it's always present and fairly fast and easy to use. To get details on the function run this command:
help(table)
Basically what it does is provide a cross tab of counts. Here is an example use where we take our raster log and output the time durations but each time duration we round to 2 decimals so that our numbers in 0.11, 0.112 etc. collapse to 0.11
table(pgrast20log$g1, round(pgrast20log$dur_sec, digits=2))
If you have a lot of variables coming from the same table or want to run a lot of R steps on the same table, you may want to write something like below
which is kind of equivalent to a SELECT a, b FROM sometable; As opposed to our above which is more like a SELECT sometable.a, sometable.b FROM sometable;. I generally like to be
more explicit especially when I have a lot of R variables floating around.
attach(pgrast20log)
table(g1, round(dur_sec, digits=2))
detach(pgrast20log)
From this excerpt, we can tell that 64 bit floating point raster bands have the worst performance in general of any of the raster types.
To get a true estimate we would run our simulation on different platforms and many times on same machine as well.
We did run this a couple of times on our 32-bit windows XP and 7 and the 64BF is consistently slightly worse than the others. Our Windows 7 is about twice as fast in general but interestingly enough on the Windows 7, we see some suspicious outliers for
16BUI and 1BB. We think this is just turbulence since on other runs 64BF most always shows as the longest running.
0 0.02 0.03 0.05 0.06 0.08 0.09 0.11 0.16 0.19
16BSI 119 52 12 7 0 0 0 0 0 0
16BUI 122 48 18 1 0 0 0 0 1 0
1BB 121 57 8 1 1 1 0 0 0 1
2BUI 132 52 6 0 0 0 0 0 0 0
32BF 114 39 13 15 6 3 0 0 0 0
32BSI 115 41 10 17 6 0 1 0 0 0
32BUI 116 39 14 19 1 0 1 0 0 0
4BUI 130 53 7 0 0 0 0 0 0 0
64BF 113 41 1 2 10 12 10 1 0 0
8BSI 128 59 3 0 0 0 0 0 0 0
8BUI 124 59 5 2 0 0 0 0 0 0
Now we do the same exercise except flipping to see the culprit functions. Here we just want for each function if its better or worse than the 3rd Quantile standard deviation of our tests.
Though this query can be done in SQL its much more succinct in R. Note: the 2 spellings in ST_SetValue,
yes our machine gun test has spotted an inconsistency in our PostGIS documentation. Shame on us.
table(pgrast20log$func, pgrast20log$dur_sec > summary(pgrast20log$dur_sec)["3rd Qu."])
FALSE TRUE
&& 120 1
&< 121 0
&> 121 0
AddRasterColumn 3 8
DropRasterTable 11 0
insert data 10 1
PostGIS_Raster_Lib_Build_Date 1 0
PostGIS_Raster_Lib_Version 1 0
ST_AddBand 13 53
ST_AsBinary 6 5
ST_BandHasNoDataValue 22 0
ST_BandMetaData 7 15
ST_BandNoDataValue 22 0
ST_BandPath 22 0
ST_BandPixelType 22 0
ST_Box2D 11 0
ST_ConvexHull 11 0
ST_DumpAsPolygons 21 1
ST_Envelope 11 0
ST_GeoReference 22 0
ST_Height 11 0
ST_Intersection 220 0
ST_Intersects 438 2
ST_MakeEmptyRaster 14 0
ST_MetaData 11 0
ST_NumBands 11 0
ST_PixelAsPolygon 18 4
ST_PixelSizeX 11 0
ST_PixelSizeY 11 0
ST_Polygon 22 0
ST_Raster2WorldCoordX 22 0
ST_Raster2WorldCoordY 22 0
ST_SetBandHasNoDataValue 13 9
ST_SetBandNoDataValue 11 11
ST_SetGeoReference 11 11
ST_SetPixelSize 13 9
ST_SetSkew 11 11
ST_SetSRID 6 5
ST_SetUpperLeft 6 5
ST_SetValue 41 36
ST_SEtValue 31 24
ST_SkewX 11 0
ST_SkewY 11 0
ST_SRID 11 0
ST_UpperLeftX 11 0
ST_UpperLeftY 11 0
ST_Value 131 1
ST_Width 7 4
ST_World2RasterCoordX 77 0
ST_World2RasterCoordY 77 0
Second PL/R stored function, creating a plain text report
When you are writing PL/R code, you are pretty much just writing R code and living in the R environment. However, there are certain functions in R that you may come to prefer when working in PL/R than you would prefer in R.
In PL/R you want data to be output in variables so you can easily return them in PL/R. The PostgreSQL language also doesn't provide a whole lot of mappings for the objects you get back in R. For our next example we want a simple text output
of representation of the cross tabs we created and we want it to be returnable as an SQL statement. For this next example, we are going to lean on our favorite R functions for doing this. Other's more experienced may have better ways.
- capture.output - this is an R function that will instead of returning to the screen the output of a variable or a command -- will return to a variable what it would normally return to the viewer's screen. So it will give us the same
output as we have above. The only caveate is that it returns a string array where each line is a separate line on the screen. THis works kind of like the sink function in R except it doesn't have the messiness of needing
a file to redirect screen output to.
- paste - this we use to commonly collapse output such as arrays. It has many uses, but we are using it as a reverse split function. Kind of like a implode in PHP.
- sprintf - does pretty much the same thing as it does in C and allows you to format text with place holders for variables
- subset -- the subset function in R is akin to a SELECT FROM WHERE in SQL. Except | means or & means AND and == means =. The c() function in R is a way of creating vectors
So now we show our reporting function which exercises all the above R functions.
CREATE OR REPLACE FUNCTION output_garden_stats() RETURNS text AS
$$
pggeom20log <<- pg.spi.exec("SELECT logid ,log_label,func,g1, g2, log_start,log_end,
date_part('epoch',age(log_end, log_start)) As dur_sec
FROM postgis_garden_log ORDER BY log_start")
pgrast20log <<- pg.spi.exec("SELECT logid ,log_label,func,g1,g2,log_start,log_end,
date_part('epoch',age(log_end, log_start)) As dur_sec
FROM raster_garden_log ORDER BY log_start")
attach(pgrast20log)
prsum.breakq <- summary(dur_sec)["3rd Qu."]
prsum <- table(func, dur_sec > prsum.breakq)
prdur <- table(g1, round(dur_sec,2))
detach(pgrast20log)
attach(pggeom20log)
pggdur <- table(g1, round(dur_sec,1),useNA="ifany")
pgfuncdur <- table(func, round(dur_sec,1),useNA="ifany")
detach(pggeom20log)
pgpoly<-subset(pggeom20log, g1 == 'CURVEPOLYGON' | g2 == 'CURVEPOLYGON', select=c(func, dur_sec))
pgpolydur <- table(pgpoly$func, !is.na(pgpoly$dur_sec))
reportoutput <- sprintf("RASTER results is duration greater than %s secs %s \r\r RASTER results duration breakout \r %s
GEOMETRY GEOGRAPHY by type %s \r GEOMETRY\GEOGRAPHY By function duration %s \r\r CURVEPOLYGON failure = FALSE, completion = TRUE by function %s", prsum.breakq,
paste(capture.output(prsum), sep ="", collapse="\r"), paste(capture.output(prdur), sep ="", collapse="\r"),
paste(capture.output(pggdur), sep ="", collapse="\r"), paste(capture.output(pgfuncdur), sep ="", collapse="\r")
, paste(capture.output(pgpolydur), sep ="", collapse="\r")
)
return(reportoutput)
$$
language 'plr';
To run our report, we run this command: SELECT output_garden_stats();
The report output can be seen http://www.postgresonline.com/demos/plr/postgis20report.txt
There are a lot of surprising things that we discovered simply looking at counts. Some might be things in the testing scripts (or even the PostGIS documentation) that need to be cleaned up and some may be PostGIS issues or just predictable oddities when a function is applied to 2 different kinds of geometries.
For example for the CURVEPOLYGON report:
- the total counts for the operators are not the same. The && for most is about twice that of most of the other operators. This may mean we are overtesting && perhaps maybe its listed twice in the documentation.
- Another oddity is that the ST_Split function, ST_ShortestLine
and some other functions that are not supposed to work on CURVEPOLYGONS return an answer 3 out of 29 times. What are these 3 times? On closer inspection of the detail records -- it turns out that specialty geometries such as our favorite EMPTY geometry, MULTI NULL, and what we call a SINGLE NULL (hmm is that a real thing - we should probably just change that to 1 NULL),
will work because they are special cases. When you split an empty geometry with anything you get back an empty geometry for example.
- Another interesting conclusion is the ST_SnapToGrid which again shouldn't work on CURVEPOLYGONS. Yet we have 2 cases where it does. This is because there are about 3 implementations of ST_SnapToGrid and one is snapping to another geoemtry.
If you snap any geometry to NULL evidentally you get a NULL back.
So the moral of the story -- You can learn a lot from a dummy.
Join us next time as we continue our exploration of R and PL/R and demonstrate creating aggregates with PL/R and then later creating graphs and plots.
Tracked: Dec 09, 23:10