In this article we'll provide a summary of what PL/R is and how to get running with it. Since we don't like repeating ourselves, we'll refer you to an article we wrote a while ago which is still fairly relevant today called Up and Running with PL/R (PLR) in PostgreSQL: An almost Idiot's Guide and just fill in the parts that have changed. We should note that particular series was more geared toward the spatial database programmer (PostGIS in particular). There is a lot of overlap between the PL/R, R, and PostGIS user-base which is comprised of many environmental scientists and researchers in need of powerful charting and stats tools to analyse their data who are high on the smart but low on the money human spectrum.
This series will be more of a general PL/R user perspective. We'll follow more of the same style we did with Quick Intro to PL/Python. We'll end our series with a PL/R cheatsheet similar to what we had for PL/Python.
As stated in our State of PostGIS article, we'll be using log files we generated from our PostGIS stress tests. These stress tests were auto-generated from the PostGIS official documentation. The raster tests are comprised of 2,095 query executions exercising all the pixel types supported. The geometry/geograpy tests are comprised of 65,892 spatial SQL queries exercising every PostGIS geometry/geography supported in PostGIS 2.0 -- yes this includes TINS, Triangles,Polyhedral Surfaces, Curved geometries and all dimensions of them. Most queries are unique. If you are curious to see what these log tables look like or want to follow along with these exercises, you can download the tables from here.
R is both a language and an environment for doing statistics and generating graphs and plots. It is GNU-licensed and a common favorite of Universities and Research institutions. PL/R is a procedural language for PostgreSQL that allows you to write database stored functions in R. R is a set-based and domain specific language similar to SQL except unlike the way relational databases treat data, it thinks of data as matrices, lists and vectors. I tend to think of it as a cross between LISP and SQL though more experienced Lisp and R users will probably disagree with me on that. This makes it easier in many cases to tabulate data both across columns as well as across rows. The examples we will show in these exercises, could be done in SQL, but they are much more succinct to write in R. In addition to the language itself, there are a whole wealth of statistical and graphing functions available in R that you will not find in any relational database. These functions are growing as more people contribute packages. Its packaging system called Comprehensive R Archive (CRAN) is similar in concept to Perl's CPAN and the in the works PGXN for PostgreSQL.
NOTE: the windows binaries were built with R 2.11.1 and installer won't work with the newer or older R 2.12, but you can use the binaries if you are using R below 2.11.1 or above R.2.11.1 if you set the environment variables as we described in Up and Running with PL/R (PLR) in PostgreSQL: An almost Idiot's Guide. For Windows R 2.12 -- the path variable you need to add to your path environment variable has changed a little. It's now path to R install\bin\i386 or R\bin\x64. So for example if I am on 32-bit Windows I would add to my environment path variable -- C:\Program Files\R\R-2.12.0\bin\i386 and my R_HOME would be C:\Program Files\R\R-2.12.0. Even if you are on 64-bit windows, you will need to use the 32-bit package if you are using 32-bit PostgreSQL or you want to use R packages that are not currently supported in 64-bit Windows R.
Then you run the plr.sql file which gets installed in share/contrib in your PostgreSQL database to enable the language in that database.
In R there are 5 basic kinds of data structures: scalars, vectors, arrays, dataframes, and lists. Lists are the most complex of all because they are collections of the other 4 data structures. You can think of a PostgreSQL query output of table as a data frame. In this simple example, we'll demonstrate how to feed a PostgreSQL query into an R dataframe and save it in an R data file.
CREATE OR REPLACE FUNCTION save_postgis_logs_rdata() RETURNS text AS
$$
pggeom20log <<- pg.spi.exec("SELECT logid ,log_label,func,g1,g2,log_start,log_end,
date_part('epoch',age(log_end, log_start)) As dur_sec
FROM postgis_garden_log ORDER BY log_start")
pgrast20log <<- pg.spi.exec("SELECT logid ,log_label,func,g1,g2,log_start,log_end,
date_part('epoch',age(log_end, log_start)) As dur_sec
FROM raster_garden_log ORDER BY log_start")
save(pggeom20log,pgrast20log, file="/temp/postgisstats20.RData")
return("done")
$$
language 'plr';
The most common function used in PL/R is the pg.spi.exec routine which takes as input SQL statements and converts them to R dataframes. In the example above, we are dumping our geometry/geography logging test stats and our raster stats from a PostGIS 2.0 test run and saving them to 2 R DataFrames. In addition we are saving these data frames to a custom R Data file. An R Data file is a file that stores R objects in native R format. It's useful for saving datasets you are working on for easy load back.
R is both a scriptable and an interactive environment and one that you can also automate a good bit of with PL/R. Before automating things, I like to test processes out interactively in R. For this next example, we'll load back the data we saved using PL/R into R and play with it.
To do so, launch R-GUI or R command line and at the prompt type:
load("/temp/postgisstats20.RData")
ls()
-- the ls output will return listing of variables we have loaded
[1] "pggeom20log" "pgrast20log"
summary(pgrast20log)
-- yields logid log_label func g1 Min. : 1.0 Length:2095 Length:2095 Length:2095 1st Qu.: 524.5 Class :character Class :character Class :character Median :1048.0 Mode :character Mode :character Mode :character Mean :1048.0 3rd Qu.:1571.5 Max. :2095.0 g2 log_start log_end dur_sec Length:2095 Length:2095 Length:2095 Min. :0.000000 Class :character Class :character Class :character 1st Qu.:0.000000 Mode :character Mode :character Mode :character Median :0.000000 Mean :0.008964 3rd Qu.:0.016000 Max. :0.187000
So we know from the above summary that we have 2095 observations and basic stats about the average duration of each process in our raster log tests. We ran the above test on a fairly new beefed up Winodw 7 and an older clunkier Windows Xp. Both 32-bit running PostgreSQL 9.0 / PostGIS 2.0. The windows xp in general followed the same pattern as the windows 7 except timings were generally 50-80% worse.
One of my favorite functions in R is the table
function because it's always present and fairly fast and easy to use. To get details on the function run this command:
help(table)
Basically what it does is provide a cross tab of counts. Here is an example use where we take our raster log and output the time durations but each time duration we round to 2 decimals so that our numbers in 0.11, 0.112 etc. collapse to 0.11
table(pgrast20log$g1, round(pgrast20log$dur_sec, digits=2))
If you have a lot of variables coming from the same table or want to run a lot of R steps on the same table, you may want to write something like below which is kind of equivalent to a SELECT a, b FROM sometable; As opposed to our above which is more like a SELECT sometable.a, sometable.b FROM sometable;. I generally like to be more explicit especially when I have a lot of R variables floating around.
attach(pgrast20log)
table(g1, round(dur_sec, digits=2))
detach(pgrast20log)
From this excerpt, we can tell that 64 bit floating point raster bands have the worst performance in general of any of the raster types. To get a true estimate we would run our simulation on different platforms and many times on same machine as well. We did run this a couple of times on our 32-bit windows XP and 7 and the 64BF is consistently slightly worse than the others. Our Windows 7 is about twice as fast in general but interestingly enough on the Windows 7, we see some suspicious outliers for 16BUI and 1BB. We think this is just turbulence since on other runs 64BF most always shows as the longest running.
0 0.02 0.03 0.05 0.06 0.08 0.09 0.11 0.16 0.19 16BSI 119 52 12 7 0 0 0 0 0 0 16BUI 122 48 18 1 0 0 0 0 1 0 1BB 121 57 8 1 1 1 0 0 0 1 2BUI 132 52 6 0 0 0 0 0 0 0 32BF 114 39 13 15 6 3 0 0 0 0 32BSI 115 41 10 17 6 0 1 0 0 0 32BUI 116 39 14 19 1 0 1 0 0 0 4BUI 130 53 7 0 0 0 0 0 0 0 64BF 113 41 1 2 10 12 10 1 0 0 8BSI 128 59 3 0 0 0 0 0 0 0 8BUI 124 59 5 2 0 0 0 0 0 0
Now we do the same exercise except flipping to see the culprit functions. Here we just want for each function if its better or worse than the 3rd Quantile standard deviation of our tests. Though this query can be done in SQL its much more succinct in R. Note: the 2 spellings in ST_SetValue, yes our machine gun test has spotted an inconsistency in our PostGIS documentation. Shame on us.
table(pgrast20log$func, pgrast20log$dur_sec > summary(pgrast20log$dur_sec)["3rd Qu."])
FALSE TRUE && 120 1 &< 121 0 &> 121 0 AddRasterColumn 3 8 DropRasterTable 11 0 insert data 10 1 PostGIS_Raster_Lib_Build_Date 1 0 PostGIS_Raster_Lib_Version 1 0 ST_AddBand 13 53 ST_AsBinary 6 5 ST_BandHasNoDataValue 22 0 ST_BandMetaData 7 15 ST_BandNoDataValue 22 0 ST_BandPath 22 0 ST_BandPixelType 22 0 ST_Box2D 11 0 ST_ConvexHull 11 0 ST_DumpAsPolygons 21 1 ST_Envelope 11 0 ST_GeoReference 22 0 ST_Height 11 0 ST_Intersection 220 0 ST_Intersects 438 2 ST_MakeEmptyRaster 14 0 ST_MetaData 11 0 ST_NumBands 11 0 ST_PixelAsPolygon 18 4 ST_PixelSizeX 11 0 ST_PixelSizeY 11 0 ST_Polygon 22 0 ST_Raster2WorldCoordX 22 0 ST_Raster2WorldCoordY 22 0 ST_SetBandHasNoDataValue 13 9 ST_SetBandNoDataValue 11 11 ST_SetGeoReference 11 11 ST_SetPixelSize 13 9 ST_SetSkew 11 11 ST_SetSRID 6 5 ST_SetUpperLeft 6 5 ST_SetValue 41 36 ST_SEtValue 31 24 ST_SkewX 11 0 ST_SkewY 11 0 ST_SRID 11 0 ST_UpperLeftX 11 0 ST_UpperLeftY 11 0 ST_Value 131 1 ST_Width 7 4 ST_World2RasterCoordX 77 0 ST_World2RasterCoordY 77 0
When you are writing PL/R code, you are pretty much just writing R code and living in the R environment. However, there are certain functions in R that you may come to prefer when working in PL/R than you would prefer in R. In PL/R you want data to be output in variables so you can easily return them in PL/R. The PostgreSQL language also doesn't provide a whole lot of mappings for the objects you get back in R. For our next example we want a simple text output of representation of the cross tabs we created and we want it to be returnable as an SQL statement. For this next example, we are going to lean on our favorite R functions for doing this. Other's more experienced may have better ways.
So now we show our reporting function which exercises all the above R functions.
CREATE OR REPLACE FUNCTION output_garden_stats() RETURNS text AS
$$
#geometry/geography stats we are only selecting tests that involve only one geometry type
pggeom20log <<- pg.spi.exec("SELECT logid ,log_label,func,g1, g2, log_start,log_end,
date_part('epoch',age(log_end, log_start)) As dur_sec
FROM postgis_garden_log ORDER BY log_start")
pgrast20log <<- pg.spi.exec("SELECT logid ,log_label,func,g1,g2,log_start,log_end,
date_part('epoch',age(log_end, log_start)) As dur_sec
FROM raster_garden_log ORDER BY log_start")
attach(pgrast20log) #make pgrast20log fields global
prsum.breakq <- summary(dur_sec)["3rd Qu."]
prsum <- table(func, dur_sec > prsum.breakq)
prdur <- table(g1, round(dur_sec,2))
detach(pgrast20log)
attach(pggeom20log) #make pggeom20log fields global
pggdur <- table(g1, round(dur_sec,1),useNA="ifany")
pgfuncdur <- table(func, round(dur_sec,1),useNA="ifany")
detach(pggeom20log) #make it not globlal
#create subset of data only containing CURVEPOLYGONS
pgpoly<-subset(pggeom20log, g1 == 'CURVEPOLYGON' | g2 == 'CURVEPOLYGON', select=c(func, dur_sec))
#create a cross tab count for just curvepolygon
#TRUE count for completions
#FALSE for processes that throw errors
pgpolydur <- table(pgpoly$func, !is.na(pgpoly$dur_sec))
#we use capture.output to capture the string representation of the data tables as would be shown on screen
reportoutput <- sprintf("RASTER results is duration greater than %s secs %s \r\r RASTER results duration breakout \r %s
GEOMETRY GEOGRAPHY by type %s \r GEOMETRY\GEOGRAPHY By function duration %s \r\r CURVEPOLYGON failure = FALSE, completion = TRUE by function %s", prsum.breakq,
paste(capture.output(prsum), sep ="", collapse="\r"), paste(capture.output(prdur), sep ="", collapse="\r"),
paste(capture.output(pggdur), sep ="", collapse="\r"), paste(capture.output(pgfuncdur), sep ="", collapse="\r")
, paste(capture.output(pgpolydur), sep ="", collapse="\r")
)
return(reportoutput)
$$
language 'plr';
To run our report, we run this command: SELECT output_garden_stats();
The report output can be seen http://www.postgresonline.com/demos/plr/postgis20report.txt
There are a lot of surprising things that we discovered simply looking at counts. Some might be things in the testing scripts (or even the PostGIS documentation) that need to be cleaned up and some may be PostGIS issues or just predictable oddities when a function is applied to 2 different kinds of geometries.
For example for the CURVEPOLYGON report:
So the moral of the story -- You can learn a lot from a dummy.
Join us next time as we continue our exploration of R and PL/R and demonstrate creating aggregates with PL/R and then later creating graphs and plots.