Quick Intro to R and PL/R - Part 1 - Postgres OnLine Journal

In this article we'll provide a summary of what PL/R is and how to get running with it. Since we don't like repeating ourselves, we'll refer you to an article we wrote a while ago which is still fairly relevant today called Up and Running with PL/R (PLR) in PostgreSQL: An almost Idiot's Guide and just fill in the parts that have changed. We should note that particular series was more geared toward the spatial database programmer (PostGIS in particular). There is a lot of overlap between the PL/R, R, and PostGIS user-base which is comprised of many environmental scientists and researchers in need of powerful charting and stats tools to analyse their data who are high on the smart but low on the money human spectrum.

This series will be more of a general PL/R user perspective. We'll follow more of the same style we did with Quick Intro to PL/Python. We'll end our series with a PL/R cheatsheet similar to what we had for PL/Python.

As stated in our State of PostGIS article, we'll be using log files we generated from our PostGIS stress tests. These stress tests were auto-generated from the PostGIS official documentation. The raster tests are comprised of 2,095 query executions exercising all the pixel types supported. The geometry/geograpy tests are comprised of 65,892 spatial SQL queries exercising every PostGIS geometry/geography supported in PostGIS 2.0 -- yes this includes TINS, Triangles,Polyhedral Surfaces, Curved geometries and all dimensions of them. Most queries are unique. If you are curious to see what these log tables look like or want to follow along with these exercises, you can download the tables from here.

What is R and PL/R and why should you care?

R is both a language and an environment for doing statistics and generating graphs and plots. It is GNU-licensed and a common favorite of Universities and Research institutions. PL/R is a procedural language for PostgreSQL that allows you to write database stored functions in R. R is a set-based and domain specific language similar to SQL except unlike the way relational databases treat data, it thinks of data as matrices, lists and vectors. I tend to think of it as a cross between LISP and SQL though more experienced Lisp and R users will probably disagree with me on that. This makes it easier in many cases to tabulate data both across columns as well as across rows. The examples we will show in these exercises, could be done in SQL, but they are much more succinct to write in R. In addition to the language itself, there are a whole wealth of statistical and graphing functions available in R that you will not find in any relational database. These functions are growing as more people contribute packages. Its packaging system called Comprehensive R Archive (CRAN) is similar in concept to Perl's CPAN and the in the works PGXN for PostgreSQL.

What do you need before you can use PL/R?

PostgreSQL and latest version of PL/R at this time plr-8.3.0.11 which works on PostgreSQL 8.3-9.0
R installed on your PostgreSQL server. Note: That there are both 32-bit and 64-bit binaries available for Windows as well as Mac OSX and several variants of Linux. You can get these from R CRAN - http://cran.r-project.org/
PL/R handler for PostgreSQL. For Windows both 32-bit and 64-bit, Fedora, and Debian there are installers available which you can find on http://www.joeconway.com/plr/. For others you'll have to compile from source.
NOTE: the windows binaries were built with R 2.11.1 and installer won't work with the newer or older R 2.12, but you can use the binaries if you are using R below 2.11.1 or above R.2.11.1 if you set the environment variables as we described in Up and Running with PL/R (PLR) in PostgreSQL: An almost Idiot's Guide. For Windows R 2.12 -- the path variable you need to add to your path environment variable has changed a little. It's now path to R install\bin\i386 or R\bin\x64. So for example if I am on 32-bit Windows I would add to my environment path variable -- C:\Program Files\R\R-2.12.0\bin\i386 and my R_HOME would be C:\Program Files\R\R-2.12.0. Even if you are on 64-bit windows, you will need to use the 32-bit package if you are using 32-bit PostgreSQL or you want to use R packages that are not currently supported in 64-bit Windows R.

Then you run the plr.sql file which gets installed in share/contrib in your PostgreSQL database to enable the language in that database.

Our first PL/R stored function: creating data frames and saving to RData file

In R there are 5 basic kinds of data structures: scalars, vectors, arrays, dataframes, and lists. Lists are the most complex of all because they are collections of the other 4 data structures. You can think of a PostgreSQL query output of table as a data frame. In this simple example, we'll demonstrate how to feed a PostgreSQL query into an R dataframe and save it in an R data file.

CREATE OR REPLACE FUNCTION save_postgis_logs_rdata() RETURNS text AS
$$ 
    pggeom20log <<- pg.spi.exec("SELECT logid ,log_label,func,g1,g2,log_start,log_end, 
       date_part('epoch',age(log_end, log_start))  As dur_sec 
    FROM postgis_garden_log ORDER BY log_start")
    
    pgrast20log <<- pg.spi.exec("SELECT logid ,log_label,func,g1,g2,log_start,log_end, 
       date_part('epoch',age(log_end, log_start))  As dur_sec 
    FROM raster_garden_log ORDER BY log_start")
    
    save(pggeom20log,pgrast20log, file="/temp/postgisstats20.RData")
    return("done")
$$
language 'plr';

The most common function used in PL/R is the pg.spi.exec routine which takes as input SQL statements and converts them to R dataframes. In the example above, we are dumping our geometry/geography logging test stats and our raster stats from a PostGIS 2.0 test run and saving them to 2 R DataFrames. In addition we are saving these data frames to a custom R Data file. An R Data file is a file that stores R objects in native R format. It's useful for saving datasets you are working on for easy load back.

Using R interactively

R is both a scriptable and an interactive environment and one that you can also automate a good bit of with PL/R. Before automating things, I like to test processes out interactively in R. For this next example, we'll load back the data we saved using PL/R into R and play with it.

To do so, launch R-GUI or R command line and at the prompt type:

load("/temp/postgisstats20.RData")
ls()

-- the ls output will return listing of variables we have loaded 
[1] "pggeom20log" "pgrast20log"

summary(pgrast20log)

-- yields
    logid         log_label             func                g1
Min.   :   1.0   Length:2095        Length:2095        Length:2095
1st Qu.: 524.5   Class :character   Class :character   Class :character
Median :1048.0   Mode  :character   Mode  :character   Mode  :character
Mean   :1048.0
3rd Qu.:1571.5
Max.   :2095.0
     g2             log_start           log_end             dur_sec
Length:2095        Length:2095        Length:2095        Min.   :0.000000
Class :character   Class :character   Class :character   1st Qu.:0.000000
Mode  :character   Mode  :character   Mode  :character   Median :0.000000
                                                         Mean   :0.008964
                                                         3rd Qu.:0.016000
                                                         Max.   :0.187000

So we know from the above summary that we have 2095 observations and basic stats about the average duration of each process in our raster log tests. We ran the above test on a fairly new beefed up Winodw 7 and an older clunkier Windows Xp. Both 32-bit running PostgreSQL 9.0 / PostGIS 2.0. The windows xp in general followed the same pattern as the windows 7 except timings were generally 50-80% worse.

One of my favorite functions in R is the table function because it's always present and fairly fast and easy to use. To get details on the function run this command:

help(table)

Basically what it does is provide a cross tab of counts. Here is an example use where we take our raster log and output the time durations but each time duration we round to 2 decimals so that our numbers in 0.11, 0.112 etc. collapse to 0.11

table(pgrast20log$g1, round(pgrast20log$dur_sec, digits=2))

If you have a lot of variables coming from the same table or want to run a lot of R steps on the same table, you may want to write something like below which is kind of equivalent to a SELECT a, b FROM sometable; As opposed to our above which is more like a SELECT sometable.a, sometable.b FROM sometable;. I generally like to be more explicit especially when I have a lot of R variables floating around.

attach(pgrast20log)
table(g1, round(dur_sec, digits=2))
detach(pgrast20log)

From this excerpt, we can tell that 64 bit floating point raster bands have the worst performance in general of any of the raster types. To get a true estimate we would run our simulation on different platforms and many times on same machine as well. We did run this a couple of times on our 32-bit windows XP and 7 and the 64BF is consistently slightly worse than the others. Our Windows 7 is about twice as fast in general but interestingly enough on the Windows 7, we see some suspicious outliers for 16BUI and 1BB. We think this is just turbulence since on other runs 64BF most always shows as the longest running.

          0 0.02 0.03 0.05 0.06 0.08 0.09 0.11 0.16 0.19
  16BSI 119   52   12    7    0    0    0    0    0    0
  16BUI 122   48   18    1    0    0    0    0    1    0
  1BB   121   57    8    1    1    1    0    0    0    1
  2BUI  132   52    6    0    0    0    0    0    0    0
  32BF  114   39   13   15    6    3    0    0    0    0
  32BSI 115   41   10   17    6    0    1    0    0    0
  32BUI 116   39   14   19    1    0    1    0    0    0
  4BUI  130   53    7    0    0    0    0    0    0    0
  64BF  113   41    1    2   10   12   10    1    0    0
  8BSI  128   59    3    0    0    0    0    0    0    0
  8BUI  124   59    5    2    0    0    0    0    0    0

Now we do the same exercise except flipping to see the culprit functions. Here we just want for each function if its better or worse than the 3rd Quantile standard deviation of our tests. Though this query can be done in SQL its much more succinct in R. Note: the 2 spellings in ST_SetValue, yes our machine gun test has spotted an inconsistency in our PostGIS documentation. Shame on us.

table(pgrast20log$func, pgrast20log$dur_sec > summary(pgrast20log$dur_sec)["3rd Qu."])

                               FALSE TRUE
 &&                              120    1
 &<                              121    0
 &>                              121    0
 AddRasterColumn                   3    8
 DropRasterTable                  11    0
 insert data                      10    1
 PostGIS_Raster_Lib_Build_Date     1    0
 PostGIS_Raster_Lib_Version        1    0
 ST_AddBand                       13   53
 ST_AsBinary                       6    5
 ST_BandHasNoDataValue            22    0
 ST_BandMetaData                   7   15
 ST_BandNoDataValue               22    0
 ST_BandPath                      22    0
 ST_BandPixelType                 22    0
 ST_Box2D                         11    0
 ST_ConvexHull                    11    0
 ST_DumpAsPolygons                21    1
 ST_Envelope                      11    0
 ST_GeoReference                  22    0
 ST_Height                        11    0
 ST_Intersection                 220    0
 ST_Intersects                   438    2
 ST_MakeEmptyRaster               14    0
 ST_MetaData                      11    0
 ST_NumBands                      11    0
 ST_PixelAsPolygon                18    4
 ST_PixelSizeX                    11    0
 ST_PixelSizeY                    11    0
 ST_Polygon                       22    0
 ST_Raster2WorldCoordX            22    0
 ST_Raster2WorldCoordY            22    0
 ST_SetBandHasNoDataValue         13    9
 ST_SetBandNoDataValue            11   11
 ST_SetGeoReference               11   11
 ST_SetPixelSize                  13    9
 ST_SetSkew                       11   11
 ST_SetSRID                        6    5
 ST_SetUpperLeft                   6    5
 ST_SetValue                      41   36
 ST_SEtValue                      31   24
 ST_SkewX                         11    0
 ST_SkewY                         11    0
 ST_SRID                          11    0
 ST_UpperLeftX                    11    0
 ST_UpperLeftY                    11    0
 ST_Value                        131    1
 ST_Width                          7    4
 ST_World2RasterCoordX            77    0
 ST_World2RasterCoordY            77    0

Second PL/R stored function, creating a plain text report

When you are writing PL/R code, you are pretty much just writing R code and living in the R environment. However, there are certain functions in R that you may come to prefer when working in PL/R than you would prefer in R. In PL/R you want data to be output in variables so you can easily return them in PL/R. The PostgreSQL language also doesn't provide a whole lot of mappings for the objects you get back in R. For our next example we want a simple text output of representation of the cross tabs we created and we want it to be returnable as an SQL statement. For this next example, we are going to lean on our favorite R functions for doing this. Other's more experienced may have better ways.

capture.output - this is an R function that will instead of returning to the screen the output of a variable or a command -- will return to a variable what it would normally return to the viewer's screen. So it will give us the same output as we have above. The only caveate is that it returns a string array where each line is a separate line on the screen. THis works kind of like the sink function in R except it doesn't have the messiness of needing a file to redirect screen output to.
paste - this we use to commonly collapse output such as arrays. It has many uses, but we are using it as a reverse split function. Kind of like a implode in PHP.
sprintf - does pretty much the same thing as it does in C and allows you to format text with place holders for variables
subset -- the subset function in R is akin to a SELECT FROM WHERE in SQL. Except | means or & means AND and == means =. The c() function in R is a way of creating vectors

So now we show our reporting function which exercises all the above R functions.


 CREATE OR REPLACE FUNCTION output_garden_stats() RETURNS text AS
$$ 
    #geometry/geography stats we are only selecting tests that involve only one geometry type
    pggeom20log <<- pg.spi.exec("SELECT logid ,log_label,func,g1, g2, log_start,log_end, 
       date_part('epoch',age(log_end, log_start))  As dur_sec 
    FROM postgis_garden_log  ORDER BY log_start")
    
    pgrast20log <<- pg.spi.exec("SELECT logid ,log_label,func,g1,g2,log_start,log_end, 
       date_part('epoch',age(log_end, log_start))  As dur_sec 
    FROM raster_garden_log ORDER BY log_start")

    attach(pgrast20log) #make pgrast20log fields global
    prsum.breakq <- summary(dur_sec)["3rd Qu."]
    prsum <- table(func, dur_sec > prsum.breakq)
    prdur <- table(g1, round(dur_sec,2))
    detach(pgrast20log)

    attach(pggeom20log) #make pggeom20log fields global
    pggdur <- table(g1, round(dur_sec,1),useNA="ifany")
    pgfuncdur <- table(func, round(dur_sec,1),useNA="ifany")
    detach(pggeom20log) #make it not globlal
    #create subset of data only containing CURVEPOLYGONS
    pgpoly<-subset(pggeom20log, g1 == 'CURVEPOLYGON' | g2 == 'CURVEPOLYGON', select=c(func, dur_sec))
    
    #create a cross tab count for just curvepolygon 
    #TRUE count for completions 
    #FALSE for processes that throw errors
    pgpolydur <- table(pgpoly$func, !is.na(pgpoly$dur_sec))
    
    #we use capture.output to capture the string representation of the data tables as would be shown on screen
    reportoutput <- sprintf("RASTER results is duration greater than %s secs %s \r\r RASTER results duration breakout \r %s
        GEOMETRY GEOGRAPHY by type %s \r GEOMETRY\GEOGRAPHY By function duration %s \r\r CURVEPOLYGON failure = FALSE, completion = TRUE by function %s", prsum.breakq, 
            paste(capture.output(prsum), sep ="", collapse="\r"), paste(capture.output(prdur), sep ="", collapse="\r"),
            paste(capture.output(pggdur), sep ="", collapse="\r"), paste(capture.output(pgfuncdur), sep ="", collapse="\r")
            , paste(capture.output(pgpolydur), sep ="", collapse="\r")
            )
    return(reportoutput)
$$
language 'plr';

To run our report, we run this command: SELECT output_garden_stats();

The report output can be seen http://www.postgresonline.com/demos/plr/postgis20report.txt

There are a lot of surprising things that we discovered simply looking at counts. Some might be things in the testing scripts (or even the PostGIS documentation) that need to be cleaned up and some may be PostGIS issues or just predictable oddities when a function is applied to 2 different kinds of geometries.

For example for the CURVEPOLYGON report:

the total counts for the operators are not the same. The && for most is about twice that of most of the other operators. This may mean we are overtesting && perhaps maybe its listed twice in the documentation.
Another oddity is that the ST_Split function, ST_ShortestLine and some other functions that are not supposed to work on CURVEPOLYGONS return an answer 3 out of 29 times. What are these 3 times? On closer inspection of the detail records -- it turns out that specialty geometries such as our favorite EMPTY geometry, MULTI NULL, and what we call a SINGLE NULL (hmm is that a real thing - we should probably just change that to 1 NULL), will work because they are special cases. When you split an empty geometry with anything you get back an empty geometry for example.
Another interesting conclusion is the ST_SnapToGrid which again shouldn't work on CURVEPOLYGONS. Yet we have 2 cases where it does. This is because there are about 3 implementations of ST_SnapToGrid and one is snapping to another geoemtry. If you snap any geometry to NULL evidentally you get a NULL back.

So the moral of the story -- You can learn a lot from a dummy.

Join us next time as we continue our exploration of R and PL/R and demonstrate creating aggregates with PL/R and then later creating graphs and plots.