Quick Intro to R and PL/R - Part 1

In this article we'll provide a summary of what PL/R is and how to get running with it. Since we don't like repeating ourselves, we'll refer you to an article we wrote a while ago which is still fairly relevant today called Up and Running with PL/R (PLR) in PostgreSQL: An almost Idiot's Guide and just fill in the parts that have changed. We should note that particular series was more geared toward the spatial database programmer (PostGIS in particular). There is a lot of overlap between the PL/R, R, and PostGIS user-base which is comprised of many environmental scientists and researchers in need of powerful charting and stats tools to analyse their data who are high on the smart but low on the money human spectrum.

This series will be more of a general PL/R user perspective. We'll follow more of the same style we did with Quick Intro to PL/Python. We'll end our series with a PL/R cheatsheet similar to what we had for PL/Python.

As stated in our State of PostGIS article, we'll be using log files we generated from our PostGIS stress tests. These stress tests were auto-generated from the PostGIS official documentation. The raster tests are comprised of 2,095 query executions exercising all the pixel types supported. The geometry/geograpy tests are comprised of 65,892 spatial SQL queries exercising every PostGIS geometry/geography supported in PostGIS 2.0 -- yes this includes TINS, Triangles,Polyhedral Surfaces, Curved geometries and all dimensions of them. Most queries are unique. If you are curious to see what these log tables look like or want to follow along with these exercises, you can download the tables from here.

What is R and PL/R and why should you care?

R is both a language and an environment for doing statistics and generating graphs and plots. It is GNU-licensed and a common favorite of Universities and Research institutions. PL/R is a procedural language for PostgreSQL that allows you to write database stored functions in R. R is a set-based and domain specific language similar to SQL except unlike the way relational databases treat data, it thinks of data as matrices, lists and vectors. I tend to think of it as a cross between LISP and SQL though more experienced Lisp and R users will probably disagree with me on that. This makes it easier in many cases to tabulate data both across columns as well as across rows. The examples we will show in these exercises, could be done in SQL, but they are much more succinct to write in R. In addition to the language itself, there are a whole wealth of statistical and graphing functions available in R that you will not find in any relational database. These functions are growing as more people contribute packages. Its packaging system called Comprehensive R Archive (CRAN) is similar in concept to Perl's CPAN and the in the works PGXN for PostgreSQL.

What do you need before you can use PL/R?

Our first PL/R stored function: creating data frames and saving to RData file

In R there are 5 basic kinds of data structures: scalars, vectors, arrays, dataframes, and lists. Lists are the most complex of all because they are collections of the other 4 data structures. You can think of a PostgreSQL query output of table as a data frame. In this simple example, we'll demonstrate how to feed a PostgreSQL query into an R dataframe and save it in an R data file.

CREATE OR REPLACE FUNCTION save_postgis_logs_rdata() RETURNS text AS
$$ 
    pggeom20log <<- pg.spi.exec("SELECT logid ,log_label,func,g1,g2,log_start,log_end, 
       date_part('epoch',age(log_end, log_start))  As dur_sec 
    FROM postgis_garden_log ORDER BY log_start")
    
    pgrast20log <<- pg.spi.exec("SELECT logid ,log_label,func,g1,g2,log_start,log_end, 
       date_part('epoch',age(log_end, log_start))  As dur_sec 
    FROM raster_garden_log ORDER BY log_start")
    
    save(pggeom20log,pgrast20log, file="/temp/postgisstats20.RData")
    return("done")
$$
language 'plr';

The most common function used in PL/R is the pg.spi.exec routine which takes as input SQL statements and converts them to R dataframes. In the example above, we are dumping our geometry/geography logging test stats and our raster stats from a PostGIS 2.0 test run and saving them to 2 R DataFrames. In addition we are saving these data frames to a custom R Data file. An R Data file is a file that stores R objects in native R format. It's useful for saving datasets you are working on for easy load back.

Using R interactively

R is both a scriptable and an interactive environment and one that you can also automate a good bit of with PL/R. Before automating things, I like to test processes out interactively in R. For this next example, we'll load back the data we saved using PL/R into R and play with it.

To do so, launch R-GUI or R command line and at the prompt type:

load("/temp/postgisstats20.RData")
ls()

-- the ls output will return listing of variables we have loaded 
[1] "pggeom20log" "pgrast20log"
summary(pgrast20log)
-- yields
    logid         log_label             func                g1
Min.   :   1.0   Length:2095        Length:2095        Length:2095
1st Qu.: 524.5   Class :character   Class :character   Class :character
Median :1048.0   Mode  :character   Mode  :character   Mode  :character
Mean   :1048.0
3rd Qu.:1571.5
Max.   :2095.0
     g2             log_start           log_end             dur_sec
Length:2095        Length:2095        Length:2095        Min.   :0.000000
Class :character   Class :character   Class :character   1st Qu.:0.000000
Mode  :character   Mode  :character   Mode  :character   Median :0.000000
                                                         Mean   :0.008964
                                                         3rd Qu.:0.016000
                                                         Max.   :0.187000

So we know from the above summary that we have 2095 observations and basic stats about the average duration of each process in our raster log tests. We ran the above test on a fairly new beefed up Winodw 7 and an older clunkier Windows Xp. Both 32-bit running PostgreSQL 9.0 / PostGIS 2.0. The windows xp in general followed the same pattern as the windows 7 except timings were generally 50-80% worse.

One of my favorite functions in R is the table function because it's always present and fairly fast and easy to use. To get details on the function run this command:

help(table)

Basically what it does is provide a cross tab of counts. Here is an example use where we take our raster log and output the time durations but each time duration we round to 2 decimals so that our numbers in 0.11, 0.112 etc. collapse to 0.11

table(pgrast20log$g1, round(pgrast20log$dur_sec, digits=2))

If you have a lot of variables coming from the same table or want to run a lot of R steps on the same table, you may want to write something like below which is kind of equivalent to a SELECT a, b FROM sometable; As opposed to our above which is more like a SELECT sometable.a, sometable.b FROM sometable;. I generally like to be more explicit especially when I have a lot of R variables floating around.

attach(pgrast20log)
table(g1, round(dur_sec, digits=2))
detach(pgrast20log)

From this excerpt, we can tell that 64 bit floating point raster bands have the worst performance in general of any of the raster types. To get a true estimate we would run our simulation on different platforms and many times on same machine as well. We did run this a couple of times on our 32-bit windows XP and 7 and the 64BF is consistently slightly worse than the others. Our Windows 7 is about twice as fast in general but interestingly enough on the Windows 7, we see some suspicious outliers for 16BUI and 1BB. We think this is just turbulence since on other runs 64BF most always shows as the longest running.

          0 0.02 0.03 0.05 0.06 0.08 0.09 0.11 0.16 0.19
  16BSI 119   52   12    7    0    0    0    0    0    0
  16BUI 122   48   18    1    0    0    0    0    1    0
  1BB   121   57    8    1    1    1    0    0    0    1
  2BUI  132   52    6    0    0    0    0    0    0    0
  32BF  114   39   13   15    6    3    0    0    0    0
  32BSI 115   41   10   17    6    0    1    0    0    0
  32BUI 116   39   14   19    1    0    1    0    0    0
  4BUI  130   53    7    0    0    0    0    0    0    0
  64BF  113   41    1    2   10   12   10    1    0    0
  8BSI  128   59    3    0    0    0    0    0    0    0
  8BUI  124   59    5    2    0    0    0    0    0    0

Now we do the same exercise except flipping to see the culprit functions. Here we just want for each function if its better or worse than the 3rd Quantile standard deviation of our tests. Though this query can be done in SQL its much more succinct in R. Note: the 2 spellings in ST_SetValue, yes our machine gun test has spotted an inconsistency in our PostGIS documentation. Shame on us.

table(pgrast20log$func, pgrast20log$dur_sec > summary(pgrast20log$dur_sec)["3rd Qu."])
                               FALSE TRUE
 &&                              120    1
 &<                              121    0
 &>                              121    0
 AddRasterColumn                   3    8
 DropRasterTable                  11    0
 insert data                      10    1
 PostGIS_Raster_Lib_Build_Date     1    0
 PostGIS_Raster_Lib_Version        1    0
 ST_AddBand                       13   53
 ST_AsBinary                       6    5
 ST_BandHasNoDataValue            22    0
 ST_BandMetaData                   7   15
 ST_BandNoDataValue               22    0
 ST_BandPath                      22    0
 ST_BandPixelType                 22    0
 ST_Box2D                         11    0
 ST_ConvexHull                    11    0
 ST_DumpAsPolygons                21    1
 ST_Envelope                      11    0
 ST_GeoReference                  22    0
 ST_Height                        11    0
 ST_Intersection                 220    0
 ST_Intersects                   438    2
 ST_MakeEmptyRaster               14    0
 ST_MetaData                      11    0
 ST_NumBands                      11    0
 ST_PixelAsPolygon                18    4
 ST_PixelSizeX                    11    0
 ST_PixelSizeY                    11    0
 ST_Polygon                       22    0
 ST_Raster2WorldCoordX            22    0
 ST_Raster2WorldCoordY            22    0
 ST_SetBandHasNoDataValue         13    9
 ST_SetBandNoDataValue            11   11
 ST_SetGeoReference               11   11
 ST_SetPixelSize                  13    9
 ST_SetSkew                       11   11
 ST_SetSRID                        6    5
 ST_SetUpperLeft                   6    5
 ST_SetValue                      41   36
 ST_SEtValue                      31   24
 ST_SkewX                         11    0
 ST_SkewY                         11    0
 ST_SRID                          11    0
 ST_UpperLeftX                    11    0
 ST_UpperLeftY                    11    0
 ST_Value                        131    1
 ST_Width                          7    4
 ST_World2RasterCoordX            77    0
 ST_World2RasterCoordY            77    0

Second PL/R stored function, creating a plain text report

When you are writing PL/R code, you are pretty much just writing R code and living in the R environment. However, there are certain functions in R that you may come to prefer when working in PL/R than you would prefer in R. In PL/R you want data to be output in variables so you can easily return them in PL/R. The PostgreSQL language also doesn't provide a whole lot of mappings for the objects you get back in R. For our next example we want a simple text output of representation of the cross tabs we created and we want it to be returnable as an SQL statement. For this next example, we are going to lean on our favorite R functions for doing this. Other's more experienced may have better ways.

So now we show our reporting function which exercises all the above R functions.

CREATE OR REPLACE FUNCTION output_garden_stats() RETURNS text AS
$$ 
    #geometry/geography stats we are only selecting tests that involve only one geometry type
    pggeom20log <<- pg.spi.exec("SELECT logid ,log_label,func,g1, g2, log_start,log_end, 
       date_part('epoch',age(log_end, log_start))  As dur_sec 
    FROM postgis_garden_log  ORDER BY log_start")
    
    pgrast20log <<- pg.spi.exec("SELECT logid ,log_label,func,g1,g2,log_start,log_end, 
       date_part('epoch',age(log_end, log_start))  As dur_sec 
    FROM raster_garden_log ORDER BY log_start")

    attach(pgrast20log) #make pgrast20log fields global
    prsum.breakq <- summary(dur_sec)["3rd Qu."]
    prsum <- table(func, dur_sec > prsum.breakq)
    prdur <- table(g1, round(dur_sec,2))
    detach(pgrast20log)

    attach(pggeom20log) #make pggeom20log fields global
    pggdur <- table(g1, round(dur_sec,1),useNA="ifany")
    pgfuncdur <- table(func, round(dur_sec,1),useNA="ifany")
    detach(pggeom20log) #make it not globlal
    #create subset of data only containing CURVEPOLYGONS
    pgpoly<-subset(pggeom20log, g1 == 'CURVEPOLYGON' | g2 == 'CURVEPOLYGON', select=c(func, dur_sec))
    
    #create a cross tab count for just curvepolygon 
    #TRUE count for completions 
    #FALSE for processes that throw errors
    pgpolydur <- table(pgpoly$func, !is.na(pgpoly$dur_sec))
    
    #we use capture.output to capture the string representation of the data tables as would be shown on screen
    reportoutput <- sprintf("RASTER results is duration greater than %s secs %s \r\r RASTER results duration breakout \r %s
        GEOMETRY GEOGRAPHY by type %s \r GEOMETRY\GEOGRAPHY By function duration %s \r\r CURVEPOLYGON failure = FALSE, completion = TRUE by function %s", prsum.breakq, 
            paste(capture.output(prsum), sep ="", collapse="\r"), paste(capture.output(prdur), sep ="", collapse="\r"),
            paste(capture.output(pggdur), sep ="", collapse="\r"), paste(capture.output(pgfuncdur), sep ="", collapse="\r")
            , paste(capture.output(pgpolydur), sep ="", collapse="\r")
            )
    return(reportoutput)
$$
language 'plr'; 

To run our report, we run this command: SELECT output_garden_stats();

The report output can be seen http://www.postgresonline.com/demos/plr/postgis20report.txt

There are a lot of surprising things that we discovered simply looking at counts. Some might be things in the testing scripts (or even the PostGIS documentation) that need to be cleaned up and some may be PostGIS issues or just predictable oddities when a function is applied to 2 different kinds of geometries.

For example for the CURVEPOLYGON report:

So the moral of the story -- You can learn a lot from a dummy.

Join us next time as we continue our exploration of R and PL/R and demonstrate creating aggregates with PL/R and then later creating graphs and plots.