OGR foreign data wrapper on Windows first taste

This christmas I received something very special from Paul Ramsey and Even Roualt as detailed in Foreign Data Wrappers for PostGIS. It's been something I've been patiently waiting for for 4 years. I think it has a few issues I'm working to replicate, but overall it's much faster than I expected and pretty slick.

So why is ogr_fdw so special, because GDAL/OGR is an avenue to many data sources, NOT JUST GEOSPATIAL. It's the NOT JUST that I am most excited about. Though the focus is geospatial you can use it with non-geospatial datasources, as we described a long time ago in OGR2OGR for data loading

Windows Bag O' FDWs for PostgreSQL 9.4 coming soon

To celebrate this new gift, I've compiled ogr_fdw binaries for PostgreSQL 9.4 windows (both 32-bit and 64-bit). I'll be doing the same for other PostgreSQL versions in due time and hope to include this in upcoming PostGIS 2.2 windows bundle. For now, we'll be including the ogr_fdw in our bag-of-fdws for both 64-bit and 32-bit PostgreSQL 9.4. I'm still testing these out before I release, and will provide download links for these in a separate article.

Now this new bag o fdws, contains my other favorite FDWs -- the file_textarray_fdw, which we can't imagine living without for the kind of work we do. I also included www_fdw. Both file_textarray_fdw and www_fdw are show-cased in our new PostgreSQL: Up and Running, 2nd edition book, which just came out in print. Our PostGIS In Action 2nd edition, is still sadly one or 2 months away from hard-copy print and in final stages of proofing.

PostgreSQL 9.4 Windows OGR_FDW

At first pass, I used the same GDAL library (currently 1.11.1) as we use to build PostGIS 2.1 with. This unfortunately was not good enough for our needs, so I recompiled GDAL 1.11.1 to include Expat 2.1 and FreeXL 1.0.0h, as well as Curl (since I had compiled that for www_fdw) dependencies. With expat dependency, you get access to being able to read Microsoft Excel 2007+ XSLX workbooks and OpenOffice spreadsheets via this FDW and with FreeXL you get access to reading older Microsoft spreadsheet binary format .XLS files.

UPDATE: I've recompiled GDAL with ODBC support

For those interested in the gory details of my compile, I have it on my ogr_fdw_depends_build.sh gist which I'm fine tuning a bit.

I wanted to include ODBC, but it seems under mingw the ODBC is called odbc32 and haven't dug in enough to figure out how to trick GDAL into using that instead of the odbc it is looking for. With ODBC you get a lot of other stuff like any data source you can access via ODBC.

This version of GDAL I tested with the PostGIS 2.1.5 windows binaries we recently released on application stackbuilder and seems to be a simple drop and replace affair (for the packaged libgdal-1.dll) (just make sure you also copy the included expat dll, lib-curl,and libfreexl as well) and keep in mind, if you use these, they may be overwritten by a PostGIS upgrade you do.

When we built GDAL for PostGIS, we took only the defaults, because we didn't want to deal with extra dependencies. This meant really useful common drivers like ODBC, Geospatial webservices, OpenOffice Spreadsheet, Excel XLSX, Excel XLS, and SQLite/SpatiaLite will not be accessible with the version of GDAL packaged with PostGIS windows stackbuilder install. I'm rethinking this for PostGIS 2.2. Right now, PostGIS doesn't need the OGR (vector) api of GDAL, and GDAL is only used by the PostGIS raster api to access the raster drivers. So the GDAL vector (OGR) bindings have just been conveniently hybernating waiting for someone to wake them up. Also since this FDW doesn't actually rely on PostGIS, you don't need PostGIS to use it.

I also don't compile with PostgreSQL support, which at first glance seems like a real oddity. The main reason for that is that I use the same GDAL for multiple versions of PostgreSQL, but I don't distribute libpq.dll since that comes packaged with PostgreSQL EDB installers. So I didn't want to run into weird dependency issues with GDAL trying to hook into a different version of libpq.dll than what it was compiled with. If you do compile with PostgreSQL, then you get PostgreSQL (and for raster -- ability to read PostGIS raster), which are very useful for client side use, but of marginal interest for server side (when you've got postgres_fdw already).

The unhighlighted vector drivers are what you get when you build GDAL with no extra dependencies. Still a very rich list. The yellow highlighted ones are the additional ones you get compiling with expat, freexl and Curl. The green highlighted ones are the additional you get compiling with ODBC.

Supported Formats:
  -> "ESRI Shapefile" (read/write)
  -> "MapInfo File" (read/write)
  -> "UK .NTF" (readonly)
  -> "SDTS" (readonly)
  -> "TIGER" (read/write)
  -> "S57" (read/write)
  -> "DGN" (read/write)
  -> "ODBC" (read/write)
  -> "VRT" (readonly)
  -> "REC" (readonly)
  -> "Memory" (read/write)
  -> "BNA" (read/write)
  -> "CSV" (read/write)
  -> "GML" (read/write)
  -> "GPX" (read/write)
  -> "KML" (read/write)
  -> "GeoJSON" (read/write)
  -> "GMT" (read/write)
  -> "WAsP" (read/write)
  -> "PGeo" (readonly)
  -> "MSSQLSpatial" (read/write)
  -> "PCIDSK" (read/write)
  -> "OpenFileGDB" (readonly)
  -> "XPlane" (readonly)
  -> "AVCBin" (readonly)
  -> "AVCE00" (readonly)
  -> "DXF" (read/write)
  -> "Geoconcept" (read/write)
  -> "GeoRSS" (read/write)
  -> "GPSTrackMaker" (read/write)
  -> "PGDump" (read/write)
  -> "GPSBabel" (read/write)
  -> "SUA" (readonly)
  -> "OpenAir" (readonly)
  -> "PDS" (readonly)
  -> "WFS" (readonly)
  -> "HTF" (readonly)
  -> "AeronavFAA" (readonly)
  -> "Geomedia" (readonly)
  -> "EDIGEO" (readonly)
  -> "GFT" (read/write) (Google Fusion tables)
  -> "GME" (read/write) (Google Map tables)
  -> "SVG" (readonly)
  -> "CouchDB" (read/write)
  -> "Idrisi" (readonly)
  -> "ARCGEN" (readonly)
  -> "SEGUKOOA" (readonly)
  -> "SEGY" (readonly)
  -> "XLS" (readonly)
  -> "ODS" (read/write)
  -> "XLSX" (read/write)
  -> "ElasticSearch" (read/write)
  -> "PDF" (read/write)
  -> "Walk" (readonly)
  -> "CartoDB" (readonly)
  -> "SXF" (readonly)

Compile with more, and you could get as many as this. I'll be putting the MapInfo, ESRI Shapefile and MS Office excel drivers to use right away.

Taking ogr_fdw for a test drive

How you connect to datasources varies a bit from driver to driver. For example an excel workbook is considered a server, and each worksheet is a layer (table in PostgreSQL speak). For datasources like ESRI Shapefile or CSV files, the server is the folder that the files sit in and each individual file is a table.

Paul already demonstrated some examples of using Web Feature service (WFS) and ESRI OpenFileGDB (ESRIs geodatabase) examples https://github.com/pramsey/pgsql-ogr-fdw. Those examples may be exciting to folks who get high on spatial data, but for most people, I imagine they are yawn material. But what about good'ol spreadsheets or other relational databases. Something most people work with?

To take ogr_fdw for a ride

  1. I downloaded a random xlsx file from: Student weight status category reporting results: beginning 2010 and saved to my C:/fdw_data/excel folder on my PostgreSQL server.
  2. Then using the command line tool: ogr_fdw_info that comes packaged with ogr_fdw, I checked to see what spreadsheets the xlsx file had in it with this command:
    ogr_fdw_info -s "C:/fdw_data/excel/Student_Weight_Status_Category_Reporting_Results__Beginning_2010.xlsx"
    and got this: Student Weight Status Category

    Sadly the government had no excel workbooks I could find that had more than one worksheet; a big waste of a workbook. I did try with some of the beauties that people send us and confirmed that yes each worksheet gets listed as a separate layer.

  3. To get the table structure for the spreadsheet I do this:
    ogr_fdw_info -s "C:/fdw_data/excel/Student_Weight_Status_Category_Reporting_Results__Beginning_2010.xlsx" -l "Student Weight Status Category"

    Which outputs:

    CREATE SERVER myserver
      FOREIGN DATA WRAPPER ogr_fdw
      OPTIONS (
        datasource 'C:/fdw_data/excel/Student_Weight_Status_Category_Reporting_Results__Beginning_2010.xlsx',
        format 'XLSX' );
    
    CREATE FOREIGN TABLE student_weight_status_category (
      fid integer,
      location_code varchar,
      county varchar,
      area_name varchar,
      region varchar,
      school_years varchar,
      no__overweight real,
      pct_overweight real,
      no__obese real,
      pct_obese real,
      no__overweight_or_obese real,
      pct_overweight_or_obese real,
      grade_level varchar,
      area_type varchar,
      street_address varchar,
      city varchar,
      state varchar,
      zip_code varchar,
      location_1 varchar )
      SERVER myserver
      OPTIONS ( layer 'Student Weight Status Category' );

    Now this isn't a geospatial file, but GDAL has this narrow-minded view of the world that everything worth recording involves location or space. Unfortunately you can't simply remove that geom column, but you can ignore it. Just set the type to bytea and call it a day and ignore that column when you query. Also I'm going to change the name of the server to reflect its a single workbook This was a bug in the outdated version of ogr_fdw_info I was using. There is no geometry column output if you feed it a non-spatial file

So when all is said and done, you should run a script that looks something like:

CREATE SERVER student_weight_workbook
  FOREIGN DATA WRAPPER ogr_fdw
  OPTIONS (
    datasource 'C:/fdw_data/excel/Student_Weight_Status_Category_Reporting_Results__Beginning_2010.xlsx',
    format 'XLSX' );

CREATE FOREIGN TABLE student_weight_status_category (
  fid integer,
  location_code varchar,
  county varchar,
  area_name varchar,
  region varchar,
  school_years varchar,
  no__overweight real,
  pct_overweight real,
  no__obese real,
  pct_obese real,
  no__overweight_or_obese real,
  pct_overweight_or_obese real,
  grade_level varchar,
  area_type varchar,
  street_address varchar,
  city varchar,
  state varchar,
  zip_code varchar,
  location_1 varchar )
  SERVER student_weight_workbook
  OPTIONS ( layer 'Student Weight Status Category' );

Then you can run any sql and even join your new table with any local table like this:

  SELECT county , sum(no__overweight) As tot_overweight 
  FROM student_weight_status_category 
  WHERE region ILIKE '%North%'
  GROUP BY county
  ORDER BY county;

and get an output like:
    county  | tot_overweight
------------+----------------
ALBANY      |           8084
CLINTON     |           2638
COLUMBIA    |           1903
DELAWARE    |           1581
ESSEX       |            943
FRANKLIN    |           2570
FULTON      |           1209
GREENE      |           1805
HAMILTON    |             46
MONTGOMERY  |           1794
N/A         |          18927
OTSEGO      |           2039
RENSSELAER  |           5366
SARATOGA    |           6747
SCHENECTADY |           4700
SCHOHARIE   |            893
WARREN      |           2714
WASHINGTON  |           2646

I did run into one issue that on occasion if I started with just querying a subset of the columns, I got weird errors like: value too long for type character varying(11) which would mysteriously disappear the second time I run the query or if I do a select of the whole table twice the issue would disappear. I haven't determined if its an issue with the specific data source driver, the ogr_fdw, or somethng about my compile. Haven't had quite the same issue with ESRI shapefiles, though I have had others with things like counts not being the same as first time I ran if I run two simulatenous queries or server crashing if I run two queries and cancel one out. So I suspect there is a dirty memory issue somewhere, but not sure where at the moment.