Loading and Processing GPX XML files using PostgreSQL

Wednesday, April 29. 2009

Loading and Processing GPX XML files using PostgreSQL

Recommended Books: XSLT XSLT 2.0 and XPath 2.0 Programmer's Reference

Simon Greener, wrote an article on how to load GPX xml files into Oracle XMLDB. That got me thinking that I haven't really explored all the XML features that PostgreSQL has to offer and to some extent I've been reticent about XML processed in any database for that matter.

In this article we shall attempt to perform the same feats that Simon did, but with PostgreSQL instead of Oracle XMLDB. Note while we are demonstrating this with a GPX file, the same XPath approach can be used to process any XML file.

PostgreSQL since 8.3 has had ANSI SQL 2003 XML functionality built in. Before 8.3, you could use the xml2 contrib module to achieve the same effect in a not so standards compliant sort of way. In this example we shall demonstrate the built in functionality in 8.3 and above. The key function we will use is the xpath function. XPath is a language used to query XML data and PostgreSQL supports the XPath 1.0 version. The following is a quick primer on XPath that seems useful http://www.zvon.org/xxl/XPathTutorial/General/examples.html. You can also refer to the PostgreSQL XML section of the documentation.

Getting the data

We will use the same sample data Simon used, except sadly we had to change it further because the schema wasn't defined in such a way that PostgreSQL liked or rather I was too stupid to construct the XPath statement in such a fashion that would satisfy the PostgreSQL XML thingy. PostgreSQL seems to require that the schema have a name in addition to a location or at least that is what we concluded. The GPX example Simon had a location but no name. The docs have an example of the form. Where the second argument is an array of 2 dimensional arrays with the first item being the schema name and second the URI for the schema. Also note that the xpath function always returns an array even if there is only one element.

SELECT xpath('/my:a/text()', '<my:a xmlns:my="http://example.com">test</my:a>',
			 ARRAY[ARRAY['my', 'http://example.com']]);

 xpath
--------
 {test}
(1 row)

Suffice it to say, our version is like Simon's revised version except we also stripped off the namespace references since the PostgreSQL XML parser seemed unhappy that the name space defined was never referenced in format xsi:.... or something of that sort . We will be using the xpath version that takes no schema references.

Our revision of Simon's revision can be downloaded from here

The change made is very subtle. Simon had this as the first part

<?xml version="1.0"?>
<gpx version="1.1" creator="Toshihiro Hiraoka" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xmlns="http://www.topografix.com/GPX/1/1" 
 xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd">

and we changed it to:

<?xml version="1.0"?>
<gpx version="1.1" creator="Toshihiro Hiraoka">

Getting the data in the database

Simon used Oracle's get LOB fileopen to get the xml file into the db which he calls from an Oracle stored function. When you think about the closest parallel in PostgreSQL, I would say its the lo_* functions that allow import export of files into the db, though that only allows you to import and export files and not read the file. There is also the pg_read_file which does what we want, but can only read files from the PostgreSQL init cluster. Of course their are other ways. You could use perl or python or some other language such as PLPerlU that has system file access.

For now we'll just create a folder called gpxdir in the PostgreSQL cluster. You can determine the location of your cluster by running as super user

SELECT name, setting FROM pg_settings WHERE name='data_directory';

Now we'll create a function to mirror Simon's getClobDocument, except instead of calling it getClobDocument, we'll call it getXMLDocument because it will return an XML object instead of a CLOB object. Please note -- our getXMLDocument function is marked as SECURITY DEFINER because only super users can use the pg_read_file, so to allow regular users access to this, we have this run in the postgres context and then can give rights to this function to those users we want to who may not have super user rights.

--create the function to load xml doc
CREATE OR REPLACE FUNCTION getXMLDocument(p_filename character varying)
  RETURNS xml AS
$$
---we set the end read to some big number
-- because we are too lazy to grab the length
-- and it will cut of at the EOF anyway
SELECT CAST(pg_read_file(E'gpxdir/' || $1 ,0, 100000000) As xml);
$$
  LANGUAGE 'sql' VOLATILE SECURITY DEFINER;
ALTER FUNCTION getxmldocument(character varying) OWNER TO postgres;

Now to use this function we simply do:

Copy the gpxtestrevised.gpx file into the gpxdir and call the below

SELECT getXMLDocument('gpxtestrevised.gpx');

Next we'll create a table similar to what Simon has called gpx and stuff our xml in there

--create table to store  xml docs
CREATE TABLE gpx
(
  object_name character varying(50) NOT NULL PRIMARY KEY,
  object_value xml
);


--insert xml doc
INSERT INTO gpx(object_name, object_value)
VALUES ('gpxtestrevised.gpx', getXMLDocument('gpxtestrevised.gpx'));

Unfortunately PostgreSQL even in 8.4 is not as rich as Oracle's offering for XMLDB and doesn't have all that fancy validation schema stuff, though if you try to pull an obviously malformed XML document with getXMLDocument it will tell you you are missing tags and so forth. So we are skipping that section of Simon's and moving straight to the fun part.


--Get the metadataname
SELECT (xpath('/gpx/metadata/name/text()', g.object_value))[1] As metadataname
FROM GPX As g;

	  metadataname
------------------------
 Manila to Mt. Pinatubo


--Full meta data
SELECT (xpath('/gpx/metadata/name/text()', g.object_value))[1] as Name,
	   (xpath('/gpx/metadata/desc/text()', g.object_value))[1] as Description,
 (xpath('/gpx/metadata/copyright/year/text()', g.object_value))[1] as Copyright_Year,
 (xpath('/gpx/metadata/copyright/license/text()', g.object_value))[1] as Copyright_License,
(xpath('/gpx/metadata/link/@href', g.object_value))[1] as Hyperlink,
(xpath('/gpx/metadata/link/text/text()', g.object_value))[1] as Hyperlink_Text ,
(xpath('/gpx/metadata/link/time/text()', g.object_value))[1] as Document_DateTime ,
(xpath('/gpx/metadata/link/keywords/text()', g.object_value))[1] as keywords ,
(xpath('/gpx/metadata/bounds/@minlon', g.object_value))[1] as MinLong,
(xpath('/gpx/metadata/bounds/@minlat', g.object_value))[1] as MinLat,
(xpath('/gpx/metadata/bounds/@maxlon', g.object_value))[1] as MaxLong,
(xpath('/gpx/metadata/bounds/@maxlat', g.object_value))[1] as MaxLat
 FROM GPX AS g;


 --the same stuff Simon got
		  name          |          description           | copyright_year |
  copyright_license       |           hyperlink           |  hyperlink_text   |
document_datetime | keywords | minlong | minlat | maxlong | maxlat
------------------------+--------------------------------+----------------+-----
--------------------------+-------------------------------+-------------------+-
------------------+----------+---------+--------+---------+--------
 Manila to Mt. Pinatubo | This is test data for gpx2shp. | 2004           | http
://gpx2shp.sourceforge.jp | http://gpx2shp.sourceforge.jp | Toshihiro Hiraoka |
				  |          | -180.0  | -90.0  | 179.9   | 90.0
(1 row)

And now for the finale -- we shall pull the way points just as Simon did

--Lets extract way points (Simon's is a bit shorter)
-- (the offset here is an ugly hack to force Postgres to use our xml value instead of recopying it as
-- suggested by a commenter to our previous post
--note we were using order by before but OFFSET though still ugly seems cleaner --
--With the offset hack -- this finishes in 895ms.  Without offset hack it takes about 3182 ms (~3 seconds)
SELECT CAST((xpath('/wpt/name/text()', wayp.pt))[1] As varchar(20)) As Name,
	CAST(CAST((xpath('/wpt/@lon', wayp.pt))[1] As varchar) As numeric) As longitude,
	CAST(CAST((xpath('/wpt/@lat', wayp.pt))[1] As varchar) As numeric) As latitude,
	CAST(CAST((xpath('/wpt/ele/text()', wayp.pt))[1] As varchar) As numeric) As Elevation
FROM (SELECT (xpath('/gpx/wpt',g.object_value))[it.i] As pt
FROM (SELECT generate_series(1,array_upper(xpath('/gpx/wpt',g.object_value),1)) As i
FROM GPX As g WHERE object_name = 'gpxtestrevised.gpx') As it
 CROSS JOIN (SELECT object_value
	FROM GPX WHERE object_name = 'gpxtestrevised.gpx') As g OFFSET 0) As wayp;

  name  |   longitude    |   latitude   | elevation
--------+----------------+--------------+------------
 001    |  121.043382715 | 14.636015547 |  45.307495
 002    |  121.042653322 | 14.637198653 |  50.594727
 003    |  121.043165457 | 14.640581002 |  46.989868
 004    |  120.155537082 | 14.975596117 |  38.097656
 005    |  120.236538453 | 15.037303017 | 147.687134
 006    |  120.236548427 | 15.037305867 | 145.043579
 007    |  120.237012533 | 15.038105585 | 160.905151
 008    |  120.237643858 | 15.038478328 | 165.231079
 009    |  120.238984879 | 15.038991300 | 173.882935
 010    |  120.239190236 | 15.039099846 | 166.192383
 011    |  120.241263332 | 15.040223943 | 175.324829
 012    |  120.247956365 | 15.042621084 | 186.860474
 013    |  120.253084749 | 15.043179905 | 208.730347
 014    |  120.254095523 | 15.043297336 | 211.374023
 015    |  120.254105665 | 15.043296246 | 213.296631
 016    |  120.247880174 | 15.042568864 | 189.984863
 017    |  120.246971911 | 15.042486135 | 187.100830
 018    |  120.245966502 | 15.042233923 | 185.418579
 019    |  120.244808039 | 15.041693626 | 181.092651
 020    |  120.244476954 | 15.041558258 | 179.410400
 021    |  120.243841019 | 15.041360026 | 178.689453
 022    |  120.241488637 | 15.040351683 | 176.526489
:
:

Scott Bailey has some functions he has written for PostgreSQL that look similar to the extract_value you will find in Oracle that Simon uses in his example. http://scottrbailey.wordpress.com/2009/06/19/xml-parsing-postgres/

Posted by Leo Hsu and Regina Obe in 8.3, 8.4, application development, intermediate, oracle, postgresql versions at 00:28 | Comments (4) | Trackbacks (0)

Trackbacks

Trackback specific URI for this entry

PingBack

Weblog: www.postgresonline.com
Tracked: Nov 11, 02:27

Comments

Display comments as (Linear | Threaded)

After reading this, I felt convicted to share some functions that I wrote to simplify working with XML in Postgres.

http://scottrbailey.wordpress.com/2009/06/19/xml-parsing-postgres/

#1 Scott Bailey on 2009-06-19 17:23 (Reply)

Scott,

Thanks. We've added it to the end of our article.

#1.1 Regina on 2009-06-19 20:43 (Reply)

Thanks for an interesting article. I've found that when the xml file has a unnamed default name space you can define a named one in the third parameter (e.g. 'aaa') of the xpath function and use it in the first parameter (e.g. '/aaa:gpx'). e.g.

select xpath('/aaa:gpx','abc',array[array['aaa','http://example.com']]);
xpath
-----------------------------------------------------------------
{"abc"}
(1 row)

#2 Frank on 2009-09-18 23:10 (Reply)

Thanks for sharing. I've been looking for some way to convert xml to pdf. I saved all of my files by sadly they aren't in the correct format.

#2.1 xml to pdf (Homepage) on 2012-02-17 16:09 (Reply)

Add Comment

Name
Email
Homepage
In reply to
Comment	E-Mail addresses will not be displayed and will only be used for E-Mail notifications. To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly. Enter the string from the spam-prevention image above: Phone* What is one plus one?
	Remember Information? Subscribe to this entry

Loading and Processing GPX XML files using PostgreSQL

Postgres OnLine Journal

PostGIS in Action About the Authors Consulting

Wednesday, April 29. 2009

Loading and Processing GPX XML files using PostgreSQL

Getting the data

Getting the data in the database

Entry's Links

Quicksearch

Calendar

Categories

Archives

Subscribe

Blog Administration

Loading and Processing GPX XML files using PostgreSQL

Postgres OnLine Journal PostGIS in Action About the Authors Consulting

Wednesday, April 29. 2009

Loading and Processing GPX XML files using PostgreSQL

Getting the data

Getting the data in the database

Entry's Links

Quicksearch

Calendar

Categories

Archives

Subscribe

Blog Administration

Postgres OnLine Journal

PostGIS in Action About the Authors Consulting