Simon Greener, wrote an article on how to load GPX xml files into Oracle XMLDB. That
got me thinking that I haven't really explored all the XML features that PostgreSQL has to offer
and to some extent I've been reticent about XML processed in any database for that matter.
In this article we shall attempt to perform the same feats that Simon did, but with PostgreSQL instead of
Oracle XMLDB. Note while we are demonstrating this with a GPX file, the same XPath approach can be used to process any XML file.
PostgreSQL since 8.3 has had ANSI SQL 2003 XML functionality built in. Before 8.3, you could use the xml2 contrib module to achieve the same effect in a not so standards compliant sort of way.
In this example we shall demonstrate the built in functionality in 8.3 and above. The key function we will use is the xpath function. XPath is a language used to query XML data and PostgreSQL supports the XPath
1.0 version. The following is a quick primer on XPath that seems useful http://www.zvon.org/xxl/XPathTutorial/General/examples.html.
You can also refer to the PostgreSQL XML section of the documentation.
Getting the data
We will use the same sample data Simon used, except sadly we had to change it further because the schema wasn't defined in such a way that PostgreSQL liked or rather I was too stupid to construct the XPath statement in such a fashion that would satisfy the
PostgreSQL XML thingy. PostgreSQL seems
to require that the schema have a name in addition to a location or at least that is what we concluded. The GPX example Simon had a location but no name. The docs have an example of the form. Where the second argument is an array of 2 dimensional
arrays with the first item being the schema name and second the URI for the schema. Also note that the xpath function always returns an array even if there is only one element.
SELECT xpath('/my:a/text()', '<my:a xmlns:my="http://example.com">test</my:a>',
ARRAY[ARRAY['my', 'http://example.com']]);
xpath
--------
{test}
(1 row)
Suffice it to say, our version is like Simon's revised version except we also stripped off the namespace references since the PostgreSQL XML parser seemed unhappy that the name space defined was never referenced in format xsi:.... or something of that sort
. We will be using the xpath version that takes no schema references.
Our revision of Simon's revision can be downloaded from here
The change made is very subtle. Simon had this as the first part
<?xml version="1.0"?>
<gpx version="1.1" creator="Toshihiro Hiraoka" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://www.topografix.com/GPX/1/1"
xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd">
and we changed it to:
<?xml version="1.0"?>
<gpx version="1.1" creator="Toshihiro Hiraoka">
Getting the data in the database
Simon used Oracle's get LOB fileopen to get the xml file into the db which he calls from an Oracle stored function. When you think about the closest
parallel in PostgreSQL, I would say its the lo_* functions that allow import export of files into the db, though that only allows you to import and export files and not read the file.
There is also the pg_read_file which does what we want, but can only read files from the PostgreSQL init cluster. Of course their are other ways. You could use
perl or python or some other language such as PLPerlU that has system file access.
For now we'll just create a folder called gpxdir in the PostgreSQL cluster. You can determine the location of your cluster by running as super user
SELECT name, setting
FROM pg_settings
WHERE name='data_directory';
Now we'll create a function to mirror Simon's getClobDocument, except instead of calling it getClobDocument, we'll call it getXMLDocument because it will return an XML object instead of a CLOB object.
Please note -- our getXMLDocument function is marked as SECURITY DEFINER because
only super users can use the pg_read_file, so to allow regular users access to this, we have this run in the postgres context and then can give rights to this function to those users we want to who may not have super user rights.
--create the function to load xml doc
CREATE OR REPLACE FUNCTION getXMLDocument(p_filename character varying)
RETURNS xml AS
$$
---we set the end read to some big number
-- because we are too lazy to grab the length
-- and it will cut of at the EOF anyway
SELECT CAST(pg_read_file(E'gpxdir/' || $1 ,0, 100000000) As xml);
$$
LANGUAGE 'sql' VOLATILE SECURITY DEFINER;
ALTER FUNCTION getxmldocument(character varying) OWNER TO postgres;
Now to use this function we simply do:
Copy the gpxtestrevised.gpx file into the gpxdir and call the below
SELECT getXMLDocument('gpxtestrevised.gpx');
Next we'll create a table similar to what Simon has called gpx and stuff our xml in there
--create table to store xml docs
CREATE TABLE gpx
(
object_name character varying(50) NOT NULL PRIMARY KEY,
object_value xml
);
--insert xml doc
INSERT INTO gpx(object_name, object_value)
VALUES ('gpxtestrevised.gpx', getXMLDocument('gpxtestrevised.gpx'));
Unfortunately PostgreSQL even in 8.4 is not as rich as Oracle's offering for XMLDB and doesn't have all
that fancy validation schema stuff, though if you try to pull an obviously malformed XML document with getXMLDocument it will
tell you you are missing tags and so forth. So we are skipping that section of Simon's and moving straight to the fun part.
--Get the metadataname
SELECT (xpath('/gpx/metadata/name/text()', g.object_value))[1] As metadataname
FROM GPX As g;
metadataname
------------------------
Manila to Mt. Pinatubo
--Full meta data
SELECT (xpath('/gpx/metadata/name/text()', g.object_value))[1] as Name,
(xpath('/gpx/metadata/desc/text()', g.object_value))[1] as Description,
(xpath('/gpx/metadata/copyright/year/text()', g.object_value))[1] as Copyright_Year,
(xpath('/gpx/metadata/copyright/license/text()', g.object_value))[1] as Copyright_License,
(xpath('/gpx/metadata/link/@href', g.object_value))[1] as Hyperlink,
(xpath('/gpx/metadata/link/text/text()', g.object_value))[1] as Hyperlink_Text ,
(xpath('/gpx/metadata/link/time/text()', g.object_value))[1] as Document_DateTime ,
(xpath('/gpx/metadata/link/keywords/text()', g.object_value))[1] as keywords ,
(xpath('/gpx/metadata/bounds/@minlon', g.object_value))[1] as MinLong,
(xpath('/gpx/metadata/bounds/@minlat', g.object_value))[1] as MinLat,
(xpath('/gpx/metadata/bounds/@maxlon', g.object_value))[1] as MaxLong,
(xpath('/gpx/metadata/bounds/@maxlat', g.object_value))[1] as MaxLat
FROM GPX AS g;
--the same stuff Simon got
name | description | copyright_year |
copyright_license | hyperlink | hyperlink_text |
document_datetime | keywords | minlong | minlat | maxlong | maxlat
------------------------+--------------------------------+----------------+-----
--------------------------+-------------------------------+-------------------+-
------------------+----------+---------+--------+---------+--------
Manila to Mt. Pinatubo | This is test data for gpx2shp. | 2004 | http
://gpx2shp.sourceforge.jp | http://gpx2shp.sourceforge.jp | Toshihiro Hiraoka |
| | -180.0 | -90.0 | 179.9 | 90.0
(1 row)
And now for the finale -- we shall pull the way points just as Simon did
--Lets extract way points (Simon's is a bit shorter)
-- (the offset here is an ugly hack to force Postgres to use our xml value instead of recopying it as
-- suggested by a commenter to our previous post
--note we were using order by before but OFFSET though still ugly seems cleaner --
--With the offset hack -- this finishes in 895ms. Without offset hack it takes about 3182 ms (~3 seconds)
SELECT CAST((xpath('/wpt/name/text()', wayp.pt))[1] As varchar(20)) As Name,
CAST(CAST((xpath('/wpt/@lon', wayp.pt))[1] As varchar) As numeric) As longitude,
CAST(CAST((xpath('/wpt/@lat', wayp.pt))[1] As varchar) As numeric) As latitude,
CAST(CAST((xpath('/wpt/ele/text()', wayp.pt))[1] As varchar) As numeric) As Elevation
FROM (SELECT (xpath('/gpx/wpt',g.object_value))[it.i] As pt
FROM (SELECT generate_series(1,array_upper(xpath('/gpx/wpt',g.object_value),1)) As i
FROM GPX As g WHERE object_name = 'gpxtestrevised.gpx') As it
CROSS JOIN (SELECT object_value
FROM GPX WHERE object_name = 'gpxtestrevised.gpx') As g OFFSET 0) As wayp;
name | longitude | latitude | elevation
--------+----------------+--------------+------------
001 | 121.043382715 | 14.636015547 | 45.307495
002 | 121.042653322 | 14.637198653 | 50.594727
003 | 121.043165457 | 14.640581002 | 46.989868
004 | 120.155537082 | 14.975596117 | 38.097656
005 | 120.236538453 | 15.037303017 | 147.687134
006 | 120.236548427 | 15.037305867 | 145.043579
007 | 120.237012533 | 15.038105585 | 160.905151
008 | 120.237643858 | 15.038478328 | 165.231079
009 | 120.238984879 | 15.038991300 | 173.882935
010 | 120.239190236 | 15.039099846 | 166.192383
011 | 120.241263332 | 15.040223943 | 175.324829
012 | 120.247956365 | 15.042621084 | 186.860474
013 | 120.253084749 | 15.043179905 | 208.730347
014 | 120.254095523 | 15.043297336 | 211.374023
015 | 120.254105665 | 15.043296246 | 213.296631
016 | 120.247880174 | 15.042568864 | 189.984863
017 | 120.246971911 | 15.042486135 | 187.100830
018 | 120.245966502 | 15.042233923 | 185.418579
019 | 120.244808039 | 15.041693626 | 181.092651
020 | 120.244476954 | 15.041558258 | 179.410400
021 | 120.243841019 | 15.041360026 | 178.689453
022 | 120.241488637 | 15.040351683 | 176.526489
:
:
Scott Bailey has some functions he has written for PostgreSQL that look similar to the extract_value you will find in Oracle that Simon uses in his example. http://scottrbailey.wordpress.com/2009/06/19/xml-parsing-postgres/
Tracked: Nov 11, 02:27