Using HStore for Archiving

I'm not a big proponent of schemaless designs, but they have their place. One particular place where I think they are useful is for archiving of data where even though the underlying table structure of the data you need to archive is changing, you want the archived record to have the same fields as it did back then. This is a case where I think Hstore and the way PostgreSQL has it implemented works pretty nicely.

Side note: one of the new features of PostgreSQL 9.4 is improved GIN indexes (faster and smaller) which is very often used with hstore data (and the new jsonb type). We're really looking forward to the GIN improvements more so than the jsonb feature. We're hoping to test out this improved index functionality with OpenStreetMap data soon and compare with our existing PostgreSQL 9.3. OpenStreetMap pbf and osm extract loaders (osm2pgsql, imposm) provide option for loading tagged data into PostgreSQL hstore fields, in addition to PostGIS geometry and other attribute fields. So 9.4 enhancements should be a nice gift for OSM data users. More on that later.

Test table

Lets suppose we had a contacts table that we wanted to record whenever we delete or update:

CREATE TABLE contacts(id serial primary key, first_name varchar(40)
 , last_name varchar(40), email varchar(75));
INSERT INTO contacts(first_name, last_name, email)
  VALUES ('Joey', 'Bubba', 'joey@xyz.com')
   , ('Forrest', 'Gump', 'fgump@bg1234.com');

Archive Bag

We create a catch-all archive table that can archive most anything.

-- archiving table --
CREATE EXTENSION hstore;
CREATE TABLE archive(table_name text, row_id text
 , archive_type char(1), archive_dt timestamptz DEFAULT CURRENT_TIMESTAMP, data hstore);

For archiving especially if you are going to do crazy things like dump PostGIS geometries/geographies in your hstore (a storelocken), you should use GIST instead of GIN since GIN is lossless, will fail for large sizes. For this example we'll at most just have tiny points in this exercise.

CREATE INDEX idx_archive_data ON archive USING gin (data);

Archiving data

For Archiving we could use triggers or just build into our delete/update logic using a writeable CTE like

-- update data --
WITH cte AS (UPDATE contacts SET email = 'fgump2@bg12345.com' WHERE first_name = 'Forrest' RETURNING *) 
 INSERT INTO archive(table_name, row_id, archive_type, data)
 SELECT 'contacts', cte.id::text, 'U', hstore(cte) As data
 FROM cte;

Later on we might think its nice to store the locations of our contacts with a geography point field

-- change structure
CREATE EXTENSION postgis;
ALTER TABLE contacts ADD COLUMN geog geography(Point,4326);
CREATE INDEX idx_contacts_geog ON contacts USING gist(geog);

Then we update a record

-- update again --
WITH cte AS (UPDATE contacts SET geog = ST_Point(-71.060, 42.358)::geography 
  WHERE first_name = 'Forrest' RETURNING *) 
 INSERT INTO archive(table_name, row_id, archive_type, data)
 SELECT 'contacts', cte.id::text, 'U', hstore(cte) As data
 FROM cte;

Then delete a record

WITH cte AS (DELETE FROM contacts WHERE first_name = 'Joey' RETURNING *) 
 INSERT INTO archive(table_name, row_id, archive_type, data)
 SELECT 'contacts', cte.id::text, 'D', hstore(cte) As data
 FROM cte;

Inspecting our Archive

 -- query data -
 SELECT table_name, row_id, data ? 'geog' As has_geog
  , data->'first_name' As first_name, archive_dt
 FROM archive;
 table_name | row_id | has_geog | first_name |          archive_dt
-----------+--------+----------+------------+-------------------------------
contacts   | 2      | f        | Forrest    | 2014-04-24 23:26:45.407163-04
contacts   | 2      | t        | Forrest    | 2014-04-24 23:31:57.034065-04
contacts   | 1      | t        | Joey       | 2014-04-24 23:32:13.314088-04

Note the output tells us that the first time we updated Forrest, we did not have a geography column and second time we updated Forrest and when we deleted Joey, we had a geography column. Of course you can do more interesting queries like each() etc to get values of every single field as needed.