We've been working on converting some of our SQL Server apps to PostgreSQL. In this article we'll describe some things to watch out for and provide a function we wrote to automate some of the conversion.
Although both databases are fairly ANSI-SQL compliant, there are still differences with their CREATE TABLE statements, data types, and how they handle other things that makes porting applications not so trivial.
Here are a couple of key differences and similarities in data types between the two. Some are easy to resolve and others are not.
SQL Server, similar to most other relational databases I can think of doesn't really have a true boolean type even in SQL Server 2008 and upcoming version. Leo really loves the PostgreSQL boolean type so much so that he's willing to trade portability for having a real true/false data type. PostgreSQL also has a bit which can be more than 1 in length, but it doesn't cast naturally to a boolean type in applications like SQL Server's bit or PostgreSQL boolean so we suggest you stay away from it for boolean use. Most tools .NET etc. we work with cast both SQL Server bit and PostgreSQL boolean to the boolean type of the application language. When writing raw SQL, however, they are different beasts.
In SQL Server you would use a bit in SQL as if it were an integer - sorta. There are nuisances, such as you can't add bits unless you cast them to integers etc. but for sake of argument we can wave our hands and call it a small integer that would be expressed LIKE:
WHERE is_active = 1
WHERE is_active = 0
In PostgreSQL, you would use a boolean as if it were a boolean and treating it like a 0 / 1 number is not allowed unless you define type casts for such behavior. So the way you would use PostgreSQL boolean in a where would take form:
WHERE is_active = true
WHERE is_active
WHERE is_active = false
WHERE NOT is_active
PostgreSQL does have auto casts for boolean to text which allows you to write Oracle compatible syntax- any of these will cast ('TRUE','FALSE','true','t','f', 'false')
WHERE is_active = 'FALSE'
WHERE is_active = 'TRUE'
are all valid constructs in PostgreSQL but invalid in SQL Server and you could use 0 and 1 too if you install some autocasts in your database as we described in Using MS Access with PostgreSQL which while designed for MS Access work equally well when writing raw SQL.
This small little annoyance is a less of a concern in SQL Server 2008+ since 2008 introduced (DATE, TIME, DATETIMEOFFSET) which are equivalent to PostgreSQL types so mapping is much cleaner for 2008.
If you are converting from SQL Server 2005/2000 you have to consciously think about whether the columns you had the old way should really be timestamps or dates or even times and if you really want the time zone following your data around.
For most of the apps we are porting, we don't ever intend to go back nor have it work with anything but PostgreSQL, so we've been taking liberties using syntax that will probably not be portable such as using PostgreSQL array support, complex type, and built-in regular expression support shamelessly just because it makes so many tasks so much succinct and faster.
It's a hard decision to make consciously because of the WHAT IFS (not just about betting on success of PostgreSQL, but the fact that you can't market your work to people who need it to work for a different database platform or even a lower version of a database platform). Product lock-in is a reality even when you have no vendors and when you are using open source. Standards are there to minimize product/vendor lock-in, but they often fall short of covering your needs. Product-lockin is a step that should not be taken lightly and should be considered on an application by application basis. It is best to make the decision consciously and play out the pros / cons of what you are gaining and losing. Most of our apps involve lots of stats, complex analytics of which we've decided it's much more efficient to make the database the engine to perform that than to try to write application code so that it works across multiple database platforms. Database lock-in is the lesser of other evils.
In SQL Server the standard for casing is Pascal case / camel case as described in our article Of Camels and People: Converting back and forth from Camel Case, Pascal Case to underscore lower case and in PostgreSQL people usually prefer underscores and lower case because upper or mixed case would need constant ugly quoting of columns. SQL Server can get away with this without enduring penalty because while it respects the casing of the columns in design, it is case insensitive when it is used in an SQL statement.
So for example if you had a table structure in SQL Server with columns named:
PersonID, FirstName, LastName
you can reference them as personid, firstname, FIRSTNAME etc.
.
If you want to keep your code more or less working without changing anything, you are best off just getting rid of the casing of SQL Server columns so you don't need to quote them in PostgreSQL. Same issue holds for Oracle that it is sensitive to casing of tables, except in Oracle case columns need to be uppercase to not require quoting and in PostgreSQL, columns need to be lower case to not require quoting.
However if you get really hot an bothered about not having word like breakages in your column names that casing or underscores provide, the thought of always having to quote columns sounds like hell, and this is an application you expect to live with for a while, you may just want to bite the bullet as we often do and change the casing when you migrate. YES, this means you probably have to rewrite a good chunk of your code. We like to think of it as providing an opportunity to refactor our code. I LOVE refactoring.
To automate some of the process of converting our table structures from SQL Server to PostgreSQL we opted to write a PostgreSQL function to do it so that we could completely control how it is done.
One of the reasons we are porting some of apps is that PostgreSQL string manipulation functions and array support makes so many things easier. One of those things is this. SQL Server by comparison has pretty pathetic built-in string manipulation features.
CREATE OR REPLACE FUNCTION convert_mssqlddl2pgsql(sql text,
change_camel_under boolean, default_not_nulls boolean DEFAULT true,
replace_dbo_with_schema text DEFAULT NULL)
RETURNS text AS
$$
DECLARE
var_sql text := sql;
r record;
BEGIN
IF change_camel_under THEN
-- only match captilas [A-Z] that are preceded and followed by lower case [a-z]
-- replace the whole match with preceding lowercase _uppercase following lowercase
var_sql := regexp_replace(var_sql, E'([a-z])([A-Z])([a-z]?)', E'\\1\_\\2\\3','g');
END IF;
var_sql := lower(var_sql);
var_sql := replace(var_sql,'[dbo].', COALESCE('[' || replace_dbo_with_schema || '].',''));
var_sql := replace(var_sql,'on [primary]', '');
FOR r IN (SELECT * FROM ( VALUES ('datetime', 'timestamp with time zone', 'CURRENT_TIMESTAMP'),
('bit', 'boolean', 'true'),
('varchar(max)', 'text', ''),
('nvarchar', 'varchar', ''),
('tinyint','smallint', '0') ,
('[int] identity(1,1)', 'serial', NULL)
) As f(ms,pg, def)) LOOP
IF default_not_nulls AND r.def IS NOT NULL THEN
var_sql := replace(var_sql, '[' || r.ms || '] not null', '[' || r.ms || '] not null DEFAULT ' || r.def);
END IF;
var_sql := replace(var_sql, '[' || r.ms || ']',r.pg) ;
var_sql := replace(var_sql, r.ms ,r.pg) ;
END LOOP;
var_sql := regexp_replace(var_sql, '[\[\]]','','g');
var_sql := regexp_replace(var_sql,'(primary key|unique) (clustered|nonclustered)', E'\\1', 'g');
--get rid of all that WITH (PAD_INDEX ...) that sql server generates for tables
-- so basically match any phrase WITH ("everything not containing )" )
var_sql := regexp_replace(var_sql, 'with \([^\)]+\)', '','g');
-- get rid of asc in column constraints
var_sql := regexp_replace(var_sql, '([a-z]+) asc', E'\\1','g');
-- get rid of collation
-- for PostgreSQL 9.1 might want
-- to just change it to 9.1 syntax
var_sql := regexp_replace(var_sql, 'collate [a-z0-9\_]+', '','g');
RETURN var_sql;
END;
$$
LANGUAGE plpgsql IMMUTABLE;
-- EXAMPLE USAGE - convert camel to under score, dbo to public, not null to default defined constants
SELECT convert_mssqlddl2pgsql('
CREATE TABLE [dbo].[Contacts](
[ContactID] [int] IDENTITY(1,1) NOT NULL,
[FirstName] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[LastName] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[CompanyID] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[AddDate] [datetime] NOT NULL,
CONSTRAINT [PK_Contacts] PRIMARY KEY CLUSTERED
(
[ContactID] ASC
)WITH (IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
', true, true, 'public');
-- output --
create table public.contacts(
contact_id serial not null,
first_name varchar(50) null,
last_name varchar(50) null,
company_id varchar(50) null,
add_date timestamp with time zone not null DEFAULT CURRENT_TIMESTAMP,
constraint pk_contacts primary key
(
contact_id
)
)
-- EXAMPLE USAGE - Don't convert to under score
SELECT convert_mssqlddl2pgsql('
CREATE TABLE [dbo].[Contacts](
[ContactID] [int] IDENTITY(1,1) NOT NULL,
[FirstName] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[LastName] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[CompanyID] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[AddDate] [datetime] NOT NULL,
CONSTRAINT [PK_Contacts] PRIMARY KEY CLUSTERED
(
[ContactID] ASC
)WITH (IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
', false, true, 'public');
-- output --
create table public.contacts(
contactid serial not null,
firstname varchar(50) null,
lastname varchar(50) null,
companyid varchar(50) null,
adddate timestamp with time zone not null DEFAULT CURRENT_TIMESTAMP,
constraint pk_contacts primary key
(
contactid
)
)