Question: You have a table of people and a table that specifies the activities each person is involved
in. You want to return a result that has one record per person and a column that has a listing of activities for each person
separated by semicolons and alphabetically sorted by activity. You also want the whole set alphabetically sorted by person's name.
This is a question we are always asked and since we mentor on various flavors of databases,
we need to be able to switch gears and provide an answer that works on the client's database. Most
often the additional requirement is that you can't install new functions in the database. This means that
for PostgreSQL/SQL Server that both support defining custom aggregates, that is out as an option.
Normally we try to come up with an answer that works in most databases, but sadly the only solution that works in
most is to push the problem off to the client front end and throw up your hands and proclaim -- "This ain't something that should be
done in the database and is a reporting problem." That is in fact what many database purists do, and all I can say to them is wake up and smell the coffee before you are out of a job.
We feel that data
transformation is an important function of a database, and if your database is incapable of massaging the data into a format
your various client apps can easily digest, WELL THAT's A PROBLEM.
We shall now document this answer rather than trying to answer for the nteenth time. For starter's
PostgreSQL has a lot of answers to this question, probably more so than any other, though some are easier to execute than others
and many depend on the version of PostgreSQL you are using. SQL Server has 2 classes of answers neither of which is terribly appealing,
but we'll go over the ones that don't require you to be able to install .NET stored functions in your database since we said that is often a requirement.
MySQL has a fairly
simple, elegant and very portable way that it has had for a really long time.
Is the new way always better?
For PostgreSQL and SQL Server, we'll demonstrate more than one approach. Just because we demonstrate newer approaches doesn't make the older approaches obsolete or necessarily worse.
Don't feel the need to use something new because it's new.
Think of the newer approaches as just extra options that may or may not be faster for your use-case.
Test Table
Below is the script to setup our test tables. This script will work fine for all databases we will discuss here.
CREATE TABLE people(p_name varchar(75) PRIMARY KEY);
INSERT INTO people(p_name) VALUES('Jack');
INSERT INTO people(p_name) VALUES('Jill');
INSERT INTO people(p_name) VALUES('Humpty');
INSERT INTO people(p_name) VALUES('Dopey');
CREATE TABLE people_activities(p_name varchar(75) NOT NULL,
activity varchar(100) NOT NULL);
ALTER TABLE people_activities
ADD CONSTRAINT pk_people_activities
PRIMARY KEY (p_name,activity);
INSERT INTO people_activities(p_name, activity)
SELECT 'Jack', 'Hiking'
UNION ALL
SELECT 'Jack', 'Skiing'
UNION ALL
SELECT 'Jack', 'Fishing'
UNION ALL
SELECT 'Jill', 'Hiking'
UNION ALL
SELECT 'Jill', 'Bear Hunting'
UNION ALL
SELECT 'Humpty', 'Tumbling'
UNION ALL
SELECT 'Humpty', 'Seating'
UNION ALL
SELECT 'Humpty', 'Cracking';
p_name | activities
Dopey |
Humpty | Cracking;Seating;Tumbling
Jack | Fishing;Hiking;Skiing
Jill | Bear Hunting;Hiking
The PostgreSQL ways
In PostgreSQL 9.0, PostgreSQL introduced the STRING_AGG function and also the ORDER BY predicate in any aggregate function, which made this task
much easier and probably much more efficient as well. Note that in Oracle 11G Release 2, Oracle introduced an aggregate function call LISTAGG which is similar. Details of using LISTAGG and
Oracle LISTAGG, the older WM_CONCAT and other string aggregation approaches for Oracle are described in Oracle String Aggregation Techniques.
SELECT p.p_name,
STRING_AGG(a.activity, ';' ORDER BY a.activity) As activities
FROM people AS p
LEFT JOIN people_activities As a ON (p.p_name = a.p_name)
GROUP BY p.p_name
ORDER BY p.p_name;
PostgreSQL 8.4 introduced the ANSI-SQL array_agg function which IBM DB2 also has. Not sure about other databases. This provided another option for doing a string aggregation.
Without the ORDER BY feature for aggregates introduced in PostgreSQL 9.0, this option is less powerful.
SELECT p.p_name,
array_to_string(ARRAY_AGG(a.activity),';') As activities
FROM people AS p
LEFT JOIN
(SELECT *
FROM people_activities
ORDER BY p_name, activity) As a ON (p.p_name = a.p_name)
GROUP BY p.p_name
ORDER BY p.p_name;
Here is the PostgreSQL 8.4+ equivalent of a third option that people mentioned is available for SQL Server - the Recursive Window CTE.
WITH RECURSIVE
t1 AS (
SELECT
p.p_name,
a.activity,
ROW_NUMBER() OVER(PARTITION BY a.p_name ORDER BY a.activity) AS rn
FROM people AS p
LEFT JOIN people_activities AS a ON a.p_name = p.p_name
),
t2 AS (
SELECT
t1.p_name,
CAST(t1.activity AS varchar) AS activities,
t1.rn
FROM t1
WHERE t1.rn = 1
UNION ALL
SELECT
t2.p_name,
t2.activities || ';' || t1.activity,
t1.rn
FROM t2
INNER JOIN t1 ON
t1.p_name = t2.p_name AND
t1.rn = t2.rn + 1
)
SELECT
t2.p_name,
COALESCE(t2.activities, '')
FROM t2
INNER JOIN (
SELECT
t1.p_name,
MAX(t1.rn) AS max_rn
FROM t1
GROUP BY t1.p_name
) AS t ON
t.p_name = t2.p_name AND
t.max_rn = t2.rn
ORDER BY t2.p_name;
One thing that PostgreSQL has always had going for it that both MySQL and SQL Server still lack is phenomenal array support. Arrays are first class citizens in PostgreSQL witha rich set of functions to
complement them. They are extremely versatile and make creating things like aggregate functions pretty trivial. For example in MySQL you'd have to become a C programmer
to build an aggregate function which is generally out of the reach of most users. In SQL Server you'd have to learn .NET (VB or C#) and write a non-trivial amount of code which is much more accessible than C but not wihtout its issues. In PostgreSQL all you need to know is SQL and write a couple of lines of code to build your own custom aggregate functions. That
is just SUPER COOL.
SELECT p.p_name,
array_to_string(ARRAY(
SELECT activity
FROM people_activities AS a
WHERE a.p_name = p.p_name ORDER BY a.activity) ,';') As activities
FROM people AS p
ORDER BY p.p_name;
SQL Server Ways
SQL Server has a fairly cryptic way of doing this which one would consider a HACK and repurpose of functionality that was designed for something else. On the plus side,
we do find its CROSS APPLY and OUTER APPLY functionality which I crave a lot when working with MySQL and a little with PostgreSQL (PostgreSQL has similar but not quite as elegant equivalents). Not so much for this particular use case.
SELECT p.p_name,
activities
FROM people AS p OUTER APPLY
(SELECT STUFF(( SELECT ';' + a.activity
FROM people_activities AS a
WHERE a.p_name = p.p_name
ORDER BY a.activity
FOR XML PATH('') ), 1,1,'') AS activities ) AS f
ORDER BY p.p_name;
SQL Server Recursive CTE + Window
Evidentally there is a third way of doing this in SQL Server 2005 that many purists were up in arms about that we neglected to mention.
This is the Window + CTE approach that someone posted on reddit that we revised slightly to be a bit more ANSI-SQL compliant. Note: we listed the PostgrreSQL 8.4+ equivalent of this approach in the PostgreSQL section.
The reason many people prefer this approach is that it doesn't leave any XML entities you have to strip with REPLACE. e.g. if you had & in your original text you would have to use REPLACE(final,'&', '&')
to strip off the xmlish stuff.
The other reason is people felt this was less of a hack. Note: that the PostgreSQL 8.4 variant looks pretty close to SQL Server since both support ANSI SQL Window and CTE. The main differences:
- PostgreSQL forces you to put in a RECURSIVE word if any of the subexpressions is RECURSIVE and SQL SERVER would reject that word and consider it invalid.
- SQL SERVER uses the non-ANSI SQL + operator for string concatenation while PostgreSQL uses the ANSI SQL ||
- SQL Server won't allow text + varchar concatenation and if your varchar is unlimited, you must call it varchar(MAX). PostgreSQL assumes if no length is specified,then its limitless so varchar, text mean much the same in PostgreSQL
and varchar(MAX) is equivalent to PostgreSQL varchar.
As far as speed between the two variants, people did some benchmarks on reddit, but there wasn't a consensus on a clear winner.
WITH
t1 AS (
SELECT
p.p_name,
a.activity,
ROW_NUMBER() OVER(PARTITION BY a.p_name ORDER BY a.activity) AS rn
FROM people AS p
LEFT JOIN people_activities AS a ON a.p_name = p.p_name
),
t2 AS (
SELECT
t1.p_name,
CAST(t1.activity AS varchar(MAX)) AS activities,
t1.rn
FROM t1
WHERE t1.rn = 1
UNION ALL
SELECT
t2.p_name,
t2.activities + ';' + t1.activity,
t1.rn
FROM t2
INNER JOIN t1 ON
t1.p_name = t2.p_name AND
t1.rn = t2.rn + 1
)
SELECT
t2.p_name,
COALESCE(t2.activities, '')
FROM t2
INNER JOIN (
SELECT
t1.p_name,
MAX(t1.rn) AS max_rn
FROM t1
GROUP BY t1.p_name
) AS t ON
t.p_name = t2.p_name AND
t.max_rn = t2.rn
ORDER BY t2.p_name;
This is the old tried and true way which should work even if you are in SQL Server 2000 compatibility mode or are stuck using SQL Server 2000
SELECT p.p_name,
STUFF(( SELECT ';' + a.activity
FROM people_activities AS a
WHERE a.p_name = p.p_name
ORDER BY a.activity
FOR XML PATH('') ), 1,1,'') AS activities
FROM people AS p
ORDER BY p.p_name;
The MySQL way
Database purists like to make fun of MySQL, but in some ways it's way ahead of others.
Its focus has always been to make the common easy and as a result was more advanced than some other databases. Some might say it catered too much to the whims
of its users at the expense of doing things the right way. That's another story. This catering is no doubt a large reason for MySQL's popularity. There is something to be said about listening to the needs of users, which
I think most other databases dismiss. For example
MySQL was the first of these 3 databases to have a LIMIT OFFSET feature, which PostgreSQL later copied, and SQL Server 2010 (Denali) is finally introducing in the form of OFFSET FETCH NEXT.
It was also the first to have the MULTI ROW VALUE insert option which was later adopted by PostgreSQL 8.2 and SQL Server 2008 and I think Oracle is yet to adopt. Heck Oracle still has that arcane FROM DUAL
stupid requirement to my knowledge. How Oracle users put up with that nonsense I don't know.
UPDATE: As noted by some below and in reddit comments, a GOTCHA of the MySQL GROUP_CONCAT is that it silently truncates your result at about 1024 characters. As noted by another
in the
reddit thread "
You can change the length at server level with group_concat_max_len up to your max_allowed_packet." or as
Charles noted at the Session level with
SET SESSION group_concat_max_len = 32768;
In addition it has always had this aggregate function for as far back as MySQL 3.23 days GROUP_CONCAT. It's really convenient to be able to use this on any version of MySQL. Below is how you would use it.
SELECT p.p_name,
GROUP_CONCAT(a.activity ORDER BY a.activity SEPARATOR ';') As activities
FROM people AS p
LEFT JOIN people_activities As a ON (p.p_name = a.p_name)
GROUP BY p.p_name
ORDER BY p.p_name;
Tracked: Jan 18, 18:51