One of the things we love most about PostgreSQL is the ease with which one can define new aggregate functions with even a language as succinct as SQL. Normally when we have needed a median function, we've just used the built-in median function in PL/R as we briefly demonstrated in Language Architecture in PostgreSQL.
If all you demand is a simple median aggregate function ever then installing the whole R statistical environment so you can use PL/R is overkill and much less portable.
In this article we will demonstrate how to create a Median function with nothing but the built-in PostgreSQL SQL language, array constructs, and functions.
PostgreSQL has a very simple but elegant architecture for defining aggregates. Aggregates can be defined using any functions, built-in languages and PL languages you have installed. You can even mix and match languages if you want to take advantage of the unique speed optimization/library features of each language. Below are the basic steps to building an aggregate function in PostgreSQL.
CREATE AGGREGATE median(numeric) (
SFUNC=array_append,
STYPE=numeric[],
FINALFUNC=array_median
);
CREATE AGGREGATE median(timestamp) (
SFUNC=array_append,
STYPE=timestamp[],
FINALFUNC=array_median
);
and also define a array_median function for dates. Keep in mind that PostgreSQL supports function overloading which means
we can have all these functions named the same as long as they take different data type inputs. This allows the final user of our median function
not to worry about whether they are taking a median for dates or numbers and just call the aggregate median().
In the steps that follow we shall flesh out the FINALFUNC function. Please note that array_append is a built-in function in PostgreSQL that takes an array and keeps on appending elements to the array. So conveniently - we don't need to define an SFUNC as we would normally.
Now what makes creating a median aggregate function harder than say an Average is that it cares about order and needs to look at all items to determine what to return. This means that unlike average, sum, max, min etc - we need to look at all values passed to us, resort it based on the data type sorting rules of that data type and return the middle item. Here is where the beauty of array_append saves us.
Now lets get started. We have conveniently everything we need gratis from PostgreSQL. Now all we need are our array_median functions that will take in our array of items collected during the group process, junk the nulls and resort whats left and then return the middle item.
NOTES:
CREATE OR REPLACE FUNCTION array_median(numeric[])
RETURNS numeric AS
$$
SELECT CASE WHEN array_upper($1,1) = 0 THEN null ELSE asorted[ceiling(array_upper(asorted,1)/2.0)] END
FROM (SELECT ARRAY(SELECT ($1)[n] FROM
generate_series(1, array_upper($1, 1)) AS n
WHERE ($1)[n] IS NOT NULL
ORDER BY ($1)[n]
) As asorted) As foo ;
$$
LANGUAGE 'sql' IMMUTABLE;
CREATE OR REPLACE FUNCTION array_median(timestamp[])
RETURNS timestamp AS
$$
SELECT CASE WHEN array_upper($1,1) = 0 THEN null ELSE asorted[ceiling(array_upper(asorted,1)/2.0)] END
FROM (SELECT ARRAY(SELECT ($1)[n] FROM
generate_series(1, array_upper($1, 1)) AS n
WHERE ($1)[n] IS NOT NULL
ORDER BY ($1)[n]
) As asorted) As foo ;
$$
LANGUAGE 'sql' IMMUTABLE;
CREATE AGGREGATE median(numeric) (
SFUNC=array_append,
STYPE=numeric[],
FINALFUNC=array_median
);
CREATE AGGREGATE median(timestamp) (
SFUNC=array_append,
STYPE=timestamp[],
FINALFUNC=array_median
);
Now the tests
----TESTS numeric median - 16ms
SELECT m, median(n) As themedian, avg(n) as theavg
FROM generate_series(1, 58, 3) n, generate_series(1,5) m
WHERE n > m*2
GROUP BY m
ORDER BY m;
--Yields
m | themedian | theavg
---+-----------+---------------------
1 | 31 | 31.0000000000000000
2 | 31 | 32.5000000000000000
3 | 31 | 32.5000000000000000
4 | 34 | 34.0000000000000000
5 | 34 | 35.5000000000000000
SELECT m, n
FROM generate_series(1, 58, 3) n, generate_series(1,5) m
WHERE n > m*2 and m = 1
ORDER BY m, n;
--Yields
m | n
---+----
1 | 4
1 | 7
1 | 10
1 | 13
1 | 16
1 | 19
1 | 22
1 | 25
1 | 28
1 | 31
1 | 34
1 | 37
1 | 40
1 | 43
1 | 46
1 | 49
1 | 52
1 | 55
1 | 58
--Test to ensure feeding numbers out of order still works
SELECT avg(x), median(x)
FROM (SELECT 3 As x
UNION ALL
SELECT - 1 As x
UNION ALL
SELECT 11 As x
UNION ALL
SELECT 10 As x
UNION ALL
SELECT 9 As x) As foo;
--Yields -
avg | median
--+------------
6.4 | 9
---TEST date median -NOTE: average is undefined for dates so we left that out. 16ms
SELECT m, median(CAST('2008-01-01' As date) + n) As themedian
FROM generate_series(1, 58, 3) n, generate_series(1,5) m
WHERE n > m*2
GROUP BY m
ORDER BY m;
m | themedian
---+---------------------
1 | 2008-02-01 00:00:00
2 | 2008-02-01 00:00:00
3 | 2008-02-01 00:00:00
4 | 2008-02-04 00:00:00
5 | 2008-02-04 00:00:00
SELECT m, (CAST('2008-01-01' As date) + n) As thedate
FROM generate_series(1, 58, 3) n, generate_series(1,5) m
WHERE n > m*2 AND m = 1
ORDER BY m,n;
--Yields
m | thedate
--+------------
1 | 2008-01-05
1 | 2008-01-08
1 | 2008-01-11
1 | 2008-01-14
1 | 2008-01-17
1 | 2008-01-20
1 | 2008-01-23
1 | 2008-01-26
1 | 2008-01-29
1 | 2008-02-01
1 | 2008-02-04
1 | 2008-02-07
1 | 2008-02-10
1 | 2008-02-13
1 | 2008-02-16
1 | 2008-02-19
1 | 2008-02-22
1 | 2008-02-25
1 | 2008-02-28