Build Median Aggregate Function in SQL

One of the things we love most about PostgreSQL is the ease with which one can define new aggregate functions with even a language as succinct as SQL. Normally when we have needed a median function, we've just used the built-in median function in PL/R as we briefly demonstrated in Language Architecture in PostgreSQL.

If all you demand is a simple median aggregate function ever then installing the whole R statistical environment so you can use PL/R is overkill and much less portable.

In this article we will demonstrate how to create a Median function with nothing but the built-in PostgreSQL SQL language, array constructs, and functions.

Primer on PostgreSQL Aggregate Structure

PostgreSQL has a very simple but elegant architecture for defining aggregates. Aggregates can be defined using any functions, built-in languages and PL languages you have installed. You can even mix and match languages if you want to take advantage of the unique speed optimization/library features of each language. Below are the basic steps to building an aggregate function in PostgreSQL.

  1. Define a start function that will take in the values of a result set - this can be in the PL/built-in language of your choosing or you can use one that already exists.
  2. Define an end function that will do something with the final output of the start function - this can be in the PL/built-in language of your choosing or you can use one that already exists.
  3. If the intermediary type returned by your start function, does not exist, then create it
  4. Now define the aggregate with steps that look something like this:
    
    	CREATE AGGREGATE median(numeric) (
    		  SFUNC=array_append,
    		  STYPE=numeric[],
    		  FINALFUNC=array_median
    		);
    
  5. NOTE: As Tom Lane pointed out in comments below, the following is not entirely true. Since all arrays can be cast to anyarray datatype. You can use anyarray to use the same function for all data types assuming you want all medians to behave the same regardless of data type. We shall demonstrate this in our next aggregate example
    This part is a bit annoying. You need to define an aggregate for each data type you need it to work with that doesn't automatically cast to a predefined type. The above example will only work for numbers because all numbers can be automatically cast to a numeric. However if we needed a median for dates, we would also need to do
    
    	CREATE AGGREGATE median(timestamp) (
    		  SFUNC=array_append,
    		  STYPE=timestamp[],
    		  FINALFUNC=array_median
    		);
    
    and also define a array_median function for dates. Keep in mind that PostgreSQL supports function overloading which means we can have all these functions named the same as long as they take different data type inputs. This allows the final user of our median function not to worry about whether they are taking a median for dates or numbers and just call the aggregate median().

Build our Median Aggregate

In the steps that follow we shall flesh out the FINALFUNC function. Please note that array_append is a built-in function in PostgreSQL that takes an array and keeps on appending elements to the array. So conveniently - we don't need to define an SFUNC as we would normally.

Now what makes creating a median aggregate function harder than say an Average is that it cares about order and needs to look at all items to determine what to return. This means that unlike average, sum, max, min etc - we need to look at all values passed to us, resort it based on the data type sorting rules of that data type and return the middle item. Here is where the beauty of array_append saves us.

Now lets get started. We have conveniently everything we need gratis from PostgreSQL. Now all we need are our array_median functions that will take in our array of items collected during the group process, junk the nulls and resort whats left and then return the middle item.

NOTES:

  1. you can instead of using the array_append directly, create an intermediary that rejects nulls. That would probably perform better but require a bit more code.
  2. When there are ties, the customary thing is to average the ties, for our particular use case, we wanted the result to be in the list, so we are simply taking the last in the average set.
  3. You see the multiply by 2.0, that is needed because 1/2 is 0 in SQL because it needs to return the same data type as the inputs. To get around that we force the 2 to be a decimal.
So the code looks like this:
CREATE OR REPLACE FUNCTION array_median(numeric[])
  RETURNS numeric AS
$$
    SELECT CASE WHEN array_upper($1,1) = 0 THEN null ELSE asorted[ceiling(array_upper(asorted,1)/2.0)] END
    FROM (SELECT ARRAY(SELECT ($1)[n] FROM
generate_series(1, array_upper($1, 1)) AS n
    WHERE ($1)[n] IS NOT NULL
            ORDER BY ($1)[n]
) As asorted) As foo ;
$$
  LANGUAGE 'sql' IMMUTABLE;


CREATE OR REPLACE FUNCTION array_median(timestamp[])
  RETURNS timestamp AS
$$
    SELECT CASE WHEN array_upper($1,1) = 0 THEN null ELSE asorted[ceiling(array_upper(asorted,1)/2.0)] END
    FROM (SELECT ARRAY(SELECT ($1)[n] FROM
generate_series(1, array_upper($1, 1)) AS n
    WHERE ($1)[n] IS NOT NULL
            ORDER BY ($1)[n]
) As asorted) As foo ;
$$
  LANGUAGE 'sql' IMMUTABLE;


CREATE AGGREGATE median(numeric) (
  SFUNC=array_append,
  STYPE=numeric[],
  FINALFUNC=array_median
);

CREATE AGGREGATE median(timestamp) (
  SFUNC=array_append,
  STYPE=timestamp[],
  FINALFUNC=array_median
);

Now the tests

----TESTS numeric median - 16ms
SELECT m, median(n) As themedian, avg(n) as theavg
FROM generate_series(1, 58, 3) n, generate_series(1,5) m
WHERE n > m*2
GROUP BY m
ORDER BY m;

--Yields
m | themedian |       theavg
---+-----------+---------------------
 1 |        31 | 31.0000000000000000
 2 |        31 | 32.5000000000000000
 3 |        31 | 32.5000000000000000
 4 |        34 | 34.0000000000000000
 5 |        34 | 35.5000000000000000
 
 SELECT m, n
FROM generate_series(1, 58, 3) n, generate_series(1,5) m
WHERE n > m*2 and m = 1
ORDER BY m, n;
--Yields


 m | n
---+----
 1 |  4
 1 |  7
 1 | 10
 1 | 13
 1 | 16
 1 | 19
 1 | 22
 1 | 25
 1 | 28
 1 | 31
 1 | 34
 1 | 37
 1 | 40
 1 | 43
 1 | 46
 1 | 49
 1 | 52
 1 | 55
 1 | 58

--Test to ensure feeding numbers out of order still works
SELECT avg(x), median(x)
FROM (SELECT 3 As x 
    UNION ALL 
    SELECT - 1 As x 
    UNION ALL 
    SELECT 11 As x
    UNION ALL
    SELECT 10 As x
    UNION ALL
    SELECT 9 As x) As foo;
    
--Yields - 


avg | median
--+------------
6.4 | 9

---TEST date median -NOTE: average is undefined for dates so we left that out. 16ms
SELECT m, median(CAST('2008-01-01' As date) + n) As themedian
    FROM generate_series(1, 58, 3) n, generate_series(1,5) m
    WHERE n > m*2
GROUP BY m
ORDER BY m;


 m |      themedian
---+---------------------
 1 | 2008-02-01 00:00:00
 2 | 2008-02-01 00:00:00
 3 | 2008-02-01 00:00:00
 4 | 2008-02-04 00:00:00
 5 | 2008-02-04 00:00:00
 
SELECT m, (CAST('2008-01-01' As date) + n) As thedate
FROM generate_series(1, 58, 3) n, generate_series(1,5) m
WHERE n > m*2 AND m = 1
ORDER BY m,n;

--Yields


m |  thedate
--+------------
1 | 2008-01-05
1 | 2008-01-08
1 | 2008-01-11
1 | 2008-01-14
1 | 2008-01-17
1 | 2008-01-20
1 | 2008-01-23
1 | 2008-01-26
1 | 2008-01-29
1 | 2008-02-01
1 | 2008-02-04
1 | 2008-02-07
1 | 2008-02-10
1 | 2008-02-13
1 | 2008-02-16
1 | 2008-02-19
1 | 2008-02-22
1 | 2008-02-25
1 | 2008-02-28