More Aggregate Fun: Who's on First and Who's on Last

Tuesday, August 12. 2008

More Aggregate Fun: Who's on First and Who's on Last

Microsoft Access has these peculiar set of aggregates called First and Last. We try to avoid them because while the concept is useful, we find Microsoft Access's implementation of them a bit broken. MS Access power users we know moving over to something like MySQL, SQL Server, and PostgreSQL often ask - where's first and where's last? First we shall go over what exactly these aggregates do in MS Access and how they are different from MIN and MAX and what they should do in an ideal world. Then we shall create our ideal world in PostgreSQL.

Why care who's on First and who's on Last?

This may come as a shock to quite a few DBAs, but there are certain scenarios in life where you want to ask for say an Average, Max, Min, Count etc and you also want the system to give you the First or last record of the group (this could be based on physical order or some designated order you ascribe). Even more shocking to DB Programmer type people who live very orderly lives and dream of predictability where there is none, some people don't care which record of the group is returned, just as long as all the fields returned are for a specific record. Not Care, You ask?

Here is a somewhat realistic scenario. Lets say you want to generate a mailing, but you have a ton of people on your list and you only want to send to one person in each company where the number of employees in the company is greater than 100. The boss doesn't care whether that person is Doug Smith or John MacDonald, but if you start making people up such as a person called Doug MacDonald, that is a reason for some concern. So your mandate is clear - Save money on stamps, Inventing people is not cool, DO NOT INVENT ANYONE IN THE PROCESS. So you see why MIN and MAX just does not work in this scenario. Yah Yah you say, I'm a top notch database programmer, I can do this in a hard to read but efficient SQL statement, that is portable across all databases. Good for you.

With First or Last function, your query would look like this:

SELECT First(LastName) As LName, First(FirstName) As FName, COUNT(EmployeeID) As numEmployees
FROM CompanyRoster
GROUP BY CompanyID
HAVING COUNT(EmployeeID) > 100;

The above is all fine and dandy and MS Access will help you nicely. What if you care about order though? This is where Access fails you because even if you do something like below in hopes of sending to the oldest person in the company, Access will completely ignore your attempts at sorting and return to you the first person entered for that company. This is where we will improve on Access's less than ideal implementation of First and Last.

SELECT First(LastName) As LName, First(FirstName) As FName, COUNT(EmployeeID) As numEmployees
FROM (SELECT * FROM 
        CompanyRoster 
        ORDER BY CompanyID, BirthDate DESC) As foo
GROUP BY CompanyID
HAVING COUNT(EmployeeID) > 100;

Creating our First and Last Aggregates

Creating a First and Last Aggregate is much simpler than our Median function example. The First aggregate will simply look at the first entry that comes to it and ignore all the others. The Last aggregate will continually replace its current entry with whatever new entry is passed to it. The last aggregate is very trivial. The first aggregate is a bit more complicated because we don't want to throw out true nulls, but since our initial state is null, we want to ignore our initial state as well.

This time we shall also use Tom Lane's suggestion from our median post of using anyelement to make this work for all data types.

CREATE OR REPLACE FUNCTION first_element_state(anyarray, anyelement)
  RETURNS anyarray AS
$$
    SELECT CASE WHEN array_upper($1,1) IS NULL THEN array_append($1,$2) ELSE $1 END;
$$
  LANGUAGE 'sql' IMMUTABLE;

CREATE OR REPLACE FUNCTION first_element(anyarray)
  RETURNS anyelement AS
$$
    SELECT ($1)[1] ;
$$
  LANGUAGE 'sql' IMMUTABLE;

CREATE OR REPLACE FUNCTION last_element(anyelement, anyelement)
  RETURNS anyelement AS
$$
    SELECT $2;
$$
  LANGUAGE 'sql' IMMUTABLE;
  
CREATE AGGREGATE first(anyelement) (
  SFUNC=first_element_state,
  STYPE=anyarray,
  FINALFUNC=first_element
  )
;

CREATE AGGREGATE last(anyelement) (
  SFUNC=last_element,
  STYPE=anyelement
);
--Now some sample tests
--pick the first and last member from each family arbitrary by order of input
SELECT max(age) As oldest_age, min(age) As youngest_age, count(*) As numinfamily, family,
    first(name) As firstperson, last(name) as lastperson
FROM (SELECT 2 As age , 'jimmy' As name, 'jones' As family
    UNION ALL SELECT 50 As age, 'c' As name , 'jones' As family
    UNION ALL SELECT 3 As age, 'aby' As name, 'jones' As family
    UNION ALL SELECT 35 As age, 'Bartholemu' As name, 'Smith' As family
    ) As foo
GROUP BY family;
--Result 
 oldest_age | youngest_age | numinfamily | family | firstperson | lastperson
------------+--------------+-------------+--------+-------------+------------
         50 |            2 |           3 | jones  | jimmy       | aby
         35 |           35 |           1 | Smith  | Bartholemu  | Bartholemu


--For each family group list count of members,
--oldest and youngest age, and name of oldest and youngest family members
SELECT max(age) As oldest_age, min(age) As youngest_age, count(*) As numinfamily, family,
    first(name) As youngest_name, last(name) as oldest_name
FROM (SELECT * FROM (SELECT 2 As age , 'jimmy' As name, 'jones' As family
    UNION ALL SELECT 50 As age, 'c' As name , 'jones' As family
    UNION ALL SELECT 3 As age, 'aby' As name, 'jones' As family
    UNION ALL SELECT 35 As age, 'Bartholemu' As name, 'Smith' As family
    ) As foo ORDER BY family, age) as foo2
    WHERE age is not null
GROUP BY family;

--Result 
 oldest_age | youngest_age | numinfamily | family | youngest_name | oldest_name
------------+--------------+-------------+--------+---------------+-------------
         35 |           35 |           1 | Smith  | Bartholemu    | Bartholemu
         50 |            2 |           3 | jones  | jimmy         | c

Posted by Leo Hsu and Regina Obe in intermediate, ms access, mysql, pl programming, sql functions, sql server at 22:58 | Comments (11) | Trackback (1)

Trackbacks

Trackback specific URI for this entry

STRICT on SQL Function Breaks In-lining Gotcha
One of the coolest features of PostgreSQL is the ability to write functions using plain old SQL. This feature it has had for a long time. Even before PostgreSQL 8.2. No other database to our knowledge has this feature. By SQL we mean sans procedural m

Weblog: Postgres OnLine Journal
Tracked: Jun 02, 05:07

PingBack

Weblog: www.postgresonline.com
Tracked: Aug 14, 22:09

PingBack

Weblog: www.postgresonline.com
Tracked: Jan 08, 13:31

Comments

Display comments as (Linear | Threaded)

This aggregate really helpful, and is exactly what I needed to find "hot spot" records.

However, it is also useful to know how many "first" matches have been found (using the examples above, "how many jones' have an age of two"; the example above will result with one).

I've been wracking my brain on this one (aggregate functions are relatively new to me), and I've come up with these functions to solve this issue:

CREATE OR REPLACE FUNCTION explode_array(anyarray)
RETURNS SETOF anyelement AS
$$
SELECT ($1)[s] from generate_series(1,array_upper($1, 1)) AS s;
$$
LANGUAGE 'sql' IMMUTABLE
ROWS 1000;

CREATE OR REPLACE FUNCTION count_first_element(anyarray)
RETURNS integer AS
$$
SELECT COUNT(*)::integer FROM explode_array($1) AS e WHERE e=$1[1];
$$
LANGUAGE 'sql' IMMUTABLE;

CREATE AGGREGATE count_first(anyelement) (
SFUNC=array_append,
STYPE=anyarray,
FINALFUNC=count_first_element
);

--and a similar (but modified) test:

SELECT max(age) As oldest_age, min(age) As youngest_age, count_first(age) As youngest_count, count(*) As numinfamily, family,
first(name) As youngest_name, last(name) as oldest_name
FROM (SELECT * FROM (SELECT 2 As age , 'jimmy' As name, 'jones' As family
UNION ALL SELECT 2 As age, 'c' As name , 'jones' As family
UNION ALL SELECT 3 As age, 'aby' As name, 'jones' As family
UNION ALL SELECT 35 As age, 'Bartholemu' As name, 'Smith' As family
) As foo ORDER BY age, family, name) as foo2
WHERE age is not null
GROUP BY family;

I haven't prepared a similar "count_last" aggregate, nor have I thoroughly tested this function. Also, I don't think that I handled NULL records gracefully as first_element_state above.

------------------

Also of importance is that the ORDER BY needs to have the first/last index before anything else (i.e., age needs to appear first). Failure to sort on this first will yield errors (e.g., try "ORDER BY family, name, age" which will incorrectly place aby as the youngest). The above example should have sorted in this order to reinforce this point.

#1 Mike on 2008-09-12 03:36

I Liked this articles very much. It is very much helpful to me.

Thanks a lot,

Regards,

Shamsu Zoha

#2 Shamsu Zoha on 2008-11-25 08:06

in order to get the first NOT NULL element:

CREATE OR REPLACE FUNCTION firstnotnull_element_state(anyarray, anyelement)
RETURNS anyarray AS
$$
SELECT CASE WHEN $2 is not null AND $1[1] IS NULL THEN array_prepend($2, $1) ELSE $1 END;
$$
LANGUAGE 'sql' IMMUTABLE

CREATE AGGREGATE firstnotnull(anyelement) (
SFUNC=firstnotnull_element_state,
STYPE=anyarray,
FINALFUNC=first_element
)

HTH,

M. Mamin

#3 m.mamin on 2010-04-19 04:13

If you are looking for the first not null value, then there is no need to carry around an array of the information to deal with the initial state making the aggregate much simpler.

CREATE OR REPLACE FUNCTION first_notnull_state(anyelement,anyelement)
RETURNS anyelement AS
$$
SELECT COALESCE($1,$2);
$$
LANGUAGE 'sql' IMMUTABLE;

CREATE AGGREGATE first_notnull(anyelement) (
SFUNC=first_notnull_state,
STYPE=anyelement
)
;

#3.1 bitnerd on 2011-04-13 11:46

This post really saved me today Regina, thanks!

To be able to get the first (or any in the group) I think is very useful for things like this:

Select ST_Difference(first(a.the_geom), ST_Union(b.the_geom)) from table1 a inner join table2 b on ST_Intersects(a.the_geom, b.the_geom) group by a.gid;

When it is nessecary to union the geometries in the second table before difference to get the wanted difference result. From my understanding it is a more robust way of doing it than grouping on a.the_geom which is the alternative.
Am I right?

Thanks
Nicklas

#4 Nicklas on 2010-08-25 17:31

Nicklas,

Haven't thought much of using it in that way. I would think the speed wouldn't be much different or it might be faster to do the way without first.

e.g.

GROUP BY a.gid, a.the_geom

Remember the grouping by a.the_geom will just group by the bounding box so its a pretty light grouping anyway.

#4.1 Regina on 2010-08-25 21:49

You are right that grouping the geometries together with gid is a better idea. I didn't think about that when I read about first and last :-)
But I think I have had some quite similar problem recently when first or last would have been the only option. But I cannot recall it.
Anyway, it is a nice functionality.
/Nicklas

#4.1.1 Nicklas on 2010-08-27 18:27

Nicklas,

In 9.1 -- you can just group by gid without need of grouping by the geometry because in 9.1 you can leave out the grouping by geometry if gid is a primary key.

Check out depesz article:
http://www.depesz.com/index.php/2010/08/08/waiting-for-9-1-recognize-functional-dependency-on-primary-keys/

#4.1.1.1 Regina on 2010-08-28 16:16

Just thanks, a very useful function for me :-)

#5 L. Jégou (Homepage) on 2011-02-17 06:03

Might be worth while to add that this workaround is no longer necessary as PG now has windowing functions (first_value, last_value) which can do exactly that.

http://www.postgresql.org/docs/current/static/functions-window.html

#6 Tobias on 2012-01-08 16:17

The work around is still needed. Window functions serve a different purpose. In this case we are using it as an aggregate to consolidate a number of records. Window functions unless you throw in a limit or something is going to return the same number of records as what you started out with, which is not desirable in this case.

#6.1 Regina on 2012-01-08 20:32

The author does not allow comments to this entry

Entry's Links

Quicksearch

Calendar

Blog Administration

Open login screen

More Aggregate Fun: Who's on First and Who's on Last

Postgres OnLine Journal

PostGIS in Action About the Authors Consulting

Tuesday, August 12. 2008

More Aggregate Fun: Who's on First and Who's on Last

Why care who's on First and who's on Last?

Creating our First and Last Aggregates

Entry's Links

Quicksearch

Calendar

Categories

Archives

Subscribe

Blog Administration

More Aggregate Fun: Who's on First and Who's on Last

Postgres OnLine Journal PostGIS in Action About the Authors Consulting

Tuesday, August 12. 2008

More Aggregate Fun: Who's on First and Who's on Last

Why care who's on First and who's on Last?

Creating our First and Last Aggregates

Entry's Links

Quicksearch

Calendar

Categories

Archives

Subscribe

Blog Administration

Postgres OnLine Journal

PostGIS in Action About the Authors Consulting