NOT IN NULL Uniqueness trickery

I know a lot has been said about this beautiful value we affectionately call NULL, which is neither here nor there and that manages to catch many of us off guard with its casual neither here nor thereness. Database analysts who are really just back seat mathematicians in disguise like to philosophize about the unknown and pat themselves on the back when they feel they have mastered the unknown better than any one else. Of course database spatial analysts, the worst kind of back seat mathematicians, like to talk not only about NULL but about EMPTY and compare notes with their brethren and write dissertations about what to do about something that is neither here nor there but is more known than the unknown, but not quite as known as the empty string.

Okay getting to the point, one of our clients asked us about a peculiar problem they had with a query, and the strange results they were getting. We admit this still manages to catch us off guard every once in a while.

Here is a cute little example to demonstrate.

CREATE TABLE testnulls(id integer, id2 integer PRIMARY KEY);

CREATE UNIQUE INDEX uidx_testnulls_id
   ON testnulls USING btree (id);

INSERT INTO testnulls(id, id2)
SELECT 1,5 UNION ALL SELECT 2,6 UNION ALL SELECT NULL, 2;


-- wow this works
INSERT INTO testnulls(id,id2)
VALUES(null,0);



-- case 1  - returns no records
SELECT n.id2
FROM testnulls As n
WHERE n.id2 NOT IN(SELECT id FROM testnulls);

-- case 2 - returns 3 records
SELECT n.id2
FROM testnulls As n
WHERE n.id2 NOT IN(SELECT t.id FROM testnulls As t WHERE t.id IS NOT NULL);

-- case 3 -- returns 1 record
SELECT n.id2
FROM testnulls As n
WHERE n.id2 IN(SELECT t.id FROM testnulls As t );


-- case 4 -- returns 1 record
SELECT n.id2
FROM testnulls As n
WHERE n.id IN(SELECT t.id2 FROM testnulls As t );


-- case 5 -- returns 2 records
SELECT n.id, n.id2
FROM testnulls As n
WHERE n.id IN(SELECT t.id FROM testnulls As t );

Many will be surprised to find out that Case 1 yields no records and wonder how its different from case 2. Some may also wonder why case 5 does not return the records with null values. This is not a bug in PostgreSQL, but just the nature of NULL that you will see in other ANSI SQL compliant non-PostgreSQL databases as well.

The other interesting thing demonstrated by the example above, is the idea of a unique index vs. a primary key. In a primary key, you can't have nulls, but in a unique key, you can have as many nulls as you want, but other values must be unique. This is again ANSI standard compliant behavior you will see in other databases.

Why does case 1 return no records?

Since the NOT IN subquery has NULLS, which is another way of saying UNKNOWN, you can't be aboslutely sure that a value is not in the unknown set, but if asked to say true or false what do you say in such a case? The answer according to the ANSI standard is false.

Why does case 5 not return NULL records?

You can't say a missing value is equal to a set of missing or non-missing values, but should you return true or false when asked to do so? The answer according to the ANSI standard is false.

Just say NO

So in short, when in doubt about any question that requires a Yes or NO answer, just say NO.

THe STRICTNESS problem

This as odd as it may seem is also related to the STRICTNESS issue we brought up that broke the inlining logic we had in place for PostGIS functions. As Anarazel beautifully explains in our bug ticket that all boolean expressions are fundamentally considered to be not strict meaning they expect a true or false answer (not unknown). What STRICTNESS means is that when any of the inputs to your function is NULL, your answer should be NULL, but in the case of booleans, things often get messy when you ask for data and the computer answers, "I DON'T KNOW, IF I SHOULD GIVE YOU THIS DATA".