In our last article we talked about String Aggregation implementing in PostgreSQL, SQL Server, and MySQL. This task is one that makes purist relational database programmers a bit squeamish. In this article we'll talk about the reverse of that, how do you deal with data that someone hands you delimeted in a single field and that you are asked to explode or re-sort based on some lookup table.
What are the benefits of having a structure such as? :
p_name | activities
--------+--------------------------------
Dopey | Tumbling
Humpty | Cracking;Seating;Tumbling
Jack | Fishing;Hiking;Skiing
Jill | Bear Hunting;Hiking
Well for the casual programmer or simple text file database that knows nothing about JOINS and so forth, it makes it simple to pull a list of people who like Tumbling.
You simply do a WHERE ';' || activities || ';' LIKE '%;Tumbling;%'. It's great for security too because you can determine security with a simple like check and also list all the security groups a member belongs in without doing anything.
Quite easy for even the least data-skilled of programmers to work with because most procedural languages have a split function that can easily parse these into an array useful for stuffing into drop down lists and so forth. As a consultant of semi-techie people
I'm often torn by the dilemma of "What is the way I would program for myself vs. the way that provides the most autonomy to the client". By that I mean
for example I try to avoid heavy-weight things like Wizards that add additional bloated dependencies or slow the speed down of an application. These bloated dependencies may provide ease to the client but make my debugging life harder. So I weight the options
and figure out which way works now and also provides me an easy escape route should things like speed or complexity become more of an issue.
This brings us to the topic of, what is wrong with this model? It can be slow because the LIKE condition you have can't easily take advantage of an index unless using a full text index so not ideal where this is the primary filtering factor. It's also prone to pollution because you can't easily validate that the values in the field are in your valid set of lookups or if your lookup changes, the text can be forced to change with a CASCADE UPDATE/DELETE RULE etc. In cases where this is of minor consequence which is many if referential integrity is not high on your list of requirements, this design is not bad. It might make a purist throw up but oh well there is always dramamine to fall back on. As long as you have done your cost benefit analysis, I don't think there should be any shame of following this less than respected route.
While you may despise this model, it has its place and it's a fact of life that one day someone will hand it to you and you may need to flip it around a bit. We shall demonstrate how to do that in this article.
We'll have a table called lu_activities that lists the possible activities and our table of peopel with a column to hold the activities.
CREATE TABLE lu_activities(activity varchar(50) PRIMARY KEY, category varchar(50));
INSERT INTO lu_activities(activity,category) VALUES ('Bear Hunting', 'Outdoor');
INSERT INTO lu_activities(activity,category) VALUES ('Cracking', 'Other');
INSERT INTO lu_activities(activity,category) VALUES ('Fishing', 'Outdoor');
INSERT INTO lu_activities(activity,category) VALUES ('Hiking', 'Outdoor');
INSERT INTO lu_activities(activity,category) VALUES ('Skiing', 'Outdoor');
INSERT INTO lu_activities(activity,category) VALUES ('Tumbling', 'Other');
CREATE TABLE people(p_name varchar(75) PRIMARY KEY, activities text);
INSERT INTO people(p_name, activities) VALUES('Jack','Fishing;Hiking;Skiing');
INSERT INTO people(p_name, activities) VALUES('Jill', 'Bear Hunting;Hiking');
INSERT INTO people(p_name, activities) VALUES('Humpty','Cracking;Seating;Tumbling');
INSERT INTO people(p_name, activities) VALUES('Dopey', 'Tumbling');
You have a table of people and a field for each person that lists all their activities separated by ;. You luckily also have a lookup table that defines all the possible activities you care about. You want to explode this out into separate records. How do you do this in PostgreSQL, MySQL, and SQL Server.
We want our output to look like this:
-- We want out output to look like --
p_name | activity | category
--------+--------------+----------
Dopey | Tumbling | Other
Humpty | Cracking | Other
Humpty | Tumbling | Other
Jack | Fishing | Outdoor
Jack | Hiking | Outdoor
Jack | Skiing | Outdoor
Jill | Bear Hunting | Outdoor
Jill | Hiking | Outdoor
Having a lookup table makes this excersise fairly trivial. We can use more or less the same solution across these 3 databases. We use a left join to guarantee we will include everyone even if they have no activities.
-- PostgreSQL all versions
SELECT p.p_name, a.activity, a.category
FROM people As p
LEFT JOIN lu_activities AS a ON(';' || p.activities || ';' LIKE '%;' || a.activity || ';%')
ORDER BY p,p_name, a.category, a.activity;
-- SQL Server 2005+
-- (for SQL Server 2000 you would replace varchar(max) with varchar(3000) or something along that --
SELECT p.p_name, a.activity, a.category
FROM people As p
LEFT JOIN lu_activities AS a ON(';' + CAST(p.activities As varchar(MAX)) + ';' LIKE '%;' + a.activity + ';%')
ORDER BY p.p_name, a.category, a.activity;
-- MySQL NON-ANSI looks like below (ANSI enabled looks like PostgreSQL solution) --
SELECT p.p_name, a.activity, a.category
FROM people As p
LEFT JOIN lu_activities AS a ON(CONCAT(';' , p.activities , ';') LIKE CONCAT('%;' , a.activity , ';%'))
ORDER BY p.p_name, a.category, a.activity;
I'll leave this one as an exercise for the reader. The basic solution is to combine Question 1 with the solution in String Aggregation implementing in PostgreSQL, SQL Server, and MySQL.