Montag, 20. April 2015

< min(), max() and the collation dependent ordering of character columns on Greenplum (Postgres 8.2) >

Recently I came across an interesting collation related problem while using a Greenplum (Postgres 8.2) database setup with an UTF-8 collation.

Each character column that undergoes an order by, min() or max() will be sorted according to that collation.

But in some cases, I needed to perform ordering or min(), max() aggregations based on the ASCII content of character columns - luckily, this is quite easily achieved on Greenplum using operator classes and custom aggregates:

create temp table col_test (a text);

insert into col_test (a) values ('"quoted"'),('Abc'),('123'),('Öfföff'),('Zürich'),('Zuhause'),('Zahnpasta');

--
-- order by UTF-8 content
--
select * from col_test order by a asc;

123
Abc
Öfföff
"quoted"
Zahnpasta
Zuhause
Zürich

select * from col_test order by a desc;

Zürich
Zuhause
Zahnpasta
"quoted"
Öfföff
Abc
123


--
-- order by ASCII content
-- 
select * from col_test order by a using ~<~;

"quoted"
123
Abc
Zahnpasta
Zuhause
Zürich
Öfföff

select * from col_test order by a using ~>~;

Öfföff
Zürich
Zuhause
Zahnpasta
Abc
123
"quoted"



The min(), max() aggregation can be taken care of by creating two custom aggregates:

drop aggregate ascii_max(bpchar) cascade;

drop aggregate ascii_min(bpchar) cascade;

create or replace function ascii_bpchar_larger(prev bpchar, curr bpchar) returns bpchar AS $$
begin if (prev ~>~ curr) then return prev; end if; return curr; end; $$ language plpgsql immutable strict;

create or replace function ascii_bpchar_smaller(prev bpchar, curr bpchar) returns bpchar AS $$
begin if (prev ~<~ curr) then return prev; end if; return curr; end; $$ language plpgsql immutable strict;

create aggregate ascii_max(bpchar) ( sfunc = ascii_bpchar_larger, stype = bpchar, sortop = '~>~' );

create aggregate ascii_min(bpchar) ( sfunc = ascii_bpchar_smaller, stype = bpchar, sortop = '~<~' );


The custom aggregates can be used like any builtin aggregate:

--
-- aggregate by UTF-8 content
--
select min(a), max(a) from col_test;
 min |  max 
-----+--------
 123 | Zürich

--
-- aggregate by ASCII content
--
select ascii_min(a), ascii_max(a) from col_test;
 ascii_min | ascii_max
-----------+-----------
 "quoted"  | Öfföff


See http://www.postgresql.org/docs/8.2/static/sql-createaggregate.html for more information regarding custom aggregates and http://www.postgresql.org/docs/8.2/static/indexes-opclass.html regarding operator classes.