Alter a column data type in all tables

We must change the timestamp format for all columns of type timestamp.

The format is like that :

2017-06-14 17:47:23.295343

and it must be like that :

2017-06-14 17:47:23

The SQL command to be done is :

alter table my_table 
   alter column my_timestamp_column type timestamp(0);

And we embed this command into a loop that takes relevant information from the catalog :

create or replace function public.alter_all_timestamps() 
 returns void as $$
declare
  r       record;
  stmt    varchar;
begin
  for r in 
    select 
        c.table_name
       ,c.column_name
    from 
        information_schema.columns as c
    inner join 
        information_schema.tables as t 
    on 
        c.table_name = t.table_name
    where 
        c.table_schema = 'public'
    and 
        c.data_type = 'timestamp without time zone'
    and 
        t.table_type = 'BASE TABLE'
  loop
    
        stmt = 'alter table ' 
                || quote_ident(r.table_name )
                || ' alter column '
                || quote_ident(r.column_name)
                || ' type timestamp(0)';
                
        execute stmt;

  end loop;

end $$ language 'plpgsql' volatile;

That’s simple !

If there are views on those tables that select a column of type timestamp, those views must be dropped before the run of the function and recreated after.

 

Calculated columns with a trigger

 

Some say that calculated columns is bad design because it does not follow the relational model. It is true but performance can be greatly enhanced. When a function is applied in a query on a indexed column, the query optimizer will not use the index. It is thus sometimes better to have an index on the calculated column because queries from the front end applications will run faster.

In this simple example, we have a table with 3 columns. The first and the second are given by the user and the third is calculated by a trigger. We add a condition : x * 0 = 0 * x = x (This fantasy comes from the requirements of a customer).  It is done with the case expression.

Some say that calculated columns is bad design because it does not follow the relational model. It is true but performance can be greatly enhanced.

The condition : x * 0 = 0 * x = x is implemented with a case construct.

drop table if exists tup;

create table tup (
     col1   numeric(10,2)
    ,col2   numeric(10,2)
    ,col3   numeric(10,2)
);

insert into tup(col1,col2) values
     (2.87      ,3.77)
    ,(4         ,5.11)
    ,(2.12      ,0)
    ,(0.0       ,3);

update tup
    set col3 =     (case col1 when 0 then 1 else col1 end)
                *  (case col2 when 0 then 1 else col2 end);

select * from tup;

will give

col1     col2   col3
---------------------
2.87     3.77   10.82
4.00     5.11   20.44
2.12     0.00    2.12
0.00     3.00    3.00

Let now add the trigger and the trigger function :

create or replace function calculate_columns() returns trigger as $$
begin

  new.col3 =   (case new.col1 when 0 then 1 else new.col1 end)
             * (case new.col2 when 0 then 1 else new.col2 end);

  return new;

end $$ language plpgsql;

create trigger calculated_columns  
    before insert or update 
    on tup
    for each row
    execute procedure calculate_columns();

Note that we do a before insert or update trigger and that we update the columns in the row new.

insert into tup(col1, col2) values
     (6.23      ,2)
    ,(0         ,55.11);

select * from tup;

will give :

col1     col2   col3
---------------------
2.87     3.77   10.82
4.00     5.11   20.44
2.12     0.00    2.12
0.00     3.00    3.00
6.23     2.00   12.46
0.00    55.11   55.11

The deal is in the bag!

To list all the pk’s and fk’s

The system catalogs of PostgreSQL are very well designed. They are not easy to understand at first glance but they allow to write short and efficient queries.

As usual, if the data model is sound, the queries are simple and natural.

Just an example : the comma separated list of columns for composite keys is automatically created in this query :

select
     conrelid::regclass as table_name
    ,conname
    ,pg_get_constraintdef(c.oid)
from
    pg_constraint c
join
    pg_namespace n on n.oid = c.connamespace
where
    contype in ('f', 'p')
and
    n.nspname = 'public'
order by
    conrelid::regclass::text
   ,contype desc;

The output is like that :

...
"SLM_TYPE"           SLM_TYPE_pkey                   PRIMARY KEY ("SLM_TYPE_NR")
"SLM_TYPE_CONTACT"   SLM_TYPE_CONTACT_CONTACT_fkey   FOREIGN KEY ("CONTACT") 
                                                        REFERENCES "CONTACTS"("CONTACT_NR") 
                                                        ON DELETE CASCADE
...

By the way, it is always better to use serials for pk’s and fk’s. But I am working with database I migrate from Access to PostgreSQL. I would like to write a script that will add serials for those pk’s and fk’s and that will transform the initial pk’s and fk’s to unique constraints. Not so easy…

To list all the indexes

When we create a pk in PostgreSQL, an index named table_name_pkey is automatically created.

The following script gives all indexes, those created by the pk’s and the others.

Don’t forget that PostgreSQL does not create automatically an index when you create a fk. You have to do it yourself !

By the way, this script works for composite indexes thanks to the handy function array_to_string that creates of comma separated list of the column names. It works because the two first columns in the select clause are given in the order by.

select 
     t.relname as table_name
    ,i.relname as index_name
    ,array_to_string(array_agg(a.attname), ', ') as column_names
from
     pg_class      t
    ,pg_class      i
    ,pg_index      ix
    ,pg_attribute  a
where
    t.oid = ix.indrelid
and 
    i.oid = ix.indexrelid
and 
    a.attrelid = t.oid
and 
    a.attnum = any(ix.indkey)
and 
    t.relkind = 'r' -- takes regular tables
and 
    t.relname not like 'pg_%' -- excludes system catalogs
group by
    t.relname
   ,i.relname
order by
    t.relname
   ,i.relname;

How to safely delete records ?

To collect accurate data is expensive !

To structure data is very expensive !

To record accurate structured data is very very expensive !

Then…

Once data is recorded in a database, it is not a good idea to definitely delete it. It is better to archive the records we don’t want to see anymore in the database.

We will build a very simple framework for a soft delete with the PostgreSQL trigger functions. In a previous post, we developed an audit system that we will use again. We have just to modify the pre-delete trigger.

The components are :

  • a schema called ‘deleted’ having the same tables that the database
  • a trigger pre-delete on each table

and, inherited from the audit system :

  • an additional column on each table called audit_id
  • a table called audit_history

Let’s begin with the creation of the schema ‘deleted’ :

create or replace function public.create_schema_deleted() returns void as $$
declare 
 r	record;
 stmt	varchar;
begin

 execute 'create schema deleted;'
 
 for r in
  select 
   quote_ident(table_name) as table_name
  from 
   information_schema.tables 
  where 
   table_schema = 'public'

  loop

   stmt :=    'create table deleted.' 
     || r.table_name 
     || ' (like public.' 
     || r.table_name 
     || ')';

   execute stmt;

  end loop;
  
  execute 'grant usage on schema deleted to generic user'
  execute 'grant insert on all tables in schema deleted to generic_user;'

end $$ language 'plpgsql' volatile;

select public.create_schema_deleted();

This schema contains all the table of your production database without indexes, without pk/fk constraints, without sequences and without triggers. The role generic_user is granted to actual users.

Just after the creation of the schema, the tables are empty.

Each table has a column audit_id. It is a big serial that is incremented automatically when we insert a record in the production database.

The tables in the schema ‘deleted’ have of course also this column.

We create now the trigger function that will be executed each time an user deletes a record in the production database :

create or replace function public.pre_delete() returns trigger as $$
declare
 stmt 	varchar;
begin

 insert into public.audit_history
  (table_name, operation, audit_id, user_name, audit_date)
 values 
  (TG_TABLE_NAME, TG_OP, old.audit_id, current_user, now())

  stmt :=    'insert into deleted.'
   || quote_ident(TG_TABLE_NAME)
   || ' select * from public.'
   || quote_ident(TG_TABLE_NAME)
   || ' where audit_id = $1;';

  execute stmt using old.audit_id;
        
  return old;

end; $$ language 'plpgsql'; 

and a trigger pre-delete for each table :

create or replace function public.create_pre_delete_triggers() 
   returns void as $$
declare
 r	record;
 stmt	varchar;
begin
 for r in 
    select table_name 
    from information_schema.tables 
    where table_schema = 'public'
  loop

    stmt :=    'create trigger ' 
     || quote_ident(r.table_name || '_pre_delete')
     || ' after insert on public.' 
     || quote_ident(r.table_name)
     || ' for each row execute procedure public.pre_delete();';

    execute stmt;

  end loop;

end; $$ language 'plpgsql'; 

select public.create_pre_delete_triggers();

If we want to find a deleted record, we have a look in the table audit_history that has the following structure :

 id_audit_history  bigserial unique 
,table_name        text 
,operation         text 
,audit_id          bigint 
,user_name         text 
,audit_date        timestamp

And we do a select in the schema ‘deleted’ for this table and for this audit_id.

That’s simple !

A simple audit system

Let’s suppose we want to audit all insert, update and delete committed on a database.

We create a table to store relevant data :

drop table if exists audit_history;
create table audit_history (
  id_audit_history       serial
 ,table_name             text
 ,operation              text
 ,audit_id               bigint
 ,user_name              text
 ,insertion_date         timestamp       default now()
);

drop sequence if exists audit_history_seq;
create sequence audit_history_seq;

We create two test tables with a pk constraints and one having a fk constraint to the other. Note that this fk constraint has the clause on delete cascade.

drop table if exists app_table_ref;
drop table if exists app_table;

create table app_table (
  pk             integer         primary key
 ,info           varchar
 ,audit_id       bigserial
);

create table app_table_ref (
  pk             integer         primary key
 ,info           varchar
 ,fk             integer references app_table(pk) on delete cascade
 ,audit_id       bigserial
);

Note the presence of the column audit_id, a bigserial that is not a pk.

We create now the trigger function, a generic fonction for the two triggers : after insert or update and before delete.

create or replace function populate_audit_history() 
   returns trigger as $$
begin

  if (TG_OP = 'INSERT' or TG_OP = 'UPDATE') then

      insert into audit_history
        (table_name, operation, audit_id, user_name)
      values
        (TG_TABLE_NAME, TG_OP, new.audit_id, current_user);

      return new;

  else
  
     insert into audit_history
        (table_name, operation, audit_id, user_name)
      values
        (TG_TABLE_NAME, TG_OP, old.audit_id, current_user);

     return old;

  end if;
end;
$$ language 'plpgsql';


create or replace function create_audit_triggers() 
   returns void as $$
declare
   r       record;
   stmt    varchar;
begin
  for r in select table_name from information_schema.tables 
              where table_name like 'app_table%' loop

     stmt :=    'create trigger '
             || quote_ident(r.table_name || '_audit_after_iu')
             || ' after insert or update on '
             || quote_ident(r.table_name)
             || ' for each row 
                    execute procedure populate_audit_history();';

     execute stmt;
   
     stmt :=    'create trigger '
             || quote_ident(r.table_name || '_audit_before_d')
             || ' before delete on '
             || quote_ident(r.table_name)
             || ' for each row 
                    execute procedure populate_audit_history();';

     execute stmt;

   end loop;
end;
$$ language 'plpgsql';

select create_audit_triggers();

Note the execute instruction that allows us to execute dynamic SQL statements.

We do now some transactions on those two tables :

insert into app_table values(23,'record 23');
insert into app_table values(56,'record 56');
insert into app_table values(71,'record 71');
insert into app_table values(82,'record 82');
insert into app_table values(85,'record 85');
insert into app_table values(91,'record 91');
insert into app_table values(94,'record 94');
insert into app_table values(97,'record 97');
insert into app_table values(99,'record 99');
update app_table set info = 'modified' where pk = 23;
update app_table set info = 'modified' where pk = 56;
update app_table set info = 'modified' where pk = 97;
delete from app_table where pk = 71;
insert into app_table values(101,'record 101');
insert into app_table values(121,'record 121');
insert into app_table values(167,'record 167');
update app_table set info = 'modified' where pk = 101;
delete from app_table where pk = 121;
insert into app_table_ref values(1, 'ref1', 23);
insert into app_table_ref values(2, 'ref2', 23);
insert into app_table_ref values(3, 'ref3', 82);
delete from app_table where pk = 23;

Here is the table audit_history :

1  | app_table     | INSERT |  1 | mchl | 2017-02-08 15:01:10.28164
2  | app_table     | INSERT |  2 | mchl | 2017-02-08 15:01:10.292839
3  | app_table     | INSERT |  3 | mchl | 2017-02-08 15:01:10.304088
4  | app_table     | INSERT |  4 | mchl | 2017-02-08 15:01:10.315051
5  | app_table     | INSERT |  5 | mchl | 2017-02-08 15:01:10.32618
6  | app_table     | INSERT |  6 | mchl | 2017-02-08 15:01:10.337284
7  | app_table     | INSERT |  7 | mchl | 2017-02-08 15:01:10.348366
8  | app_table     | INSERT |  8 | mchl | 2017-02-08 15:01:10.35954
9  | app_table     | INSERT |  9 | mchl | 2017-02-08 15:01:10.370595
10 | app_table     | UPDATE |  1 | mchl | 2017-02-08 15:01:10.381727
11 | app_table     | UPDATE |  2 | mchl | 2017-02-08 15:01:10.392824
12 | app_table     | UPDATE |  8 | mchl | 2017-02-08 15:01:10.403822
13 | app_table     | DELETE |  3 | mchl | 2017-02-08 15:01:10.414956
14 | app_table     | INSERT | 10 | mchl | 2017-02-08 15:01:10.4261
15 | app_table     | INSERT | 11 | mchl | 2017-02-08 15:01:10.437168
16 | app_table     | INSERT | 12 | mchl | 2017-02-08 15:01:10.448195
17 | app_table     | UPDATE | 10 | mchl | 2017-02-08 15:01:10.459344
18 | app_table     | DELETE | 11 | mchl | 2017-02-08 15:01:10.470463
19 | app_table_ref | INSERT |  1 | mchl | 2017-02-08 15:01:10.481536
20 | app_table_ref | INSERT |  2 | mchl | 2017-02-08 15:01:10.49259
21 | app_table_ref | INSERT |  3 | mchl | 2017-02-08 15:01:10.503579
22 | app_table     | DELETE |  1 | mchl | 2017-02-08 15:01:10.514913
23 | app_table_ref | DELETE |  1 | mchl | 2017-02-08 15:01:10.514913
24 | app_table_ref | DELETE |  2 | mchl | 2017-02-08 15:01:10.514913

And here the table app_table :

  56 | modified   |        2
  82 | record 82  |        4
  85 | record 85  |        5
  91 | record 91  |        6
  94 | record 94  |        7
  97 | modified   |        8
  99 | record 99  |        9
 101 | modified   |       10
 167 | record 167 |       12

and the table app_table_ref :

  3 | ref3 | 82 |        3

We note that the delete in app_table of the record having pk = 23 has deleted also the two records of app_table_ref having fk = 23.

This design is very handy because you can add auditing on an existing database by adding a to each table : 1/  a column serial (that is not a pk) and 2/  the two triggers (after the creation of the generic function).

The applications are not touched and the referential integrity of the database is preserved.

Having for each modifications, the name of the table and the value of the audit_id, we can find, for insert and update, the rows that have changed.

Remove duplicate records from a table

A well known problem : a table has an id which is the Primary Key (PK) but some records have the others fields similar. The script must remove all duplicates and keep the record with the smallest id.

For example, a table Customers having a SERIAL as PK but no UNIQUE constraint on the other fields. Two or more employees are able to create customers. And after a certain amount of time, there is a quite high probability to have duplicates.

The job is done with a single statement but we need to create a table with duplicates and to perform statistics in order to verify our script.

--
-- Create table, populates it, take a backup
--
drop table if exists dups_test;
drop table if exists dups_test_backup;

create table dups_test (
 id int primary key
 ,title text
 ,info int
);

insert into dups_test
 select i, 'Record #' || i::text, i
 from generate_series(1, 10000) i;

create table dups_test_backup as select * from dups_test;

By the way, we can see how the PostgreSQL function generate_series is used.

The first records of dups_test :

select * from dups_test order by id;

gives

id   title         info
---- ----------    -----
1    Record #1	   1
2    Record #2     2
3    Record #3	   3
...

Let’s now introduce some duplicates. The id must be different because id is a PK but the other fields, title and info will be taken from initial data. In order to guarantee the unicity, we will use a sequence starting at the size of the initial table + 1.

--
-- insert duplicates into the table
--
create sequence dups_seq minvalue 10001;

insert into dups_test (id, title, info)
 select nextval('dups_seq'), title, info 
 from dups_test 
 order by random()  
 limit 1000;

insert into dups_test (id, title, info)
 select nextval('dups_seq'), title, info 
 from dups_test 
 order by random()  
 limit 500;

insert into dups_test (id, title, info)
 select nextval('dups_seq'), title, info 
 from dups_test 
 order by random()  
 limit 250;

insert into dups_test (id, title, info)
 select nextval('dups_seq'), title, info 
 from dups_test 
 order by random() 
 limit 125;

drop sequence dups_seq;

By the way, we can see how to pick n random records in a table : SELECT * FROM table ORDER BY random() LIMIT n.

We see that 1875 (1000 + 500 + 250 + 125) records have been added :

select count(*) from dups_test;

gives

count
-----
11875

Some records of dups_test :

select title, info, id from dups_test order by title, info, id;

gives

title          info   id
-----------    ----   -----
...
Record #1490   1490   1490
Record #1491   1491   1491
Record #1491   1491   11491
Record #1491   1491   32491
Record #1492   1492   1492
Record #1493   1493   1493
Record #1493   1493   11493
Record #1493   1493   21493
Record #1494   1494   1494
...

In order to know the repartition of duplicates, we execute the following query

select distinct count(*) as count_records, t.countd as count_dups 
from (
 select info, count(*) as countd
 from dups_test
 group by (info)
 having count(*) > 1) as t
group by t.countd
order by t.countd;

that gives

count_records	count_dups
-------------   ----------
1498            2
167             3
13              4
1               5

It means that 1498 records have 2 times the same fields. And 167 have 3 times the same fields etc. (The same title and info but not the same id because id is the PK)

And now, we delete duplicates with this tricky query :

--
-- remove duplicates
--
delete from dups_test 
 where id in (
  select id
   from (
    select 
     id
     ,row_number() over (partition by title,info order by id ) as r
    from dups_test ) t 
   where t.r > 1);

that gives

DELETE 1875

How works this tricky query ?

partition is a window function : a function that performs a calculation across a set of table rows that are somehow related to the current row like an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row . The rows retain their separate identities.

row_number() gives the number of the current row within its partition, counting from 1.

Let’s execute the following query on the subset we have examined before :

select 
  id
  ,title
  ,info
  ,row_number() over (partition by title,info ORDER BY id ) as r
 from dups_test
 where info in (1490, 1491, 1492, 1493, 1494)
 order by title, info

gives

id      title          info   r
----    ------------   ----   -
1490    Record #1490   1490   1 
1491    Record #1491   1491   1 
11491   Record #1491   1491   2 
32491   Record #1491   1491   3 
1492    Record #1492   1492   1 
1493    Record #1493   1493   1 
11493   Record #1493   1493   2 
21493   Record #1493   1493   3 
1494    Record #1494   1494   1

The deletion of the records in each partition having a ROW_NUMBER > 1 removes duplicates and keep initial record.

We have to verify that all duplicates have been deleted and only them.

select count(*) from dups_test;

gives the initial number of rows :

count
-----
10000

Good !

And we run now the script that computes the repartition of the duplicates.

select distinct count(*) as count_records, t.countd as count_dups 
from (
 select info, count(*) as countd
 from dups_test 
 group by (info) having count(*) > 1) as t 
group by t.countd order by t.countd; 

gives

count_records	count_dups
-------------   ----------
(0 rows)

Good !

And then we check the final table against the backup :

select count(*)
from dups_test a
full outer join dups_test_backup b
on a.id = b.id
where a.title = b.title and a.info = b.info;

gives

count
-----
10000

This query, comparing all fields, gives the initial number of rows.

A little explanation about OUTER JOINs :

table1 LEFT OUTER JOIN table2 ON table1.id = table2.id gives all the records of table1 even if there is no corresponding record in the table in table2.

table1 RIGHT OUTER JOIN table2 ON table1.id = table2.id gives all the records of table2 even if there is no corresponding record in the table in table1.

table1 FULL OUTER JOIN table2 ON table1.id = table2.id gives all records belonging to both table1 and table2.

Alternatively, the following script do the same job in searching fields having the same id but not the same title or info :

select count(*)
from dups_test a
full outer join dups_test_backup b
on a.id = b.id
where a.title != b.title or a.info != b.info;

gives

count
-------
(0 rows)

It seems that everything is OK.

By the way, have a look at the two last queries. The only difference is the WHERE clause

where a.title = b.title and a.info = b.info

giving all the records and

where a.title != b.title or a.info != b.info

giving no records.

It is a good example of the De Morgans’law :

The negation of a conjunction is the disjunction of the negations.
The negation of a disjunction is the conjunction of the negations.

or, in other words :

not (A AND B) = (not A) OR (not B)
not (A OR B) = (not A) AND (not B)

And now, a little bit cleaning before we leave :

drop table dups_test_backup;

And, last but not least, a visualization of the repartition of duplicates !

When we make statistics, it is always better to visualize the numbers. The R programming language is excellent for statistics and visualization ! And R has a PostgreSQL interface.

In order to produce a nice chart, a large amount of duplicates has been added.

And in order to avoid to put SQL in the front end application, as needed in client-server architecture, we create the following view that embeds the query we used before :

create or replace view stat_on_dups as
 select distinct count(*) as count_records, t.countd as count_dups 
 from (
  select info, count(*) as countd
  from dups_test
  group by (info)
  having count(*) > 1) as t
 group by t.countd
 order by t.countd;

that gives

count_records	count_dups
-------------   ----------
2027            2
1469            3
1082            4
857             5
549             6
376             7
260             8
182             9
124             10
73              11
53              12
38              13
30              14
26              15
19              16
7               17
11              18
5               19
2               20
1               21
3               22
1               26

This R script :

require("RPostgreSQL") 
# establish connection 
driver ← dbDriver("PostgreSQL") 
connexion ← dbConnect(driver, dbname="higayone"
                        ,host="localhost", port=5432
                        ,user="mchl", password="secret")                              

# query the database 
list ← dbGetQuery(connexion, "select * from stat_on_dups;") 

# create the plot 
vx ← seq(0, 2000, 100) 
vy ← seq(1, 30, 1) 
plot(list, type = "h", col = "blue", lwd = 1, 
        ,xlab = "Number of records", 
        ,ylab = "Number of duplicates", 
        ,main = "Repartion of duplicates"
        ,xaxt = "n", yaxt = "n") 
axis(side = 2, at = vy) 
axis(side = 1, at = vx) 

# clean before leave 
dbDisconnect(connexion) 
dbUnloadDriver(driver)

displays :

Rplot

The x-axis shows the number of records having duplicates and the y-axis shows the number of duplicates for those records.

We see at glance that we have a lot of records with few duplicates and few records with a lot of duplicates.