Thursday, April 24, 2008

PostgreSQL: In place UPDATE

Disclaimer: MVCC is at the heart of PostgreSQL. Any attempts to hack it may leave your data inconsistent. Please use the trick below only if you are sure about what you are doing. Some may find it cool, but for many it will be an ugly hack.

In one of my previous posts, I talked about MVCC and VACUUM in Postgres. There is a nice document authored by Tom which explains it in more detail. One of the main complaints about MVCC is that it creates multiple versions of a row. That not only bloats the table, but also requires periodic maintenance activity to reclaim the dead tuples.

Recently somebody posted a use case on pgsql-general mailing list where the user wants to UPDATE all rows in a particular table. Its a simple update where a particular column is updated with a new value. The user is worried about the table bloat which in this case would be 100% because each row will have two versions at the end of the update operation. He can update a bunch of rows, run VACUUM, then update the next bunch and continue. This would avoid table bloat, but requires multiple passes over the table and indexes for each VACUUM.

The user further stated that he is not bothered about the transactional atomicity of the data. So he is fine if there is a failure or crash mid-way and some of the rows are updated while others are not. He would of course not want to have data corruption.

I thought about this for a while and it occurred to me that this case can possibly be handled by using BEFORE UPDATE triggers and heap_inplace_update() function. So here is a quick example. Its probably not a good idea to publish such foot gun techniques. But at certain times, you are left with no other option, but to use such hacks to solve your problem.

Before we start, here are few necessary conditions I can think of:
  • The updated tuple should be of the same size as the old tuple. IOW you should be updating only fixed size columns. Also, you must not be changing NULL values to non-NULL and vice versa.
  • There must not be any index involving any of the columns being updated. This would include predicates and expressions for partial/expression indexes.
  • There must not be any foreign key or check constraints on the columns being updated.
  • You are not using this on a system catalog table.
  • You should be absolutely sure that you don't need transaction atomicity and isolation for this update operation.
  • You should understand the technique and validate your use case against this. If you are not sure, please don't use it.
  • AFAIR heap_inplace_update() is available only 8.2 onwards.

If your use case meets these requirements, you can proceed further.

Write a BEFORE UPDATE trigger in C, something like this:

PG_FUNCTION_INFO_V1(inplace_update_trigger);
extern Datum inplace_update_trigger(PG_FUNCTION_ARGS);

Datum
inplace_update_trigger(PG_FUNCTION_ARGS)
{
 TriggerData *trigdata = (TriggerData *)fcinfo->context;
  trigdata->tg_newtuple->t_self = trigdata->tg_trigtuple->t_self;
  heap_inplace_update(trigdata->tg_relation, trigdata->tg_newtuple);
  return NULL;
}

Compile the function and generate an shared object (say trigger.so) from it.

Now, create a BEFORE UPDATE trigger on the table you want to update in-place.

CREATE OR REPLACE FUNCTION inplace_update_trigger()
RETURNS TRIGGER
AS 'trigger.so', 'inplace_update_trigger'
LANGUAGE C STRICT;

CREATE TRIGGER inplace_update_trigger BEFORE UPDATE ON test
 FOR EACH ROW EXECUTE PROCEDURE  inplace_update_trigger();

You are now ready for the in-place update operation. I would recommend acquiring and holding an EXCLUSIVE lock on the relation to ensure that there are no concurrent operations on the table.

When an UPDATE hits the table, the BEFORE UPDATE trigger overwrites the existing tuple with the new version. Since the existing copy is overwritten, any failure or crash after this won't rollback the change. So you should be ready to handle this case i.e. you should know how to undo the changes for the affected rows or retry the operation for the remaining tuples. Remember you may not have any easy way to know which tuples were successfully updated before failure/crash. Of course, you can do some clever tricks to store that information explictly in the rows.

The BEFORE UPDATE trigger returns NULL, so the actual UPDATE operation is skipped by the executor. So you would see "UPDATE 0" as the status of the operation. But remember, your existing rows have been updated in-place.

Once again I remind you: It's a hack. It compromises PostgreSQL MVCC and transaction atomicity/isolation properties. Use it at your own risk :-)