Remove Postgres / Redshift duplicates without unique key

March 3, 2021

Sometimes you got duplicates in database and have to deal with it. There's 2 possibilities - you have an auto-generated unique id on the table or not. The first case is very easy

Let say we have table feedback where we collect all the feedbacks given by our users. This table has field feedback_id which represents unique id of external system from which data was posted.

id	feedback_id	type	date
0001	MX-001	livechat	2020-01-01
0002	MX-001	livechat	2020-01-01
0003	MX-001	livechat	2020-01-01
0004	MX-002	pendo	2020-01-03
0005	MX-002	pendo	2020-01-03

If we created an auto-generated Id on it then we safe

delete from feedback
where id in (
    with dups as (
        select id,
            row_number() over (partition by feedback_id order by date desc) rn
        from feedback
    )
    select id from dups where rn > 1
)

But what to do if there's no unique Id? Temporary table is an approach - first we create feedback_temp where only unique records from all set of duplicates will be stored.

CREATE TABLE feedback_temp as (
    with dups as (
        select feedback_id
        from feedback
        group by 1
        having count(*) > 1
    )
    , sorted as (
        select row_number() over (partition by ff.feedback_id order by date desc) rn, f.*
        from feedback f
            right join dups d on d.feedback_id = f.feedback_id
    )
    select * from sorted where rn = 1
);

Next step is to remove the duplicates from initial table

delete from feedback
where feedback_id in (
    select feedback_id
    from feedback
    group by 1
    having count(*) > 1
);

Then we just need to insert data back and remove temp table

BEGIN TRANSACTION;

insert into feedback
(select * from feedback_temp);

drop table feedback_temp;

END TRANSACTION;