Remove Postgres / Redshift duplicates without unique key
Sometimes you got duplicates in database and have to deal with it. There's 2 possibilities - you have an auto-generated unique id on the table or not. The first case is very easy
Let say we have table feedback where we collect all the feedbacks given by our users. This table has field feedback_id
which represents unique id of external system from which data was posted.
id | feedback_id | type | date |
---|---|---|---|
0001 | MX-001 | livechat | 2020-01-01 |
0002 | MX-001 | livechat | 2020-01-01 |
0003 | MX-001 | livechat | 2020-01-01 |
0004 | MX-002 | pendo | 2020-01-03 |
0005 | MX-002 | pendo | 2020-01-03 |
If we created an auto-generated Id on it then we safe
delete from feedback
where id in (
with dups as (
select id,
row_number() over (partition by feedback_id order by date desc) rn
from feedback
)
select id from dups where rn > 1
)
But what to do if there's no unique Id? Temporary table is an approach - first we create feedback_temp
where only unique records from all set of duplicates will be stored.
CREATE TABLE feedback_temp as (
with dups as (
select feedback_id
from feedback
group by 1
having count(*) > 1
)
, sorted as (
select row_number() over (partition by ff.feedback_id order by date desc) rn, f.*
from feedback f
right join dups d on d.feedback_id = f.feedback_id
)
select * from sorted where rn = 1
);
Next step is to remove the duplicates from initial table
delete from feedback
where feedback_id in (
select feedback_id
from feedback
group by 1
having count(*) > 1
);
Then we just need to insert data back and remove temp table
BEGIN TRANSACTION;
insert into feedback
(select * from feedback_temp);
drop table feedback_temp;
END TRANSACTION;