But some cases can be interesting, say, where table is large or very large, from few hundred millions of rows to few billions of rows and tables are wide with tens to hundreds of columns. One technique that I use in Vertica environment is copying the duplicates to temp table and generating a pseudo row_id and then deleting the duplicates. Recently a table with with more than 150 million rows and 60 columns had just 2 duplicates and needed dedupe without any downtime. The steps I used were:
Step 1: Check for dups
select count(1) from large-table-1; — Just some sanity check before beginning the dedupe
— Find the duplicates
select keys, count(1) from large-table-1
where where-conditions
group by 1
having count(1) > 1
order by count(1) desc ;
Step 2: Dump the duplicates into temp table
create table test.large-table-1-dups
like large-table-1;
alter table test.large-table-1-dups — add row_num column (pseudo row_id)
add column row_num int;
insert into test.large-table-1-dups
select *, ROW_NUMBER() OVER(PARTITION BY key)
from large-table-1
where key in (‘k1’, ‘k2’); — where, say, k1 has n and k2 has m exact duplicates
Step 3: Remove duplicates from the temp table
delete from test.large-table-1-dups
where row_num > 1;
select * from test.dim_line_items_dups; — Sanity test. Should have 1 row each of k1 & k2 rows
Step 4: Delete all duplicates from main table…
delete from large-table-1
— select * from large-table-1
where key in (‘k1’, ‘k2’); — key is like line_item_nk, etc.
Step 5: Insert data back into main table from temp dedupe data
alter table test.large-table-1-dups
drop column row_num;
insert into large-table-1
select * from test.large-table-1-dups;
Thanks for your informative guide on SQL technology. As we all know, SQL is most used database tool and your article offer clear insight about the technology. Your blog is recommended for freshers and experienced ones looking for ideal blog to learn SQL technology.
Java Training in Chennai
Dot Net Training in Chennai
PHP Training in Chennai
Your blog is awesome..You have clearly explained about it …Its very useful for me to know about new things..Keep on blogging..
PHP training in chennai
Angular Js is a powerful framework. New job opportunities are opening up and the Angular Js is the most important one in leading MNC companies.Thanks for taking time to share this post.
Regards,
Angularjs Training Chennai | Angularjs course in Chennai
Updating with the latest technology and implementing it is the only way to survive in our niche. Thanks for making me this article. You have done a great job by sharing this content in here. Keep writing article like this.
SAS Training in Chennai | SAS Training Institute in Chennai
Pretty article! I found some useful information in your blog, it was awesome to read, thanks for sharing this great content to my vision, keep sharing.
Regards,
PHP Training in Chennai | Webdesigning Training in Chennai
Thanks for posting this useful content, Good to know about new things here, Let me share this,
AngularJS Training in Chennai | AngularJS Training | Best AngularJS Training Institute in Chennai
great article
job oriented java j2ee training | java spring training | java hibernate training
Really an interesting post.
Regards,
SAS Training in Chennai | SAS course in Chennai | SAS Institutes in Chennai
Thanks for sharing the information about the hadoop training and keep updating us.This information is useful to me.
Hadoop Training Chennai
Hadoop training in Chennai
SAS (Statistical Analysis System) is the most popular tool for statistical modelling and data analysis and having a great scope in IT industry. It’s an application suite that can change, manage & retrieve data from the variety of origins & perform statistical analysis on it.
Regards,
SAS Training Institutes in Chennai | SAS Training in Chennai | SAS Courses in Chennai
Thanks for posting this useful content, Good to know about new things here, Let me share this,
AngularJS Training in Chennai | AngularJS Training | Best AngularJS Training Institute in Chennai