Check and set analyze statistics on tables-columns in Vertica

It is important to update statistics in any database for better query runs or query optimizations.  Vertica provides ANALYZE_STATISTICS and ANALYZE_HISTOGRAM to collect stats and in big data warehouse environment it is sometime possible that some tables may not have stats collected.  In that case you can run following query to check and add the tables/ columns to collect stats.

SELECT T1.table_schema, T1.table_name, T2.row_ct, T2.wos_ct, T2.ros_ct
-- Get list of tables that don't have 'FULL' stats
SELECT distinct table_schema, table_name
FROM projection_columns
WHERE (statistics_type = 'ROWCOUNT' or statistics_type = 'NONE')
-- and table_schema not like 'v_%'
AND (table_schema = 'schema_1' OR table_schema = 'schema_2' )
) T1
-- Get list of tables with storage details
SELECT anchor_table_schema, anchor_table_name,
sum(row_count) row_ct, sum(wos_row_count) wos_ct, sum(ros_row_count) ros_ct
FROM projection_storage
WHERE (anchor_table_schema = 'schema_1' OR anchor_table_schema = 'schema_2' )
) T2
ON T1.table_schema = T2.anchor_table_schema
AND T1.table_name = T2.anchor_table_name
ORDER BY 3 desc;

From the list generated above add those tables/ columns to your regular stats collecting script which you can run during off peak hours.

Vertica’s ANALYZE_HISTOGRAM returns similar data as ANALYZE_STATISTICS except that one can set the percent of data that is sampled to generate statistics metrics.  In the case of latter it uses fixed 10% sampling.

WOS: Write Optimized Store is a memory resident data structure used to support fast data loads without data compression and indexing.
ROS: Read Optimized Store is disk-based storage with data compressed and sorted.  Tuple Mover moves data out of WOS and into ROS in batches.


Removing duplicates in Vertica data sets

Many a time due to bug or otherwise duplicate rows gets introduced into tables and it is not hard to perform dedupe and let the ETL/ELT process continue without interruption.  In environment where rows can be identified by row_ids (Oracle) or table is not large (say, less than 100 million rows) simple techniques like copying the whole table to temporary table and then removing the dups are suffice to handle the situation. Thus, data users (Scientists, Analysis or Execs) are not negatively affected either due to data issue or downtime.

But some cases can be interesting, say, where table is large or very large, from few hundred millions of rows to few billions of rows and tables are wide with tens to hundreds of columns.  One technique that I use in Vertica environment is copying the duplicates to temp table and generating a pseudo row_id and then deleting the duplicates.  Recently a table with with more than 150 million rows and 60 columns had just 2 duplicates and needed dedupe without any downtime.  The steps I used were:

Step 1:  Check for dups
select count(1) from large-table-1;    — Just some sanity check before beginning the dedupe

— Find the duplicates
select keys, count(1) from large-table-1
where where-conditions
group by 1
having count(1) > 1
order by count(1) desc  ;

Step 2:  Dump the duplicates into temp table
create table test.large-table-1-dups
like large-table-1;

alter table test.large-table-1-dups     — add row_num column (pseudo row_id)
add column row_num int;

insert into test.large-table-1-dups
from large-table-1
where key in (‘k1’, ‘k2’);                    — where, say, k1 has n and k2 has m exact duplicates

Step 3: Remove duplicates from the temp table
delete from test.large-table-1-dups
where row_num > 1;

select * from test.dim_line_items_dups;    —  Sanity test.  Should have 1 row each of k1 & k2 rows

Step 4: Delete all duplicates from main table…
delete from large-table-1
— select * from large-table-1
where key in (‘k1’, ‘k2’);                      — key is like line_item_nk, etc.

Step 5: Insert data back into main table from temp dedupe data
alter table test.large-table-1-dups
drop column row_num;

insert into large-table-1
select * from test.large-table-1-dups;

Purging Vertica tables/ partitions at regular intervals

First thing first, especially, when there is potential to misunderstand due to definitions of specific words – purging vs truncating vs deleting data. In this article, my interest is to purge the tables of deleted rows.

Truncate => Removes all storage associated with a table, while preserving the table definitions.
Purge => Permanently removes deleted data from physical storage so that the disk space can be reused.  You can purge historical data up to and including the epoch in which the Ancient History Mark is contained.
Delete => It marks tuples as no longer valid in the current epoch. It does not delete data from disk storage for base tables.

Vertica purge_table or purge_partitions statement purges all projections of the specified table and can temporarily take significant disk space while performing the purge.

Following query returns list of queries that when executed purges data for each partitioned tables.

   — Look into only large tables (TABLE_SIZE_OF_INTEREST), say 1Billion row,
   —  that have good chunk deletes (>20%)
   SELECT ‘SELECT purge_partition(”’|| P.projection_schema ||’.’|| P.anchor_table_name || ”’,”’||PRT.partition_key||”’);’
    FROM projections P inner join
    (   SELECT projection_id, partition_key
        ,SUM(ros_row_count) as total_rows
        FROM partitions
        GROUP BY 1,2
        HAVING SUM(deleted_row_count)/SUM(ros_row_count) > 0.2
    ) PRT
    ON PRT.projection_id = P.projection_id
    GROUP BY PRT.partition_key, P.projection_schema, P.anchor_table_name
    HAVING max(total_rows) <= {TABLE_SIZE_OF_INTEREST}
    ORDER BY max(total_rows) DESC
    LIMIT 25 ;

Similarly for those tables that are non-partitioned use following query.

    SELECT ‘select purge_table(”’|| projection_schema ||’.’|| anchor_table_name || ”’);’
    FROM projections P inner join
    (   SELECT schema_name, projection_name ,
            sum(deleted_row_count) delete_rows ,
            sum(delete_vector_count) delete_vector_count ,
            sum(total_row_count) total_rows,
            sum(deleted_row_count)/sum(total_row_count) as delete_percent
        FROM storage_containers
        GROUP BY 1,2
        HAVING sum(deleted_row_count)/sum(total_row_count) > 0.2
    ) A
    ON A.schema_name = P.projection_schema
      AND A.projection_name = P.projection_name
    INNER JOIN tables T on P.anchor_table_id = T.table_id
    WHERE length(T.partition_expression) = 0
    GROUP BY 1
    HAVING max(total_rows) <= {TABLE_SIZE_OF_INTEREST}
    ORDER BY max(total_rows) DESC
    LIMIT 25 ;

With a wrapper around these queries in Python/ Perl/ Bash, etc one can easily go through the list executing each statement and cleaning up old deleted data for improved performance.