Not able to vacuum table in greenplum - greenplum

I have created a table in Greenplum and performing insert update delete operation on it. I have run vacuum command on the table, showing it ran successfully.
However when I run the command select * from gp_toolkit.gp_bloat_diag;. It displays same table name.
After repeatedly running vacuum also display table name in list from command select * from gp_toolkit.gp_bloat_diag;
How should I make sure table does not have any bloat and vacuumed properly?

For clarification:
VACUUM does remove bloat (the dead tuples in a table), and allows that space to be re-used by new tuples.
The difference between VACUUM and VACUUM FULL is that FULL re-writes the relfiles ( the table storage ) and reclaims the space for the OS.
gp_toolkit.gp_bloat_diag doesn't update in immediately, but updates shortly after an ANALYZE when the stats for the table have been updated.
I would only recommend that you run VACUUM FULL if the table is very small or if a system catalog table has grown out of proportion, where you don't have a lot of options.
VACUUM FULL is a very expensive operation.
On a very large table can lead to unexpected run time and during this run the table will be on exclusive lock the whole time.
In general, frequent VACUUM will save your tables from growing unnecessarily large. The dead tuples will be removed and the space will be reused.
If you have a large table with significant bloat and a lot of dead space, you will likely want to reorganize -- which is a less expensive way to reclaim space.
alter table <table_name> set with (reorganize=true) distributed (randomly -- or -- by (<column_names1>,<column_names2>....)

Please refer this to know Different option to remove bloat from a table
VACUUM will not remove bloat but VACUUM FULL will. Check below example
Table Creation:
DROP TABLE IF EXISTS testbloat;
CREATE TABLE testbloat
(
id BIGSERIAL NOT NULL
, dat_year INTEGER
)
WITH (OIDS = FALSE)
DISTRIBUTED BY (id);
Inserting 1M records into table :
INSERT INTO testbloat (dat_year) VALUES(generate_series(1,1000000));
Checking the size of the table. Size is 43MB
SELECT 'After Inserting data',pg_size_pretty(pg_relation_size('testbloat'));
Updating all the records in the table
UPDATE testbloat
SET dat_year = dat_year+1;
Checking the size of the table after updating. Size is 85MB. It is increased because of bloat which was caused because of update operation
SELECT 'After updating data',pg_size_pretty(pg_relation_size('testbloat'));
Applying VACUUM on the table
Vacuum testbloat;
Checking the size of the table after VACUUM. Size is still 85MB.
SELECT 'After Vacuum', pg_size_pretty(pg_relation_size('testbloat'));
Applying VACUUM FULL on the table
Vacuum FULL testbloat;
Checking the size of the table after VACUUM FULL. Size is still 43MB. It got reduced as the table bloat was not there
SELECT 'After Vacuum FULL ', pg_size_pretty(pg_relation_size('testbloat'));

Vacuum never releases the space occupied by expired rows rather it mark those space to be reused for later insert of new rows to the same table itself. Hence, even after you run vaccum, the size of the table won't come down.
Instead of using vaccum full , use CTAS , it is faster than vaccum full and unlike vaccum full it does not hold locks on pg_class table.
And after CTAS operations, rename the table to the old table name.

Related

Sessions are locking each other while direct insert into subpartition with name of subpartition specified

We have a one large table that we need to insert data of it into another table.
Target table is partitioned by range (by day) and subpartitioned by departments.
For loading table data, we have used dbms_parallelel_execute and created a task using sql that gets list of departments, level is 20, that is at one time only 20 tasks corresponding to 20 departments will run. Those task will select the department's data from source table and inserts into target table.
Before doing insert, we first get subpartition name and generate the following insert:
INSERT /*+ NO_GATHER_OPTIMIZER_STATISTICS ENABLE_PARALLEL_DML APPEND_VALUES */ into Target_Table subpartition (subpartition_name) values (:B1, :B2, :B3, ....) ;
We read on oracle documentation that specifiying subpartition during insert will lock only that subpartition and append will work . The goal of doing this was to create n jobs that will independently insert into their own give subpartitions. Append itself is working, but when we monitor v$session while loading table data, we see that
BLOCKING_SESSION_STATUS is VALID;
FINAL_BLOCKING_SESSION_STATUS is VALID;
EVENT# is library cache lock
STATE is WAITING,
WAIT_CLASS is Concurrency
From this, we are concluding that one append_values is still locking other sessions to insert to another subpartition, is there something we missed? We have enabled parallel dml, disabled target table's indexes, set skip_unusable_indexes to true, no referential constraints are present in target table, table, partitions and subpartitions are set to nologging.
EDIT: Tested the same thing with another table that is also partitioned, but it doesn't have subpartitions, it is only list partitioned. So instead of subpartition (subpartition_name) inside insert statement there was partition(partition_name) . However, in this case , insert run without sessions waiting for others and no locks were held. I am assuming with subpartitioned interval tables the above won't work.
EDIT2 I have created the same scenario in another database which is also Oracle 19c. Created a table with partitions and subpartitions, set the interval, disabled indexes, set nologging and run the job that inserts into subpartitions. Surprisingly, the insert went without errors and no sessions locking each other. Now I am thinking maybe its some database parameter that should be turned on or changed. Because database versions, table structures, jobs, inserts are the same, but in one it is locking each other, in another it is not.
UPDATE Adding the insert part of the code :
if c_tab_cursor %isopen then
close c_tab_cursor;
end if;
open c_tab_cursor;
loop
fetch c_tab_cursor bulk collect
into v_row limit 100000;
exit when(v_row.count = 0);
forall i in v_row.First .. v_row.Last
insert /*+ NO_GATHER_OPTIMIZER_STATISTICS APPEND_VALUES */ into
Target_Table subpartition(SYS_P68457)
values v_row
(i);
commit;
end loop;
close c_tab_cursor;
Edit3 Adding table info, table is daily partitioned, and each partition has around 150 subpartitions. At the time of writing this, table had total 177845 subpartitions. My other guess is oracle is spending many time to find the right subpartition, which is also arguable because subpartition name is provided during insert.
I'd say it is expected "feature" - when you insert into the same segment. Direct path insert writes data beyond HWM(high water mark) rather than using segment's free space map.
When you commit direct path insert HWM advances, when you rollback HWM stays and data is discarded.
Check Oracle segment parameter "FREELIST", but I'm afraid even this parameter wont help you.
When your inserts touch different subpartitions this should not be happening.
There can be various objects held by library cache lock (maybe due to bug).
IMHO only way how to investigate this would be either to use hanganalyze to check which function in oracle is being blocked or to query P1,P2,P3 parameters of library cache lock and identify which object is blocking parallel run.
PS: I saw bugs like: Only one session could run Java stored procedure at the time because Oracle unnecessarily wanted to hold exclusive lock on some library case object.
v$session reports the wait state at that precise instant that you query it. It's meaningless unless you keep requerying and keep seeing the same thing. Better yet, use v$active_session_history to see Oracle's own 1-second sampling of the wait state. If you see lots of rows with that wait, then it's meaningful.
Assuming that this is meaningful, I would point out that you are using a single row VALUES list and yet are asking for parallel dml. Parallel dml is for multiple row operations, not single row operations. You can use it for an insert-select, for example, but not an insert-values.
If your application is necessarily single-row driven, remove ENABLE_PARALLEL_DML APPEND_VALUES hints. If you are binding arrays to these variables, you can leave the APPEND_VALUES but remove the ENABLE_PARALLEL_DML. For inserts, parallel DML only works with insert-select.
As you clearly intend to have multiple sessions, each loading a separate subpartitions, that's your parallelism right there - you don't need nor want to add another layer of parallelism with PDML.

Gather Stats Required After Truncate and Insert?

I needed to truncate and reload a table.
I learned that truncate needs stats gathering on the table as its successor process so the database gets the actual statistics, otherwise previous stats are not cleared by the truncate statement.
After doing these two operations (truncate and stats gathering on the empty table), ran the insert... but don't see new statistics in all_tab_statistics table for my table. Sample_size is still 0.
Why is that? Shouldn't have Oracle done the automatic stats gathering after the insert?
Do I need to rerun the stats or is it just fine considering the performance around this table (please note it's going to truncate and reload each time)?
Consider the following approach. It has the advantage of the table always being present.
Create an empty new table like the old one.
Load the data into the new table. This is the slowest step.
Do whatever cleanup you might need, such as refreshing the statistics.
RENAME tables to swap the new table in place. This step is fast enough so you won't notice.
I know it's a long time since I posted my question above. But recently, we again faced the similar situation and this time below steps worked towards a much better performance on a table with 800 million rows.
Take a backup of the original table.
Truncate the original table.
Gather stats on the truncated table, so that statistics show 0 in the DB. Us CASCADE=>TRUE in the command to also include indexes in the process.
Drop the indexes on the truncated table and Insert the required data from the backup table.
Recreate the indexes and gather stats again (ofcourse, with CASCADE=>TRUE; however recreation of the indexes should ideally have calculated the appropriate stats).
Drop the backup table if not needed.

Removing bloat from greenplum table

I have created some tables in Greenplum, performing insert update and delete operation. Regularly I am also performing vacuum operation. I Found bloat in it. Found solution to remove bloat https://discuss.pivotal.io/hc/en-us/articles/206578327-What-are-the-different-option-to-remove-bloat-from-a-table
However, if I truncate the table and reinsert the data, it removes bloat. Is it good practice to truncate the data from the table?
If you are performing UPDATE and DELETE statements on a heap table (default storage) and running VACUUM regularly, you will get some bloat by design. Heap storage, which is similar to the default PostgreSQL storage mechanism, provides read consistency using Multi-Version Concurrency Control (MVCC).
When you UPDATE or DELETE a record, the old value is still in the table and is able to be read by transactions that are still inflight and started before you issued the UPDATE or DELETE command. This provides the read consistency to the table.
When you execute a VACUUM statement, the database will mark the stale rows as available to be overwritten. It doesn't shrink the files. It just marks rows so they can be overwritten. The next time you execute an INSERT or UPDATE, the stale rows are now able to be used for the new data.
So if you UPDATE or DELETE 10% of a table between running VACUUM, you will probably have about 10% bloat.
Greenplum also has Append-Optimized (AO) storage which doesn't use MVCC and uses a visibility map instead. The files are bit smaller too so you should get better performance. The stale rows are hidden with the visibility map and VACUUM won't do anything until you hit the gp_appendonly_compaction_threshold percentage. The default is 10%. When you have 10% bloat in an AO table and execute VACUUM, the table will automatically get rebuilt for you.
Append-Optimized is called "appendonly" for backwards compatibility reasons but it does allow UPDATE and DELETE. Here is an example of an AO table:
CREATE TABLE sales
(txn_id int, qty int, date date)
WITH (appendonly=true)
DISTRIBUTED BY (txn_id);
Instead of truncate it is better to use drop the table, create the table and then insert the data.

Bulk update in Oracle12c

I have a situation like to update a column(all rows) in a table having 150 million records.
Creation of duplicate table with updates and dropping of previous table is the best way but there is no available disk space to hold the duplicate table.
So how to perform the update in less time? Partitions are there on the table.
I am using oracle 12c
The cleanest approach is NOT updating the table, but creating a new table with the new column of updated rows. For instance, let's say I needed to update a column called old_value with the max of some value, instead of updating the old_table one does:
create new_table as select foo, bar, max(old_value) from old_table;
drop table old_table;
rename new_table as old_table.
If you need even more speed, you can do this creation using a parallel query with nologging thereby generating very little redo and no undo logs. More details can be ascertained here: https://asktom.oracle.com/pls/asktom/f?p=100:11:0::NO::P11_QUESTION_ID:6407993912330

Oracle 11g Deleting large amount of data without generating archive logs

I need to delete a large amount of data from my database on a regular basis. The process generates huge volume of archive logs. We had a database crash at one point because there was no storage space available on archive destination. How can I avoid generation of logs while I delete data?
The data to be deleted is already marked as inactive in the database. Application code ignores inactive data. I do not need the ability to rollback the operation.
I cannot partition the data in such a way that inactive data falls in one partition that can be dropped. I have to delete the data with delete statements.
I can ask DBAs to set certain configuration at table level/schema level/tablespace level/server level if needed.
I am using Oracle 11g.
What proportion of the data on the table would be deleted, what volume? Are there any referential integrity constraints to manage or is this table childless?
Depending on the answers , you might consider:
"CREATE TABLE keep_data UNRECOVERABLE AS SELECT * FROM ... WHERE
[keep condition]"
Then drop the original table
Then rename keep_table to original table
Rebuild the indexes (again with unrecoverable to prevent redo),constraints etc.
The problem with this approach is it's a multi-step DDL, process, which you will have a job to make fault tolerant and reversible.
A safer option might be to use data-pump to:
Data-pump expdp to extract the "Keep" data
TRUNCATE the table
Data-pump impdp import of data from step 1, with direct-path
At this point I suggest you read the Oracle manual on Data Pump, particularly the section on Direct Path Loads to be sure this will work for you.
MY preferred option would be partitioning.
Of course, the best way would be TenG solution (CTAS, drop and rename table) but it seems it's impossible for you.
Your only problem is the amount of archive logs and database crash problem. In this case, maybe you could partition your delete statement (for example per 10.000 rows).
Something like:
declare
e number;
i number
begin
select count(*) from myTable where [delete condition];
f :=trunc(e/10000)+1;
for i in 1.. f
loop
delete from myTable where [delete condition] and rownum<=10000;
commit;
dbms_lock.sleep(600); -- purge old archive if it's possible
end loop;
end;
After this operation, you should reorganize your table which is surely fragmented.
Alter the table to set NOLOGGING, delete the rows, then turn logging back on.

Resources