Can I delete the detached folder on clickhouse data - clickhouse

There are many ignored folder in the detached, and take up too many spaces, can I remove it?
[root#cl1-data4 billing_test]# du -h --max-depth=1 ./tb_pay_order_log_local/
40G ./tb_pay_order_log_local/detached

If you are definitely sure that these data will not be used more it can be deleted from the file system manually.
I would prefer to remove ClickHouse artifacts using specialized operation DROP DETACHED PARTITION:
# get list of detached partitions
SELECT database, table, partition_id
FROM system.detached_parts
# drop them one by one
ALTER TABLE {database}.{table} DROP DETACHED PARTITION {partition_id}
Or automate it (idea was borrowed CH github: Attach all detached partitions #8183):
# warning [be careful]: this script remove ALL detached parts of ALL tables
# to affect only one table need to add "WHERE table = 'tb_pay_order_log_local'"
clickhouse-client --format=TSVRaw \
-q"select 'ALTER TABLE ' || database || '.' || table || ' DROP DETACHED PARTITION \'' || partition_id || '\';\n' from system.detached_parts group by database, table, partition_id order by database, table, partition_id;" \
| clickhouse-client -mn --allow_drop_detached 1

You can delete _ignored.
Google translate:
Inactive parts are not deleted immediately, because when writing a new chunk, fsync is not called, i.e. for some time the new part is located only in the server's RAM (OS cache). So if the server(HW) is rebooted spontaneously, new just merged parts can be lost or damaged. Then ClickHouse during the startup process is checking the integrity of the parts, can detect a problem, return the inactive parts to the active list, and later merge them again. Then the broken parts is renamed (the prefix broken is added) and moved to the detached folder. If the integrity check detects no problems in the merged chunk, then the original inactive chunks are renamed (prefix ignored is added) and moved to the detached folder.

Related

Maria DB, locking table for replacement?

I have a web application using a mariaDB10.4.10 INNO_DB table which is updated every 5 minutes from a script.
The script is working like:
Create a temp table from a table XY and writing data to the temp table from a received csv file. When the data is written, the script starts a transaction, drop the XY table and rename the temp table to the XY, and commits the transaction.
Nevertheless some times a user gets an "XY table does not exists" error working with the application.
I already tried to LOCK the XY table in the transaction but it doesn't change a thing.
How can I solve this? Is there any kind of locking (I thought locking is no longer possible with INNO_DB?)
Do this another way.
Create the temporary table (not as temporary table, as a real table). Fill it as needed. Nobody else knows it's there, you have all the time.
SET autocommit = 0; // OR: LOCK TABLE xy WRITE, tempxy READ;
DELETE FROM xy;
INSERT INTO xy SELECT * FROM tempxy;
COMMIT WORK; // OR: UNLOCK TABLES;
DROP TABLE tempxy;
This way, other customers will see the old table until point 5, then they'll start seeing the new table.
If you use LOCK, customers will stall from point 2 to point 5, which, depending on time needed, might be bad.
At point #3, in some scenarios you might be able to optimize things by deleting only rows that are not in tempxy, and running an INSERT ON DUPLICATE KEY UPDATE at point 4.
Funnily enough, I answered recently another question that was somewhat like yours.
autoincrement
To prevent autoincrement column from overflowing, you can replace COMMIT WORK with ALTER TABLE xy AUTO_INCREMENT=. This is a dirty hack and relies on the fact that this DDL command in MySQL/MariaDB will execute an implicit COMMIT immediately followed by the DDL command itself. If nobody else inserts in that table, it is completely safe. If somebody else inserts in that table at the exact same time your script is running, it should be safe in MySQL 5.7 and derived releases; it might not be in other releases and flavours, e.g. MySQL 8.0 or Percona.
In practice, you fill up tempxy using a new autoincrement from 1 (since tempxy has been just created), then perform the DELETE/INSERT, and update the autoincrement counter to the count of rows you've just inserted.
To be completely sure, you can use a cooperative lock around the DDL command, on the one hand, and anyone else wanting to perform an INSERT, on the other:
script thread other thread
SELECT GET_LOCK('xy-ddl', 30);
SELECT GET_LOCK('xy-ddl', 30);
ALTER TABLE `xy` AUTO_INCREMENT=12345; # thread waits while
# script thread commits
# and runs DDL
SELECT RELEASE_LOCK('xy-ddl'); # thread can acquire lock
INSERT INTO ...
DROP TABLE tempxy; # Gets id = 12346
SELECT RELEASE_LOCK('xy-ddl');

Not able to vacuum table in greenplum

I have created a table in Greenplum and performing insert update delete operation on it. I have run vacuum command on the table, showing it ran successfully.
However when I run the command select * from gp_toolkit.gp_bloat_diag;. It displays same table name.
After repeatedly running vacuum also display table name in list from command select * from gp_toolkit.gp_bloat_diag;
How should I make sure table does not have any bloat and vacuumed properly?
For clarification:
VACUUM does remove bloat (the dead tuples in a table), and allows that space to be re-used by new tuples.
The difference between VACUUM and VACUUM FULL is that FULL re-writes the relfiles ( the table storage ) and reclaims the space for the OS.
gp_toolkit.gp_bloat_diag doesn't update in immediately, but updates shortly after an ANALYZE when the stats for the table have been updated.
I would only recommend that you run VACUUM FULL if the table is very small or if a system catalog table has grown out of proportion, where you don't have a lot of options.
VACUUM FULL is a very expensive operation.
On a very large table can lead to unexpected run time and during this run the table will be on exclusive lock the whole time.
In general, frequent VACUUM will save your tables from growing unnecessarily large. The dead tuples will be removed and the space will be reused.
If you have a large table with significant bloat and a lot of dead space, you will likely want to reorganize -- which is a less expensive way to reclaim space.
alter table <table_name> set with (reorganize=true) distributed (randomly -- or -- by (<column_names1>,<column_names2>....)
Please refer this to know Different option to remove bloat from a table
VACUUM will not remove bloat but VACUUM FULL will. Check below example
Table Creation:
DROP TABLE IF EXISTS testbloat;
CREATE TABLE testbloat
(
id BIGSERIAL NOT NULL
, dat_year INTEGER
)
WITH (OIDS = FALSE)
DISTRIBUTED BY (id);
Inserting 1M records into table :
INSERT INTO testbloat (dat_year) VALUES(generate_series(1,1000000));
Checking the size of the table. Size is 43MB
SELECT 'After Inserting data',pg_size_pretty(pg_relation_size('testbloat'));
Updating all the records in the table
UPDATE testbloat
SET dat_year = dat_year+1;
Checking the size of the table after updating. Size is 85MB. It is increased because of bloat which was caused because of update operation
SELECT 'After updating data',pg_size_pretty(pg_relation_size('testbloat'));
Applying VACUUM on the table
Vacuum testbloat;
Checking the size of the table after VACUUM. Size is still 85MB.
SELECT 'After Vacuum', pg_size_pretty(pg_relation_size('testbloat'));
Applying VACUUM FULL on the table
Vacuum FULL testbloat;
Checking the size of the table after VACUUM FULL. Size is still 43MB. It got reduced as the table bloat was not there
SELECT 'After Vacuum FULL ', pg_size_pretty(pg_relation_size('testbloat'));
Vacuum never releases the space occupied by expired rows rather it mark those space to be reused for later insert of new rows to the same table itself. Hence, even after you run vaccum, the size of the table won't come down.
Instead of using vaccum full , use CTAS , it is faster than vaccum full and unlike vaccum full it does not hold locks on pg_class table.
And after CTAS operations, rename the table to the old table name.

Oracle 11g Deleting large amount of data without generating archive logs

I need to delete a large amount of data from my database on a regular basis. The process generates huge volume of archive logs. We had a database crash at one point because there was no storage space available on archive destination. How can I avoid generation of logs while I delete data?
The data to be deleted is already marked as inactive in the database. Application code ignores inactive data. I do not need the ability to rollback the operation.
I cannot partition the data in such a way that inactive data falls in one partition that can be dropped. I have to delete the data with delete statements.
I can ask DBAs to set certain configuration at table level/schema level/tablespace level/server level if needed.
I am using Oracle 11g.
What proportion of the data on the table would be deleted, what volume? Are there any referential integrity constraints to manage or is this table childless?
Depending on the answers , you might consider:
"CREATE TABLE keep_data UNRECOVERABLE AS SELECT * FROM ... WHERE
[keep condition]"
Then drop the original table
Then rename keep_table to original table
Rebuild the indexes (again with unrecoverable to prevent redo),constraints etc.
The problem with this approach is it's a multi-step DDL, process, which you will have a job to make fault tolerant and reversible.
A safer option might be to use data-pump to:
Data-pump expdp to extract the "Keep" data
TRUNCATE the table
Data-pump impdp import of data from step 1, with direct-path
At this point I suggest you read the Oracle manual on Data Pump, particularly the section on Direct Path Loads to be sure this will work for you.
MY preferred option would be partitioning.
Of course, the best way would be TenG solution (CTAS, drop and rename table) but it seems it's impossible for you.
Your only problem is the amount of archive logs and database crash problem. In this case, maybe you could partition your delete statement (for example per 10.000 rows).
Something like:
declare
e number;
i number
begin
select count(*) from myTable where [delete condition];
f :=trunc(e/10000)+1;
for i in 1.. f
loop
delete from myTable where [delete condition] and rownum<=10000;
commit;
dbms_lock.sleep(600); -- purge old archive if it's possible
end loop;
end;
After this operation, you should reorganize your table which is surely fragmented.
Alter the table to set NOLOGGING, delete the rows, then turn logging back on.

How do I Drop Table from slony

I have a database which is being backed up by slony. I dropped a table from the replicated DB and re-created the same table using sql scripts and nothing through slony scripts.
I found this on a post and tried it:
Recreate the table
Get the OID for the recreated table: SELECT OID from pg_class WHERE relname = <your_table>' AND relkind = 'r';
Update the tab_reloid in sl_table for the problem table.
Execute SET DROP TABLE ( ORIGIN = N, ID = ZZZ); where N is the NODE # for the MASTER, and ZZZ is the ID # in sl_table.
But it doesn't seem to work.
How do I drop the table from the replicated DB? Or is there a way to use the newly created table in place of the old one?
The authoritative documentation on dropping things from Slony is here.
It's not really clear what state things were in before you ran the commands above, and you haven't clarified "doesn't seem to work".
There is one significant "gotcha" that I know off with dropping tables from replication with Slony. After you remove a table from replication, you can have trouble actually physically dropping the table on the slaves (but not on the master) with Slony 1.2, getting a cryptic error like this:
ERROR: "table_pkey" is an index
This may be fixed in Slony 2.0, but the problem here is that there is a referential integrity relationship between the unreplicated table on the slave and the replicated table, and slony 1.2 has intentionally corrupted the system table some as part of it's design, causing this issue.
A solution is to run the "DROP TABLE" command through slonik_execute_script. If you have already physically dropped the table on the master, you can use the option "EXECUTE ONLY ON" to run the command only on a specific slave. See the docs for EXECUTE SCRIPT for details.
you have dropped the table from the database but you haven't dropped from the _YOURCLUSTERNAME.sl_table.
It's importatnt de "_" before YOURCLUSTERNAME.
4 STEPS to solve the mess:
1. Get the tab_id
select tab_id from _YOURCLUSTERNAME.sl_table where tab_relname='MYTABLENAME' and tab_nspname='MYSCHEMANAME'
It returna a number 2 in MYDATABASE
2. Delete triggers
select _YOURCLUSTERNAME.altertablerestore(2);
This can return an error. Because It's trying to delete triggers in the original table, and now there is a new one.
3. Delete slony index if were created
select _YOURCLUSTERNAME.tableDropKey(2);
This can return an error.
Because It's trying to delete a index in the original table, and now there is a new table.
4. Delete the table from sl_table
delete from _YOURCLUSTERNAME.sl_table where tab_id = 2;
The best way for dropping a table is:
1. Drop the table form the cluster:
select tab_id from _YOURCLUSTERNAME.sl_table where tab_relname='MYTABLENAME' and tab_nspname='MYSCHEMANAME'
It returna a number 2 in MYDATABASE
Execute with slonik < myfile.slonik
where myfile.slonik is:
cluster name=MYCLUSTER;
NODE 1 ADMIN CONNINFO = 'dbname=DATABASENAME host=HOST1_MASTER user=postgres port=5432';
NODE 2 ADMIN CONNINFO = 'dbname=DATABASENAME host=HOST2_SLAVE user=postgres port=5432';
SET DROP TABLE (id = 2, origin = 1);
2 is the tab_id from sl_table and 1 is NODE 1, HOST1_MASTER
2. Drop the table from slave
with SQL DROP TABLE

PostgreSQL temporary tables

I need to perform a query 2.5 million times. This query generates some rows which I need to AVG(column) and then use this AVG to filter the table from all values below average. I then need to INSERT these filtered results into a table.
The only way to do such a thing with reasonable efficiency, seems to be by creating a TEMPORARY TABLE for each query-postmaster python-thread. I am just hoping these TEMPORARY TABLEs will not be persisted to hard drive (at all) and will remain in memory (RAM), unless they are out of working memory, of course.
I would like to know if a TEMPORARY TABLE will incur disk writes (which would interfere with the INSERTS, i.e. slow to whole process down)
Please note that, in Postgres, the default behaviour for temporary tables is that they are not automatically dropped, and data is persisted on commit. See ON COMMIT.
Temporary table are, however, dropped at the end of a database session:
Temporary tables are automatically dropped at the end of a session, or
optionally at the end of the current transaction.
There are multiple considerations you have to take into account:
If you do want to explicitly DROP a temporary table at the end of a transaction, create it with the CREATE TEMPORARY TABLE ... ON COMMIT DROP syntax.
In the presence of connection pooling, a database session may span multiple client sessions; to avoid clashes in CREATE, you should drop your temporary tables -- either prior to returning a connection to the pool (e.g. by doing everything inside a transaction and using the ON COMMIT DROP creation syntax), or on an as-needed basis (by preceding any CREATE TEMPORARY TABLE statement with a corresponding DROP TABLE IF EXISTS, which has the advantage of also working outside transactions e.g. if the connection is used in auto-commit mode.)
While the temporary table is in use, how much of it will fit in memory before overflowing on to disk? See the temp_buffers option in postgresql.conf
Anything else I should worry about when working often with temp tables? A vacuum is recommended after you have DROPped temporary tables, to clean up any dead tuples from the catalog. Postgres will automatically vacuum every 3 minutes or so for you when using the default settings (auto_vacuum).
Also, unrelated to your question (but possibly related to your project): keep in mind that, if you have to run queries against a temp table after you have populated it, then it is a good idea to create appropriate indices and issue an ANALYZE on the temp table in question after you're done inserting into it. By default, the cost based optimizer will assume that a newly created the temp table has ~1000 rows and this may result in poor performance should the temp table actually contain millions of rows.
Temporary tables provide only one guarantee - they are dropped at the end of the session. For a small table you'll probably have most of your data in the backing store. For a large table I guarantee that data will be flushed to disk periodically as the database engine needs more working space for other requests.
EDIT:
If you're absolutely in need of RAM-only temporary tables you can create a table space for your database on a RAM disk (/dev/shm works). This reduces the amount of disk IO, but beware that it is currently not possible to do this without a physical disk write; the DB engine will flush the table list to stable storage when you create the temporary table.

Resources