How to manually delete database when ClickHouse server is shut down? - clickhouse

When a database has thousands of tables (e.g. due to incorrect behavior of a client) server cannot start in a reasonable time (minutes) due to loading tables metadata(?) on startup. At least when the huge amount of tables resulted in the exhaustion of all available disk space no the partition where ClickHouse stores data.
Messages from the clickhouse-server.log file written for each table while the server is starting and not accepting connections from the client.
022.07.13 16:18:14.623869 [ 409851 ] {} <Debug> test_db.DL_3050BCE0 (66b2363e-17da-41aa-bb4b-951f898359a6): Loading data parts
2022.07.13 16:18:14.624721 [ 409851 ] {} <Debug> test_db.DL_3050BCE0 (66b2363e-17da-41aa-bb4b-951f898359a6): Loaded data parts (1 items)
2022.07.13 16:18:14.625279 [ 409848 ] {} <Debug> test_db.DL_30535FE0 (6ae3cd38-b0ec-494c-ab13-a1455f6b6e09): Loading data parts
In such a case, it is not possible to issue "drop database" SQL.
Problem statement
How to manually delete a given database created with Atomic engine (default setting) consisting of MergeTree family tables when the ClickHouse server is not started?
Failed approach
(DO NOT FOLLOW THESE STEPS AS YOU MAY LOOSE ALL YOUR DATA)
I tried to search for metadata SQL files for tables I would like to drop and delete table folders from /store/, /metadata/, and /data/ directories in the following steps:
find table UUID from metadata SQL files and generate a batch file with rm commands to drop directories from /store/ subdirectory
find /mnt/volume_01/clh_data/metadata/test_db/ -name "DL*.*" | xargs head -n1 | grep UUID | cut -d \' -f2 | ./store_rm.py > /tmp/test_store_rm.sh
store_rm.py:
#!/usr/bin/python3
import sys
for l in sys.stdin:
cmd = "find /mnt/volume_01/test_db/store/ -name "+l.strip()+" -print -quit | cut -d \/ -f1-6 | xargs rm -r \n"
sys.stdout.write(cmd)
the resulting /tmp/test_store_rm.sh (after manually added header):
#!/bin/bash
find /mnt/volume_01/test_db/store/ -name add024a9-1578-42ac-a7c5-0b174bfa0d84 -print -quit | cut -d \/ -f1-6 | xargs rm -r
find /mnt/volume_01/test_db/store/ -name fafd4ca8-ec54-41cc-a29e-6c67e6b2cecd -print -quit | cut -d \/ -f1-6 | xargs rm -r
find /mnt/volume_01/test_db/store/ -name 87650b47-3b47-42b7-97b4-1e3fa5db3216 -print -quit | cut -d \/ -f1-6 | xargs rm -r
find tables the same way as above and generate rm statements for each directory in /data/ subdirectory:
find /mnt/volume_01/clh_data/metadata/test_db/ -name "DL*.*" | cut -d \/ -f7 | cut -d \. -f1 | ./data_rm.py > /tmp/test_data_rm.sh
data_rm.py
#!/usr/bin/python3
import sys
for l in sys.stdin:
cmd = "rm -r /mnt/volume_01/clh_data/data/test_db/"+l.strip()+"\n"
sys.stdout.write(cmd)
content of /tmp/test_data_rm.sh:
#!/bin/bash
rm -r /mnt/volume_01/clh_data/data/test_db/DL_62CDEEE0
rm -r /mnt/volume_01/clh_data/data/test_db/DL_62CF4060
delete metadata SQL files for each table:
find /mnt/volume_01/clh_data/metadata/test_db/ -name "DL*.*" | rm -r
The problem with the approach above is that after executing /tmp/test_store_rm.sh TABLES FROM ALL DATABASES were gone. This was not a problem as the affected node was a replica but if the other node would fail to execute the "drop database" statement or would be restarted the recovery process would be even more complicated than outlined here.

Related

bash command inside a function to delete all files except the recent 5

I have a delete backup files function which takes in the arguments as a directory name and to backup the files of a specific directory and specific type of file like this delete_old_backup_files $(dirname $$abc) "$abc.*"
The function body is:
local fpath=$1
local fexpr=$2
# delete backup files older than a day
find $fpath -name "${fexpr##*/}" -mmin +1 -type f | xargs rm -f
Currently deleting files that are older than a day. Now I want to modify the function such that this function should delete all backup files of type $abc.*, except the last 5 backup files created. Tried various commands using stat or -printf but couldn't succeed.
What is the correct way of completing this function?
Assuming the filenames do not contain newline characters, would you please
try:
delete_old_backup_files() {
local fpath=$1
local fexpr=$2
find "$fpath" -type f -name "${fexpr##*/}" -printf "%T#\t%p\n" | sort -nr | tail -n +6 | cut -f2- | xargs rm -f --
}
-printf "%T#\t%p\n" prints the seconds since epoch (%T#) followed
by a tab character (\t) then the filename (%p) and a newline (\n).
sort -nr numerically sorts the lines in descending order (newer first,
older last).
tail -n +6 prints the 6th and following lines.
cut -f2- removes the prepended timestamp leaving the filename only.
[Edit]
In case of MacOS, please try instead (not tested):
find "$fpath" -type f -print0 | xargs -0 stat -f "%m%t%N" | sort -nr | tail -n +6 | cut -f2- | xargs rm --
In the stat command, %m is expanded to the modification time (seconds since epoch), %t is replaced with a tab, and %N to be a filename.
I would use sorting instead of find. You can use ls -t
$ touch a b c
$ sleep 3
$ touch d e f
ls -t | tr ' ' '\n' | tail -n +4
a
b
c
$ ls -t | tr ' ' '\n' | tail -n +4 | xargs rm
$ ls
d e f
From man ls:
-t sort by modification time, newest first
Make sure you create backups before you delete stuff :-)

Linux copy directory structure and create symlinks to existing files with different extension

I have a large directory structure similar to the following
/home/user/abc/src1
/file_a.xxx
/file_b.xxx
/home/user/abc/src2
/file_a.xxx
/file_b.xxx
It contains multiple srcX folders and has many files, most of the files have a .xxx extension. These are the ones that I am interested in.
I would like to create an identical directory structure in say /tmp. This part I have been able to accomplish via rsync
rsync -av -f"+ */" -f"- *" /home/user/abc/ /tmp/xyz/
The next step is what I can't figure out. I need the directory structure in /tmp/xyz to have symlinks to all the files in /home/user/abc with a different file extension (.zzz). The directory structure would look as follows:
/tmp/xyz/src1
/file_a.zzz -> /home/user/abc/src1/file_a.xxx
/file_b.zzz -> /home/user/abc/src1/file_b.xxx
/tmp/xyz/src2
/file_a.zzz -> /home/user/abc/src2/file_a.xxx
/file_b.zzz -> /home/user/abc/src2/file_b.xxx
I understand that I could just copy the data and do a batch rename. That is not an acceptable solution.
How do I recursively create symlinks for all the .xxx files in /home/user/abc and link them to /tmp/xyz with a .zzz extension.
The find + exec seems like what I want but I can't put 2 and 2 together on this one.
This could work
cd /tmp/xyz/src1
find /home/user/abc/src1/ -type f -print0 | xargs -r0 -I '{}' ln -s '{}' $(basename '{}' .xxx).zzz
Navigate to /tmp/xyz/ then run the following script:
#!/usr/bin/env bash
# First make src* folders in present directory:
mkdir -p $(find ~/user/abc/src* -type d -name "src*" | rev | cut -d"/" -f1 | rev)
# Then make symbolic links:
while read -r -d' ' file; do
ln -s ${file} $(echo ${file} | rev | cut -d/ -f-2 | rev | sed 's/\.xxx/\.zzz/')
done <<< $(echo "$(find ~/user/abc/src* -type f -name '*.xxx') dummy")
Thanks for the input all. Based upon the ideas I saw I was able to come up with a script that fits my needs.
#!/bin/bash
GLOBAL_SRC_DIR="/home/usr/abc"
GLOBAL_DEST_DIR="/tmp/xyz"
create_symlinks ()
{
local SRC_DIR="${1}"
local DEST_DIR="${2}"
# read in our file, use null terminator
while IFS= read -r -d $'\0' file; do
# If file ends with .xxx or .yyy
if [[ ${file} =~ .*\.(xxx|yyy) ]] ; then
basePath="$(dirname ${file})"
fileName="$(basename ${file})"
completeSourcePath="${basePath}/${fileName}"
#echo "${completeSourcePath}"
# strip off preceding text
partialDestPath=$(echo ${basePath} | sed -r "s|^${SRC_DIR}||" )
fullDestPath="${DEST_DIR}/${partialDestPath}"
# rename file from .xxx to .zzz. don't rename just link .yyy
cppFileName=$(echo ${fileName} | sed -r "s|\.xxx$|\.zzz|" )
completeDestinationPath="${fullDestPath}/${cppFileName}"
$(ln -s ${completeSourcePath} ${completeDestinationPath})
fi
done < <(find ${SRC_DIR} -type f -print0)
}
main ()
{
create_symlinks ${GLOBAL_SRC_DIR} ${GLOBAL_DEST_DIR}
}
main

Find duplicates of a specific file on macOS

I have a directory that contains files and other directories. And I have one specific file where I know that there are duplicates of somewhere in the given directory tree.
How can I find these duplicates using Bash on macOS?
Basically, I'm looking for something like this (pseudo-code):
$ find-duplicates --of foo.txt --in ~/some/dir --recursive
I have seen that there are tools such as fdupes, but I'm neither interested in any duplicate files (only duplicates of a specific file) nor am I interested in duplicates anywhere on disk (only within the given directory or its subdirectories).
How do I do this?
For a solution compatible with macOS built-in shell utilities, try this instead:
find DIR -type f -print0 | xargs -0 md5 -r | grep "$(md5 -q FILE)"
where:
DIR is the directory you are interested in;
FILE is the file (path) you are searching for duplicates of.
If you only need the duplicated files paths, then pipe thru this as well:
cut -d' ' -f2
If you're looking for a specific filename, you could do:
find ~/some/dir -name foo.txt
which would return a list of all files with the name foo.txt in the directory. If you're looking if there are multiple files in the directory with the same name, you could do:
find ~/some/dir -exec basename {} \; | sort | uniq -d
This will give you a list of files with duplicate names (you can then use find again to figure out where those live).
---- EDIT -----
If you're looking for identical files (with the same md5 sum), you could also do:
find . -type f -exec md5sum {} \; | sort | uniq -d --check-chars=32
--- EDIT 2 ----
If your md5sum doesn't output the filename, you can use:
find . -type f -exec echo -n "{} " \; -exec md5sum {} \; | awk {'print $2 $1'} | sort | uniq -d --check-chars=32
--- EDIT 3 ----
if you're looking for a file with a specific md5 sums:
sum=`md5sum foo.txt | cut -f1 -d " "`
find ~/some/dir -type f -exec md5sum {} \; | grep $sum

How to delete all files except the N newest files?

this command allows me to login to a server, to a specific directory from my pc
ssh -t xxx.xxx.xxx.xxx "cd /directory_wanted ; bash"
How can I then do this operation in that directory. I want to be able to basically delete all files except the N most newest.
find ./tmp/ -maxdepth 1 -type f -iname *.tgz | sort -n | head -n -10 | xargs rm -f
This command should work:
ls -t *.tgz | tail -n +11 | xargs rm -f
Warning: Before doing rm -f, confirm that the files being listed by ls -t *.tgz | tail -n +11 are as expected.
How it works:
ls lists the contents of the directory.-t flag sorts by
modification time (newest first). See the man page of ls
tail -n +11 outputs starting from line 11. Please refer the man page of
tail for more
detials.
If the system is a Mac OS X then you can delete based on creation time too. Use ls with -Ut flag. This will sort the contents based on the creation time.
You can use this command,
ssh -t xxx.xxx.xxx.xxx "cd /directory_wanted; ls -t *.tgz | tail -n
+11 | xargs rm -f; bash"
In side quotes, we can add what ever the operations to be performed in remote machine. But every command should be terminated with semicolon (;)
Note: Included the same command suggested by silentMonk. It is simple and it is working. But verify it once before performing the operation.

Postgresql separate backups for different clusters

I have this script for backing up postgresql databases
#!/bin/bash
# Location to place backups.
backup_dir="/path/to/backups/"
#String to append to the name of the backup files
backup_date=`date +%Y-%m-%d`
#Numbers of days you want to keep copie of your databases
number_of_days=7
databases=`psql -l -t | cut -d'|' -f1 | sed -e 's/ //g' -e '/^$/d'`
for i in $databases; do
if [ "$i" != "template0" ] && [ "$i" != "template1" ]; then
echo Dumping $i to $backup_dir$i\_$backup_date
pg_dump -Fc $i > $backup_dir$i\_$backup_date
fi
done
find $backup_dir -type f -prune -mtime +$number_of_days -exec rm -f {} \;
Only one cluster was used on server, so everything was fine. But now one new cluster was created. So I got thinking if backups will be done properly and if not how to make sure it will do backups properly for every cluster?
Will this script now goes over every cluster and do backups for all databases in all clusters? If so, there might be name clashes.
How could I make sure it would do backups in different directories for different clusters?
I solved this by creating second script to backup another cluster databases. Well it is not very elegant, but it works. If anyone could write more universal script (and that one script could be used for all DB backups) so it would take in consideration different backup directories and clusters, please post it as answer (as it would be better solution)
Second script looks like this (ofcourse it would be best to merge both scripts into one):
#!/bin/bash
# Location to place backups.
backup_dir="/path/to/backups2"
#String to append to the name of the backup files
backup_date=`date +%Y-%m-%d`
#Numbers of days you want to keep copie of your databases
number_of_days=1
databases=` psql -p 5433 -l -t | cut -d'|' -f1 | sed -e 's/ //g' -e '/^$/d' `
for i in $databases; do
if [ "$i" != "template0" ] && [ "$i" != "template1" ]; then
echo Dumping $i to $backup_dir$i\_$backup_date\.gz
pg_dump -p 5433 -Fc $i |gzip -f > $backup_dir$i\_$backup_date\.gz
fi
done
find $backup_dir -type f -prune -mtime +$number_of_days -exec rm -f {} \;

Resources