Deleting empty files in tar.gz file in bash

Deleting empty files in tar.gz file in bash - bash

i have a tar.gz file and it contains .yang files along with some empty .yang files.
so i want to go into the tar.gz file and delete only those empty files
Currently i am using:
for f in *.tar.gz
do
echo "Processing file $f"
gzip -d "$f"
find $PWD -size 0 -print -delete
gzip -9 "${f%.*}"
echo "******************************************"
done
but this is not working maybe because currently, i m not in a directory instead inside the tar.gz file.
any other way to do this?

Your find command doesn't do anything useful to your tarballs because it searches and deletes in the current directory, not inside the tarballs.
So we need to first unpack the tarball (tar -xf), delete the empty files (find), and repack (tar -czf). As a safety measure we will work in temporary directories (mktemp -d) and create new tarballs (*.tar.gz.new) instead of overwriting the old ones. As you want to delete only yang empty files, we will also use some more find options. The following is for GNU tar, adapt to your own tar version (or install GNU tar). Before using it read what comes next, just in case...
for f in *.tar.gz; do
echo "Processing file $f"
d="$(mktemp -d)"
tar -xf "$f" -C "$d"
find "$d" -type f -name '*.yang' -size 0 -print -delete
tar -C "$d" -czf "$f.new" .
rm -rf "$d"
echo "******************************************"
done
But what you want is more complex than it seems because your tarballs could contain files with meta-data (owner, permissions...) that you are not allowed to use. If you run what precedes as a regular user, tar will silently change the ownership and permissions of such files and directories. When re-packing they will thus have modified meta-data. If it is a problem and you absolutely want to preserve the meta-data there are basically two options:
Pretend you are root with fakeroot or an equivalent.
Delete the files inside the tarballs without unpacking.
To use fakeroot just run the above bash script inside a fakeroot environment:
$ fakeroot
# for f in *.tar.gz; do
# ...
# done
# exit
The second solution (in-place tarball edition) uses GNU tar and GNU awk:
for f in *.tar.gz; do
echo "Processing file $f"
t="${f%.*}"
gzip -cd "$f" > "$t"
tar -tvf "$t" | awk -vORS=$"\0" '/^-.*\.yang$/ && $3==0 {
match($0,/(\S+\s+){4}\S+\s/); print substr($0,RLENGTH+1)}' |
xargs -0 -n1 tar -f "$t" --delete
gzip -c9 "$t" > "$f.new"
echo "******************************************"
done
Explanations:
We use the GNU tar --delete option to delete files directly inside the tarball, without unpacking it, which is probably more elegant (even if it is also probably slower than a fakeroot-based solution).
Let's first find all empty files in the tarball:
$ tar -tvf foo.tar
drwx------ john/users 0 2021-10-18 14:26 ./
drwx------ john/users 0 2021-10-18 16:34 ./
-rw------- john/users 0 2021-10-18 16:34 ./nonyang
drwx------ john/users 0 2021-10-18 15:22 ./foo.yang/
-rw------- john/users 0 2021-10-18 16:01 ./empty.yang
-rw------- john/users 7 2021-10-18 15:22 ./nonempty.yang
-rw------- john/users 0 2021-10-18 16:01 ./filename with spaces.yang
As you can see the size is in third column. Directory names have a leading d and a trailing /. Symbolic links have a leading l. So by keeping only lines starting with - and ending with .yang we eliminate them. GNU awk can do this twofold filtering:
$ tar -tvf foo.tar | awk '/^-.*\.yang$/ && $3==0 {print}'
-rw------- john/users 0 2021-10-18 16:01 ./empty.yang
-rw------- john/users 0 2021-10-18 16:01 ./filename with spaces.yang
This is more than what we want, so let's print only the name part. We first measure the length of the 5 first fields, including the spaces, with the match function (that sets a variable named RLENGTH) and remove them with substr:
$ tar -tvf foo.tar | awk '/^-.*\.yang$/ && $3==0 {
match($0,/(\S+\s+){4}\S+\s/); print substr($0,RLENGTH+1)}'
./empty.yang
./filename with spaces.yang
We could try to optimize a bit by calling match only on the first line but I am not 100% sure that all output lines are perfectly aligned, so let's call it on each line.
We are almost done: just pass this to tar -f foo.tar --delete <filename>, one name at a time. xargs can do this for us but there is a last trick: as file names can contain spaces we must use another separator, something that cannot be found in file names, like the NUL character (ASCII code 0). Fortunately GNU awk can use NUL as Output Record Separator (ORS) and xargs has the -0 option to use it as input separator. So, let's put all this together:
$ tar -tvf foo.tar | awk -vORS=$"\0" '/^-.*\.yang$/ && $3==0 {
match($0,/(\S+\s+){4}\S+\s/); print substr($0,RLENGTH+1)}' |
xargs -0 -n1 tar -f foo.tar --delete
$ tar -tvf foo.tar
drwx------ john/users 0 2021-10-18 16:34 ./
-rw------- john/users 0 2021-10-18 16:34 ./nonyang
drwx------ john/users 0 2021-10-18 15:22 ./foo.yang/
-rw------- john/users 7 2021-10-18 15:22 ./nonempty.yang
Inside your for loop:
for f in *.tar.gz; do
echo "Processing file $f"
t="${f%.*}"
gzip -cd "$f" > "$t"
tar -tvf "$t" | awk -vORS=$"\0" '/^-.*\.yang$/ && $3==0 {
match($0,/(\S+\s+){4}\S+\s/); print substr($0,RLENGTH+1)}' |
xargs -0 -n1 tar -f "$t" --delete
gzip -c9 "$t" > "$f.new"
echo "******************************************"
done
Note that we must decompress the tarballs before editing them because GNU tar cannot edit compressed tarballs.

Related

Remove duplicates from tar archive

I'm trying to create an archive of multiple text files. Sometimes these files are updated, when these files are updated I use the --update option in tar to append these files to the archive.
Say we have two files, test1.txt, and test2.txt. These files are added to archive test.tar.
Inspecting the tar with tar -tf test.tar
I get as expected:
test1.txt
test2.txt
Now if I update test2.txt, and append it into the archive using tar -f test.tar -u test2.txt.
I expect the output of running tar -tf test.tar to be:
test1.txt
test2.txt
But instead I get:
test1.txt
test2.txt
test2.txt
So how do I shake this tar to remove the older test2.txt? I know that after extracting the archive, I'd get only the most recent changes to both files, so this problem might seem trivial in this demo, but I'm actually archiving thousands of 5000-line files so the archive sizes get comically large with repeated runs.
What I'm currently doing is I'm extracting the files into a temp directory then re-archiving each time my script is run. This is obviously very inefficient. I'm hoping there's a tar option I'm missing somewhere.

TAR is simply a concatenation of the raw file contents with some metadata mixed in between. As you noticed, updating a file simply appends the file to the end of the TAR and, by convention, the last file occurring in the TAR "wins". TAR does not simply update a file because that could mean that all file contents after the updated file might have to be moved some bytes away to make space for the larger newer file version.
There actually is a TAR option not having been mentioned here which fits your use case: --occurrence=[NUMBER]. With this option, you can specify which of the multiple versions of a file with the same name/path is to be extracted or deleted. It would work fine with your simple example. This is how I set it up:
echo foo > test1.txt
echo foo > test2.txt
tar -cf updated.tar test1.txt test2.txt
sleep 1s
echo barbara > test2.txt
tar --update -f updated.tar test1.txt test2.txt
sleep 1s
echo foobar > test2.txt
tar --update -f updated.tar test1.txt test2.txt
tar tvlf updated.tar
-rwx------ user/group 4 2022-03-29 19:00 test1.txt
-rwx------ user/group 4 2022-03-29 19:00 test2.txt
-rwx------ user/group 8 2022-03-29 19:01 test2.txt
-rwx------ user/group 7 2022-03-29 19:01 test2.txt
Note that tar --update will only check the timestamp not the contents and the timestamp only has 1s granularity! Therefore, we need to wait 1s to be sure that the timestamp is at least one second later or tar will not add it to the archive. This is especially important when copy-pasting this code.
Simply calling --delete will delete all versions:
tar --delete -f updated.tar test2.txt
tar tvlf updated.tar
-rwx------ user/group 4 2022-03-29 19:00 test1.txt
When specifying --occurrence=1, only the first occurrence, i.e., the oldest version will be deleted:
tar --delete -f updated.tar test2.txt
tar tvlf updated.tar
-rwx------ user/group 4 2022-03-29 19:00 test1.txt
-rwx------ user/group 8 2022-03-29 19:01 test2.txt
-rwx------ user/group 7 2022-03-29 19:01 test2.txt
Unfortunately, for --delete, you can only delete exactly one file version. So, you would have to repeat deleting the oldest version until only the most recent is left. It is possible to do it in bash and that would at least be more space-efficient than extracting it to a temporary folder but it would probably be slower because it has to go over the archive many times and each time the archive is basically completely rewritten in place.
I recommend using ratarmount, which I wrote, instead. It will mount the archive (without actually extracting it) and expose a folder view showing the most recent versions of each file. Using this, you can create the new trimmed-down archive:
python3 -m pip install --user ratarmount
ratarmount updated.tar
ls -lA updated/
-rwx------ 1 user group 4 Mar 29 19:14 test1.txt
-rwx------ 1 user group 7 Mar 29 19:14 test2.txt
tar -c -f most-recent.tar -C updated/ .
tar tvlf updated.tar
drwxrwxrwx user/group 0 2022-03-29 19:00 ./
-rwx------ user/group 4 2022-03-29 19:00 ./test1.txt
-rwx------ user/group 7 2022-03-29 19:01 ./test2.txt
And there you have it. The output of tar tvlf looks a bit different with the preceding dot because we used -C and specified to archive the . folder. Normally, this doesn't hurt but you can circumvent this with any of these slightly more problematic alternatives:
tar -c -f most-recent.tar -C updated/ test1.txt test2.txt
tar -c -f most-recent.tar -C updated/ $( cd updated && find . -mindepth 1 -maxdepth 1 )
( cd updated/ && tar -c -f ../most-recent.tar {[^.],.[!.],..?}*; )
If you encounter problems with ratarmount please open an issue here.
Note that ratarmount even exposes those older versions but in well-hidden special folders:
ratarmount updated.tar
ls -lA updated/test2.txt.versions/
-rwx------ 1 user group 4 Mar 29 20:10 1
-rwx------ 1 user group 8 Mar 29 20:10 2
-rwx------ 1 user group 7 Mar 29 20:10 3
The file names inside the special .versions folder match the arguments given to --occurrence.
The above mentioned version in bash with --occurrence would look like this:
function deleteAllButMostRecentInTar()
{
local archive=$1
local filesToDelete=$( mktemp )
while true; do
tar --list --file "$archive" | sort | uniq -c |
sed -n -E '/^[ \t]*1 /d; s|^[ \t]*[0-9]+ ||p' > "$filesToDelete"
if [[ -s "$filesToDelete" ]]; then
local fileCount=$( cat -- "$filesToDelete" | wc -l )
echo -n "Found $fileCount files with more than version. Deleting ..."
tar --delete --occurrence=1 --files-from="$filesToDelete" \
--file "$archive"
echo " OK"
else
break
fi
done
rm -- "$filesToDelete"
echo
}
deleteAllButMostRecentInTar updated.tar
tar tvlf updated.tar
-rwx------ user/group 4 2022-03-29 19:00 test1.txt
-rwx------ user/group 7 2022-03-29 19:01 test2.txt

bash script to remove previous backup

I set up a daily cron job to backup my server.
In my folder backup, the backup command generates 2 files : the archive itself .tar.gz and a file .info.json like the ones below:
-rw-r--r-- 1 root root 1617 Feb 2 16:17 20200202-161647.info.json
-rw-r--r-- 1 root root 48699726 Feb 2 16:17 20200202-161647.tar.gz
-rw-r--r-- 1 root root 1617 Feb 3 06:25 20200203-062501.info.json
-rw-r--r-- 1 root root 48737781 Feb 3 06:25 20200203-062501.tar.gz
-rw-r--r-- 1 root root 1618 Feb 4 06:25 20200204-062501.info.json
-rw-r--r-- 1 root root 48939569 Feb 4 06:25 20200204-062501.tar.gz
How to I write a bash script that will only keep the last 2 archives and deletes all the others backup (targ.gz and info.json).
In this example, that would mean deleted 20200204-062501.info.json and 20200204-062501.tar.gz .
Edit:
I replace -name by -wholename in the script but when I run it, it doesn't have any effects apparently.The old archives are still there and they have not been deleted.
the script :
#!/bin/bash
DEBUG="";
DEBUG="echo DEBUG..."; #put last to safely debug without deleting files
keep=2;
for suffix in /home/archives .json .tar; do
list=( $( find . -wholename "*$suffix" ) ); #allow for zero names
if [ ${#list[#]} -gt $keep ]; then
# delete all but last $keep oldest files
${DEBUG}rm -f "$( ls -tr "${list[#]}" | head -n-$keep )";
fi
done
Edit 2:
if I run #sorin script, does it actually delete everything if I believe the script output?
The archive folder before running the script:
https://pastebin.com/7WtwVHCK
The script I run:
find home/archives/ \( -name '*.json' -o -name '*.tar.gz' \) -print0 |\
sort -zr |\
sed -z '3,$p' | \
xargs -0 echo rm -f
The script output:
https://pastebin.com/zd7a2zcq
Edit 3 :
The command find /home/archives/ -daystart \( -name '*.json' -o -name '*.tar.gz' \) -mtime +1 -exec echo rm -f {} + works and does the job.
Marked as solved

If the file is generated daily, a simple approach would be to take advantage of the -mtime find condition:
find /home/archives/ -daystart \( -name '*.json' -o -name '*.tar.gz' \) -mtime +1 -exec echo rm -f {} +
-daystart - use the start of the day for comparing modification times
\( -name '*.json' -o -name '*.tar.gz' \) - select files that end either in *.json or *.tar.gz
-mtime +1 - modification time is older than 24 hours (from the day start)
-exec echo rm -f {} + - remove the files (remove the echo after testing and verifying the result is what you want)
A simpler solution avoiding ls and it's pitfalls and not depending on the modification time of the files:
find /home/archives/ \( -name '*.json' -o -name '*.tar.gz' \) -print0 |\
sort -zr |\
sed -nz '3,$p' | \
xargs -0 echo rm -f
\( -name '*.json' -o -name '*.tar.gz' \) - find files that end in either *.json or tar.gz
-print0 - print them null separated
sort -zr - -z tells sort to use null as a line separator, -r sorts them in reverse
sed -nz '3,$p' - -z same as above. '3,$p' - print lines between 3rd and the end ($)
xargs -0 echo rm -f - execute rm with the piped arguments (remove the echo after you tested and you are satisfied with the command)
Note: not all sort and sed support the -z but most do. If you are stuck with such a situation, you might have to use a higher level language

Find the two most recent files in path:
most_recent_json=$(ls -t *.json | head -1)
most_recent_tar_gz=$(ls -t *.tar.gz | head -1)
Remove everything else ignoring the found recent files:
rm -i $(ls -I $most_recent_json -I $most_recent_tar_gz)

Automatic deleting can be hazardous to your mental state if it deletes unwanted files or aborts long scripts early due to unexpected errors. Say when there are fewer than 1+2 files in your example. Be sure the script does not fail if there are no files at all.
tdir=/home/archives/; #target dir
DEBUG="";
DEBUG="echo DEBUG..."; #put last to safely debug without deleting files
keep=2;
for suffix in .json .tar; do
list=( $( find "$tdir" -name "*$suffix" ) ); #allow for zero names
if [ ${#list[#]} -gt $keep ]; then
# delete all but last $keep oldest files
${DEBUG}rm -f "$( ls -tr "${list[#]}" | head -n-$keep )";
fi
done

Assuming that you have fewer than 10 files and that they are created in pairs, then you can do something straightforward like this:
files_to_delete=$(ls -tr1 | tail -n+3)
rm $files_to_delete
The -tr1 tells the ls command to list the files in reverse chronological order by modification time, each on a single line.
The tail -n+3 tells the tail command to start at the third line (skipping the first two lines).
If you have more than 10 files, a more complicated solution will be necessary, or you would need to run this multiple times.

Creating directory from file timestamp then move file into that directory in bash

I have a directory with hundreds of files (macos), and I would like have a bash script copy each file to a directory of the creation stamp on the file. So if temp.mp4 was created on 2019-12-10, the file would be copied to that directory which may or may not exist.
I don't want to write anything to a temp file if I don't have to. I found that the following code gets just the file and date YYYY-MM-DD format:
gls -lt --time-style=+"%Y-%m-%d" | awk '{print $6,$7}'
But I'm not sure how to iterate/test (piping it to another awk instance?). Pseudocode:
if ([ -f $7 ] && if ([ -d $6 ] #If the file and directory exist
{
mv $7 $6 #move the file to the existing directory
}
else
{
mkdir $6 #create the directory based on date creation attribute of the file
mv $7 $6 #move the file to the directory
}
I don't need to worry about recursively going through directories. All the files are in a single directory.
============
Based on some feedback below, I'm expanding my question to clarify what I'm seeing. Here's the script that I'm using (note the change from hardcoded "*.mp4" to $1):
#!/bin/bash
gstat -c '%w %n' $1 | gcut --complement -f 2,3 -d' ' | while read FILE
do
NAME=${FILE#* }
DATE=${FILE%% *}
[ -d ${DATE} ] || mkdir ${DATE}
mv -i "${NAME}" ${DATE}/
done
In my directory there are two files that don't get moved. Here's what the directory looks like (ls -l):
╰─ ls -l
total 16
drwxr-xr-x 9 awagoner staff 288 Nov 6 21:44 2019-11-03
drwxr-xr-x 4 awagoner staff 128 Nov 7 11:19 2019-11-06
-rw-r--r-- 1 awagoner staff 138 Mar 19 2019 Pipfile
-rw-r--r--# 1 awagoner staff 715 Nov 1 13:53 php_errorlog
I'm not sure what to do to move all files to the associated directory. Is there a certain string which will cover all files (AAA, AAA.AAA, .AAA)?

A great tool for accessing file information is stat. So first, ask stat to give creation dates and file names, so we can iterate over each of them:
stat -c '%w %n' *.mp4
Pipe that through cut which will drop away the extra columns you don’t want (time and timezone):
stat -c '%w %n' *.mp4 | cut --complement -f 2,3 -d' '
Now, our output is 1 line per file, with date and filename. It is possible you have filenames with spaces, so watch out for that when parsing. This little bit will do the trick cleanly:
NAME=${FILE#* }
DATE=${FILE%% *}
Finally, make sure the directory exists, and move the file to its new home:
[ -d ${DATE} ] || mkdir ${DATE}
mv "${NAME}" ${DATE}/
Put it all together. This is what you want to run:
stat -c '%w %n' *.mp4 | cut --complement -f 2,3 -d' ' | while read FILE
do
NAME=${FILE#* }
DATE=${FILE%% *}
mkdir -p ${DATE}
mv -i "${NAME}" ${DATE}/
done
Updated: use suggestion from jhnc above, adding -i flag on mv
Updated (2): more jhnc feedback (the real brains behind this answer :) )

Archive files older than x days [duplicate]

This question already has answers here:
How to create tar for files older than 7 days using linux shell scripting
(3 answers)
Closed 5 years ago.
I would like to archive all files (to one .tar.gz file) in a directory when they are older than X days.
I have this one liner:
find /home/xml/ -maxdepth 1 -mtime +14 -type f -exec sh -c \ 'tar -czvPf /home/xml/archive/archive_$(date +%F).tar.gz $0' {} \;
When I run this command, I see correct files selected in this directory, but in the archive is only the last file. Is there any way to get all files into one tar.gz archive?
One more problem after #Alex answer: still many files are missing, check the screenshot.
Maybe the colons (:) are causing the problem?

-exec runs the command for each file selected, so it's writing a tar with one file in it and then overwriting it for every source file, which explains why you're only getting the last one. You can use find to generate the list of files you want and then pipe that through xargs to pass the list as if they were parameters to your tar command:
find /home/xml/ -maxdepth 1 -mtime +14 -type f | xargs tar -czvPf /home/xml/archive/archive_$(date +%F).tar.gz
File names with colons work fine for me:
% dd if=/dev/urandom of=one:1 count=1
% dd if=/dev/urandom of=two:2 count=1
% dd if=/dev/urandom of=three:3 count=1
% dd if=/dev/urandom of=four:4 count=1
% dd if=/dev/urandom of=five:5 count=1
% find . -type f | xargs tar cvf foo.tar
./five:5
./four:4
./two:2
./three:3
./one:1
% tar tvf foo.tar
-rw------- alex/alex 512 2017-07-03 21:08 ./five:5
-rw------- alex/alex 512 2017-07-03 21:08 ./four:4
-rw------- alex/alex 512 2017-07-03 21:08 ./two:2
-rw------- alex/alex 512 2017-07-03 21:08 ./three:3
-rw------- alex/alex 512 2017-07-03 21:08 ./one:1

bash script to tar specific files in a directory

I need to bash script to tar half of the files in a directory. The files are .gz files with the naming convention x.gz where x is a number starting from 1 and ends with 100, I need to tar the first half of the files. How do I do this?

Your question is a little unclear. I assume you have x.gz and you want to add 1.gz to 50.gz into a tar file. If that is the case:
tar cjf MyArchive.tar.bz2 {1..50}.gz
The above command will put the first 50 .gz files into an archive named MyArchive.tar.bz2

I understand that you have an arbitrary number of files named x.gz in the current directory. You want to tar half of them.
But as you see from the answers, your description is not detailed enough.
I tried to provide the most flexible.
files=`find . -maxdepth 1 -mindepth 1 -type f -printf '%f\n'|grep -P '^\d+.gz$'|sort -n`
n=`echo $files|sed "s/ /\n/g"|wc -l`
half=$(( $n / 2 ))
d=`echo $files|sed "s/ /\n/g"|head -$half`
tar czf archive.tar.gz $d

Here's one way to do this (using KSH which should be available anywhere BASH is)
save below script in x.sh and chmod +x x.sh; then run it
#!/bin/ksh
#
#
## first create 100 dummy GZ files
##
x=0
while :
do
x=$((${x}+1));
if [ x -gt 100 ]; then
break;
fi
touch ${x}.gz;
done
## next parse the list, sort it and loop tar it stopping at 50.gz
##
for x in `ls *.gz | sed 's/\.gz//g' | sort -n`
do
if [ $x -gt 50 ]; then
exit 0;
fi
tar -rvf all-50.tar ${x}.gz
done

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Deleting empty files in tar.gz file in bash - bash

Related

Remove duplicates from tar archive

bash script to remove previous backup

Creating directory from file timestamp then move file into that directory in bash

Archive files older than x days [duplicate]

bash script to tar specific files in a directory

Categories

Resources