Remove duplicates from tar archive

Remove duplicates from tar archive - bash

I'm trying to create an archive of multiple text files. Sometimes these files are updated, when these files are updated I use the --update option in tar to append these files to the archive.
Say we have two files, test1.txt, and test2.txt. These files are added to archive test.tar.
Inspecting the tar with tar -tf test.tar
I get as expected:
test1.txt
test2.txt
Now if I update test2.txt, and append it into the archive using tar -f test.tar -u test2.txt.
I expect the output of running tar -tf test.tar to be:
test1.txt
test2.txt
But instead I get:
test1.txt
test2.txt
test2.txt
So how do I shake this tar to remove the older test2.txt? I know that after extracting the archive, I'd get only the most recent changes to both files, so this problem might seem trivial in this demo, but I'm actually archiving thousands of 5000-line files so the archive sizes get comically large with repeated runs.
What I'm currently doing is I'm extracting the files into a temp directory then re-archiving each time my script is run. This is obviously very inefficient. I'm hoping there's a tar option I'm missing somewhere.

TAR is simply a concatenation of the raw file contents with some metadata mixed in between. As you noticed, updating a file simply appends the file to the end of the TAR and, by convention, the last file occurring in the TAR "wins". TAR does not simply update a file because that could mean that all file contents after the updated file might have to be moved some bytes away to make space for the larger newer file version.
There actually is a TAR option not having been mentioned here which fits your use case: --occurrence=[NUMBER]. With this option, you can specify which of the multiple versions of a file with the same name/path is to be extracted or deleted. It would work fine with your simple example. This is how I set it up:
echo foo > test1.txt
echo foo > test2.txt
tar -cf updated.tar test1.txt test2.txt
sleep 1s
echo barbara > test2.txt
tar --update -f updated.tar test1.txt test2.txt
sleep 1s
echo foobar > test2.txt
tar --update -f updated.tar test1.txt test2.txt
tar tvlf updated.tar
-rwx------ user/group 4 2022-03-29 19:00 test1.txt
-rwx------ user/group 4 2022-03-29 19:00 test2.txt
-rwx------ user/group 8 2022-03-29 19:01 test2.txt
-rwx------ user/group 7 2022-03-29 19:01 test2.txt
Note that tar --update will only check the timestamp not the contents and the timestamp only has 1s granularity! Therefore, we need to wait 1s to be sure that the timestamp is at least one second later or tar will not add it to the archive. This is especially important when copy-pasting this code.
Simply calling --delete will delete all versions:
tar --delete -f updated.tar test2.txt
tar tvlf updated.tar
-rwx------ user/group 4 2022-03-29 19:00 test1.txt
When specifying --occurrence=1, only the first occurrence, i.e., the oldest version will be deleted:
tar --delete -f updated.tar test2.txt
tar tvlf updated.tar
-rwx------ user/group 4 2022-03-29 19:00 test1.txt
-rwx------ user/group 8 2022-03-29 19:01 test2.txt
-rwx------ user/group 7 2022-03-29 19:01 test2.txt
Unfortunately, for --delete, you can only delete exactly one file version. So, you would have to repeat deleting the oldest version until only the most recent is left. It is possible to do it in bash and that would at least be more space-efficient than extracting it to a temporary folder but it would probably be slower because it has to go over the archive many times and each time the archive is basically completely rewritten in place.
I recommend using ratarmount, which I wrote, instead. It will mount the archive (without actually extracting it) and expose a folder view showing the most recent versions of each file. Using this, you can create the new trimmed-down archive:
python3 -m pip install --user ratarmount
ratarmount updated.tar
ls -lA updated/
-rwx------ 1 user group 4 Mar 29 19:14 test1.txt
-rwx------ 1 user group 7 Mar 29 19:14 test2.txt
tar -c -f most-recent.tar -C updated/ .
tar tvlf updated.tar
drwxrwxrwx user/group 0 2022-03-29 19:00 ./
-rwx------ user/group 4 2022-03-29 19:00 ./test1.txt
-rwx------ user/group 7 2022-03-29 19:01 ./test2.txt
And there you have it. The output of tar tvlf looks a bit different with the preceding dot because we used -C and specified to archive the . folder. Normally, this doesn't hurt but you can circumvent this with any of these slightly more problematic alternatives:
tar -c -f most-recent.tar -C updated/ test1.txt test2.txt
tar -c -f most-recent.tar -C updated/ $( cd updated && find . -mindepth 1 -maxdepth 1 )
( cd updated/ && tar -c -f ../most-recent.tar {[^.],.[!.],..?}*; )
If you encounter problems with ratarmount please open an issue here.
Note that ratarmount even exposes those older versions but in well-hidden special folders:
ratarmount updated.tar
ls -lA updated/test2.txt.versions/
-rwx------ 1 user group 4 Mar 29 20:10 1
-rwx------ 1 user group 8 Mar 29 20:10 2
-rwx------ 1 user group 7 Mar 29 20:10 3
The file names inside the special .versions folder match the arguments given to --occurrence.
The above mentioned version in bash with --occurrence would look like this:
function deleteAllButMostRecentInTar()
{
local archive=$1
local filesToDelete=$( mktemp )
while true; do
tar --list --file "$archive" | sort | uniq -c |
sed -n -E '/^[ \t]*1 /d; s|^[ \t]*[0-9]+ ||p' > "$filesToDelete"
if [[ -s "$filesToDelete" ]]; then
local fileCount=$( cat -- "$filesToDelete" | wc -l )
echo -n "Found $fileCount files with more than version. Deleting ..."
tar --delete --occurrence=1 --files-from="$filesToDelete" \
--file "$archive"
echo " OK"
else
break
fi
done
rm -- "$filesToDelete"
echo
}
deleteAllButMostRecentInTar updated.tar
tar tvlf updated.tar
-rwx------ user/group 4 2022-03-29 19:00 test1.txt
-rwx------ user/group 7 2022-03-29 19:01 test2.txt

Related

Deleting empty files in tar.gz file in bash

i have a tar.gz file and it contains .yang files along with some empty .yang files.
so i want to go into the tar.gz file and delete only those empty files
Currently i am using:
for f in *.tar.gz
do
echo "Processing file $f"
gzip -d "$f"
find $PWD -size 0 -print -delete
gzip -9 "${f%.*}"
echo "******************************************"
done
but this is not working maybe because currently, i m not in a directory instead inside the tar.gz file.
any other way to do this?

Your find command doesn't do anything useful to your tarballs because it searches and deletes in the current directory, not inside the tarballs.
So we need to first unpack the tarball (tar -xf), delete the empty files (find), and repack (tar -czf). As a safety measure we will work in temporary directories (mktemp -d) and create new tarballs (*.tar.gz.new) instead of overwriting the old ones. As you want to delete only yang empty files, we will also use some more find options. The following is for GNU tar, adapt to your own tar version (or install GNU tar). Before using it read what comes next, just in case...
for f in *.tar.gz; do
echo "Processing file $f"
d="$(mktemp -d)"
tar -xf "$f" -C "$d"
find "$d" -type f -name '*.yang' -size 0 -print -delete
tar -C "$d" -czf "$f.new" .
rm -rf "$d"
echo "******************************************"
done
But what you want is more complex than it seems because your tarballs could contain files with meta-data (owner, permissions...) that you are not allowed to use. If you run what precedes as a regular user, tar will silently change the ownership and permissions of such files and directories. When re-packing they will thus have modified meta-data. If it is a problem and you absolutely want to preserve the meta-data there are basically two options:
Pretend you are root with fakeroot or an equivalent.
Delete the files inside the tarballs without unpacking.
To use fakeroot just run the above bash script inside a fakeroot environment:
$ fakeroot
# for f in *.tar.gz; do
# ...
# done
# exit
The second solution (in-place tarball edition) uses GNU tar and GNU awk:
for f in *.tar.gz; do
echo "Processing file $f"
t="${f%.*}"
gzip -cd "$f" > "$t"
tar -tvf "$t" | awk -vORS=$"\0" '/^-.*\.yang$/ && $3==0 {
match($0,/(\S+\s+){4}\S+\s/); print substr($0,RLENGTH+1)}' |
xargs -0 -n1 tar -f "$t" --delete
gzip -c9 "$t" > "$f.new"
echo "******************************************"
done
Explanations:
We use the GNU tar --delete option to delete files directly inside the tarball, without unpacking it, which is probably more elegant (even if it is also probably slower than a fakeroot-based solution).
Let's first find all empty files in the tarball:
$ tar -tvf foo.tar
drwx------ john/users 0 2021-10-18 14:26 ./
drwx------ john/users 0 2021-10-18 16:34 ./
-rw------- john/users 0 2021-10-18 16:34 ./nonyang
drwx------ john/users 0 2021-10-18 15:22 ./foo.yang/
-rw------- john/users 0 2021-10-18 16:01 ./empty.yang
-rw------- john/users 7 2021-10-18 15:22 ./nonempty.yang
-rw------- john/users 0 2021-10-18 16:01 ./filename with spaces.yang
As you can see the size is in third column. Directory names have a leading d and a trailing /. Symbolic links have a leading l. So by keeping only lines starting with - and ending with .yang we eliminate them. GNU awk can do this twofold filtering:
$ tar -tvf foo.tar | awk '/^-.*\.yang$/ && $3==0 {print}'
-rw------- john/users 0 2021-10-18 16:01 ./empty.yang
-rw------- john/users 0 2021-10-18 16:01 ./filename with spaces.yang
This is more than what we want, so let's print only the name part. We first measure the length of the 5 first fields, including the spaces, with the match function (that sets a variable named RLENGTH) and remove them with substr:
$ tar -tvf foo.tar | awk '/^-.*\.yang$/ && $3==0 {
match($0,/(\S+\s+){4}\S+\s/); print substr($0,RLENGTH+1)}'
./empty.yang
./filename with spaces.yang
We could try to optimize a bit by calling match only on the first line but I am not 100% sure that all output lines are perfectly aligned, so let's call it on each line.
We are almost done: just pass this to tar -f foo.tar --delete <filename>, one name at a time. xargs can do this for us but there is a last trick: as file names can contain spaces we must use another separator, something that cannot be found in file names, like the NUL character (ASCII code 0). Fortunately GNU awk can use NUL as Output Record Separator (ORS) and xargs has the -0 option to use it as input separator. So, let's put all this together:
$ tar -tvf foo.tar | awk -vORS=$"\0" '/^-.*\.yang$/ && $3==0 {
match($0,/(\S+\s+){4}\S+\s/); print substr($0,RLENGTH+1)}' |
xargs -0 -n1 tar -f foo.tar --delete
$ tar -tvf foo.tar
drwx------ john/users 0 2021-10-18 16:34 ./
-rw------- john/users 0 2021-10-18 16:34 ./nonyang
drwx------ john/users 0 2021-10-18 15:22 ./foo.yang/
-rw------- john/users 7 2021-10-18 15:22 ./nonempty.yang
Inside your for loop:
for f in *.tar.gz; do
echo "Processing file $f"
t="${f%.*}"
gzip -cd "$f" > "$t"
tar -tvf "$t" | awk -vORS=$"\0" '/^-.*\.yang$/ && $3==0 {
match($0,/(\S+\s+){4}\S+\s/); print substr($0,RLENGTH+1)}' |
xargs -0 -n1 tar -f "$t" --delete
gzip -c9 "$t" > "$f.new"
echo "******************************************"
done
Note that we must decompress the tarballs before editing them because GNU tar cannot edit compressed tarballs.

zip two file with same content, but final md5sum is different

I have the following operation on my mac:
$ echo "dgrgrrgrgrg" > test1.txt
after a few seconds, copy test1.txt:
$ cp test1.txt test2.txt
$ ls -l
total 16
-rw-r--r-- 1 hqfy staff 12 Mar 31 10:18 test1.txt
-rw-r--r-- 1 hqfy staff 12 Mar 31 10:19 test2.txt
now chech md5sum:
$ md5 *.txt
MD5 (test1.txt) = 8bab5a3e202c901499d83cb25d5a8c80
MD5 (test2.txt) = 8bab5a3e202c901499d83cb25d5a8c80
it's obvious that test1.txt and test2.txt have the same md5sum, now I zip these two files:
$ zip -X test1.zip test1.txt
adding: test1.txt (deflated 8%)
$ zip -X test2.zip test2.txt
adding: test2.txt (deflated 8%)
$ ls -l
total 32
-rw-r--r-- 1 hqfy staff 12 Mar 31 10:18 test1.txt
-rw-r--r-- 1 hqfy staff 127 Mar 31 10:22 test1.zip
-rw-r--r-- 1 hqfy staff 12 Mar 31 10:19 test2.txt
-rw-r--r-- 1 hqfy staff 127 Mar 31 10:23 test2.zip
size of test1.zip and test2.zip are the same, but when I check md5sum:
$ md5 *.zip
MD5 (test1.zip) = af8783f96ce98aef717ecf6229ffb07e
MD5 (test2.zip) = 59e752a03a2930adbe7f30b9cbf14561
I've googled it, using zip with option -X, but it did not work in my case, how can I create the two zip files with the same md5sum?

Quoting from the zip man page here..
With -X, zip strips all old fields and only includes the Unicode and
Zip64 extra fields (currently these two extra fields cannot be
disabled).
So, a different md5sum is expected when zipping (even with -X).

I know that this question is very old, but I may have an answer for you:
The timestamps for the two files (which are very obviously different) are included in the .zip file. That is why the md5sums are different. If you can somehow remove those timestamps, then the md5sums will be the same.
Also note that macOS adds a folder (__MACOSX) to a zip file that contains extra metadata and such. That may also be the issue.

Zip a directory while retaining relative path

I have a directory of files:
/home/user/files/1.txt
/home/user/files/2.txt
/home/user/files/3.txt
I'd like to zip up the files directory into files.zip so when extracted I get:
files/1.txt
files/2.txt
files/3.txt
I know I can do:
# bash
cd /home/user; zip -r files.zip files/
Is there a way to do this without cding to the user directory?
I know that the --junk-paths flag will store just the filenames and junk the path but I'd like to keep the files directory as a container.

Couldn't find direct way using zip command but you can try "tar" command with -C option.
$ pwd
/home/shenzi
$ ls -l giga/files
total 3
-rw-r--r-- 1 shenzi Domain Users 3 Aug 5 11:24 1.txt
-rw-r--r-- 1 shenzi Domain Users 4 Aug 5 11:25 2.txt
-rw-r--r-- 1 shenzi Domain Users 9 Aug 5 11:25 3.txt
$ tar -C giga -cvf files.zip files/*
files/1.txt
files/2.txt
files/3.txt
$ tar -tvf files.zip
-rw-r--r-- shenzi/Domain Users 3 2014-08-05 11:24 files/1.txt
-rw-r--r-- shenzi/Domain Users 4 2014-08-05 11:25 files/2.txt
-rw-r--r-- shenzi/Domain Users 9 2014-08-05 11:25 files/3.txt
USE: -xvf to extract

Recursively touch files with file

I have a directory that contains sub-directories and other files and would like to update the date/timestamps recursively with the date/timestamp of another file/directory.
I'm aware that:
touch -r file directory
changes the date/timestamp for the file or directory with the others, but nothing within it. There's also the find version which is:
find . -exec touch -mt 201309300223.25 {} +\;
which would work fine if i could specify the actual file/directory and use anothers date/timestamp. Is there a simple way to do this? even better, is there a way to avoid changing/updating timestamps when doing a 'cp'?

even better, is there a way to avoid changing/updating timestamps when doing a 'cp'?
Yes, use cp with the -p option:
-p
same as --preserve=mode,ownership,timestamps
--preserve
preserve the specified attributes (default:
mode,ownership,timestamps), if possible additional attributes:
context, links, xattr, all
Example
$ ls -ltr
-rwxrwxr-x 1 me me 368 Apr 24 10:50 old_file
$ cp old_file not_maintains <----- does not preserve time
$ cp -p old_file do_maintains <----- does preserve time
$ ls -ltr
total 28
-rwxrwxr-x 1 me me 368 Apr 24 10:50 old_file
-rwxrwxr-x 1 me me 368 Apr 24 10:50 do_maintains <----- does preserve time
-rwxrwxr-x 1 me me 368 Sep 30 11:33 not_maintains <----- does not preserve time
To recursively touch files on a directory based on the symmetric file on another path, you can try something like the following:
find /your/path/ -exec touch -r $(echo {} | sed "s#/your/path#/your/original/path#g") {} \;
It is not working for me, but I guess it is a matter of try/test a little bit more.

In addition to 'cp -p', you can (re)create an old timestamp using 'touch -t'. See the man page of 'touch' for more details.
touch -t 200510071138 old_file.dat

Listing the content of a tar file or a directory only down to some level

I wonder how to list the content of a tar file only down to some level?
I understand tar tvf mytar.tar will list all files, but sometimes I would like to only see directories down to some level.
Similarly, for the command ls, how do I control the level of subdirectories that will be displayed? By default, it will only show the direct subdirectories, but not go further.

depth=1
tar --exclude="*/*" -tf file.tar
depth=2
tar --exclude="*/*/*" -tf file.tar

tar tvf scripts.tar | awk -F/ '{if (NF<4) print }'
drwx------ glens/glens 0 2010-03-17 10:44 scripts/
-rwxr--r-- glens/www-data 1051 2009-07-27 10:42 scripts/my2cnf.pl
-rwxr--r-- glens/www-data 359 2009-08-14 00:01 scripts/pastebin.sh
-rwxr--r-- glens/www-data 566 2009-07-27 10:42 scripts/critic.pl
-rwxr-xr-x glens/glens 981 2009-12-16 09:39 scripts/wiki_sys.pl
-rwxr-xr-x glens/glens 3072 2009-07-28 10:25 scripts/blacklist_update.pl
-rwxr--r-- glens/www-data 18418 2009-07-27 10:42 scripts/sysinfo.pl
Make sure to note, that the number is 3+ however many levels you want, because of the / in the username/group. If you just do
tar tf scripts.tar | awk -F/ '{if (NF<3) print }'
scripts/
scripts/my2cnf.pl
scripts/pastebin.sh
scripts/critic.pl
scripts/wiki_sys.pl
scripts/blacklist_update.pl
scripts/sysinfo.pl
it's only two more.
You could probably pipe the output of ls -R to this awk script, and have the same effect.

Another option is archivemount. You mount it, and cd into it. Then you can do anything with it just as with other filesystem.
$ archivemount /path/to/files.tgz /path/to/mnt/folder
It seems faster than the tar method.

It would be nice if we could tell the find command to look inside a tar file, but I doubt that is possible.
I quick and ugly (and not foolproof) way would be to limit the number of directory separators, for example:
$ tar tvf myfile.tar | grep -E '^[^/]*(/[^/]*){1,2}$'
The 2 tells to display not more than 2 slashes (in my case one is already generated by the user/group separator), and hence, to display files at depth at most one. You might want to try with different numbers in place of the 2.

I agree with leonbloy's answer - there's no way to do this straightforwardly within the tarball itself.
Regarding the second part of your question, ls does not have a max depth option. You can recurse everything with ls -R, but that's often not very useful.
However you can do this with both find and tree. For example to list files and directories one level deep, you can do
find -maxdepth 2
or
tree -L 2
tree also has a -d option, which recursively lists directories, but not files, which I find much more useful than -L, in general.

I was able to show only the directory names at a particular depth using grep:
for depth 3:
tar -tf mytar.tar | grep -Ex '([^/]+/){3}'
or for depth $DEPTH:
tar -tf mytar.tar | grep -Ex '([^/]+){$DEPTH}/'
You can speed that up by combining grep with --exclude from #sacapeao's accepted answer.
for depth 3:
tar --exclude '*/*/*/*/*' -tf mytar.tar | grep -Ex '([^/]+/){3}'

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Remove duplicates from tar archive - bash

Related

Deleting empty files in tar.gz file in bash

zip two file with same content, but final md5sum is different

Zip a directory while retaining relative path

Recursively touch files with file

Listing the content of a tar file or a directory only down to some level

Categories

Resources