When compressing files (zip, tar, ect...) in SSH what determines the 'sort order' in which files are compressed?

When compressing files (zip, tar, ect...) in SSH what determines the 'sort order' in which files are compressed? - sorting

Consider the following command run on a folder with 2TB of recursive folders and files in it.
tar -cvzf _backup.tar.gz /home/wwwsite/public_html
Consider that the folder being compressed is full of sub-folders (with hundreds of sub folders and files in them) and a naming convention that is random, but sequential, short example:
/17688EE/
/18503HH/
/19600SL/
/20659CS/
Consider that there are 10,000+ folders between each block (17000 block, then 18000 block, ect...). Naming Convention: Number 00000 + Letter A-Z, (ie: 17000AZ-17000ZA) so the folders can easily be sorted by name.
Consider that the tar command is being run in a screen with verbose output in order to check on the "progress" of that command.
screen -S compress
In theory, I had assumed I could simply look at the output of that screen, but I notice that the TARBALL does not seem to be compressing the folder in either the order they were created, nor sort them based on the name of the folder.
Therefore my question is two fold:
Other than looking at the verbose output of the TARBALL and guessing;
Is there any where to find out how long the compression process will take to complete? (such as adding a -tack command onto the TAR to show estimated time to completion, something similar to the % complete of an SCP command)
In what order does the TAR command decide to compress the folders? ( and is there a way to tell the command to "sort by" date/name during compression?)
To elaborate, after 20 min of waiting for the 17001AA-to-AZ block to compress I would figure next up would be the 17001BA-to-BZ block, but this is not the case, the verbose output shows what seem to be randomly grabbing folders without sorting by name nor date)
Simply put: What determines the sort order during compression?

If you give tar a list of directory names, the order of the entries in the tar file will match the order that readdir returns filenames from the filesystem. The fact that you are compressing the tar file has no bearing on the order.
Here is a quick example to illustrate what happens on a Linux ext4 filesystem. Other filesystems may behave differently.
First create a new directory with three files, a1, a2 and a3
$ mkdir fred
$ cd fred
$ touch a1 a2 a3
Now lets see the order that readdir returns the files. The -U option will make ls return the filenames unsorted in the order they are stored in the directory.
$ ls -U
a3 a1 a2
As you can see, on my Linux setup the files are returned in an apparently random order.
Now stick the files in a tar file. Note I'm giving tar a directory name for the input file ("." in this instance) to make sure it has to call readdir behind the scenes.
$ tar cf xxx.tar .
And finally, lets see the order that tar has stored the files.
$ tar tf xxx.tar
./
./a3
./a1
./a2
The order of the files a1, a2 and a3 matches the order that readdir returned the filenames from the filesystem. The . filename is present because it was explicitly included on the command line passed to tar.
If you want to force an order you will have to give tar a sorted list of filenames. The example below shows how to get tar to read the list of filenames from stdin, using the -T - command line option.
$ ls a* | tar cvf yyy.tar -T -
a1
a2
a3
In this toy example the list of filenames will be automatically sorted because the shell sorts the filenames that match the wildcard a*.
And just to confirm, this is what is in the tar file.
$ tar tf yyy.tar
a1
a2
a3
In your use-case a combination of the find and sort commands piped into tar should allow you to create a sorted tar file with as many entries as you like.
Something like this as a starting point.
find | sort | tar -cvzf _backup.tar.gz -T -

Related

Merge files with same name in more than 100 folders

I have a problem similar to Merge files with same name in different folders, I have about 100 different folders in which there is a .txt file "replaced_txt", the problem is that I need to merge those files but since there is 100 different folders I want to know if tehre is something quicker than doing :
cat /folder1/replaced_txt /folder2/replaced_txt /folder3/replaced_txt ...

The cat command is just about the simplest there is, so there is no obvious and portable way to make the copying of file contents any faster. The bottleneck is probably going to be finding the files, anyway, not in copying them. If indeed the files are all in subdirectories immediately below the root directory,
cat /*/replaced_txt >merged_txt
will expand the wildcard alphabetically (so /folder10/replaced_txt comes before /folder2/replaced_txt) but might run into "Argument list too long" and/or take a long time to expand the wildcard if some of these directories are large (especially on an older Linux system with an ext3 filesystem, which doesn't scale to large directories very well). A more general solution is find, which is better at finding files in arbitrarily nested subdirectories, and won't run into "Argument list too long" because it never tries to assemble all the file names into an alphabetized list; instead, it just enumerates the files it finds as it traverses directories in whichever order the filesystem reports them, and creates a new cat process when the argument list fills up to the point where the system's ARG_MAX limit would be exceeded.
find / -type f -name replaced_txt -xdev -exec cat {} + >merged_txt
If you want to limit how far subdirectories will be traversed or you only want to visit some directories, look at the find man page for additional options.

How to create multiple files in each directories and then compress it through tar (BASH)

What I am currently struggling is to create multiple files and storing it into each directory.
I have made 100 directories like this:
mkdir ./dir.{01..100}
What I want is to create 100 text files for each directory. So that the result will show like:
click on dir.01 at home dir, which has files named: 01.txt to 100.txt
Also, I want to compress all 100 directories, each containing 100 text files into one using tar.
I am struggling with:
creating 100 text files each in 100 directories
using tar to zip all 100 directories together.
I am more interested in making creating 100 text files IN 100 directories. Also I am MUCH MORE interested in how to use tar to join all 100 directories together in specific file (fdtar) for instance.

If you are fine with empty files,
touch ./dir.{01..100}/{01..100}.txt
If you need each file to contain something, use that as the driver in a loop:
for file in ./dir.{01..100}/{01..100}.txt; do
printf "This is the file %s\n" "$file">"$file"
done
This could bump into ARG_MAX ("argument list too long") on some platforms, but it works fine on MacOS and should work fine on any reasonably standard Linux.
Splitting the loop into an inner and an outer loop could work around that problem:
for dir in ./dir.{01..100}; do
for file in {01..100}.txt; do
printf "This is file %s/%s\n" >"$dir/$file"
done
done

If I understand you need two things. First, you have 100 directories and need to create a file in each. With a for loop in bash run from the parent directory where all other directories you have created are:
for n in dir.*
do
f=`echo $n | sed s/dir\.//`
echo "This is file $n" >"$n/$f.txt"
done
Regarding tar that is even easier because tar will take multiple directories and glue them together. From the parent directory try:
tar cvf fd.tar dir.*
The c option will create the archive. v will tell tar to print all it is doing so you know what is happening. f directories.tar will create the archive with that name.
When you undo the tar operation, you will use:
tar xvf fd.tar
In this case x will extract the contents of the tar archive and will create all 100 directories for you at the directory from which you invoke it.
Note that I have used fd.tar and not fdtar as the .tar extension is the customary way to signal that the file is a tar archive.

How to find duplicate directories

Let create some testing directory tree:
#!/bin/bash
top="./testdir"
[[ -e "$top" ]] && { echo "$top already exists!" >&2; exit 1; }
mkfile() { printf "%s\n" $(basename "$1") > "$1"; }
mkdir -p "$top"/d1/d1{1,2}
mkdir -p "$top"/d2/d1some/d12copy
mkfile "$top/d1/d12/a"
mkfile "$top/d1/d12/b"
mkfile "$top/d2/d1some/d12copy/a"
mkfile "$top/d2/d1some/d12copy/b"
mkfile "$top/d2/x"
mkfile "$top/z"
The structure is: find testdir \( -type d -printf "%p/\n" , -type f -print \)
testdir/
testdir/d1/
testdir/d1/d11/
testdir/d1/d12/
testdir/d1/d12/a
testdir/d1/d12/b
testdir/d2/
testdir/d2/d1some/
testdir/d2/d1some/d12copy/
testdir/d2/d1some/d12copy/a
testdir/d2/d1some/d12copy/b
testdir/d2/x
testdir/z
I need find the duplicate directories, but I need consider only files (e.g. I should ignore (sub)directories without files). So, from the above test-tree the wanted result is:
duplicate directories:
testdir/d1
testdir/d2/d1some
because in both (sub)trees are only two identical files a and b. (and several directories, without files).
Of course, I could md5deep -Zr ., also could walk the whole tree using perl script (using File::Find+Digest::MD5 or using Path::Tiny or like.) and calculate the file's md5-digests, but this doesn't helps for finding the duplicate directories... :(
Any idea how to do this? Honestly, I haven't any idea.
EDIT
I don't need working code. (I'm able to code myself)
I "just" need some ideas "how to approach" the solution of the problem. :)
Edit2
The rationale behind - why need this: I have approx 2.5 TB data copied from many external HDD's as a result of wrong backup-strategy. E.g. over the years, the whole $HOME dirs are copied into (many different) external HDD's. Many sub-directories has the same content, but they're in different paths. So, now I trying to eliminate the same-content directories.
And I need do this by directories, because here are directories, which has some duplicates files, but not all. Let say:
/some/path/project1/a
/some/path/project1/b
and
/some/path/project2/a
/some/path/project2/x
e.g. the a is a duplicate file (not only the name, but by the content too) - but it is needed for the both projects. So i want keep the a in both directories - even if they're duplicate files. Therefore me looking for a "logic" how to find duplicate directories.

Some key points:
If I understand right (from your comment, where you said: "(Also, when me saying identical files I mean identical by their content, not by their name)" , you want find duplicate directories, e.g. where their content is exactly the same as in some other directory, regardless of the file-names.
for this you must calculate some checksum or digest for the files. Identical digest = identical file. (with great probability). :) As you already said, the md5deep -Zr -of /top/dir is a good starting point.
I added the -of, because for such job you don't want calculate the contents of the symlinks-targets, or other special files like fifo - just plain files.
calculating the md5 for each file in 2.5TB tree, sure will take few hours of work, unless you have very fast machine. The md5deep runs a thread for each cpu-core. So, while it runs, you can make some scripts.
Also, consider run the md5deep as sudo, because it could be frustrating if after a long run-time you will get some error-messages about unreadable files, only because you forgot to change the files-ownerships...(Just a note) :) :)
For the "how to":
For comparing "directories" you need calculate some "directory-digest", for easy compare and finding duplicates.
The one most important thing is realize the following key points:
you could exclude directories, where are files with unique digests. If the file is unique, e.g. has not any duplicates, that's mean that is pointless checking it's directory. Unique file in some directory means, that the directory is unique too. So, the script should ignore every directory where are files with unique MD5 digests (from the md5deep's output.)
You don't need calculate the "directory-digest" from the files itself. (as you trying it in your followup question). It is enough to calculate the "directory digest" using the already calculated md5 for the files, just must ensure that you sort them first!
e.g. for example if your directory /path/to/some containing only two files a and b and
if file "a" has md5 : 0cc175b9c0f1b6a831c399e269772661
and file "b" has md5: 92eb5ffee6ae2fec3ad71c777531578f
you can calculate the "directory-digest" from the above file-digests, e.g. using the Digest::MD5 you could do:
perl -MDigest::MD5=md5_hex -E 'say md5_hex(sort qw( 92eb5ffee6ae2fec3ad71c777531578f 0cc175b9c0f1b6a831c399e269772661))'
and will get 3bc22fb7aaebe9c8c5d7de312b876bb8 as your "directory-digest". The sort is crucial(!) here, because the same command, but without the sort:
perl -MDigest::MD5=md5_hex -E 'say md5_hex(qw( 92eb5ffee6ae2fec3ad71c777531578f 0cc175b9c0f1b6a831c399e269772661))'
produces: 3a13f2408f269db87ef0110a90e168ae.
Note, even if the above digests aren't the digests of your files, but they're will be unique for every directory with different files and will be the same for the identical files. (because identical files, has identical md5 file-digest). The sorting ensures, that you will calculate the digest always in the same order, e.g. if some other directory will contain two files
file "aaa" has md5 : 92eb5ffee6ae2fec3ad71c777531578f
file "bbb" has md5 : 0cc175b9c0f1b6a831c399e269772661
using the above sort and md5 you will again get: 3bc22fb7aaebe9c8c5d7de312b876bb8 - e.g. the directory containing same files as above...
So, in such way you can calculate some "directory-digest" for every directory you have and could be ensured that if you get another directory digest 3bc22fb7aaebe9c8c5d7de312b876bb8 thats means: this directory has exactly the above two files a and b (even if their names are different).
This method is fast, because you will calculate the "directory-digests" only from small 32bytes strings, so you avoids excessive multiple file-digest-caclulations.
The final part is easy now. Your final data should be in form:
3a13f2408f269db87ef0110a90e168ae /some/directory
16ea2389b5e62bc66b873e27072b0d20 /another/directory
3a13f2408f269db87ef0110a90e168ae /path/to/other/directory
so, from this is easy to get: the
/some/directory and the /path/to/other/directory are identical, because they has identical "directory-digests".
Hm... All the above is only a few lines long perl script. Probably would be faster to write here directly the perl-script as the above long textual answer - but, you said - you don't want code... :) :)

A traversal can identify directories which are duplicates in the sense you describe. I take it that this is: if all files in a directory are equal to all files of another then their paths are duplicates.
Find all files in each directory and form a string with their names. You can concatenate the names with a comma, say (or some other sequence that is certainly not in any names). This is to be compared. Prepend the path to this string, so to identify directories.
Comparison can be done for instance by populating a hash with keys being strings with filenames and path their values. Once you find that a key already exists you can check the content of files, and add the path to the list of duplicates.
The strings with path don't have to be actually formed, as you can build the hash and dupes list during the traversal. Having the full list first allows for other kinds of accounting, if desired.
This is altogether very little code to write.
An example. Let's say that you have
dir1/subdir1/{a,b} # duplicates (files 'a' and 'b' are considered equal)
dir2/subdir2/{a,b}
and
proj1/subproj1/{a,b,X} # NOT duplicates, since there are different files
proj2/subproj2/{a,b,Y}
The above prescription would give you strings
'dir1/subdir1/a,b',
'dir2/subdir2/a,b',
'proj1/subproj1/a,b,X',
'proj2/subproj2/a,b,Y';
where the (sub)string 'a,b' identifies dir1/subdir1 and dir2/subdir2 as duplicates.
I don't see how you can avoid a traversal to build a system that accounts for all files.
The procedure above is the first step, not handling directories with files and subdirectories.
Consider
dirA/ dirB/
a b sdA/ a X sdB/
c d c d
Here the paths dirA/sdA/ and dirB/sdB/ are duplicates by the problem description but the whole dirA/ and dirB/ are distinct. This isn't shown in the question but I'd expect it to be of interest.
The procedure from the first part can be modified for this. Iterate through directories, forming a path component at every step. Get all files in each, and all subdirectories (if none we are done). Append the comma-separated file list to the path component (/sdA/). So the representation of the above is
'dirA/sdA,a,b/c,d', 'dirB/sdB,a,X/c,d'
For each file-list substring (c,d) found to already exist we can check its path against the existing one, component by component. Now a hash with keys like c,d won't do since this example has the same file-list for distinct hierarchies, but a modified (or other) data structure is needed.
Finally, there may be more subdirectories parallel to sdA (say sdA2). We care only for its own path, but except for the parallel files (a,b, in that component of the path dirA/sdaA2,a,b/). So keep in mind all bottom-level file-lists (c,d) with their paths and, if file-lists are equal and paths are of same length, check whether their paths have a,b file-lists equal in each path component.
I don't know whether this is a workable solution for you, but I'd expect "near-duplicates" to be rare -- the backup is either a duplicate or not. So there may not be much need to handle futher edge-cases in complex sprawling hierarchies. This procedure should be at least a useful pre-selection mechanism, that would greatly reduce the need for further work.
This assumes that equal file-names very likely indicate equal files. A part of that is my expectation that if a file was even just renamed it still cannot be considered a duplicate. If this is not so this approach won't work and one would need something along the lines of the answer by jm666.

I make a tool which searches duplicate folders.
https://github.com/un1t/dirdups
dirdups testdir -i 1
-i 1 option consider folders as duplicates if they have at least 1 file in common. Without this option default value is 10.
In your case it will find the following directories:
testdir/d1/d12/
testdir/d2/d1some/d12copy/

Create a .tar.bz2 file given an array of files

In a Bash script, I have an array that contains a list of files (in the form of their complete file paths):
declare -a individual_files=("/path/to/a" "/path/to/b" "/path/to/c")
I want to create a compressed file in tar.bz2 which contains all the files in the array, using tar command.
So far, I have tried
tar rf files.tar "${individual_files[#]}"
tar cjf files.tar.bz2 files.tar
But for some reason, files.tar.bz2 always contains the last file in the array only.
What is the correct command(s) for doing so, preferably without creating the intermediate .tar file?
UPDATED: using #PanRuochen's answer, this is what I see in the verbose info:
+ tar cfvj /Users/skyork/test.tar.bz2 /Users/skyork/.emacs /Users/skyork/.Rprofile /Users/skyork/.aspell.en.pws /Users/skyork/.bash_profile /Users/skyork/.vimrc /Users/skyork/com.googlecode.iterm2.plist
tar: Removing leading '/' from member names
a Users/skyork/.emacs
a Users/skyork/.Rprofile
a Users/skyork/.aspell.en.pws
a Users/skyork/.bash_profile
a Users/skyork/.vimrc
a Users/skyork/com.googlecode.iterm2.plist
But still, the resulted test.tar.bz2 file has only the last file of the array (/Users/skyork/com.googlecode.iterm2.plist) in it.
My bad, the files are indeed there but hidden.

tar cfvj files.tar.bz2 "${individual_files[#]}"
v should give you verbose information about how bz2 file is created.

diff of identical tar archive return they are not identical

I have a script that generates a tar archive using the command
tar -zacf /tmp/foo.tar.gz /home/yotam/foo
it then check if a tar file is in a certain folder, and check if there is any changes between the two archives, if so, it keeps the new one
if ! [ -e /home/yotam/barr/foo.tar.gz ]; then
cp /tmp/foo.tar.gz /home/yotam/bar/foo.tar.gz
cond=1
else
#compare
diff --brief <(sort /tmp/foo.tar.gz) <(sort /home/yotam/bar/foo.tar.gz) >/dev/null
cond=$?
fi
if [ $cond -eq 1 ]; then
rm /home/yotam/bar/foo.tar.gz
cp /tmp/foo.tar.gz /home/yotam/foo.tar.gz
fi
However, this script always view the two archive files as different, even if I'm not doing anything in any of the two archives or the foo folder itself. What is wrong with my check?
Edit:
for what it worth, replacing the diff file with
diff --brief /tmp/foo.tar.gz /home/yotam/bar/foo.tar.gz >/dev/null
yield the same result.

I'm not sure that gzip archive can be used as a hash-function. Perhaps gzip packaging implementation relies on current date-time and then produces different output for each execution.
I'd recommend to use some widely used hash function. Take a look at git internal hash implementation - shasum, for example.
More at: How does git compute file hashes?

It looks like you're doing a line-wise compare of zipped tar archives, after sorting the lines. There are multiple reasons why this is a bad idea (for one: sorting by like for something that is gzipped doesn't make sense). To check whether 2 files, either use diff file1 file2, or calculate a hash for each file (with md5/md5sum filename) and compare those.
The problem is that gzip adds the name of the files it gzips in the zip archive. If you have 2 identical files and then gzip these, you will get 2 different archives.
So what can you do to solve this? For one you can compare gunziped versions of both files: diff <(gzcat out/out2.tar.gz) <(gzcat out2.tar.gz). I assume you have the sort in there in case the files get tarred in a different order, but I don't think you have to worry about that. If that is a problem for you, check out something like tarsum. This will give you a better result, since if you use sort, you will not notice moving a line from one file to the other, or switching two lines in a file.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio