zip two file with same content, but final md5sum is different - macos

I have the following operation on my mac:
$ echo "dgrgrrgrgrg" > test1.txt
after a few seconds, copy test1.txt:
$ cp test1.txt test2.txt
$ ls -l
total 16
-rw-r--r-- 1 hqfy staff 12 Mar 31 10:18 test1.txt
-rw-r--r-- 1 hqfy staff 12 Mar 31 10:19 test2.txt
now chech md5sum:
$ md5 *.txt
MD5 (test1.txt) = 8bab5a3e202c901499d83cb25d5a8c80
MD5 (test2.txt) = 8bab5a3e202c901499d83cb25d5a8c80
it's obvious that test1.txt and test2.txt have the same md5sum, now I zip these two files:
$ zip -X test1.zip test1.txt
adding: test1.txt (deflated 8%)
$ zip -X test2.zip test2.txt
adding: test2.txt (deflated 8%)
$ ls -l
total 32
-rw-r--r-- 1 hqfy staff 12 Mar 31 10:18 test1.txt
-rw-r--r-- 1 hqfy staff 127 Mar 31 10:22 test1.zip
-rw-r--r-- 1 hqfy staff 12 Mar 31 10:19 test2.txt
-rw-r--r-- 1 hqfy staff 127 Mar 31 10:23 test2.zip
size of test1.zip and test2.zip are the same, but when I check md5sum:
$ md5 *.zip
MD5 (test1.zip) = af8783f96ce98aef717ecf6229ffb07e
MD5 (test2.zip) = 59e752a03a2930adbe7f30b9cbf14561
I've googled it, using zip with option -X, but it did not work in my case, how can I create the two zip files with the same md5sum?

Quoting from the zip man page here..
With -X, zip strips all old fields and only includes the Unicode and
Zip64 extra fields (currently these two extra fields cannot be
disabled).
So, a different md5sum is expected when zipping (even with -X).

I know that this question is very old, but I may have an answer for you:
The timestamps for the two files (which are very obviously different) are included in the .zip file. That is why the md5sums are different. If you can somehow remove those timestamps, then the md5sums will be the same.
Also note that macOS adds a folder (__MACOSX) to a zip file that contains extra metadata and such. That may also be the issue.

Related

Bash - Version Numbers in Filenames. How to list latest versions only?

I have a directory of versioned files. The version of each file is indicated within it's filename, e.g. "_v1".
Example
List of files shown by ls:
123_FileA_v1.txt
123_FileA_v2.txt
132_FileB_v1.txt
I want to run a command to see only the latest versions:
123_FileA_v2.txt
132_FileB_v1.txt
My first attempt was to list files by mtime using
ls -ltr
But in my case this doesn't lead to sufficient results. I really want to collect versions from the filenames.
What would be the best way to do it?
This will do it :
ls | awk -F '_' '!prefixes[$1]++'
Hope it helps!
Edit :
If you want to see specific info you can do :
ls | awk -F '_' '!prefixes[$1]++' | xargs ls -lh
This will work as long as there are not spaces in your filenames.
Edit :
As requested by #PaulHodges, here is the sample output :_
$ ls -lh
total 0
drwxr-xr-x 5 Matias-Barrios Matias-Barrios 160B Feb 27 11:40 .
drwxr-xr-x 106 Matias-Barrios Matias-Barrios 3.3K Feb 27 11:39 ..
-rw-r--r-- 1 Matias-Barrios Matias-Barrios 0B Feb 27 11:40 132_FileB_v1.txt
-rw-r--r-- 1 Matias-Barrios Matias-Barrios 0B Feb 27 11:40 123_FileA_v2.txt
-rw-r--r-- 1 Matias-Barrios Matias-Barrios 0B Feb 27 11:40 123_FileA_v1.txt
$ ls | awk -F '_' '!prefixes[$1]++'
.
..
132_FileB_v1.txt
123_FileA_v2.txt
You could do something like
(
PATTERN="[0-9]{3}_[^_]*"
for prefix in `find . | egrep -o "$PATTERN" | sort -u`;
do
ls $prefix* | tail -1;
done
)
It will print
123_FileA_v2.txt
132_FileB_v1.txt
What happens here?
The surrounding braces ( are used to support copy & paste of the provided code. read more
The variable PATTERN is used to access all files starting with the same prefix.
The for prefix in `find . | egrep -o "$PATTERN" | sort -u generates a list of file prefixes.
The ls $prefix* lists all files with the same prefix in alphanumerical order
The | tail -1 shows only the last entry of the former ls $prefix*
Edit
I decided to use find . instead of ls *. With that I hope to circumvent the issues with ls *. Please correct me, if I'm wrong!

OSX How to have ls -l sort in alphabetical order and list directories and files together

I want my ls -l command to list both files and directories together rather than separating them. I also want a case insensitive list. For example, the following commands create the directories a and C and also the file b.txt:
% mkdir a C
% touch b.txt
Then I list them
tyler#Tylers-MacBook-Pro test % ls -l
total 0
drwxr-xr-x 2 tyler staff 64 Feb 12 12:06 C
drwxr-xr-x 2 tyler staff 64 Feb 12 12:06 a
-rw-r--r-- 1 tyler staff 0 Feb 12 12:06 b.txt
Note how the order is C, a, b.txt. I want it to list: a, b.txt, C (like this):
tyler#Tylers-MacBook-Pro test % ls -l
total 0
drwxr-xr-x 2 tyler staff 64 Feb 12 12:06 a
-rw-r--r-- 1 tyler staff 0 Feb 12 12:06 b.txt
drwxr-xr-x 2 tyler staff 64 Feb 12 12:06 C
How do I do this case insensitive list that doesn't separate files and directories.
Combined with sort, this should be what's required :
ls -l | sort -f -k 9,9
-f -k 9,9 means sort insensitively (-f) by 9th column (-k 9,9).

Concatenate file weight less than the sum of the files

I have done these commands to concatenate the files into one file:
$ ls -1 | wc -l
16916
$ ls -1 *.txt | wc -l
16916
$ ls -lh | head -1
total 93M
$ cat *.txt > ../nectar_3.txt
$ ls -lh ../nectar_3.txt
-rw-r--r-- 1 llopis llopis 52M May 25 16:03 ../nectar_3.txt
Why is the resulting file size half of the sum of the size of all files? The only explanation I can found is about rounding in the ls -lh command, but I couldn't find anything (using ls -lk outputs almost the same 92.76953125M)
The total is rounded, and is not guaranteed to be accurate:
Simple example:
marc#panic$ ls -lk
total 24
-rw-r--r-- 1 marc marc 6000 May 25 08:39 test1.txt
-rw-r--r-- 1 marc marc 7000 May 25 08:39 test2.txt
-rw-r--r-- 1 marc marc 8000 May 25 08:39 test3.txt
Three simple files, total size = 21,000 bytes, yet the total shows 24.

Zip a directory while retaining relative path

I have a directory of files:
/home/user/files/1.txt
/home/user/files/2.txt
/home/user/files/3.txt
I'd like to zip up the files directory into files.zip so when extracted I get:
files/1.txt
files/2.txt
files/3.txt
I know I can do:
# bash
cd /home/user; zip -r files.zip files/
Is there a way to do this without cding to the user directory?
I know that the --junk-paths flag will store just the filenames and junk the path but I'd like to keep the files directory as a container.
Couldn't find direct way using zip command but you can try "tar" command with -C option.
$ pwd
/home/shenzi
$ ls -l giga/files
total 3
-rw-r--r-- 1 shenzi Domain Users 3 Aug 5 11:24 1.txt
-rw-r--r-- 1 shenzi Domain Users 4 Aug 5 11:25 2.txt
-rw-r--r-- 1 shenzi Domain Users 9 Aug 5 11:25 3.txt
$ tar -C giga -cvf files.zip files/*
files/1.txt
files/2.txt
files/3.txt
$ tar -tvf files.zip
-rw-r--r-- shenzi/Domain Users 3 2014-08-05 11:24 files/1.txt
-rw-r--r-- shenzi/Domain Users 4 2014-08-05 11:25 files/2.txt
-rw-r--r-- shenzi/Domain Users 9 2014-08-05 11:25 files/3.txt
USE: -xvf to extract

Bash problem. Xargs and problem using the basename command in argument list substitution

I'm using bash shell.
Hi,ppl
Would be glad if someone could provide some kind of advice, because googling around yielded some answers
but couldn't still get the script to work.
I'am new to using bash script and got a script to modify because it was failing to copy
a large number of files from and input directory to an output directory after the files were processed.
Description:
We have a bunch of pdf's in a large directory.
We process a file called filename.pdf, after it's processed an additional file is created called filename.pdf.marker
Then both files filename.pdf.marker and filename.pdf shoud be moved from input/in directory to directory output/out.
We work with about 10 -15 thousands of files.
The script should do the following:
select all .marker file names
move.marker files from input/in directory to directory output/out (done in separate line)
remove the .marker from the selected filename,
move the file filename.pdf to the output/out directory
Old script (didn't work for a larger number of files) :
FILELIST=$(ls ${V04}/*.pdf.marker 2> /dev/null | sort)
for FILEMARKER in ${FILELIST}; do
FILENAME=${V04}/$(basename $FILEMARKER .marker)
mv ${FILENAME} ${VLOGDIR}/.
mv ${FILENAME}.marker ${VLOGDIR}/.
done
Because of that I needed to use xargs command.
Problem:
I managed to move the .marker files in a separate line.
Now i need to move the .pdf files with this script line.
find /input/in -iname "*.marker" -print0 | xargs -0 -r -I {} mv `basename {} .marker` /output/out
My problem lies in the part: `basename {} .marker`
Why isn't the string filename.pdf extracted from the string filename.pdf.marker, and substituted into the mv command ?
Any help i's welcome ;)
UPDATED
Corrected description of what script should do: Both filetypes .pdf
and .pdf.marker should be moved in my script not copied.
Added old script that didn't work well for larger amount of files.
The problem is that the command in backticks is executed once before xargs is ever invoked.
The fix is a bit harder, not least because your step 2 says 'copy' but the previous description suggests 'move'. I'd probably create a simple script to be invoked by xargs:
find /input/in -name '*.marker' -print0 | xargs -0 mover.sh
The contents of mover.sh might be:
for mrk_source in "$#"
do
pdf_source=$(echo "$mrk_source" | sed 's/\.marker$//')
mrk_target=$(echo "/output/out/$mrk_source" | sed 's%/input/in%%')
pdf_target=$(echo "/output/out/$pdf_source" | sed 's%/input/in%%')
mv "$mrk_source" "$mrk_target"
mv "$pdf_source" "$pdf_target"
done
Note that this code preserves any directory structure under /input/in but assumes that the corresponding directory exists under /output/out (without checking). It would be possible to alter the code to flatten any directory structure, or to create the directories as needed (exercise for the reader). There is a small sleight-of-hand going on in the file name manipulation in the two xxx_target assignment lines; I think it will work OK for relative names as well as absolute names, but be a little cautious with that part (test before using, in other words).
tripleee commented:
The echo and sed invocations are very brittle -- for example, echo on some platforms will interpret backslashes in the filename as escape sequences. Fortunately, you can use the shell's substitution mechanisms to mv "${mrk_source#.marker}" /output/out instead. (Why would you want to calculate the destination file name, when all you need to give to mv is the destination directory?)
I explained the destination file name - preserving sub-directories, so /input/in/dir1/abc.pdf goes to /output/out/dir1/abc.pdf; if you want to flatten the directory structure (or there is no directory structure), then simply specifying the destination is sufficient.
The problem with echo 'should not' be a problem in the sense that the original design of echo was simple and all the later additional ... baggage simply makes what should be utterly reliable into something horrendously unreliable. That said, there could be problems with names containing backticks, $(...) and so on. There are no problems with backticks or $(...) in the names. There is a problem with backslashes in the name.
$ mkdir -p input/in output/out
$ for name in a b 'c d' 'e f g' '$(cat x)' '`cat y`' 'a\\nb'
> do
> cp /dev/null input/in/"$name.pdf"
> cp /dev/null "input/in/$name.pdf.marker"
> done
$ ls -lR [io]*
input:
total 0
drwxr-xr-x 16 jleffler staff 544 Aug 22 00:45 in
input/in:
total 0
-rw-r--r-- 1 jleffler staff 0 Aug 22 00:45 $(cat x).pdf
-rw-r--r-- 1 jleffler staff 0 Aug 22 00:45 $(cat x).pdf.marker
-rw-r--r-- 1 jleffler staff 0 Aug 22 00:45 `cat y`.pdf
-rw-r--r-- 1 jleffler staff 0 Aug 22 00:45 `cat y`.pdf.marker
-rw-r--r-- 1 jleffler staff 0 Aug 22 00:45 a.pdf
-rw-r--r-- 1 jleffler staff 0 Aug 22 00:45 a.pdf.marker
-rw-r--r-- 1 jleffler staff 0 Aug 22 00:45 a\\nb.pdf
-rw-r--r-- 1 jleffler staff 0 Aug 22 00:45 a\\nb.pdf.marker
-rw-r--r-- 1 jleffler staff 0 Aug 22 00:45 b.pdf
-rw-r--r-- 1 jleffler staff 0 Aug 22 00:45 b.pdf.marker
-rw-r--r-- 1 jleffler staff 0 Aug 22 00:45 c d.pdf
-rw-r--r-- 1 jleffler staff 0 Aug 22 00:45 c d.pdf.marker
-rw-r--r-- 1 jleffler staff 0 Aug 22 00:45 e f g.pdf
-rw-r--r-- 1 jleffler staff 0 Aug 22 00:45 e f g.pdf.marker
output:
total 0
drwxr-xr-x 2 jleffler staff 68 Aug 22 00:45 out
output/out:
$ find input/in -name '*.marker' -print0 | xargs -0 sh mover.sh
mv: rename input/in/a\nb.pdf to ./output/out/a
b.pdf: No such file or directory
$ ls -lR [io]*
input:
total 0
drwxr-xr-x 3 jleffler staff 102 Aug 22 00:46 in
input/in:
total 0
-rw-r--r-- 1 jleffler staff 0 Aug 22 00:45 a\\nb.pdf
output:
total 0
drwxr-xr-x 15 jleffler staff 510 Aug 22 00:46 out
output/out:
total 0
-rw-r--r-- 1 jleffler staff 0 Aug 22 00:45 $(cat x).pdf
-rw-r--r-- 1 jleffler staff 0 Aug 22 00:45 $(cat x).pdf.marker
-rw-r--r-- 1 jleffler staff 0 Aug 22 00:45 `cat y`.pdf
-rw-r--r-- 1 jleffler staff 0 Aug 22 00:45 `cat y`.pdf.marker
-rw-r--r-- 1 jleffler staff 0 Aug 22 00:45 a.pdf
-rw-r--r-- 1 jleffler staff 0 Aug 22 00:45 a.pdf.marker
-rw-r--r-- 1 jleffler staff 0 Aug 22 00:45 a\nb.pdf.marker
-rw-r--r-- 1 jleffler staff 0 Aug 22 00:45 b.pdf
-rw-r--r-- 1 jleffler staff 0 Aug 22 00:45 b.pdf.marker
-rw-r--r-- 1 jleffler staff 0 Aug 22 00:45 c d.pdf
-rw-r--r-- 1 jleffler staff 0 Aug 22 00:45 c d.pdf.marker
-rw-r--r-- 1 jleffler staff 0 Aug 22 00:45 e f g.pdf
-rw-r--r-- 1 jleffler staff 0 Aug 22 00:45 e f g.pdf.marker
$
Using the Bash built-ins is sensible; I'm still stuck in the 1980s on occasion, and need reminding of that.
Solution that works with backslashes etc
for mrk_source in "$#"
do
pdf_source=${mrk_source%.marker}
mrk_target=${mrk_source/\/input\/in/\/output\/out}
pdf_target=${pdf_source/\/input\/in/\/output\/out}
mv "$mrk_source" "$mrk_target"
mv "$pdf_source" "$pdf_target"
done
With the same set of input files, this code works cleanly:
EDIT: As pointed out in the comments, this will not work if there are spaces in the filenames. In that case see #Jonathan Leffler's answer (even if there are no spaces now, you should probably use his version anyway, to avoid breakage when there suddenly are spaces...).
Since the command is expanded before it is executed, you can't use it that way. The command you'll give xargs would look like this:
xargs -0 -r -I {} mv {} /output/out
Since it tries to remove any path components, and the a .marker suffix, from the string {}.
I'd say you want to use a loop in this case:
for f in $(find /input/in -iname "*.marker"); do
mv `basename $f .marker` /output/out
done
With GNU Parallel you should be able to do:
ls "$V04"/*.pdf.marker | parallel -q mv {.} {} "$VLOGDIR"
This will work even if $V04 and $VLOGDIR contains ' " space \t.
Watch the intro video to learn more: http://www.youtube.com/watch?v=OpaiGYxkSuQ
The problem with the old script was, that you try to catch all entries in one var, which have size limits.
You can solve that, if you must not sort the entries in this way:
ls -1 "${V04}/*.pdf,.marker | while read FM;
do
mv "${FM}" "${VLOGDIR}/"
mv "${V04}/$(basename "${FM}" .marker)" "${VLOGDIR}/"
done;
The backticks are executed at evaluation time, not when xargs runs. Perhaps try something like this?
find /input/in -iname "*.marker" -print0 |
xargs -r0 -i sh -c 'mv `basename "{}" .marker` /output/out; mv "{}" /output/out'
Edit: The shell is still problematic here; if the file name contains double quotes, it will not parse correctly. Using a separate script might be better:
find /input/in -iname "*.marker" -exec ./myscript {} \;
where myscript contains the simple moving commands:
#!/bin/sh
mv `basename "$1" .marker` /output/out
mv "$1" /output/out

Resources