Shell BASH---SHA256 Hash Collisions occur in file extraction

Shell BASH---SHA256 Hash Collisions occur in file extraction - bash

I use sha256 hash value to encrypt a password( no matter digit form or char form ).
when I unzip the password-protected file, I can use at least two different hash to unzip my file -- it occurs a Hash collision. Although there is no searching point out this situation, I know md5 and sha-1 have hash collision. so what is the problem?
Case1: I use 5566 sha256 hash zip my file
zip -P be41b7f1fa56ba2b0582910053c86cf6ee7e311efc51300220df0918bb9a287b abc.zip abc
Reference Sha256(0138) = 687d579d0992a7895190ad126ba8051704753bdc85d52481a83da4670e2321d7
Reference Sha256(5566) = be41b7f1fa56ba2b0582910053c86cf6ee7e311efc51300220df0918bb9a287b
However, apart from 5566 hash value, I also can use 0138 hash value to unzip this file. The following code both success in file extraction.
unzip -P 687d579d0992a7895190ad126ba8051704753bdc85d52481a83da4670e2321d7 abc.zip
unzip -P be41b7f1fa56ba2b0582910053c86cf6ee7e311efc51300220df0918bb9a287b abc.zip
Case2: I can use 'daniel' sha256 hash and 'pivate' sha256 hash to unzip a file.
Thank you for your attention. Hope someone can solve my problems.
I am doing a password brute-forcing assignment, and I cannot get the correct cracked password
because of the above problem.

[Solved]
empty zip files may lead to problems in unzip.

Related

SHA256 hash doesn't match download - what now?

Hello stackoverflow World,
I'm investigating using the miniconda package manager for the first time.
I downloaded the files from here: https://docs.conda.io/en/latest/miniconda.html
I'm on a windows machine so downloaded the following file:
As I am hoping is obvious from my title, the check sum that my machine produces using the Windows certUtil -hashfile function produces a different check sum.
Now, my main issue is what to do now...!
Do I run screaming to the hills burning all my IT kit as I go, or is there a way to get to the bottom of this?
Thanks in advance

So interestingly, using the PowerShell approach rather than the cmd line, as specified in the miniconda download reference did result in a matching Hash key.
I thought that these were supposed to be independent of the program used to unpack the HASH...?

hash is not a universally defined algorithm:
A hash function is any function that can be used to map data of arbitrary size to fixed-size values (https://en.wikipedia.org/wiki/Hash_function)
So when you use a program to hash a file and want to compare it to a published value, you must make sure that you are using the same hash function. In your case, the miniconda download page already clarifies that it is a SHA256 hash, which you need to specify when calling certutil.
Proof:
Without specifying the hash function (SHA1 is used and - as expected - produces a different hash value):
certutil -hashfile Miniconda3-latest-Windows-x86_64.exe
SHA1 hash of Miniconda3-latest-Windows-x86_64.exe:
0b553f6b77926db707c4406cafc612d74301b24e
CertUtil: -hashfile command completed successfully.
Specifying the correct function produces the right hash value:
certutil -hashfile Miniconda3-latest-Windows-x86_64.exe SHA256
SHA256 hash of Miniconda3-latest-Windows-x86_64.exe:
6013152b169c2c2d4bcd75bb03a1b8bf208b8545d69116a59351af695d9a0081
CertUtil: -hashfile command completed successfully.

how to determine if a file is completely downloaded using kqueue?

I want to implement a function which monitor a directory and perform some action when a new file is downloaded from the Internet, but found it difficult to determine if the file is completely downloaded, is there a way to do that?

Usually tools that show the hash of a file will give the state of a file - this should be compared to the hash of another file - if identical then we know the file has downloaded successfully.
md5 (native to bsd) is available - but is only practical on a local file -
If you are retrieving the remote file via HTTP , then there is no way to get the hash of the file without downloading it first (whether it is to STDOUT or piped to file , using wget -O- or curl )
If the file host has a second file that contains the md5 hash of the file being downloaded - then a comparison of the locally downloaded hash is comparable to the hash provided by the file provider.
To do anything more swish will require a comprehensive program to be written - such as the combination of this question and accepted answer :
Python Compare local and remote file MD5 Hash

Besides MD5, there is a simple way to do this:
Partially downloaded file usually has a temporary filename, and it will be renamed to original filename after fully downloaded. You can make your program to ignore or monitor only certain filename extensions.

How to find duplicate directories

Let create some testing directory tree:
#!/bin/bash
top="./testdir"
[[ -e "$top" ]] && { echo "$top already exists!" >&2; exit 1; }
mkfile() { printf "%s\n" $(basename "$1") > "$1"; }
mkdir -p "$top"/d1/d1{1,2}
mkdir -p "$top"/d2/d1some/d12copy
mkfile "$top/d1/d12/a"
mkfile "$top/d1/d12/b"
mkfile "$top/d2/d1some/d12copy/a"
mkfile "$top/d2/d1some/d12copy/b"
mkfile "$top/d2/x"
mkfile "$top/z"
The structure is: find testdir \( -type d -printf "%p/\n" , -type f -print \)
testdir/
testdir/d1/
testdir/d1/d11/
testdir/d1/d12/
testdir/d1/d12/a
testdir/d1/d12/b
testdir/d2/
testdir/d2/d1some/
testdir/d2/d1some/d12copy/
testdir/d2/d1some/d12copy/a
testdir/d2/d1some/d12copy/b
testdir/d2/x
testdir/z
I need find the duplicate directories, but I need consider only files (e.g. I should ignore (sub)directories without files). So, from the above test-tree the wanted result is:
duplicate directories:
testdir/d1
testdir/d2/d1some
because in both (sub)trees are only two identical files a and b. (and several directories, without files).
Of course, I could md5deep -Zr ., also could walk the whole tree using perl script (using File::Find+Digest::MD5 or using Path::Tiny or like.) and calculate the file's md5-digests, but this doesn't helps for finding the duplicate directories... :(
Any idea how to do this? Honestly, I haven't any idea.
EDIT
I don't need working code. (I'm able to code myself)
I "just" need some ideas "how to approach" the solution of the problem. :)
Edit2
The rationale behind - why need this: I have approx 2.5 TB data copied from many external HDD's as a result of wrong backup-strategy. E.g. over the years, the whole $HOME dirs are copied into (many different) external HDD's. Many sub-directories has the same content, but they're in different paths. So, now I trying to eliminate the same-content directories.
And I need do this by directories, because here are directories, which has some duplicates files, but not all. Let say:
/some/path/project1/a
/some/path/project1/b
and
/some/path/project2/a
/some/path/project2/x
e.g. the a is a duplicate file (not only the name, but by the content too) - but it is needed for the both projects. So i want keep the a in both directories - even if they're duplicate files. Therefore me looking for a "logic" how to find duplicate directories.

Some key points:
If I understand right (from your comment, where you said: "(Also, when me saying identical files I mean identical by their content, not by their name)" , you want find duplicate directories, e.g. where their content is exactly the same as in some other directory, regardless of the file-names.
for this you must calculate some checksum or digest for the files. Identical digest = identical file. (with great probability). :) As you already said, the md5deep -Zr -of /top/dir is a good starting point.
I added the -of, because for such job you don't want calculate the contents of the symlinks-targets, or other special files like fifo - just plain files.
calculating the md5 for each file in 2.5TB tree, sure will take few hours of work, unless you have very fast machine. The md5deep runs a thread for each cpu-core. So, while it runs, you can make some scripts.
Also, consider run the md5deep as sudo, because it could be frustrating if after a long run-time you will get some error-messages about unreadable files, only because you forgot to change the files-ownerships...(Just a note) :) :)
For the "how to":
For comparing "directories" you need calculate some "directory-digest", for easy compare and finding duplicates.
The one most important thing is realize the following key points:
you could exclude directories, where are files with unique digests. If the file is unique, e.g. has not any duplicates, that's mean that is pointless checking it's directory. Unique file in some directory means, that the directory is unique too. So, the script should ignore every directory where are files with unique MD5 digests (from the md5deep's output.)
You don't need calculate the "directory-digest" from the files itself. (as you trying it in your followup question). It is enough to calculate the "directory digest" using the already calculated md5 for the files, just must ensure that you sort them first!
e.g. for example if your directory /path/to/some containing only two files a and b and
if file "a" has md5 : 0cc175b9c0f1b6a831c399e269772661
and file "b" has md5: 92eb5ffee6ae2fec3ad71c777531578f
you can calculate the "directory-digest" from the above file-digests, e.g. using the Digest::MD5 you could do:
perl -MDigest::MD5=md5_hex -E 'say md5_hex(sort qw( 92eb5ffee6ae2fec3ad71c777531578f 0cc175b9c0f1b6a831c399e269772661))'
and will get 3bc22fb7aaebe9c8c5d7de312b876bb8 as your "directory-digest". The sort is crucial(!) here, because the same command, but without the sort:
perl -MDigest::MD5=md5_hex -E 'say md5_hex(qw( 92eb5ffee6ae2fec3ad71c777531578f 0cc175b9c0f1b6a831c399e269772661))'
produces: 3a13f2408f269db87ef0110a90e168ae.
Note, even if the above digests aren't the digests of your files, but they're will be unique for every directory with different files and will be the same for the identical files. (because identical files, has identical md5 file-digest). The sorting ensures, that you will calculate the digest always in the same order, e.g. if some other directory will contain two files
file "aaa" has md5 : 92eb5ffee6ae2fec3ad71c777531578f
file "bbb" has md5 : 0cc175b9c0f1b6a831c399e269772661
using the above sort and md5 you will again get: 3bc22fb7aaebe9c8c5d7de312b876bb8 - e.g. the directory containing same files as above...
So, in such way you can calculate some "directory-digest" for every directory you have and could be ensured that if you get another directory digest 3bc22fb7aaebe9c8c5d7de312b876bb8 thats means: this directory has exactly the above two files a and b (even if their names are different).
This method is fast, because you will calculate the "directory-digests" only from small 32bytes strings, so you avoids excessive multiple file-digest-caclulations.
The final part is easy now. Your final data should be in form:
3a13f2408f269db87ef0110a90e168ae /some/directory
16ea2389b5e62bc66b873e27072b0d20 /another/directory
3a13f2408f269db87ef0110a90e168ae /path/to/other/directory
so, from this is easy to get: the
/some/directory and the /path/to/other/directory are identical, because they has identical "directory-digests".
Hm... All the above is only a few lines long perl script. Probably would be faster to write here directly the perl-script as the above long textual answer - but, you said - you don't want code... :) :)

A traversal can identify directories which are duplicates in the sense you describe. I take it that this is: if all files in a directory are equal to all files of another then their paths are duplicates.
Find all files in each directory and form a string with their names. You can concatenate the names with a comma, say (or some other sequence that is certainly not in any names). This is to be compared. Prepend the path to this string, so to identify directories.
Comparison can be done for instance by populating a hash with keys being strings with filenames and path their values. Once you find that a key already exists you can check the content of files, and add the path to the list of duplicates.
The strings with path don't have to be actually formed, as you can build the hash and dupes list during the traversal. Having the full list first allows for other kinds of accounting, if desired.
This is altogether very little code to write.
An example. Let's say that you have
dir1/subdir1/{a,b} # duplicates (files 'a' and 'b' are considered equal)
dir2/subdir2/{a,b}
and
proj1/subproj1/{a,b,X} # NOT duplicates, since there are different files
proj2/subproj2/{a,b,Y}
The above prescription would give you strings
'dir1/subdir1/a,b',
'dir2/subdir2/a,b',
'proj1/subproj1/a,b,X',
'proj2/subproj2/a,b,Y';
where the (sub)string 'a,b' identifies dir1/subdir1 and dir2/subdir2 as duplicates.
I don't see how you can avoid a traversal to build a system that accounts for all files.
The procedure above is the first step, not handling directories with files and subdirectories.
Consider
dirA/ dirB/
a b sdA/ a X sdB/
c d c d
Here the paths dirA/sdA/ and dirB/sdB/ are duplicates by the problem description but the whole dirA/ and dirB/ are distinct. This isn't shown in the question but I'd expect it to be of interest.
The procedure from the first part can be modified for this. Iterate through directories, forming a path component at every step. Get all files in each, and all subdirectories (if none we are done). Append the comma-separated file list to the path component (/sdA/). So the representation of the above is
'dirA/sdA,a,b/c,d', 'dirB/sdB,a,X/c,d'
For each file-list substring (c,d) found to already exist we can check its path against the existing one, component by component. Now a hash with keys like c,d won't do since this example has the same file-list for distinct hierarchies, but a modified (or other) data structure is needed.
Finally, there may be more subdirectories parallel to sdA (say sdA2). We care only for its own path, but except for the parallel files (a,b, in that component of the path dirA/sdaA2,a,b/). So keep in mind all bottom-level file-lists (c,d) with their paths and, if file-lists are equal and paths are of same length, check whether their paths have a,b file-lists equal in each path component.
I don't know whether this is a workable solution for you, but I'd expect "near-duplicates" to be rare -- the backup is either a duplicate or not. So there may not be much need to handle futher edge-cases in complex sprawling hierarchies. This procedure should be at least a useful pre-selection mechanism, that would greatly reduce the need for further work.
This assumes that equal file-names very likely indicate equal files. A part of that is my expectation that if a file was even just renamed it still cannot be considered a duplicate. If this is not so this approach won't work and one would need something along the lines of the answer by jm666.

I make a tool which searches duplicate folders.
https://github.com/un1t/dirdups
dirdups testdir -i 1
-i 1 option consider folders as duplicates if they have at least 1 file in common. Without this option default value is 10.
In your case it will find the following directories:
testdir/d1/d12/
testdir/d2/d1some/d12copy/

Downloading Only Newest File Using Wget / Curl

How would I use wget or curl to download the newest file in a directory?
This seems really easy, however the filename won't always be predictable, and as new data comes in it'll be replaced with a random filename.
Specifically, the directory I wish to download data from has the following naming structure, where the last string of characters is a randomly generated timestamp:
MRMS_RotationTrackML1440min_00.50_20160530-175837.grib2.gz
MRMS_RotationTrackML1440min_00.50_20160530-182639.grib2.gz
MRMS_RotationTrackML1440min_00.50_20160530-185637.grib2.gz
The randomly generated timestamp is in the format of: {hour}{minute}{second}
The directory in question is here: http://mrms.ncep.noaa.gov/data/2D/RotationTrackML1440min/
Could it have to be something with something in the headers, where you'd use curl to sift through the last-modified timestamp?
Any help would be appreciated here, thanks in advance.

You can just run following command periodically:
wget -r -nc --level=1 http://mrms.ncep.noaa.gov/data/2D/RotationTrackML1440min/
It will download recursively whatever is new in the directory after last run.

bash scripting de-dupe

I have a shell script. A cron job runs it once a day. At the moment it just downloads a file from the web using wget, appends a timestamp to the filename, then compresses it. Basic stuff.
This file doesn't change very frequently though, so I want to discard the downloaded file if it already exists.
Easiest way to do this?
Thanks!

Do you really need to compress the file ?
wget provides -N, --timestamping which obviously, turns on time-stamping. What that does is say your file is located at www.example.com/file.txt
The first time you do:
$ wget -N www.example.com/file.txt
[...]
[...] file.txt saved [..size..]
The next time it'll be like this:
$ wget -N www.example.com/file.txt
Server file no newer than local file “file.txt” -- not retrieving.
Except if the file on the server was updated.
That would solve your problem, if you didn't compress the file.
If you really need to compress it, then I guess I'd go with comparing the hash of the new file/archive and the old. What matters in that case is, how big is the downloaded file ? is it worth compressing it first then checking the hashes ? is it worth decompressing the old archive and comparing the hashes ? is it better to store the old hash in a txt file ? do all these have an advantage over overwriting the old file ?
You only know that, make some tests.
So if you go the hash way, consider sha256 and xz (lzma2 algorithm) compression.
I would do something like this (in Bash):
newfilesum="$(wget -q www.example.com/file.txt -O- | tee file.txt | sha256sum)"
oldfilesum="$(xzcat file.txt.xz | sha256sum)"
if [[ $newfilesum != $oldfilesum ]]; then
xz -f file.txt # overwrite with the new compressed data
else
rm file.txt
fi
and that's done;

Calculate a hash of the content of the file and check against the new one. Use for instance md5sum. You only have to save the last MD5 sum to check if the file changed.
Also, take into account that the web is evolving to give more information on pages, that is, metadata. A well-founded web site should include file version and/or date of modification (or a valid, expires header) as part of the response headers. This, and quite other things, is what makes up the scalability of Web 2.0.

How about downloading the file, and checking it against a "last saved" file?
For example, the first time it downloads myfile, and saves it as myfile-[date], and compresses it. It also adds a symbolic link, such as lastfile pointing to myfile-[date]. The next time the script runs, it can check if the contents of whatever lastfile points to is the same as the new downloaded file.
Don't know if this would work well, but it's what I could think of.

You can compare the new file with the last one using the sum command. This takes the checksum of the file. If both files have the same checksum, they are very, very likely to be exactly the same. There's another command called md5 that takes the md5 fingerprint, but the sum command is on all systems.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio