I looked at https://go.dev/doc/modules/gomod-ref and https://go.dev/ref/mod#go-mod-tidy, and on neither page could I find any documentation that explains how the checksums in go.sum are computed.
How are the checksums in go.sum computed?
The checksums are hashes of the dependencies. The document you look for is https://go.dev/ref/mod#go-sum-files.
Each line in go.sum has three fields separated by spaces: a module path, a version (possibly ending with /go.mod), and a hash.
The module path is the name of the module the hash belongs to.
The version is the version of the module the hash belongs to. If the version ends with /go.mod, the hash is for the module’s go.mod file only; otherwise, the hash is for the files within the module’s .zip file.
The hash column consists of an algorithm name (like h1) and a base64-encoded cryptographic hash, separated by a colon (:). Currently, SHA-256 (h1) is the only supported hash algorithm. If a vulnerability in SHA-256 is discovered in the future, support will be added for another algorithm (named h2 and so on).
Example go.sum line with module version hash is like
github.com/go-chi/chi v1.5.4 h1:QHdzF2szwjqVV4wmByUnTcsbIg7UGaQ0tPF2t5GcAIs=
github.com/go-chi/chi v1.5.4/go.mod h1:uaf8YgoFazUOkPBG7fxPftUylNumIev9awIWOENIuEg=
If you are asking how you actually compute the hash, i.e. what inputs you feed to the SHA-256 function, it is described here: https://cs.opensource.google/go/x/mod/+/refs/tags/v0.5.0:sumdb/dirhash/hash.go
Here is a gist that allows you to compute the module hash for an arbitrary directory, without using go:
https://gist.github.com/MarkLodato/c03659d242ea214ef3588f29b582be70
Related
there. I'm preparing a Laravel test, and there's a question that I think is not correct.
When you should use a hash?
The available answers are:
When you want to compress the contents of a file.
When you want to securely store credit card information so you can use it later.
When you want to secure sen a password over email.
When you want to identify the contents of a file without storing the entire file
Since hashing is for encrypting passwords (not to send'em over email) none of this answers seem to be correct. What do you think?
Option 4. Identifying contents of a file.
Hash is a function which is supposed to return a constant length output for every input. The other property of hash functions is that for any input a it always returns the same value b. It means that if you provide file a and store its hash b then whenever you supply file a again you're going to get hash b. The last property is that for different inputs c, d and hash function f, f(c) should be different from f(d) (or the chances of outputs being equal should be near 0)
In real case scenario, you can often find hashes when downloading software and want to verify if the file you've downloaded is correct. The developer puts a hash of the executable on their site. You are downloading the file and calculate checksum (hash) just to compare it with the one from the website. If it matches, then you know it's the same (as long as the hash algorithm is not known to have collisions...).
It is quite good approach to comparing file contents, bc hashes are taking much less space than actual files.
I am calculating a MD5 sum for a file to compare it with values supplied in a text file. I use the following line to create the checksum:
cksum = File.open(File.join(File.dirname(path), file),'rb') do |f|
MD5.hexdigest(f.read)
end
Every once in a while I get one that does not match but running the md5 manually at the system level shows the file has the correct MD5.
Does anyone see any issue with the process I am using to calculate the MD5 value or have any idea why they sometimes do not match when calculated by this ruby method?
For followers, there's also a method:
for a file: Digest::MD5.file('filename').hexdigest
At this point MD5 is a well-exercised message digest with an extensive suite of test vectors. It is extremely unlikely that there is an issue with Ruby's implementation of it.
It is almost certainly another explanation, such as that perhaps when your checksum executes, the file has not yet been fully written (i.e. by another process). In troubleshooting, it may be helpful to note the length of the result from f.read and verify that against the file size. You could even save the read contents to a separate file for later comparison when you discover a discrepancy. That could offer a clue.
You're correctly opening the file with binary mode, so that is good.
I am looking at for preventing the duplication of the same pdf document with one of my application.
I know it's quite an easy task along with the name of the document but, I don't want to match the PDF duplication by its name.
Here the challenge comes, I want to check the duplication of the uploaded document based on the contents, not by the name of the document uploaded by the end user.
I have never prevented with such scenario in past but want to know, If someone has a way to get resolved my problem.
You solution or tricks will be really helpful.
Thanks in Advance and waiting for a wondering solution with the same.
I think the best way is to generate checksum from uploaded file, store in the database (or some other place) and then check if checksums for new uploaded files are already present in the database.
In Ruby you can use Digest module to do that:
require "digest"
data = File.read("some_file_path")
checksum = Digest::MD5.hexdigest(data)
You don't have to check for filename, just use this checksum.
One simple way is to look by MD5 checksum. Instead of reading or parsing files line by line, generate MD5 digests for them and match. Those with same MD5 values are same files.
How to generate MD5 for a file in Ruby?
require 'digest'
Digest::MD5.file("path/tp/pdf").hexdigest
# md5 string
Torrent specification says for "peices" field:
pieces: string consisting of the concatenation of all 20-byte SHA1 hash values, one per piece
But in case of directory there are multiple files. So to be broken to pieces files must be taken in some order. When I use bencode editor on existing torrents I see files come definitely not in alphabetical order, nor they comes in last modification order. But two different tools generate torrents with identical hashes. So there must be some defined order. But I still cannot find this in torrent specification.
When it comes to piece hashes, the metafile creation sees the content as one big blob, as if the files from info.files were concatenated. The order in info.files is a choice of the client, µTorrent defaults to "order by size" since quite a few versions, other clients sort by relative path names.
Info hashes can and will differ when different creators choose different file orders, just like with piece size choices.
After reading this, it sounds like a great idea to store files using the SHA-1 for the directory.
I have no idea what this means however, all I know is that SHA-1 and MD5 are hashing algorithms. If I calculate the SHA-1 hash using this ruby script, and I change the file's content (which changes the hash), how do I know where the file is stored then?
My question is then, what are the basics of implementing a SHA-1/file-storage system?
If all of the files are changing content all the time, is there a better solution for storing them, or do you just have to keep updating the hash?
I'm just thinking about how to create a generic file storing system like GoogleDocs, Flickr, Youtube, DropBox, etc., something that you could reuse in different environments (such as storing PubMed journal articles or Cramster homework assignments and tests, or just images like on Flickr). I'd probably store them on Amazon EC2. Just some system so I can say "this is how I'll 99% of the time do file storing from now on", so I can stop thinking about building a solid/consistent way to store files and get onto some real problems.
First of all, if the contents of the files are changing, filename from SHA-digest approach is not very suitable, because the name and location of the file in filesystem must change when the contents of the file changes.
Basically you first compute a SHA-1 or MD5 digest (= hash value) from the contents of the file.
When you have a digest, for example, 00e4f56c0de1c61fdb926e79e8a0a65bd12930c9, you generate a file location and filename from the digest. For example, you split the first few characters from the digest to directory structure and rest of the characters to file name. For example:
00e4f56c0de1c61fdb926e79e8a0a65bd12930c9 => some/path/00/e4/f5/6c0de1c61fdb926e79e8a0a65bd12930c9.txt
This way you only need to store the SHA-1 digest of the file to database. You can then always find out the right location and the name of the file.
Directories usually also have maximum number of files they can contain, for example maximum of 32000 subdirectories and files per directory. A directory structure based on this kind of hashing makes it unlikely that you store too many files to same directory. Also using hashing like this make sure that every directory has about the same number of files, you won't get into situation where all your files are in same directory.
The idea is not to change the file content, but rather its name (and path), by using a hash value.
Changing the content with a hash would be disastrous since a hash is normally not reversible.
I'm not sure of the motivivation for using a hash rather than the file name (or even rather than a long random number), but here are a few advantages of the hash appraoch:
the file names on the disk is uniform
the upper or lower parts of the hash value can be used to name the directories and hence distribute the files relatively uniformely
the name becomes a code, making it difficult for someone to
a) guess a file name
b) categorize pictures (would someone steal the hard drive content)
be able to retrieve the filename and location from the file contents itself (assuming the hash comes from such content. (not quite sure which use case would involve this... a bit contrieved...)
The general interest of using a hash is that unlike a file name, a hash is meaningless, and therefore one would require the database to relate images and "bibliographic" type data (name of uploader, date of upload, tags, ...)
In thinking about it, re-reading the referenced SO response, I don't really see much of an advantage of a hash, as compared to, say, a random number...
Furthermore... some hashes produce a numeric value, typically expressed in hexadecimal (as seen in the refernced SO question) and this could be seen as wasteful, by making the file names longer than they need to be, and hence putting more stress on the file system (bigger directories...)
One advantage I see with storing files using their hash is that the file data only needs to be stored once and then can be referenced multiple times within your database. This will save you space if you have a different users uploading the exact same file.
However the downside to this is when a user deletes what they think is there file from your app, you can't just physically delete the file from disk because other users that uploaded the same exact file may still be using it.
The idea is that you need to come up with a name for the photo, and you probably want to scatter the files among a number of directories. One easy way to come up with a unique name is to use the hash.
So the beginning of the hash was peeled off for a multi-level directory structure and the rest of the hash was used for a filename for the jpg.
This has the additional benefit of detecting duplicate uploads.