Thing is I have a file that has room for metadata. I want to store a hash for integrity verification in it. Problem is, once I store the hash, the file and the hash along with it changes.
I perfectly understand that this is by definition impossible with one way cryptographic hash methods like md5/sha.
I am also aware of the possibility of containers that store verification data separated from the content as zip & co do.
I am also aware of the possibility to calculate the hash separately and send it along with the file or to append it at the end or somewhere where the client, when calculating the hash, ignores it.
This is not what I want.
I want to know whether there is an algorithm where its possible to get the resulting hash from data where the very result of the hash itself is included.
It doesn't need to be cryptographic or fullfill a lot of criterias. It can also be based on some heuristics that after a realistic amount of time deliver the desired result.
I am really not so into mathematics, but couldn't there be some really advanced exponential modulo polynom cyclic back-reference devision stuff that makes this possible?
And if not, whats (if there is) the proof against it?
The reason why i need tis is because i want (ultimately) to store a hash along with MP4 files. Its complicated, but other solutions are not easy to implement as the file walks through a badly desigend production pipeline...
It's possible to do this with a CRC, in a way. What I've done in the past is to set aside 4 bytes in a file as a placeholder for a CRC32, filling them with zeros. Then I calculate the CRC of the file.
It is then possible to fill the placeholder bytes to make the CRC of the file equal to an arbitrary fixed constant, by computing numbers in the Galois field of the CRC polynomial.
(Further details possible but not right at this moment. You basically need to compute (CRC_desired - CRC_initial) * 2-8*byte_offset in the Galois field, where byte_offset is the number of bytes between the placeholder bytes and the end of the file.)
Note: as per #KeithS's comments this solution is not to prevent against intentional tampering. We used it on one project as a means to tie metadata within an embedded system to the executable used to program it -- the embedded system itself does not have direct knowledge of the file(s) used to program it, and therefore cannot calculate a CRC or hash itself -- to detect inadvertent mismatch between an embedded system and the file used to program it. (In later systems I've just used UUIDs.)
Of course this is possible, in a multitude of ways. However, it cannot prevent intentional tampering.
For example, let
hash(X) = sum of all 32-bit (non-overlapping) blocks of X modulo 65521.
Let
Z = X followed by the 32-bit unsigned integer (hash(X) * 65521)
Then
hash(Z) == hash(X) == last 32-bits of Z
The idea here is just that any 32-bit integer congruent to 0 modulo 65521 will have no effect on the hash of X. Then, since 65521 < 2^16, hash has a range less then 2^16, and there are at least 2^16 values less than 2^32 congruent to 0 modulo 65521. And so we can encode the hash into a 32 bit integer that will not affect the hash. You could actually use any number less than 2^16, 65521 just happens to be the largest such prime number.
I remember an old DOS program that was able to embed in a text file the CRC value of that file. However, this is possible only with simple hash functions.
Altough in theory you could create such file for any kind of hash function (given enough time or the right algorithm), the attacker would be able to use exactly the same approach. Even more, he would have a chose: to use exactly your approach to obtain such file, or just to get rid of the check.
It means that now you have two problems instead of one, and both should be implemented with the same complexity. It's up to you to decide if it worth it.
EDIT: you could consider hashing some intermediary results (like RAW decoded output, or something specific to your codec). In this way the decoder would have it anyway, but for another program it would be more difficult to compute.
No, not possible. You either you a separate file for hashs ala md5sum, or the embedded hash is only for the "data" portion of the file.
the way the nix package manager does this is by when calculating the hash you pretend the contents of the hash in the file are some fixed value like 20 x's and not the hash of the file then you write the hash over those 20 x's and when you check the hash you read that and ignore again it pretending the hash was just the fixed value of 20 x's when hashing
they do this because the paths at which a package is installed depend on the hash of the whole package so as the hash is of fixed length they set it as some fixed value and then replace it with the real hash and when verifying they ignore the value they placed and pretend it's that fixed value
but if you don't use such a method is it impossible
It depends on your definition of "hash". As you state, obviously with any pseudo-random hash this would be impossible (in a reasonable amount of time).
Equally obvious, there are of course trivial "hashes" where you can do this. Data with an odd number of bits set to 1 hash to 00 and an even number of 1s hash to 11, for example. The hash doesn't modify the odd/evenness of the 1 bits, so files hash the same when their hash is included.
Related
I'm looking for an algorithm, which is a one-way function, like Hash function. And the algorithm accept a small input(serveral bits, less than 512 bits), and map it to a long output(1K Byte or more). Do you know an algorithm or a function like this?
From the Shannon theorem you don't gain any security by having a cyphertext of a size bigger than your plain text, unless the key (or the procedure to create the cyphertext) is different for any input. Even in this case, you will need to assign only one key (or mechanism) for each input x otherwise you violate the definition of a function. So if you apply an encryption mechanism f: X (set of inputs) -> Y (set of outputs), then |Y| <= |X|.
All this to say that if your input is less than 512 bits, you gain nothing by producing a 1KB output. Now, I recommend you to use one of the functions listed on the one-way function wiki page
Keccak has variable length output, (although not evaluated for in SHA-3), it's "security claim is disentangled from the output length. There is a minimum output length..." and Skein hash function has a variable output of up to 16 exabytes
Whatever your reasons are, you can calculate hashes of the same small data using different algorithms, then concatenate those hashes. If the output is not large enough, calculate hashes of hashes and append them.
As pointed in other answers, this doesn't have much sense from security perspective.
I would like an algorithm for a function that takes n integers and returns one integer. For small changes in the input, the resulting integer should vary greatly. Even though I've taken a number of courses in math, I have not used that knowledge very much and now I need some help...
An important property of this function should be that if it is used with coordinate pairs as input and the result is plotted (as a grayscale value for example) on an image, any repeating patterns should only be visible if the image is very big.
I have experimented with various algorithms for pseudo-random numbers with little success and finally it struck me that md5 almost meets my criteria, except that it is not for numbers (at least not from what I know). That resulted in something like this Python prototype (for n = 2, it could easily be changed to take a list of integers of course):
import hashlib
def uniqnum(x, y):
return int(hashlib.md5(str(x) + ',' + str(y)).hexdigest()[-6:], 16)
But obviously it feels wrong to go over strings when both input and output are integers. What would be a good replacement for this implementation (in pseudo-code, python, or whatever language)?
A "hash" is the solution created to solve exactly the problem you are describing. See wikipedia's article
Any hash function you use will be nice; hash functions tend to be judged based on these criteria:
The degree to which they prevent collisions (two separate inputs producing the same output) -- a by-product of this is the degree to which the function minimizes outputs that may never be reached from any input.
The uniformity the distribution of its outputs given a uniformly distributed set of inputs
The degree to which small changes in the input create large changes in the output.
(see perfect hash function)
Given how hard it is to create a hash function that maximizes all of these criteria, why not just use one of the most commonly used and relied-on existing hash functions there already are?
From what it seems, turning integers into strings almost seems like another layer of encryption! (which is good for your purposes, I'd assume)
However, your question asks for hash functions that deal specifically with numbers, so here we go.
Hash functions that work over the integers
If you want to borrow already-existing algorithms, you may want to dabble in pseudo-random number generators
One simple one is the middle square method:
Take a digit number
Square it
Chop off the digits and leave the middle digits with the same length as your original.
ie,
1111 => 01234321 => 2342
so, 1111 would be "hashed" to 2342, in the middle square method.
This way isn't that effective, but for a few number of hashes, this has very low collision rates, a uniform distribution, and great chaos-potential (small changes => big changes). But if you have many values, time to look for something else...
The grand-daddy of all feasibly efficient and simple random number generators is the (Mersenne Twister)[http://en.wikipedia.org/wiki/Mersenne_twister]. In fact, an implementation is probably out there for every programming language imaginable. Your hash "input" is something that will be called a "seed" in their terminology.
In conclusion
Nothing wrong with string-based hash functions
If you want to stick with the integers and be fancy, try using your number as a seed for a pseudo-random number generator.
Hashing fits your requirements perfectly. If you really don't want to use strings, find a Hash library that will take numbers or binary data. But using strings here looks OK to me.
Bob Jenkins' mix function is a classic choice, at when n=3.
As others point out, hash functions do exactly what you want. Hashes take bytes - not character strings - and return bytes, and converting between integers and bytes is, of course, simple. Here's an example python function that works on 32 bit integers, and outputs a 32 bit integer:
import hashlib
import struct
def intsha1(ints):
input = struct.pack('>%di' % len(ints), *ints)
output = hashlib.sha1(input).digest()
return struct.unpack('>i', output[:4])
It can, of course, be easily adapted to work with different length inputs and outputs.
Have a look at this, may be you can be inspired
Chaotic system
In chaotic dynamics, small changes vary results greatly.
A x-bit block cipher will take an number and convert it effectively to another number. You could combine (sum/mult?) your input numbers and cipher them, or iteratively encipher each number - similar to a CBC or chained mode. Google 'format preserving encyption'. It is possible to create a 32-bit block cipher (not widely 'available') and use this to create a 'hashed' output. Main difference between hash and encryption, is that hash is irreversible.
If you only use the first 4 bytes of an MD5 hash, would that mean theoretically only 1 in 255^4 chance of collision? That is, are hashes designed such that you only have to use a small portion of the returned hash (say the hash is of a file of some size)?
Remember that, even without considering a smart attacker deliberately trying to cause collisions, you need to start worrying about accidental collisions once the number of objects you're hashing get comparable to the square root of the hash space... just a few tens of thousands of objects for a 32-bit hash key. This comes from the so-called birthday paradox.
It is 256, not 255.
Assuming that MD5 is a secure hash function (it turns out it is not secure, but, for the sake of the discussion, let's suppose that it is secure), then it should behave like a random oracle, a mythical object which outputs uniformly random values, under the sole constraint that it "remembers" its previous outputs and returns the same value again, given the same input.
Truncating the output of a random oracle yields another random oracle. Thus, if you keep 32 bits, then the probability of a collision with two distinct input messages is 1 in 2^32 (i.e. 1 in 256^4).
Now there is a thing known as the birthday paradox which says that, with about 2^16 distinct inputs, there are good chances that two of the 2^16 corresponding outputs collide.
MD5 has been shown to be insecure for some purposes -- in particular anything which is related to collisions. The current default recommendation is SHA-2 (a family of four functions, with output sizes 224, 256, 384 and 512 bits, respectively). A new (american) standard is currently being defined, through an open competition, under the code name SHA-3. This is a long process; the new function shall be chosen by mid-2012. Some of the remaining candidates (currently 14, out of an initial 51) are substantially faster than SHA-2, some approaching MD5 in performance, while being considerably more secure. But this is a bit new, so right now you shall use SHA-2 by default.
Assume we have a pre-determined message1. hash1 = md5(message1)
Now choose a message2 randomly, and set hash2 = md5(message2).
In theory there is a 1/255^4 chance that the first four characters of hash2 match the first four of pre-determined hash1.
It is also supposed to be very hard for an attacker that knows message1 to come up with a different message2 that has the same hash. This is called second pre-image resistance. However, even with the full MD5, there are better than theoretical pre-image attacks.
MD5 is completely broken for collisions. This means it is quite feasible for an attacker (in a few hours) to come up with two messages with the same hash (let alone the same first four bytes). The attacker gets to choose both messages, but this can still cause major damage. See for instance the poisoned message example.
If you're generating unique identifiers, you might want to use a UUID instead. These are designed to minimize the change of collisions so that in practice they should never occur.
If you're worried about filenames being too long, which is a peculiar thing to be concerned about when most operating systems support names as long as 255 characters, you can always split the filename into a path and filename component. This has the advantage of splitting up the files into different directories:
fdadda221fd71619e6c0139730b012577dd4de90
fdadda221fd71619e6c/0139730b012577dd4de90
fdad/da22/1fd7/1619/e6c0/1397/30b0/1257/7dd4/de90
Depends on the purpose of the hash.
Hash functions for use in hash tables tend to have more "randomness" in the lower bits (which are used to find the array index) than in the higher bits. Checksum and cryptographic hash functions are more evenly distributed.
I'd like to know which algorithm is employed. I strongly assume it's something simple and hopefully common. There's no lag in generating the results, for instance.
Input: any string
Output: 5 hex characters (0-F)
I have access to as many keys and results as I wish, but I don't know how exactly I could harness this to attack the function. Is there any method? If I knew any functions that converted to 5-chars to start with then I might be able to brute force for a salt or something.
I know for example that:
a=06a07
b=bfbb5
c=63447
(in case you have something in mind)
In normal use it converts random 32-char strings into 5-char strings.
The only way to derive a hash function from data is through brute force, perhaps combined with some cleverness. There are an infinite number of hash functions, and the good ones perform what is essentially one-way encryption, so it's a question of trial and error.
It's practically irrelevant that your function converts 32-character strings into 5-character hashes; the output is probably truncated. For fun, here are some perfectly legitimate examples, the last 3 of which are cryptographically terrible:
Use the MD5 hashing algorithm, which generates a 16-character hash, and use the 10th through the 14th characters.
Use the SHA-1 algorithm and take the last 5 characters.
If the input string is alphabetic, use the simple substitution A=1, B=2, C=3, ... and take the first 5 digits.
Find each character on your keyboard, measure its distance from the left edge in millimeters, and use every other digit, in reverse order, starting with the last one.
Create a stackoverflow user whose name is the 32-bit string, divide 113 by the corresponding user ID number, and take the first 5 digits after the decimal. (But don't tell 'em I told you to do it!)
Depending on what you need this for, if you have access to as many keys and results as you wish, you might want to try a rainbow table approach. 5 hex chars is only 1mln combinations. You should be able to brute-force generate a map of strings that match all of the resulting hashes in no time. Then you don't need to know the original string, just an equivalent string that generates the same hash, or brute-force entry by iterating over the 1mln input strings.
Following on from a comment I just made to Pontus Gagge, suppose the hash algorithm is as follows:
Append some long, constant string to the input
Compute the SHA-256 hash of the result
Output the last 5 chars of the hash.
Then I'm pretty sure there's no computationally feasible way from your chosen-plaintext attack to figure out what the hashing function is. To even prove that SHA-256 is in use (assuming it's a good hash function, which as far as we currently know it is), I think you'd need to know the long string, which is only stored inside the "black box".
That said, if I knew any published 20-bit hash functions, then I'd be checking those first. But I don't know any: all the usual non-crypto string hashing functions are 32 bit, because that's the expected size of an integer type. You should perhaps compare your results to those of CRC, PJW, and BUZ hash on the same strings, as well as some variants of DJB hash with different primes, and any string hash functions built in to well-known programming languages, like java.lang.String.hashCode. It could be that the 5 output chars are selected from the 8 hex chars generated by one of those.
Beyond that (and any other well-known string hashes you can find), I'm out of ideas. To cryptanalyse a black box hash, you start by looking for correlations between the bits of the input and the bits of the output. This gives you clues what functions might be involved in the hash. But that's a huge subject and not one I'm familiar with.
This sounds mildly illicit.
Not to rain on your parade or anything, but if the implementors have done their work right, you wouldn't notice lags beyond a few tens of milliseconds on modern CPU's even with strong cryptographic hashes, and knowing the algorithm won't help you if they have used salt correctly. If you don't have access to the code or binaries, your only hope is a trivial mistake, whether caused by technical limitations or carelesseness.
There is an uncountable infinity of potential (hash) functions for any given set of inputs and outputs, and if you have no clue better than an upper bound on their computational complexity (from the lag you detect), you have a very long search ahead of you...
I want to convert windows pathname to unique integer.
Eg:
For pathname C:\temp\a.out, if i add ascii value of all the characters, i get 1234. But some other path can also generate the same number. So, what is the best way to generate unique numbers for different pathnames?
Look into Hash functions. Make sure to consider the case-insensitive nature of most Windows filenames when performing the hash.
Most likely, the language you are using provides a library function (or collection of functions) which can take the hash of a string (or just data). SHA1 is popular and has low collision.
Here on Stackoverflow there are many questions pertaining to hash functions. To get you started, simply search for "hash function". This may be a useful SO question for your case: What is a performant string hashing function that results in a 32 bit integer with low collision rates?.
there are more possible pathnames than integers, therefore you can't have true uniqueness. You could settle for something like an MD5 hash.
Perfect hashing
Yes, you'll need to use some kind of hash function, simply because the domain of your input is greater than the range of your output. In other words, there are almost certainly more valid pathnames than there are numbers representable in your target language's data type.
So it will not be possible to completely avoid collisions. If this guarantee is essential to your application, you won't be able to do it by translation to integers.
How about something like this:
Use a hash of (String->n bits) for each directory level. Alloting 20 bits for each of 10 directory levels is clearly not going to scale, but maybe a telescoping level of bits, under the assumption that the lowest directory level will be the most populated -
e.g. if you have (from root) /A/B/C/D/E/F,
output some sort of n-bit number where
bits n/2 - n hashes F
bits n/4 - n/2 bits hashes E
n/8 - n/4 bits hashes D
etc. etc.
If this is on Unix, you could just grab its inode number. ls -i shows it on the command line. The stat() command allows you to retrive it from a program.
Soft links would show up as the same file, while hard links would show up as a different file. This may or may not be behavior you want.
I see a lot of folks talking about hashing. That could work, but theoretically if your hash does anything more than compress out integer values that are not allowable in file names, then you could have clashes. If that is unacceptable for you, then your hash is always going to be nearly as many digits as the file name. At that point, you might as well just use the file name.
Jimmy Said
there are more possible pathnames than
integers, therefore you can't have
true uniqueness. You could settle for
something like an MD5 hash.
I don't think there are more possible path names then integers. As a construction to create a unique number from a pathname we can convert each letter to a (two-digit) number (so from 10-25,26=., then other special chars, and 27 being / --this is assuming there are less then 89 different characters, else, we can move to three digit encoding)
home/nlucaroni/documents/cv.pdf
1724221427232130121027242318271324122827123136251315
This forms a bijection (although, if you count only valid path names then the surjective property fails, but normally one doesn't care about that holding) --Come up with a path that isn't an integer.
This number obviously doesn't fit in a 64_bit unsigned int (max being 18446744073709551615), so it's not practical, but this isn't the point of my response.
You can read here Best way to determine if two path reference to same file in C# how you can uniquely identify a path. You need three numbers (dwVolumeSerialNumber, nFileIndexHigh and nFileIndexLow), maybe you can combine those three numbers to a new number with three times more bits. See also here: What are your favorite extension methods for C#? (codeplex.com/extensionoverflow) .
To all the people saying "it's not possible because you have more possible paths than integers to store them in": no. The poster never specified an implementation language; some languages support arbitrary-length integers. Python, for example.
Say we take the 32,000 character paths as the limit mentioned in one of the other comments. If we have 256 different characters to use with paths we get:
Python 2.5.1 (r251:54863, May 18 2007, 16:56:43)
[GCC 3.4.4 (cygming special, gdc 0.12, using dmd 0.125)] on cygwin
Type "help", "copyright", "credits" or "license" for more information.
>>> 32000L**256L
20815864389328798163850480654728171077230524494533409610638224700807216119346720596024478883464648369684843227908562015582767132496646929816279813211354641525848259018778440691546366699323167100945918841095379622423387354295096957733925002768876520583464697770622321657076833170056511209332449663781837603694136444406281042053396870977465916057756101739472373801429441421111406337458176000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000L
>>>
Notice how Python represents that just fine? Yes, there's probably a better way to do it, but that doesn't mean it's impossible.
EDIT: rjack pointed out that it's actually 256^32000, not the other way around. Python still handles it just fine. The performance may leave something to be desired, but saying it's mathematically impossible is wrong.