algorithm to convert pathname to unique number - algorithm

I want to convert windows pathname to unique integer.
Eg:
For pathname C:\temp\a.out, if i add ascii value of all the characters, i get 1234. But some other path can also generate the same number. So, what is the best way to generate unique numbers for different pathnames?

Look into Hash functions. Make sure to consider the case-insensitive nature of most Windows filenames when performing the hash.
Most likely, the language you are using provides a library function (or collection of functions) which can take the hash of a string (or just data). SHA1 is popular and has low collision.
Here on Stackoverflow there are many questions pertaining to hash functions. To get you started, simply search for "hash function". This may be a useful SO question for your case: What is a performant string hashing function that results in a 32 bit integer with low collision rates?.

there are more possible pathnames than integers, therefore you can't have true uniqueness. You could settle for something like an MD5 hash.

Perfect hashing

Yes, you'll need to use some kind of hash function, simply because the domain of your input is greater than the range of your output. In other words, there are almost certainly more valid pathnames than there are numbers representable in your target language's data type.
So it will not be possible to completely avoid collisions. If this guarantee is essential to your application, you won't be able to do it by translation to integers.

How about something like this:
Use a hash of (String->n bits) for each directory level. Alloting 20 bits for each of 10 directory levels is clearly not going to scale, but maybe a telescoping level of bits, under the assumption that the lowest directory level will be the most populated -
e.g. if you have (from root) /A/B/C/D/E/F,
output some sort of n-bit number where
bits n/2 - n hashes F
bits n/4 - n/2 bits hashes E
n/8 - n/4 bits hashes D
etc. etc.

If this is on Unix, you could just grab its inode number. ls -i shows it on the command line. The stat() command allows you to retrive it from a program.
Soft links would show up as the same file, while hard links would show up as a different file. This may or may not be behavior you want.
I see a lot of folks talking about hashing. That could work, but theoretically if your hash does anything more than compress out integer values that are not allowable in file names, then you could have clashes. If that is unacceptable for you, then your hash is always going to be nearly as many digits as the file name. At that point, you might as well just use the file name.

Jimmy Said
there are more possible pathnames than
integers, therefore you can't have
true uniqueness. You could settle for
something like an MD5 hash.
I don't think there are more possible path names then integers. As a construction to create a unique number from a pathname we can convert each letter to a (two-digit) number (so from 10-25,26=., then other special chars, and 27 being / --this is assuming there are less then 89 different characters, else, we can move to three digit encoding)
home/nlucaroni/documents/cv.pdf
1724221427232130121027242318271324122827123136251315
This forms a bijection (although, if you count only valid path names then the surjective property fails, but normally one doesn't care about that holding) --Come up with a path that isn't an integer.
This number obviously doesn't fit in a 64_bit unsigned int (max being 18446744073709551615), so it's not practical, but this isn't the point of my response.

You can read here Best way to determine if two path reference to same file in C# how you can uniquely identify a path. You need three numbers (dwVolumeSerialNumber, nFileIndexHigh and nFileIndexLow), maybe you can combine those three numbers to a new number with three times more bits. See also here: What are your favorite extension methods for C#? (codeplex.com/extensionoverflow) .

To all the people saying "it's not possible because you have more possible paths than integers to store them in": no. The poster never specified an implementation language; some languages support arbitrary-length integers. Python, for example.
Say we take the 32,000 character paths as the limit mentioned in one of the other comments. If we have 256 different characters to use with paths we get:
Python 2.5.1 (r251:54863, May 18 2007, 16:56:43)
[GCC 3.4.4 (cygming special, gdc 0.12, using dmd 0.125)] on cygwin
Type "help", "copyright", "credits" or "license" for more information.
>>> 32000L**256L
20815864389328798163850480654728171077230524494533409610638224700807216119346720596024478883464648369684843227908562015582767132496646929816279813211354641525848259018778440691546366699323167100945918841095379622423387354295096957733925002768876520583464697770622321657076833170056511209332449663781837603694136444406281042053396870977465916057756101739472373801429441421111406337458176000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000L
>>>
Notice how Python represents that just fine? Yes, there's probably a better way to do it, but that doesn't mean it's impossible.
EDIT: rjack pointed out that it's actually 256^32000, not the other way around. Python still handles it just fine. The performance may leave something to be desired, but saying it's mathematically impossible is wrong.

Related

Is there a hash function for binary data which produces closer hashes when the data is more similar?

I'm looking for something like a hash function but for which it's output is closer the closer two different inputs are?
Something like:
f(1010101) = 0 #original hash
f(1010111) = 1 #very close to the original hash as they differ by one bit
f(0101010) = 9999 #not very close to the original hash they all bits are different
(example outputs for demonstration purposes only)
All of the input data will be of the same length.
I want to make comparisons between a file a lots of other files and be able to determine which other file has the fewest differences from it.
You may try this algorithm.
http://en.wikipedia.org/wiki/Levenshtein_distance
Since this is string only.
You may convert all your binary to string
for example:
0 -> "00000000"
1 -> "00000001"
You might be interested in either simhashing or shingling.
If you are only trying to detect similarity between documents, there are other techniques that may suit you better (like TF-IDF.) The second link is part of a good book whose other chapters delve into general information retrieval topics, including these other techniques.
You should not use a hash for this.
You must compute signatures containing several characteristic values like :
file name
file size
Is binary / Is ascii only
date (if needed)
some other more complex like :
variance of the values of bytes
average value of bytes
average length of same value bits sequence (in compressed files there are no long identical bit sequences)
...
Then you can compare signatures.
But the most important is to know what kind of data is in these files. If it is images, the size and main color are more important. If it is sound, you could analyse only some frequencies...
You might want to look at the source code to unix utilities like cmp or the FileCmp stuff in Python and use that to try to determine a reasonable algorithm.
In my uninformed opinion, calculating a hash is not likely to work well. First, it can be expensive to calculate a hash. Second, what you're trying to do sounds more like a job for encoding than a hash; once you start thinking of it that way, it's not clear that it's even worth transforming the file that way.
If you have some constraints, specifying them might be useful. For example, if all the files are the exact same length, that may simplify things. Or if you are only interested in differences between bits in the same position and not interested in things that are similar only if you compare bits in different positions (e.g., two files are identical, except that one has everything shifted three bits--should those be considered similar or not similar?).
You could calculate the population count of the XOR of the two files, which is exactly the number of bits that are not the same between the two files. So it just does precisely what you asked for, no approximations.
You can represent your data as a binary vector of features and then use dimensionality reduction either with SVD or with random indexing.
What you're looking for is a file fingerprint of sorts. For plain text, something like Nilsimsa (http://ixazon.dynip.com/~cmeclax/nilsimsa.html) works reasonably well.
There are a variety of different names for this type of technique. Fuzzy Hashing/Locality Sensitive Hashing/Distance Based Hashing/Dimensional reduction and a few others. Tools can generate a fixed length output or variable length output, but the outputs are generally comparable (eg by levenshtein distance) and similar inputs yield similar outputs.
The link above for nilsimsa gives two similar spam messages and here are the example outputs:
773e2df0a02a319ec34a0b71d54029111da90838cbc20ecd3d2d4e18c25a3025 spam1
47182cf0802a11dec24a3b75d5042d310ca90838c9d20ecc3d610e98560a3645 spam2
* * ** *** * ** ** ** ** * ******* **** ** * * *
Spamsum and sdhash are more useful for arbitrary binary data. There are also algorithms specifically for images that will work regardless of whether it's a jpg or a png. Identical images in different formats wouldn't be noticed by eg spamsum.

Chicken/Egg problem: Hash of file (including hash) inside file! Possible?

Thing is I have a file that has room for metadata. I want to store a hash for integrity verification in it. Problem is, once I store the hash, the file and the hash along with it changes.
I perfectly understand that this is by definition impossible with one way cryptographic hash methods like md5/sha.
I am also aware of the possibility of containers that store verification data separated from the content as zip & co do.
I am also aware of the possibility to calculate the hash separately and send it along with the file or to append it at the end or somewhere where the client, when calculating the hash, ignores it.
This is not what I want.
I want to know whether there is an algorithm where its possible to get the resulting hash from data where the very result of the hash itself is included.
It doesn't need to be cryptographic or fullfill a lot of criterias. It can also be based on some heuristics that after a realistic amount of time deliver the desired result.
I am really not so into mathematics, but couldn't there be some really advanced exponential modulo polynom cyclic back-reference devision stuff that makes this possible?
And if not, whats (if there is) the proof against it?
The reason why i need tis is because i want (ultimately) to store a hash along with MP4 files. Its complicated, but other solutions are not easy to implement as the file walks through a badly desigend production pipeline...
It's possible to do this with a CRC, in a way. What I've done in the past is to set aside 4 bytes in a file as a placeholder for a CRC32, filling them with zeros. Then I calculate the CRC of the file.
It is then possible to fill the placeholder bytes to make the CRC of the file equal to an arbitrary fixed constant, by computing numbers in the Galois field of the CRC polynomial.
(Further details possible but not right at this moment. You basically need to compute (CRC_desired - CRC_initial) * 2-8*byte_offset in the Galois field, where byte_offset is the number of bytes between the placeholder bytes and the end of the file.)
Note: as per #KeithS's comments this solution is not to prevent against intentional tampering. We used it on one project as a means to tie metadata within an embedded system to the executable used to program it -- the embedded system itself does not have direct knowledge of the file(s) used to program it, and therefore cannot calculate a CRC or hash itself -- to detect inadvertent mismatch between an embedded system and the file used to program it. (In later systems I've just used UUIDs.)
Of course this is possible, in a multitude of ways. However, it cannot prevent intentional tampering.
For example, let
hash(X) = sum of all 32-bit (non-overlapping) blocks of X modulo 65521.
Let
Z = X followed by the 32-bit unsigned integer (hash(X) * 65521)
Then
hash(Z) == hash(X) == last 32-bits of Z
The idea here is just that any 32-bit integer congruent to 0 modulo 65521 will have no effect on the hash of X. Then, since 65521 < 2^16, hash has a range less then 2^16, and there are at least 2^16 values less than 2^32 congruent to 0 modulo 65521. And so we can encode the hash into a 32 bit integer that will not affect the hash. You could actually use any number less than 2^16, 65521 just happens to be the largest such prime number.
I remember an old DOS program that was able to embed in a text file the CRC value of that file. However, this is possible only with simple hash functions.
Altough in theory you could create such file for any kind of hash function (given enough time or the right algorithm), the attacker would be able to use exactly the same approach. Even more, he would have a chose: to use exactly your approach to obtain such file, or just to get rid of the check.
It means that now you have two problems instead of one, and both should be implemented with the same complexity. It's up to you to decide if it worth it.
EDIT: you could consider hashing some intermediary results (like RAW decoded output, or something specific to your codec). In this way the decoder would have it anyway, but for another program it would be more difficult to compute.
No, not possible. You either you a separate file for hashs ala md5sum, or the embedded hash is only for the "data" portion of the file.
the way the nix package manager does this is by when calculating the hash you pretend the contents of the hash in the file are some fixed value like 20 x's and not the hash of the file then you write the hash over those 20 x's and when you check the hash you read that and ignore again it pretending the hash was just the fixed value of 20 x's when hashing
they do this because the paths at which a package is installed depend on the hash of the whole package so as the hash is of fixed length they set it as some fixed value and then replace it with the real hash and when verifying they ignore the value they placed and pretend it's that fixed value
but if you don't use such a method is it impossible
It depends on your definition of "hash". As you state, obviously with any pseudo-random hash this would be impossible (in a reasonable amount of time).
Equally obvious, there are of course trivial "hashes" where you can do this. Data with an odd number of bits set to 1 hash to 00 and an even number of 1s hash to 11, for example. The hash doesn't modify the odd/evenness of the 1 bits, so files hash the same when their hash is included.

What's the name of this algorithm/routine?

I am writing a utility class which converts strings from one alphabet to another, this is useful in situations where you have a target alphabet you wish to use, with a restriction on the number of characters available. For example, if you can use lower case letters and numbers, but only 12 characters its possible to compress a timestamp from the alphabet 01234567989 -: into abcdefghijklmnopqrstuvwxyz01234567989 so 2010-10-29 13:14:00 might become 5hhyo9v8mk6avy (19 charaters reduced to 16).
The class is designed to convert back and forth between alphabets, and also calculate the longest source string that can safely be stored in a target alphabet given a particular number of characters.
Was thinking of publishing this through Google code, however I'd obviously like other people to find it and use it - hence the question on what this is called. I've had to use this approach in two separate projects, with Bloomberg and a proprietary system, when you need to generate unique file names of a certain length, but want to keep some plaintext, so GUIDs aren't appropriate.
Your examples bear some similarity to a Dictionary coder with a fixed target and source dictionaries. Also worthwhile to look at is Fibonacci coding, which has a fixed target dictionary (of variable-length bits), which is variably targeted.
I think it also depends whether it is very important that your target alphabet has fixed width entries - if you allow for a fixed alphabet with variable length codes, your compression ratio will approach your entropy that much more optimally! If the source alphabet distribution is known in advance, a static Huffman tree could easily be generated.
Here is a simple algorithm:
Consider that you don't have to transmit the alphabet used for encoding. Also, you don't use (and transmit) the probabilities of the input symbols, as in standard compressions, so we just re-encode somehow the data.
In this case we can consider that the input data are in number represented with base equal to the cardinality of the input alphabet. We just have to change its representation to another base, that is a simple task.
EDITED example:
input alpabet: ABC, output alphabet: 0123456789
message ABAC will translate to 0102 in base 3, that is 11 (9 + 2) in base 10.
11 to base 10: 11
We could have a problem decoding it, because we don't know how many 0-es to use at the begining of the decoded result, so we have to use one of the modifications:
1) encode somehow in the stream the size of compressed data.
2) use a dummy 1 at the start of the stream: in this way our example will become:
10102 (base 3) = 81 + 9 + 2 = 92 (base 10).
Now after decoding we just have to ignore the first 1 (this also provides a basic error detection).
The main problem of this approach is that in most cases (GCD == 1) each new encoded character will completely change the output. This will be very inneficient and difficult to implement. We end up with arithmetic coding as the best solution (actually a simplified version of it).
You probably know about Base64 which does the same thing just usually the other way around. Too bad there are way too many Google results on BaseX or BaseN...

A function where small changes in input always result in large changes in output

I would like an algorithm for a function that takes n integers and returns one integer. For small changes in the input, the resulting integer should vary greatly. Even though I've taken a number of courses in math, I have not used that knowledge very much and now I need some help...
An important property of this function should be that if it is used with coordinate pairs as input and the result is plotted (as a grayscale value for example) on an image, any repeating patterns should only be visible if the image is very big.
I have experimented with various algorithms for pseudo-random numbers with little success and finally it struck me that md5 almost meets my criteria, except that it is not for numbers (at least not from what I know). That resulted in something like this Python prototype (for n = 2, it could easily be changed to take a list of integers of course):
import hashlib
def uniqnum(x, y):
return int(hashlib.md5(str(x) + ',' + str(y)).hexdigest()[-6:], 16)
But obviously it feels wrong to go over strings when both input and output are integers. What would be a good replacement for this implementation (in pseudo-code, python, or whatever language)?
A "hash" is the solution created to solve exactly the problem you are describing. See wikipedia's article
Any hash function you use will be nice; hash functions tend to be judged based on these criteria:
The degree to which they prevent collisions (two separate inputs producing the same output) -- a by-product of this is the degree to which the function minimizes outputs that may never be reached from any input.
The uniformity the distribution of its outputs given a uniformly distributed set of inputs
The degree to which small changes in the input create large changes in the output.
(see perfect hash function)
Given how hard it is to create a hash function that maximizes all of these criteria, why not just use one of the most commonly used and relied-on existing hash functions there already are?
From what it seems, turning integers into strings almost seems like another layer of encryption! (which is good for your purposes, I'd assume)
However, your question asks for hash functions that deal specifically with numbers, so here we go.
Hash functions that work over the integers
If you want to borrow already-existing algorithms, you may want to dabble in pseudo-random number generators
One simple one is the middle square method:
Take a digit number
Square it
Chop off the digits and leave the middle digits with the same length as your original.
ie,
1111 => 01234321 => 2342
so, 1111 would be "hashed" to 2342, in the middle square method.
This way isn't that effective, but for a few number of hashes, this has very low collision rates, a uniform distribution, and great chaos-potential (small changes => big changes). But if you have many values, time to look for something else...
The grand-daddy of all feasibly efficient and simple random number generators is the (Mersenne Twister)[http://en.wikipedia.org/wiki/Mersenne_twister]. In fact, an implementation is probably out there for every programming language imaginable. Your hash "input" is something that will be called a "seed" in their terminology.
In conclusion
Nothing wrong with string-based hash functions
If you want to stick with the integers and be fancy, try using your number as a seed for a pseudo-random number generator.
Hashing fits your requirements perfectly. If you really don't want to use strings, find a Hash library that will take numbers or binary data. But using strings here looks OK to me.
Bob Jenkins' mix function is a classic choice, at when n=3.
As others point out, hash functions do exactly what you want. Hashes take bytes - not character strings - and return bytes, and converting between integers and bytes is, of course, simple. Here's an example python function that works on 32 bit integers, and outputs a 32 bit integer:
import hashlib
import struct
def intsha1(ints):
input = struct.pack('>%di' % len(ints), *ints)
output = hashlib.sha1(input).digest()
return struct.unpack('>i', output[:4])
It can, of course, be easily adapted to work with different length inputs and outputs.
Have a look at this, may be you can be inspired
Chaotic system
In chaotic dynamics, small changes vary results greatly.
A x-bit block cipher will take an number and convert it effectively to another number. You could combine (sum/mult?) your input numbers and cipher them, or iteratively encipher each number - similar to a CBC or chained mode. Google 'format preserving encyption'. It is possible to create a 32-bit block cipher (not widely 'available') and use this to create a 'hashed' output. Main difference between hash and encryption, is that hash is irreversible.

Guessing the hash function?

I'd like to know which algorithm is employed. I strongly assume it's something simple and hopefully common. There's no lag in generating the results, for instance.
Input: any string
Output: 5 hex characters (0-F)
I have access to as many keys and results as I wish, but I don't know how exactly I could harness this to attack the function. Is there any method? If I knew any functions that converted to 5-chars to start with then I might be able to brute force for a salt or something.
I know for example that:
a=06a07
b=bfbb5
c=63447
(in case you have something in mind)
In normal use it converts random 32-char strings into 5-char strings.
The only way to derive a hash function from data is through brute force, perhaps combined with some cleverness. There are an infinite number of hash functions, and the good ones perform what is essentially one-way encryption, so it's a question of trial and error.
It's practically irrelevant that your function converts 32-character strings into 5-character hashes; the output is probably truncated. For fun, here are some perfectly legitimate examples, the last 3 of which are cryptographically terrible:
Use the MD5 hashing algorithm, which generates a 16-character hash, and use the 10th through the 14th characters.
Use the SHA-1 algorithm and take the last 5 characters.
If the input string is alphabetic, use the simple substitution A=1, B=2, C=3, ... and take the first 5 digits.
Find each character on your keyboard, measure its distance from the left edge in millimeters, and use every other digit, in reverse order, starting with the last one.
Create a stackoverflow user whose name is the 32-bit string, divide 113 by the corresponding user ID number, and take the first 5 digits after the decimal. (But don't tell 'em I told you to do it!)
Depending on what you need this for, if you have access to as many keys and results as you wish, you might want to try a rainbow table approach. 5 hex chars is only 1mln combinations. You should be able to brute-force generate a map of strings that match all of the resulting hashes in no time. Then you don't need to know the original string, just an equivalent string that generates the same hash, or brute-force entry by iterating over the 1mln input strings.
Following on from a comment I just made to Pontus Gagge, suppose the hash algorithm is as follows:
Append some long, constant string to the input
Compute the SHA-256 hash of the result
Output the last 5 chars of the hash.
Then I'm pretty sure there's no computationally feasible way from your chosen-plaintext attack to figure out what the hashing function is. To even prove that SHA-256 is in use (assuming it's a good hash function, which as far as we currently know it is), I think you'd need to know the long string, which is only stored inside the "black box".
That said, if I knew any published 20-bit hash functions, then I'd be checking those first. But I don't know any: all the usual non-crypto string hashing functions are 32 bit, because that's the expected size of an integer type. You should perhaps compare your results to those of CRC, PJW, and BUZ hash on the same strings, as well as some variants of DJB hash with different primes, and any string hash functions built in to well-known programming languages, like java.lang.String.hashCode. It could be that the 5 output chars are selected from the 8 hex chars generated by one of those.
Beyond that (and any other well-known string hashes you can find), I'm out of ideas. To cryptanalyse a black box hash, you start by looking for correlations between the bits of the input and the bits of the output. This gives you clues what functions might be involved in the hash. But that's a huge subject and not one I'm familiar with.
This sounds mildly illicit.
Not to rain on your parade or anything, but if the implementors have done their work right, you wouldn't notice lags beyond a few tens of milliseconds on modern CPU's even with strong cryptographic hashes, and knowing the algorithm won't help you if they have used salt correctly. If you don't have access to the code or binaries, your only hope is a trivial mistake, whether caused by technical limitations or carelesseness.
There is an uncountable infinity of potential (hash) functions for any given set of inputs and outputs, and if you have no clue better than an upper bound on their computational complexity (from the lag you detect), you have a very long search ahead of you...

Resources