Generating unique N-valued key - random

I want to generate unique random, N-valued key.
This key can contain numbers and latin characters, i.e. A-Za-z0-9.
The only solution I am thinking about is something like this (pseudocode):
key = "";
smb = "ABC…abc…0123456789"; // allowed symbols
for (i = 0; i < N; i++) {
key += smb[rnd(0, smb.length() - 1)]; // select symbol at random position
}
Is there any better solution? What can you suggest?

I would look into GUIDs. From the Wikipedia entry, "the primary purpose of the GUID is to have a totally unique number," which sounds exactly like what you are looking for. There are several implementations out there that generate GUIDs, so it's likely you will not have to reinvent the wheel.

Keeping in mind that the whole field of cryptography relies on, amongst other things, making random numbers. Therefore the NSA, the CIA, and some of the best mathematicians in the world are working on this so I guarantee you that there are better ideas.
Me? I'd just do what fbrereto suggests, and just get a guid. Or look into cryptographic key generators, or y'know, some lava lamps and a camera.
Oh, and as to the code you have; depending on the language, you may need to seed the RNG, or it'll generate the same key every time.

Whatever you do, if you wind up generating a key that uses all numbers and all letters, and if a person is ever going to see that key (which is likely if you are using numbers and letters), omit the characters l, I, 1, O, and 0. People get them confused.
Nothing in your post addresses the question of uniqueness. You're going to have to have some way of not generating the same key twice. Usually, when I need a unique key, I have some unique information to start with. I usually take a one-way hash like MD5, then there are ways to convert that to a key with varying degrees of readability:
Convert to hex
Base64 encode it
Use bits of of the key to index into a list of words.
Example: the unique string computed by hashing the part of this answer above the horizontal line is
abduction's brogue's melted bragger's

You could do a base64 encoding of some random data and remove the +, /, and = characters from the result? I don't know if this would make a predictable distribution. Also, it seems like more work that what you're doing now, which is a fine solution.

Assuming you're using a language/library without an utterly pathetic random number generator, what you've got looks pretty good. N symbols uniformly distributed over a reasonable alphabet works for me, and no amount of applying fancier code is likely to make it more random (just slower).
(For the record, pathetic would include ditching the high-order bits of the underlying random numbers when choosing a value from the given range. While ideally all RNGs would make every bit equally random, in practice that's not so; the higher-order bits tend to be more random. This means that the modulus operator is totally the wrong thing to use when clamping to a restricted range.)

Related

A function where small changes in input always result in large changes in output

I would like an algorithm for a function that takes n integers and returns one integer. For small changes in the input, the resulting integer should vary greatly. Even though I've taken a number of courses in math, I have not used that knowledge very much and now I need some help...
An important property of this function should be that if it is used with coordinate pairs as input and the result is plotted (as a grayscale value for example) on an image, any repeating patterns should only be visible if the image is very big.
I have experimented with various algorithms for pseudo-random numbers with little success and finally it struck me that md5 almost meets my criteria, except that it is not for numbers (at least not from what I know). That resulted in something like this Python prototype (for n = 2, it could easily be changed to take a list of integers of course):
import hashlib
def uniqnum(x, y):
return int(hashlib.md5(str(x) + ',' + str(y)).hexdigest()[-6:], 16)
But obviously it feels wrong to go over strings when both input and output are integers. What would be a good replacement for this implementation (in pseudo-code, python, or whatever language)?
A "hash" is the solution created to solve exactly the problem you are describing. See wikipedia's article
Any hash function you use will be nice; hash functions tend to be judged based on these criteria:
The degree to which they prevent collisions (two separate inputs producing the same output) -- a by-product of this is the degree to which the function minimizes outputs that may never be reached from any input.
The uniformity the distribution of its outputs given a uniformly distributed set of inputs
The degree to which small changes in the input create large changes in the output.
(see perfect hash function)
Given how hard it is to create a hash function that maximizes all of these criteria, why not just use one of the most commonly used and relied-on existing hash functions there already are?
From what it seems, turning integers into strings almost seems like another layer of encryption! (which is good for your purposes, I'd assume)
However, your question asks for hash functions that deal specifically with numbers, so here we go.
Hash functions that work over the integers
If you want to borrow already-existing algorithms, you may want to dabble in pseudo-random number generators
One simple one is the middle square method:
Take a digit number
Square it
Chop off the digits and leave the middle digits with the same length as your original.
ie,
1111 => 01234321 => 2342
so, 1111 would be "hashed" to 2342, in the middle square method.
This way isn't that effective, but for a few number of hashes, this has very low collision rates, a uniform distribution, and great chaos-potential (small changes => big changes). But if you have many values, time to look for something else...
The grand-daddy of all feasibly efficient and simple random number generators is the (Mersenne Twister)[http://en.wikipedia.org/wiki/Mersenne_twister]. In fact, an implementation is probably out there for every programming language imaginable. Your hash "input" is something that will be called a "seed" in their terminology.
In conclusion
Nothing wrong with string-based hash functions
If you want to stick with the integers and be fancy, try using your number as a seed for a pseudo-random number generator.
Hashing fits your requirements perfectly. If you really don't want to use strings, find a Hash library that will take numbers or binary data. But using strings here looks OK to me.
Bob Jenkins' mix function is a classic choice, at when n=3.
As others point out, hash functions do exactly what you want. Hashes take bytes - not character strings - and return bytes, and converting between integers and bytes is, of course, simple. Here's an example python function that works on 32 bit integers, and outputs a 32 bit integer:
import hashlib
import struct
def intsha1(ints):
input = struct.pack('>%di' % len(ints), *ints)
output = hashlib.sha1(input).digest()
return struct.unpack('>i', output[:4])
It can, of course, be easily adapted to work with different length inputs and outputs.
Have a look at this, may be you can be inspired
Chaotic system
In chaotic dynamics, small changes vary results greatly.
A x-bit block cipher will take an number and convert it effectively to another number. You could combine (sum/mult?) your input numbers and cipher them, or iteratively encipher each number - similar to a CBC or chained mode. Google 'format preserving encyption'. It is possible to create a 32-bit block cipher (not widely 'available') and use this to create a 'hashed' output. Main difference between hash and encryption, is that hash is irreversible.

Simple integer encryption

Is there a simple algorithm to encrypt integers? That is, a function E(i,k) that accepts an n-bit integer and a key (of any type) and produces another, unrelated n-bit integer that, when fed into a second function D(E(i),k) (along with the key) produces the original integer?
Obviously there are some simple reversible operations you can perform, but they all seem to produce clearly related outputs (e.g. consecutive inputs lead to consecutive outputs). Also, of course, there are cryptographically strong standard algorithms, but they don't produce small enough outputs (e.g. 32-bit). I know any 32-bit cryptography can be brute-forced, but I'm not looking for something cryptographically strong, just something that looks random. Theoretically speaking it should be possible; after all, I could just create a dictionary by randomly pairing every integer. But I was hoping for something a little less memory-intensive.
Edit: Thanks for the answers. Simple XOR solutions will not work because similar inputs will produce similar outputs.
Would not this amount to a Block Cipher of block size = 32 bits ?
Not very popular, because it's easy to break. But theorically feasible.
Here is one implementation in Perl :
http://metacpan.org/pod/Crypt::Skip32
UPDATE: See also Format preserving encryption
UPDATE 2: RC5 supports 32-64-128 bits for its block size
I wrote an article some time ago about how to generate a 'cryptographically secure permutation' from a block cipher, which sounds like what you want. It covers using folding to reduce the size of a block cipher, and a trick for dealing with non-power-of-2 ranges.
A simple one:
rand = new Random(k);
return (i xor rand.Next())
(the point xor-ing with rand.Next() rather than k is that otherwise, given i and E(i,k), you can get k by k = i xor E(i,k))
Ayden is an algorithm that I developed. It is compact, fast and looks very secure. It is currently available for 32 and 64 bit integers. It is on public domain and you can get it from http://github.com/msotoodeh/integer-encoder.
You could take an n-bit hash of your key (assuming it's private) and XOR that hash with the original integer to encrypt, and with the encrypted integer to decrypt.
Probably not cryptographically solid, but depending on your requirements, may be sufficient.
If you just want to look random and don't care about security, how about just swapping bits around. You could simply reverse the bit string, so the high bit becomes the low bit, second highest, second lowest, etc, or you could do some other random permutation (eg 1 to 4, 2 to 7 3 to 1, etc.
How about XORing it with a prime or two? Swapping bits around seems very random when trying to analyze it.
Try something along the lines of XORing it with a prime and itself after bit shifting.
How many integers do you want to encrypt? How much key data do you want to have to deal with?
If you have few items to encrypt, and you're willing to deal with key data that's just as long as the data you want to encrypt, then the one-time-pad is super simple (just an XOR operation) and mathematically unbreakable.
The drawback is that the problem of keeping the key secret is about as large as the problem of keeping your data secret.
It also has the flaw (that is run into time and again whenever someone decides to try to use it) that if you take any shortcuts - like using a non-random key or the common one of using a limited length key and recycling it - that it becomes about the weakest cipher in existence. Well, maybe ROT13 is weaker.
But in all seriousness, if you're encrypting an integer, what are you going to do with the key no matter which cipher you decide on? Keeping the key secret will be a problem about as big (or bigger) than keeping the integer secret. And if you're encrypting a bunch of integers, just use a standard, peer reviewed cipher like you'll find in many crypto libraries.
RC4 will produce as little output as you want, since it's a stream cipher.
XOR it with /dev/random

Guessing the hash function?

I'd like to know which algorithm is employed. I strongly assume it's something simple and hopefully common. There's no lag in generating the results, for instance.
Input: any string
Output: 5 hex characters (0-F)
I have access to as many keys and results as I wish, but I don't know how exactly I could harness this to attack the function. Is there any method? If I knew any functions that converted to 5-chars to start with then I might be able to brute force for a salt or something.
I know for example that:
a=06a07
b=bfbb5
c=63447
(in case you have something in mind)
In normal use it converts random 32-char strings into 5-char strings.
The only way to derive a hash function from data is through brute force, perhaps combined with some cleverness. There are an infinite number of hash functions, and the good ones perform what is essentially one-way encryption, so it's a question of trial and error.
It's practically irrelevant that your function converts 32-character strings into 5-character hashes; the output is probably truncated. For fun, here are some perfectly legitimate examples, the last 3 of which are cryptographically terrible:
Use the MD5 hashing algorithm, which generates a 16-character hash, and use the 10th through the 14th characters.
Use the SHA-1 algorithm and take the last 5 characters.
If the input string is alphabetic, use the simple substitution A=1, B=2, C=3, ... and take the first 5 digits.
Find each character on your keyboard, measure its distance from the left edge in millimeters, and use every other digit, in reverse order, starting with the last one.
Create a stackoverflow user whose name is the 32-bit string, divide 113 by the corresponding user ID number, and take the first 5 digits after the decimal. (But don't tell 'em I told you to do it!)
Depending on what you need this for, if you have access to as many keys and results as you wish, you might want to try a rainbow table approach. 5 hex chars is only 1mln combinations. You should be able to brute-force generate a map of strings that match all of the resulting hashes in no time. Then you don't need to know the original string, just an equivalent string that generates the same hash, or brute-force entry by iterating over the 1mln input strings.
Following on from a comment I just made to Pontus Gagge, suppose the hash algorithm is as follows:
Append some long, constant string to the input
Compute the SHA-256 hash of the result
Output the last 5 chars of the hash.
Then I'm pretty sure there's no computationally feasible way from your chosen-plaintext attack to figure out what the hashing function is. To even prove that SHA-256 is in use (assuming it's a good hash function, which as far as we currently know it is), I think you'd need to know the long string, which is only stored inside the "black box".
That said, if I knew any published 20-bit hash functions, then I'd be checking those first. But I don't know any: all the usual non-crypto string hashing functions are 32 bit, because that's the expected size of an integer type. You should perhaps compare your results to those of CRC, PJW, and BUZ hash on the same strings, as well as some variants of DJB hash with different primes, and any string hash functions built in to well-known programming languages, like java.lang.String.hashCode. It could be that the 5 output chars are selected from the 8 hex chars generated by one of those.
Beyond that (and any other well-known string hashes you can find), I'm out of ideas. To cryptanalyse a black box hash, you start by looking for correlations between the bits of the input and the bits of the output. This gives you clues what functions might be involved in the hash. But that's a huge subject and not one I'm familiar with.
This sounds mildly illicit.
Not to rain on your parade or anything, but if the implementors have done their work right, you wouldn't notice lags beyond a few tens of milliseconds on modern CPU's even with strong cryptographic hashes, and knowing the algorithm won't help you if they have used salt correctly. If you don't have access to the code or binaries, your only hope is a trivial mistake, whether caused by technical limitations or carelesseness.
There is an uncountable infinity of potential (hash) functions for any given set of inputs and outputs, and if you have no clue better than an upper bound on their computational complexity (from the lag you detect), you have a very long search ahead of you...

Symmetric Bijective String Algorithm?

I'm looking for an algorithm that can do a one-to-one mapping of a string onto another string.
I want an algorithm that given an alphabet I can perform a symmetric mapping function.
For example:
Let's consider that I have the alphabet "A","B","C","D","E","F". I want something like F("ABC") = "CEA" and F("CEA") = "ABC" for every N letter permutation.
Surely, an algorithm like this exists. If you know of an algorithm, please post the name of it and I can research it. If I haven't been clear enough in my request, please let me know.
Thanks in advance.
Edit 1:
I should clarify that I want enough entropy so that F("ABC") would equal "CEA" and F("CEA") = "ABC" but then I do NOT want F("ABD") to equal "CEF". Notice how two input letters stayed the same and the two corresponding output letters stayed the same?
So a Caesar Cipher/ROT13 or shuffling the array would not be sufficient. However, I don't need any "real" security. Just enough entropy for the output of the function to appear random. Weak encryption algorithms welcome.
Just create an array of objects that contain 2 fields -- a letter, and a random number. Sort the array. By the random numbers. This creates a mapping where the i-th letter of the alphabet now maps to the i-th letter in the array.
If simple transposition or substitution isn't quite enough, it sounds like you want to advance to a polyalphabetic cipher. The Vigenère cipher is extremely easy to implement in code, but is still difficult to break without using a computer.
I suggest the following.
Perform a dense coding of the input to positive integers - with an alphabet size of n and string length of m you can code the string into integers between zero and n^m - 1. In your example this would be the range [0,215]. Now perform a fixed involution on the encoded number and decode it again.
Take RC4, settle for some password, and you're done. (Not that this would be very safe.)
Take the set of all permutations of your alphabet, shuffle it, and map the first half of the set onto the second half. Bad for large alphabets, of course. :)
Nah, thought that over, I forgot about character repetitions. Maybe divide the input into chunks without repeating chars and apply my suggestion to all of those chunks.
I would restate your problem thus, and give you a strategy for that restatement:
"A substitution cypher where a change in input leads to a larger change in output".
The blocking of characters is irrelevant-- in the end, it's just mappings between numbers. I'll speak of letters here, but you can extend it to any block of n characters.
One of the easiest routes for this is a rotating substitution based on input. Since you already looked at the Vigenere cipher, it should be easy to understand. Instead of making the key be static, have it be dependent on the previous letter. That is, rotate through substitutions a different amount per each input.
The variable rotation satisfies the condition of making each small change push out to a larger change. Note that the algorithm will only push changes in one direction such that changes towards the end have smaller effects. You could run the algorithm both ways (front-to-back, then back-to-front) so that every letter of cleartext changed has the possibility of changing the entire string.
The internal rotation strategy elides the need for keys, while of course losing of most of the cryptographic security. It makes sense in context, though, as you are aiming for entropy rather than security.
You can solve this problem with Format-preserving encryption.
One Java-Library can be found under https://github.com/EVGStudents/FPE.git. There you can define a Regex and encrypt/decrypt string values matching this regex.

What is a good Hash Function?

What is a good Hash function? I saw a lot of hash function and applications in my data structures courses in college, but I mostly got that it's pretty hard to make a good hash function. As a rule of thumb to avoid collisions my professor said that:
function Hash(key)
return key mod PrimeNumber
end
(mod is the % operator in C and similar languages)
with the prime number to be the size of the hash table. I get that is a somewhat good function to avoid collisions and a fast one, but how can I make a better one? Is there better hash functions for string keys against numeric keys?
There's no such thing as a “good hash function” for universal hashes (ed. yes, I know there's such a thing as “universal hashing” but that's not what I meant). Depending on the context different criteria determine the quality of a hash. Two people already mentioned SHA. This is a cryptographic hash and it isn't at all good for hash tables which you probably mean.
Hash tables have very different requirements. But still, finding a good hash function universally is hard because different data types expose different information that can be hashed. As a rule of thumb it is good to consider all information a type holds equally. This is not always easy or even possible. For reasons of statistics (and hence collision), it is also important to generate a good spread over the problem space, i.e. all possible objects. This means that when hashing numbers between 100 and 1050 it's no good to let the most significant digit play a big part in the hash because for ~ 90% of the objects, this digit will be 0. It's far more important to let the last three digits determine the hash.
Similarly, when hashing strings it's important to consider all characters – except when it's known in advance that the first three characters of all strings will be the same; considering these then is a waste.
This is actually one of the cases where I advise to read what Knuth has to say in The Art of Computer Programming, vol. 3. Another good read is Julienne Walker's The Art of Hashing.
For doing "normal" hash table lookups on basically any kind of data - this one by Paul Hsieh is the best I've ever used.
http://www.azillionmonkeys.com/qed/hash.html
If you care about cryptographically secure or anything else more advanced, then YMMV. If you just want a kick ass general purpose hash function for a hash table lookup, then this is what you're looking for.
There are two major purposes of hashing functions:
to disperse data points uniformly into n bits.
to securely identify the input data.
It's impossible to recommend a hash without knowing what you're using it for.
If you're just making a hash table in a program, then you don't need to worry about how reversible or hackable the algorithm is... SHA-1 or AES is completely unnecessary for this, you'd be better off using a variation of FNV. FNV achieves better dispersion (and thus fewer collisions) than a simple prime mod like you mentioned, and it's more adaptable to varying input sizes.
If you're using the hashes to hide and authenticate public information (such as hashing a password, or a document), then you should use one of the major hashing algorithms vetted by public scrutiny. The Hash Function Lounge is a good place to start.
This is an example of a good one and also an example of why you would never want to write one.
It is a Fowler / Noll / Vo (FNV) Hash which is equal parts computer science genius and pure voodoo:
unsigned fnv_hash_1a_32 ( void *key, int len ) {
unsigned char *p = key;
unsigned h = 0x811c9dc5;
int i;
for ( i = 0; i < len; i++ )
h = ( h ^ p[i] ) * 0x01000193;
return h;
}
unsigned long long fnv_hash_1a_64 ( void *key, int len ) {
unsigned char *p = key;
unsigned long long h = 0xcbf29ce484222325ULL;
int i;
for ( i = 0; i < len; i++ )
h = ( h ^ p[i] ) * 0x100000001b3ULL;
return h;
}
Edit:
Landon Curt Noll recommends on his site the FVN-1A algorithm over the original FVN-1 algorithm: The improved algorithm better disperses the last byte in the hash. I adjusted the algorithm accordingly.
I'd say that the main rule of thumb is not to roll your own. Try to use something that has been thoroughly tested, e.g., SHA-1 or something along those lines.
A good hash function has the following properties:
Given a hash of a message it is computationally infeasible for an attacker to find another message such that their hashes are identical.
Given a pair of message, m' and m, it is computationally infeasible to find two such that that h(m) = h(m')
The two cases are not the same. In the first case, there is a pre-existing hash that you're trying to find a collision for. In the second case, you're trying to find any two messages that collide. The second task is significantly easier due to the birthday "paradox."
Where performance is not that great an issue, you should always use a secure hash function. There are very clever attacks that can be performed by forcing collisions in a hash. If you use something strong from the outset, you'll secure yourself against these.
Don't use MD5 or SHA-1 in new designs. Most cryptographers, me included, would consider them broken. The principle source of weakness in both of these designs is that the second property, which I outlined above, does not hold for these constructions. If an attacker can generate two messages, m and m', that both hash to the same value they can use these messages against you. SHA-1 and MD5 also suffer from message extension attacks, which can fatally weaken your application if you're not careful.
A more modern hash such as Whirpool is a better choice. It does not suffer from these message extension attacks and uses the same mathematics as AES uses to prove security against a variety of attacks.
Hope that helps!
What you're saying here is you want to have one that uses has collision resistance. Try using SHA-2. Or try using a (good) block cipher in a one way compression function (never tried that before), like AES in Miyaguchi-Preenel mode. The problem with that is that you need to:
1) have an IV. Try using the first 256 bits of the fractional parts of Khinchin's constant or something like that.
2) have a padding scheme. Easy. Barrow it from a hash like MD5 or SHA-3 (Keccak [pronounced 'ket-chak']).
If you don't care about the security (a few others said this), look at FNV or lookup2 by Bob Jenkins (actually I'm the first one who reccomends lookup2) Also try MurmurHash, it's fast (check this: .16 cpb).
A good hash function should
be bijective to not loose information, where possible, and have the least collisions
cascade as much and as evenly as possible, i.e. each input bit should flip every output bit with probability 0.5 and without obvious patterns.
if used in a cryptographic context there should not exist an efficient way to invert it.
A prime number modulus does not satisfy any of these points. It is simply insufficient. It is often better than nothing, but it's not even fast. Multiplying with an unsigned integer and taking a power-of-two modulus distributes the values just as well, that is not well at all, but with only about 2 cpu cycles it is much faster than the 15 to 40 a prime modulus will take (yes integer division really is that slow).
To create a hash function that is fast and distributes the values well the best option is to compose it from fast permutations with lesser qualities like they did with PCG for random number generation.
Useful permutations, among others, are:
multiplication with an uneven integer
binary rotations
xorshift
Following this recipe we can create our own hash function or we take splitmix which is tested and well accepted.
If cryptographic qualities are needed I would highly recommend to use a function of the sha family, which is well tested and standardised, but for educational purposes this is how you would make one:
First you take a good non-cryptographic hash function, then you apply a one-way function like exponentiation on a prime field or k many applications of (n*(n+1)/2) mod 2^k interspersed with an xorshift when k is the number of bits in the resulting hash.
I highly recommend the SMhasher GitHub project https://github.com/rurban/smhasher which is a test suite for hash functions. The fastest state-of-the-art non-cryptographic hash functions without known quality problems are listed here: https://github.com/rurban/smhasher#summary.
Different application scenarios have different design requirements for hash algorithms, but a good hash function should have the following three points:
Collision Resistance: try to avoid conflicts. If it is difficult to find two inputs that are hashed to the same output, the hash function is anti-collision
Tamper Resistant: As long as one byte is changed, its hash value will be very different.
Computational Efficiency: Hash table is an algorithm that can make a trade-off between time consumption and space consumption.
In 2022, we can choose the SHA-2 family to use in secure encryption, SHA-3 it is safer but has greater performance loss. A safer approach is to add salt and mix encryption., we can choose the SHA-2 family to use in secure encryption, SHA-3 it is safer but has greater performance loss. A safer approach is to add salt and mix encryption.

Resources