What is Coldfusion 10 random number generation best practice? - algorithm

Do you still need to use Randomize if you are using RandRange with an algorithm? For example:
RandRange(1, 37, "SHA1PRNG")
Adobe's documentation says:
SHA1PRNG: generates a number using the Sun Java SHA1PRNG algorithm.
This algorithm provides greater randomness than the default algorithm.
It would be nice if there was one function which provided the most randomized sequence possible. The example given by Adobe uses both Randomize and RandRange.
Any clarification would be welcome.
Additional info:
In this context I am choosing random characters from a list of about 40 to allocate a password of 7 characters. I'd like to avoid duplicates although there are also separate (though not necessarily unique) usernames. Nothing financial or confidential is at stake - just need to identify users of an educational website.

For non-repeating, you gotta reduce the randRange's range and select from a list of unused characters.
Sure, use RandRange with SHA1PRNG and don't worry about it.
You don't really need randomize. It's only used for seeding the random functions when you want predictable random sequence for debugging purposes.
An alternative solution would be shuffling a collection of character using java.util.Collections's shuffle(), then use left() to get the desired length of non-repeating characters. See: http://www.bennadel.com/blog/2284-Using-java-util-Collections-To-Shuffle-A-ColdFusion-Query-Column-Corrupts-Column-Values.htm

Related

How can I generate a unique identifier that is apparently not progressive [duplicate]

A few months back I was tasked with implementing a unique and random code for our web application. The code would have to be user friendly and as small as possible, but still be essentially random (so users couldn't easily predict the next code in the sequence).
It ended up generating values that looked something like this:
Af3nT5Xf2
Unfortunately, I was never satisfied with the implementation. Guid's were out of the question, they were simply too big and difficult for users to type in. I was hoping for something more along the lines of 4 or 5 characters/digits, but our particular implementation would generate noticeably patterned sequences if we encoded to less than 9 characters.
Here's what we ended up doing:
We pulled a unique sequential 32bit id from the database. We then inserted it into the center bits of a 64bit RANDOM integer. We created a lookup table of easily typed and recognized characters (A-Z, a-z, 2-9 skipping easily confused characters such as L,l,1,O,0, etc.). Finally, we used that lookup table to base-54 encode the 64-bit integer. The high bits were random, the low bits were random, but the center bits were sequential.
The final result was a code that was much smaller than a guid and looked random, even though it absolutely wasn't.
I was never satisfied with this particular implementation. What would you guys have done?
Here's how I would do it.
I'd obtain a list of common English words with usage frequency and some grammatical information (like is it a noun or a verb?). I think you can look around the intertubes for some copy. Firefox is open-source and it has a spellchecker... so it must be obtainable somehow.
Then I'd run a filter on it so obscure words are removed and that words which are too long are excluded.
Then my generation algorithm would pick 2 words from the list and concatenate them and add a random 3 digits number.
I can also randomize word selection pattern between verb/nouns like
eatCake778
pickBasket524
rideFlyer113
etc..
the case needn't be camel casing, you can randomize that as well. You can also randomize the placement of the number and the verb/noun.
And since that's a lot of randomizing, Jeff's The Danger of Naïveté is a must-read. Also make sure to study dictionary attacks well in advance.
And after I'd implemented it, I'd run a test to make sure that my algorithms should never collide. If the collision rate was high, then I'd play with the parameters (amount of nouns used, amount of verbs used, length of random number, total number of words, different kinds of casings etc.)
In .NET you can use the RNGCryptoServiceProvider method GetBytes() which will "fill an array of bytes with a cryptographically strong sequence of random values" (from ms documentation).
byte[] randomBytes = new byte[4];
RNGCryptoServiceProvider rng = new RNGCryptoServiceProvider();
rng.GetBytes(randomBytes);
You can increase the lengh of the byte array and pluck out the character values you want to allow.
In C#, I have used the 'System.IO.Path.GetRandomFileName() : String' method... but I was generating salt for debug file names. This method returns stuff that looks like your first example, except with a random '.xyz' file extension too.
If you're in .NET and just want a simpler (but not 'nicer' looking) solution, I would say this is it... you could remove the random file extension if you like.
At the time of this writing, this question's title is:
How can I generate a unique, small, random, and user-friendly key?
To that, I should note that it's not possible in general to create a random value that's also unique, at least if each random value is generated independently of any other. In addition, there are many things you should ask yourself if you want to generate unique identifiers (which come from my section on unique random identifiers):
Can the application easily check identifiers for uniqueness within the desired scope and range (e.g., check whether a file or database record with that identifier already exists)?
Can the application tolerate the risk of generating the same identifier for different resources?
Do identifiers have to be hard to guess, be simply "random-looking", or be neither?
Do identifiers have to be typed in or otherwise relayed by end users?
Is the resource an identifier identifies available to anyone who knows that identifier (even without being logged in or authorized in some way)?
Do identifiers have to be memorable?
In your case, you have several conflicting goals: You want identifiers that are—
unique,
easy to type by end users (including small), and
hard to guess (including random).
Important points you don't mention in the question include:
How will the key be used?
Are other users allowed to access the resource identified by the key, whenever they know the key? If not, then additional access control or a longer key length will be necessary.
Can your application tolerate the risk of duplicate keys? If so, then the keys can be completely randomly generated (such as by a cryptographic RNG). If not, then your goal will be harder to achieve, especially for keys intended for security purposes.
Note that I don't go into the issue of formatting a unique value into a "user-friendly key". There are many ways to do so, and they all come down to mapping unique values one-to-one with "user-friendly keys" — if the input value was unique, the "user-friendly key" will likewise be unique.
If by user friendly, you mean that a user could type the answer in then I think you would want to look in a different direction. I've seen and done implementations for initial random passwords that pick random words and numbers as an easier and less error prone string.
If though you're looking for a way to encode a random code in the URL string which is an issue I've dealt with for awhile then I what I have done is use 64-bit encoded GUIDs.
You could load your list of words as chakrit suggested into a data table or xml file with a unique sequential key. When getting your random word, use a random number generator to determine what words to fetch by their key. If you concatenate 2 of them, I don't think you need to include the numbers in the string unless "true randomness" is part of the goal.

Need an Algorithm to generate Serialnumber

I want to generate 16-digits hexadecimal serial-number like: F204-8BE2-17A2-CFF3.
(This pattern give me 16^16 distinct serial-number But I don't need all of them)
I need you all to suggest me an algorithm to generate these serial-numbers randomly with an special characteristic which is:
each two serial-numbers have (at-least) 6 different digits
(= It means if you are given two most similar serial-number, they should still have difference in 6 indexes)
I know that a good algorithm with this characteristic needs to remember previously generated serial-numbers and I don't want that much.
In fact, I need an algorithm which do this with least probability for a chosen pair to collide (less than 0.001 seems sufficient )
PS:
I've just tried to create 10K string randomly using MD5 hash and It gave similar string( similar=more than 3 same digits) with 0.00018 probability.
It is possible to construct a correct generator without having to remember all previously generated codes. You can generate serial numbers that are spaced 6 characters apart by using Hamming code. A hamming code can be designed to arbitrarily space out two distinct generated values. Obviously, the greater the distance, the higher redundancy you will have to use, resulting in more complex code and longer numbers.
First you design a hamming code to your liking, that encodes a number into a sequence of hexadecimal digits and then you can take any sequence of numbers and use it as a seed, such as prime numbers. You just always need to remember, what number was used last and use the next one.
That being said, if you don't need to properly ensure minimal distance of two serials, and would settle for a small error, I would suggest that any half decent hash function or cypher should produce decently spaced out outputs. Therefore the first thing I would try to do is to take MD5 or SHA hashes and test-drive them on numbers 1 - 1000. My hopes are, the results will be quite satisfactory.
I suggest you look into the ANSI X9.17 pseudorandom bit generator. An algorithmic sketch is given in these slides. ANSI X9.17 generates 64-bit pseudorandom strings which is what you want.
A revised and enhanced version of this generator was approved by NIST. Please have a look at this page.
Now whether you use ANSI X9.17 generator, another generator, or develop your own, it's a good idea to have the generator pass some statistical tests in order to ensure the quality of its pseudorandom bits.
Example tests include the ENT battery, the DIEHARD battery, and the NIST battery.

Why are some random() functions deemed "not secure?"

I've heard people being warned all over the place not to rely on a language's random() function to generate a random number or string sequence "for security reasons." Java even has a SecureRandom class. Why is this?
When people talk about predicting the output of a random number generator, they don't even need to get the actual "next number". Even something subtle like noticing that the random numbers aren't evenly distributed, or that they never produce the same number twice in a row, or that "bit 5 is always set", can go a long way towards turning an attack based on guessing a "random" number from taking years, to taking days.
There is a tradeoff, generally, too. Without specific hardware to do it, generating large quantities of random numbers quickly can be really hard, since there isn't enough "randomness" available to the computer so it has to fake it.
If you're not using the randomness for security (cryptography, passwords, etc), but instead for things like simulations or numerical work, then it doesn't matter too much if they're predictable, only that they're statistically random.
Almost every random number generator is 'pseudo random' in that it uses a table of random numbers or a predictable formula. A seed is sometimes used to "start" the random sequence at a specific point, e.g. seedRandom(timer).
This was especially prevalent in the days of BAsIC programming, because it's random number generator always started at exactly the same sequence of numbers, making it unusable for any kind of GUID generation.
Back in the day, the Z-80 microprocessor had a truly random number generator, although it was only a number between 0 and 127. It used a processor cycle function and was unpredictable.
Pseudo-random numbers that can be determined in advance can lead to security holes that are vulnerable to a random number generator attack.
Predictability of a random number is a big issue. Most "random" functions derive their value from time. Given the right set of conditions you could end up with two "random" numbers of a large value that are the same.
In windows .NET world CPRNG (Cryptographically secure pseudo random number generator) can be found in System.Security.Cryptography.RandomNumberGenerator through underlying win32 APIs
In Linux there is a random "device"

YouTube URL algorithm?

How would you go about generating the unique video URL's that YouTube uses?
Example:
http://www.youtube.com/watch?v=CvUN8qg9lsk
YouTube uses Base64 encoding to generate IDs for each video.Characters involved in generating Ids consists of
(A-Z) + (a-z) + (0-9) + (-) + (_). (64 Characters).
Using Base64 encoding and only up to 11 characters they can generate 73+ Quintilian unique IDs.How much large pool of ID is that?
Well, it's enough for everyone on earth to produce video every single minute for 18000 years.
And they have achieved such huge number by only using 11 characters (64*64*64*64*64*64*64*64*64*64*64) if they need more IDs they will just have to add 1 more character to their IDs.
So when video is uploaded on YouTube they basically randomly select from 73+ Quintilian possibility and see if its already taken or not.if not use it otherwise look for another one.
Refer to this video for detailed explanation.
Using some non-trivial hashing function. The probability of collision is very low, depending on the function, the parameters and the input domain. Keep in mind that cryptographic hashes were specifically designed to have very low collision rates for non-random input (i.e. completely different hashes for two close-but-unequal inputs).
This post by Jeff Attwood is a nice overview of the topic.
And here is an online hash calculator you can play with.
There is no need to use a hash. It is probably just a quasi-random 64 bit value passed through base64 or some equivalent.
By quasi-random, I mean it is just a one-to-one mapping with the counting integers, just shuffled.
For example, you could take a monotonically increasing database id and multiply it by some prime near 2^64, then base64 the result. If you did not want people to be able to guess, you might choose a more complex mapping or just pick a random number that is not in the database yet.
Normal base64 would add an equals at the end, but in this case it is implied because the size is known. The character mapping could easily be something besides the standard.
Eli's link to Jeff's article is, in my opinion, irrelevant. URL shortening is not the same thing as presenting an ID to the world. Instead, a nicer way would be to convert your existing integer ID to a different radix.
An example in PHP:
$id = 9999;
//$url_id = base_convert($id, 10, 26+26+10); // PHP doesn't like this
$url_id = base_convert($id, 10, 26+10); // Works, but only digits + lowercase
Sadly, PHP only supports up to base 36 (digits + alphabet). Base 62 would support alphabet in both upper-case and lower-case.
People are talking about these other systems:
Random number/letters - Why? If you want people to not see the next video (id+1), then just make it private. On a website like youtube, where it actively shows any video it has, why bother with random ids?
Hashing an ID - This design concept really stinks. Think about it; so you have an ID guaranteed by your DBM software to be unique, and you hash it (introducing a collision factor)? Give me one reason why to even consider this idea.
Using the ID in URL - To be honest, I don't see any problems with this either, though it will grow to be large when in fact you can express the same number with fewer letters (hence my solution).
Using Base64 - Base64 expects bytes of data, literally anything from nulls to spaces. Why use this function when your data consists of a number (ie, a mix of 10 different characters, instead of 256)?
You can use any library or some languages like python provides it in standard library.
Example:
import secrets
id_length = 12
random_video_id = secrets.token_urlsafe(id_length)
You could generate a GUID and have that as the ID for the video.
Guids are very unlikely to collide.
Your best bet is probably to simply generate random strings, and keep track (in a DB for example) of which strings you've already used so you don't duplicate. This is very easy to implement and it cannot fail if properly implemented (no duplicates, etc).
I don't think that the URL v parameter has anything to do with the content (video properties, title, description etc).
It's a randomly generated string of fixed length and contains a very specific set of characters. No duplicates are allowed.
I suggest using a perfect hash function:
Perfect Hash Function for Human Readable Order Codes
As the accepted answer indicates, take a number, then apply a sequence of "bijective" (or reversible) operations on the number to get a hashed number.
The input numbers should be in sequence: 0, 1, 2, 3, and so on.
Typically you're hiding a numeric identifier in the form of something that doesn't look numeric. One simple method is something like base-36 encoding the number. You should be able to pull that off with one or another variant of itoa() in the language of your choice.
Just pick random values until you have one never seen before.
Randomly picking and exhausting all values form a set runs in expected time O(nlogn): What is O value for naive random selection from finite set?
In your case you wouldn't exhaust the set, so you should get constant time picks. Just use a fast data structure to do the duplication lookups.

Algorithm that Generates Unique Serial Number for Each English Word

For an application I need to generate unique serial numbers for each English word.
What would be the best approach?
One constraint is serial number generation algorithm should be very effective in an ordinary desktop computer.
Thanks
Do you have a list of all possible words? If yes, start from 0 at the first word and increment the serial by 1 for each word.
If not then a simple way to guarantee they are unique is to use the word itself as the serial. For example, ABC = 0x41 0x42 0x43 = 4276803.
As suggested in the comments there are other ways (that however require more work), such as compressing the words first with, for example, Huffman.
This of course gets awkward with long words: The serial of Pneumonoultramicroscopicsilicovolcanoconiosis would require around 100 digits, for example.
Otherwise you can use a hash, but there is no guarantee it will be unique for all English words.
You appear to be asking about a perfect hashing function. If so, take a look at this Wikipedia article, and at the gperf utility.
Here is an algorithm (in python) that allows you to code and decode any combination of lowercase letters:
def encode(s):
r = 1
for i in len(s):
r = r * 26 + (ord(s[i]) - ord('a'))
return r
Using 64 bits you can code up to 12 letter words. You can use the remaining unused serials as in index to a table containing low-frequency very long words.
Just use a 64-bit hash function, like Fowler-Noll-Vo. You're not likely to get collisions using a 64-bit integer, as this gives you 2^64 possible values, and there are certainly way less than that many words in the English language. You'd need to normalize each word, of course, (convert to lower-case, etc.)
Do you really need it to be 'serial'? if not - did you try to use the various hash algorithms? Several of them are built into .NET (MD5 and SHA1 if I remember correctly). I am not sure which one will be good enough especially with short strings
Are you looking for every word, or every word in the English dictionary? Are you using standard words - i.e. from the Oxford English Dictionary or are slang words included too? I guess what I'm getting at is: "How big is your dictionary"? You could use an MD5 hash which has a theoretical possibility of collisions - albeit 1 in billions of hashes that may collide - although, I can't say I'd understand the purpose of using a hash over using the actual word. Unless perhaps you're wanting to calculate the serial client side so that it's referencing a correct dictionary item on the server side without having to parse the dictionary looking for its serial. Of course - the word obviously has to be sufficiently unique in order for us to understand it as humans, and we're way more efficient at parsing the meaning of words than a computer is at doing the same.
Are you looking to separate words that look the same but are pronounced differently? Words that look and sound the same but have different meanings? If so, then you're going to come unstuck with a hash, as the same spelling with a different semantic will produce the same hash, so it won't work for this scenario. In this case you'd need some kind of incremental system. If you add words after the fact to the dictionary, will they be added at the end and just given the next serial number in sequence? What if that word is spelled the same as another word but sounds different or sounds the same but has a different semantic? What then?
I guess it depends on the purpose of the serialization as to what would be the most suitable output for your serial number and hence what would be the most efficient algorithm.
The most efficient algorithm would probably be to split your dictionary into the same number of chunks as you have processors and have a thread on each processor serialize the words in its chunk recombining the output from each thread at the end. This (in theory) would work at a speed slightly slower than O(n/number of processors) in real world performance, however I think for mathematical correctness that's still O(n) because you still have to parse the whole dictionary once to serialize each word.
I think the safest way to go is:
Worry about what you've got now
Order them in the most logical sequence (alphabetically?)
Number them in sequence
Add new words (whether spelled the same or not and having different semantics) at the end; give them the next number in the sequence, regardless of their rightful place in the dictionary alphabetically.
This way you don't have to worry about leaving spaces in the serial numbers to account for insertions between words, you don't have to worry about reindexing any dependent data to account for changes in indexes when words are inserted, you just carry on as normal. You don't have to worry about collisions, and you still get the most efficient indexing mechanism for storage purposes meaning you're not storing MD5 hashes that are potentially longer than the original word - which makes no sense for real world use.
If you need to access the dictionary alphabetically, just sort by the word, otherwise, don't.
I still think I'm at a loss as to the necessity of serializing the word - except for storage purposes where you can store your dictionary and link tables by the word's key.
I wonder if an answer is even possible.
Are color and colour the same word? Do they get one serial number or two?
Are polish and Polish the same word?
Are watch (noun) and watch (verb) the same word?
Are multiply (verb) and multiply (adverb) the same word?
Analysis (singular noun) and analyses (plural noun) are not the same word. Are analyse (plural verb) and analyze (plural verb) the same word? Are analyses (singular verb) and analyzes (singular verb) the same word? Are analyses (singular verb) and analyses (plural noun) the same word?
Are wont and won't the same word?
Are Beijing and Peking the same word? Or maybe they aren't English, since Londres and Frankreich aren't English, but then what is the English word for the capital of the Middle Country?
About about MD5 hash algorithm. Do something like this:
serialNumber = MD5( ToLower ( english word ) )

Resources