I have a list of 1000 session ids. the session-id lengths are of 32 characters each. What is the most efficient algorithm which I can use to determine the randomness or variation at each character level? I am new to python, can somebody help me develop a python code snippet for the same?
Just for reference, Sequencer tool in Burpsuite gives a randomness graph for each 10 character positions if the token length is 10 characters. (algorithm is unknown to me)
I don't know how Burp does it but one way to determine the variation at each character level would be to do character frequency analysis for each position in the session ids.
The premise is that you'd expect that all character are equally likely to appear at a position across all session ids (the distribution of characters is uniform). Let's say you have collected/generated 100 session ids which are numeric (so possible characters at each position would be 0-9) you'd expect that each digit would appear 100/10=10 times at each position.
Now for each position in the sequences build a histogram with how many time a character actually appears at that position across all session ids.
To figure out how likely is your observed character distribution at each position given that you'd expect them to be uniformly distributed you can use a statistical test like the Chi Squared test.
I've written a simple Python character count tester using the Chi Squared test here: https://github.com/decbis/salr. I'll add more tests in the future.
Related
Say I need to generate a random username for an user. So I first set it as a random number generated using a function. In the next case I set the username to random number + timestamp. Which case has more probability of a collision ? Is it even decidable ?
Edit
Please note that I am not generating usernames to be used by people. Its just a placeholder.
And most importantly I'm not appending the timestamp. I'm rather performing an addition.
To further elaborate, take the random number to be 105 and the timestamp to be 106. So the username will be 211. Later the timestamp will be a bigger number but there is still a probability that during the timestamp 201 the generated random number might be 10, making it 211 again.
Obviously, the more characters you have, the less likely you are to have a collision. Chance of collision will be 1/(#OfcharsInUsername ^ #ofpossiblecharacters supported). For example, a 5 character password using only the lowercase alphabet will have a 1/(5^26) chance of a collision. Since you are appending a timestamp to the username it will be hard to have a collision occur. I see 3 potential issues however. The first is that you need to make sure you use a time fed random function, or you will have the same username multiple times. The second is that if you use a time fed random number generator than if you generate two usernames at the same time they will be exactly the same (even if you append the time to the end). The last one is that if you append the timestamp to the end of the username its going to be quite annoying for people entering the username, as timestamps have a bunch of colons and other garbage associated with them. I would prompt you to look for a different solution, producing a mix of letters and numbers. A different solution might be to grab a set amount of random numbers and use their ascii values to concatenate a username. Going from 21 to 126 would give you all of the lower and upper case letters, along with the numbers and some symbols. In addition this would make it very difficult to have collision because it wouldn't execute perfectly in time with another password. Likelyhood of collision would be 1/(n^105) where n is the number of characters.
I want to generate 16-digits hexadecimal serial-number like: F204-8BE2-17A2-CFF3.
(This pattern give me 16^16 distinct serial-number But I don't need all of them)
I need you all to suggest me an algorithm to generate these serial-numbers randomly with an special characteristic which is:
each two serial-numbers have (at-least) 6 different digits
(= It means if you are given two most similar serial-number, they should still have difference in 6 indexes)
I know that a good algorithm with this characteristic needs to remember previously generated serial-numbers and I don't want that much.
In fact, I need an algorithm which do this with least probability for a chosen pair to collide (less than 0.001 seems sufficient )
PS:
I've just tried to create 10K string randomly using MD5 hash and It gave similar string( similar=more than 3 same digits) with 0.00018 probability.
It is possible to construct a correct generator without having to remember all previously generated codes. You can generate serial numbers that are spaced 6 characters apart by using Hamming code. A hamming code can be designed to arbitrarily space out two distinct generated values. Obviously, the greater the distance, the higher redundancy you will have to use, resulting in more complex code and longer numbers.
First you design a hamming code to your liking, that encodes a number into a sequence of hexadecimal digits and then you can take any sequence of numbers and use it as a seed, such as prime numbers. You just always need to remember, what number was used last and use the next one.
That being said, if you don't need to properly ensure minimal distance of two serials, and would settle for a small error, I would suggest that any half decent hash function or cypher should produce decently spaced out outputs. Therefore the first thing I would try to do is to take MD5 or SHA hashes and test-drive them on numbers 1 - 1000. My hopes are, the results will be quite satisfactory.
I suggest you look into the ANSI X9.17 pseudorandom bit generator. An algorithmic sketch is given in these slides. ANSI X9.17 generates 64-bit pseudorandom strings which is what you want.
A revised and enhanced version of this generator was approved by NIST. Please have a look at this page.
Now whether you use ANSI X9.17 generator, another generator, or develop your own, it's a good idea to have the generator pass some statistical tests in order to ensure the quality of its pseudorandom bits.
Example tests include the ENT battery, the DIEHARD battery, and the NIST battery.
Let's say I have a document & the document is spread across 4 different machines, I would like to get a character which has the highest repeated count (all 4 machines combined).
One approach I have is to use a hashmap in each machine and calculate the frequency on each machine individually and then pass that hashmap to the main server where hashmaps from all the 4 machines will be merged.
Thus we'll get the character with the highest frequency.
But the cache here is that I want to minimize the data transferred from each machine.
What improvements can be made ?
[EDIT]
Each machine holds a part of the document
If you don't mind it taking longer...
Each computer passes the most frequent character(s). Hopefully, the number of characters with the highest frequency is low. Ideally, it would be almost always only one.
Main server combines them into a set. If the set has a single character done. Otherwise this set is passed along to the computers, likely as an array or list. Assuming only one character from each computer, this list would have only 2-4 characters.
Each computer returns the frequencies of each character in the set.
Main server sums the frequencies, obtaining the most frequent.
I assert that without prior knowledge of the distribution of characters in the document then any approach you take will have to reduce the data from all 4 computers onto one of them. To minimise the data transferred it is necessary to minimise the size of the data structure which holds the character counts on each computer.
Supposing that you are working with an alphabet with N characters your problem is now the design of a data structure which can hold N integers (in some range [0..m], m being the number of characters in the alphabet) and there is any number of such data structures to be found.
Of course, if you have prior knowledge of the distribution of characters, for example if you know that it is pure text written in English, you have a range of possible approaches to data compression.
Given the relatively small values for N and m likely to be found in practice I agree with the general thrust of the commentary, that it is probably not worth devising a complicated structure to minimise the amount of data transferred, sending an array of N integers would be adequate in most conceivable circumstances.
I am making an application that stores documents and gives each one a UID based on a SHA1 digest of a few things including the timestamp. The digest has a lot of characters, and I want to allow users to identify the documents by using the first x characters of the full digest. What's a good value for x if the number of documents is maybe around 10K - 100K?
Adapting the formulas on on wikipedia for the Birthday problem, you can approximate the probability of collision as 1 - e^(-n^2/(2^(b+1))), where n is the document count and b is the number of bits. Graphing this formula with n=100,000, it looks like you'll want b > 45 at least. I'd be more inclined to go with 64 to make it a nice and round number. That said, do have a plan to deal with collisions if they occur (maybe alter the timestamp slightly, or add a nonce?)
For that matter, if the sha1 is based on more than just the content of the document, why not simply make it a random ID? In this case collisions are less of a problem, as you can always generate a new random number and try again (the probability of a collision with a single try is the same, however).
Be careful of truncation as there is no reduction in proof that the smaller hash is secure. See Kelsey's http://csrc.nist.gov/groups/ST/hash/documents/Kelsey_Truncation.pdf. Kelsey gives to heuristic arguments stating the same ("Related Hash Outputs" and "Near Collisions"). Biham/Chen offer examples of Near Collisions; and Knudsen demonstrates Truncated Differentials.
In the end, you probably want to feed your data into an HMAC with the truncated size (the size is digested by the HMAC, too) and then use the truncated HMAC.
There really isn't a value for this; part of what makes SHA a good general-purpose hashing algorithm is that similar data does not necessarily produce similar hashed values. Your best bet (without knowing anything else about your system) would just be to search the list of documents whose hashes start with the value supplied by the user, then either present them with a list of documents to select from or go directly to the document if there's only one.
It's a generalization of the birthday problem. In you case n is number of documents, and instead of constant 365 you'd have number of possibilities the cutoff gives you (so for k bits it's 2k).
Of course exact calculation is out of the question, but you might use approximation.
Well, here's a possibly too simplistic of an answer..
If with full sha1 you get about 1 in 2^160 chance of collision, then by truncating one character you increase the chances of collision by 16 (all possible values of the truncated character)... which is 2^4.. So, if you truncate x characters you get 1 in 2^(160 - 4*x) chances of collision.. right?
How would you go about generating the unique video URL's that YouTube uses?
Example:
http://www.youtube.com/watch?v=CvUN8qg9lsk
YouTube uses Base64 encoding to generate IDs for each video.Characters involved in generating Ids consists of
(A-Z) + (a-z) + (0-9) + (-) + (_). (64 Characters).
Using Base64 encoding and only up to 11 characters they can generate 73+ Quintilian unique IDs.How much large pool of ID is that?
Well, it's enough for everyone on earth to produce video every single minute for 18000 years.
And they have achieved such huge number by only using 11 characters (64*64*64*64*64*64*64*64*64*64*64) if they need more IDs they will just have to add 1 more character to their IDs.
So when video is uploaded on YouTube they basically randomly select from 73+ Quintilian possibility and see if its already taken or not.if not use it otherwise look for another one.
Refer to this video for detailed explanation.
Using some non-trivial hashing function. The probability of collision is very low, depending on the function, the parameters and the input domain. Keep in mind that cryptographic hashes were specifically designed to have very low collision rates for non-random input (i.e. completely different hashes for two close-but-unequal inputs).
This post by Jeff Attwood is a nice overview of the topic.
And here is an online hash calculator you can play with.
There is no need to use a hash. It is probably just a quasi-random 64 bit value passed through base64 or some equivalent.
By quasi-random, I mean it is just a one-to-one mapping with the counting integers, just shuffled.
For example, you could take a monotonically increasing database id and multiply it by some prime near 2^64, then base64 the result. If you did not want people to be able to guess, you might choose a more complex mapping or just pick a random number that is not in the database yet.
Normal base64 would add an equals at the end, but in this case it is implied because the size is known. The character mapping could easily be something besides the standard.
Eli's link to Jeff's article is, in my opinion, irrelevant. URL shortening is not the same thing as presenting an ID to the world. Instead, a nicer way would be to convert your existing integer ID to a different radix.
An example in PHP:
$id = 9999;
//$url_id = base_convert($id, 10, 26+26+10); // PHP doesn't like this
$url_id = base_convert($id, 10, 26+10); // Works, but only digits + lowercase
Sadly, PHP only supports up to base 36 (digits + alphabet). Base 62 would support alphabet in both upper-case and lower-case.
People are talking about these other systems:
Random number/letters - Why? If you want people to not see the next video (id+1), then just make it private. On a website like youtube, where it actively shows any video it has, why bother with random ids?
Hashing an ID - This design concept really stinks. Think about it; so you have an ID guaranteed by your DBM software to be unique, and you hash it (introducing a collision factor)? Give me one reason why to even consider this idea.
Using the ID in URL - To be honest, I don't see any problems with this either, though it will grow to be large when in fact you can express the same number with fewer letters (hence my solution).
Using Base64 - Base64 expects bytes of data, literally anything from nulls to spaces. Why use this function when your data consists of a number (ie, a mix of 10 different characters, instead of 256)?
You can use any library or some languages like python provides it in standard library.
Example:
import secrets
id_length = 12
random_video_id = secrets.token_urlsafe(id_length)
You could generate a GUID and have that as the ID for the video.
Guids are very unlikely to collide.
Your best bet is probably to simply generate random strings, and keep track (in a DB for example) of which strings you've already used so you don't duplicate. This is very easy to implement and it cannot fail if properly implemented (no duplicates, etc).
I don't think that the URL v parameter has anything to do with the content (video properties, title, description etc).
It's a randomly generated string of fixed length and contains a very specific set of characters. No duplicates are allowed.
I suggest using a perfect hash function:
Perfect Hash Function for Human Readable Order Codes
As the accepted answer indicates, take a number, then apply a sequence of "bijective" (or reversible) operations on the number to get a hashed number.
The input numbers should be in sequence: 0, 1, 2, 3, and so on.
Typically you're hiding a numeric identifier in the form of something that doesn't look numeric. One simple method is something like base-36 encoding the number. You should be able to pull that off with one or another variant of itoa() in the language of your choice.
Just pick random values until you have one never seen before.
Randomly picking and exhausting all values form a set runs in expected time O(nlogn): What is O value for naive random selection from finite set?
In your case you wouldn't exhaust the set, so you should get constant time picks. Just use a fast data structure to do the duplication lookups.