Code that maps numbers from one number to another with where each number has a distance greater than 1 - number-theory

I need to tag a load of books with a unique id. Because human error would really mess with the system i need the code to detect if one of the numbers is wrong. That means that no two elements of the code can have a hamming distance of 1. Or have a parity check method or something again such that some errors can be detected. I would normally post what I've done so far, but I don't know where to start really.
Thanks

If you are dealing with human readable data, I would go with something like the Luhn algorithm. This is designed for simple manual computation, as well as resulting in decimal encoded data.
The problem you will have with binary codes is that they will scramble the data a little. So unless you plan to encode your id's in an image, such as a barcode or QR, it's probably not the right choice. Also, optimal decoders are complicated algorithms, they are certainly not practical to check by hand calculation.
If you insist on going with a binary code, then you'll have to decide how many bit errors you want to detect. You'll need a hamming distance of at least 2 to detect an error, otherwise a single bit flip will result in the code being transformed into another equally valid code, and the error will go unnoticed. If you want to correct N errors, then you'll need to choose a code with a distance of 2N+1.
If you are planning to encode hexadecimal digits, then you'll need 4-bits per digit storage, which will require a code with 9-bits of redundancy per message in order to correct a single digit error. I'm not even sure such a perfect codes exists, and in reality you might find you need more redundancy to equally protect all bits.

Related

How to get truly random data, not random data fed into a PRNG seed like CSRNG's do?

From what I understand, a CSRNG like RNGCryptoServiceProvider still passes the truly random user data like mouse movement, etc through a PRNG to sort of sanitize the output and make it equal distribution. The bits need to be completely independent.
(this is for a theoretical infinite computing power attacker)
If the CSRNG takes 1KB of true random data and expands it to 1MB, all the attacker has to do is generate every combination of 1KB of data, expand it, and see which 1MB of data generates a one-time pad that returns sensible english output. I read somewhere that if the one-time pad had a PRNG anywhere in the RNG, it is just a glorified stream cipher. I was wondering if the truly random starting data was in large enough numbers to just use instead of cryptographically expanding. I need truly random output for a one-time pad, not just a cryptographically secure RNG. Or perhaps if there were other ways to somehow get truly random data, so that all bits are independent of each other. I was thinking of XOR'ing with the mouse coordinates for a few seconds, then perhaps the last digits of the Environment.TickCount, then maybe getting microphone input (1, 2, 3, 4) as well. However, as some point out on stackoverflow, I should really just let the OS handle it all. Unfortunately that isn't possible since there is an PSRNG used. I would like to avoid a hardware solution, since this is meant to be an easy to use program, and also not utilize RDRAND since it ALSO uses a PRNG (unless RDRAND can return the truly random data before it goes through a PRNG??). Would appreciate any responses if such a thing is even possible; I've been working on this for weeks under the impression that RNGCryptoServiceProvider was sufficient for a one time pad. Thanks.
(Side note: some say for most crypto functions you don't need true entropy, just unpredictability. for a one-time pad, it MUST be random otherwise it is not a one time pad.)
As you know, "truly random" means each of the bits is independent of everything else as well as uniformly distributed. However, this ideal is hard, if not impossible, to achieve in practice. In general, the closest way to get "truly random data" in practice is to gather hard-to-guess bits from nondeterministic sources, then condense those bits into a random block of data.
There are many issues involved with getting this close to "truly random data", including the following:
The sources must be nondeterministic, that is, their output cannot be determined by their inputs. Examples of nondeterministic sources include timings of input devices; thermal noise; and the noise registered by microphone and camera outputs.
The sources' output must be hard to guess. This is more formally known as entropy, such as 32 bits of entropy per 64 bits of output. However, measuring entropy is far from trivial. If you need 1 MB (8 million bits) of truly random data, you need to have data with at least 8 million bits of entropy (which in practice will be many times more than 1 MB long depending on the sources), then condense the data somehow into 1 MB of data while preserving that entropy.
The sources must be independent of each other.
There should be two or more independent sources. This is because it's impossible to extract full randomness from just one source (see McInnes and Pinkas 1990). On the other hand, extracting randomness from three or more independent sources is relatively trivial, but there is still a matter of choosing an appropriate randomness extractor, and a survey of randomness extractors would be beyond the scope of this answer.
In general, for random number generation purposes, the more sources available, the better.
REFERENCES:
McInnes, J. L., & Pinkas, B. (1990, August). On the impossibility of private key cryptography with weakly random keys. In Conference on the Theory and Application of Cryptography (pp. 421-435).

Why is a finite sum calculated so long?

I'm trying to compute the next sum:
It is calculated instantly. So I raise the number of points to 24^3 and it still works fast:
But when the number of points is 25^3 it's almost impossible to await the result! Moreover, there is a warning:
Why is it so time-consuming to calculate a finite sum? How can I get a precise answer?
Try
max=24;
Timing[N[
Sum[1/(E^((i^2+j^2+k^2-3)/500)-1),{i,2,max},{j,1,max},{k,1,max}]+
Sum[1/(E^((i^2+j^2+k^2-3)/500)-1),{i,1,1},{j,2,max},{k,1,max}]+
Sum[1/(E^((i^2+j^2+k^2-3)/500)-1),{i,1,1},{j,1,1},{k,2,max}]]]
which quickly returns
{0.143978,14330.9}
and
max=25;
Timing[N[
Sum[1/(E^((i^2+j^2+k^2-3)/500)-1),{i,2,max},{j,1,max},{k,1,max}]+
Sum[1/(E^((i^2+j^2+k^2-3)/500)-1),{i,1,1},{j,2,max},{k,1,max}]+
Sum[1/(E^((i^2+j^2+k^2-3)/500)-1),{i,1,1},{j,1,1},{k,2,max}]]]
which quickly returns
{0.156976,14636.6}
and even
max=50;
Timing[N[
Sum[1/(E^((i^2+j^2+k^2-3)/500)-1),{i,2,max},{j,1,max},{k,1,max}]+
Sum[1/(E^((i^2+j^2+k^2-3)/500)-1),{i,1,1},{j,2,max},{k,1,max}]+
Sum[1/(E^((i^2+j^2+k^2-3)/500)-1),{i,1,1},{j,1,1},{k,2,max}]]]
which quickly returns
{1.36679,16932.5}
Changing your code in this way avoids doing hundreds or thousands of If tests that will almost always result in True. And it potentially uses symbolic algorithms to find those results instead of needing to add up each one of the individual values.
Compare those results and times if you replace Sum with NSum and if you replace /500 with *.002
To try to guess why the times you see suddenly change as you increment the bound, other people have noticed in the past that it appears there are some hard coded bounds inside some of the numerical algorithms and when a range is small enough Mathematica will use one algorithm, but when the range is just large enough to exceed that bound then it will switch to another and potentially slower algorithm. It is difficult or impossible to know exactly why you see this change without being able to inspect the decisions being made inside the algorithms and nobody outside Wolfram gets to see that information.
To get a more precise numerical value you can change N[...] to N[...,64] or N[...,256] or eliminate the N entirely and get a large complicated exact numeric result.
Be cautious with this, check the results carefully to make certain that I have not made any mistakes. And some of this is just guesswork on my part.

Need an Algorithm to generate Serialnumber

I want to generate 16-digits hexadecimal serial-number like: F204-8BE2-17A2-CFF3.
(This pattern give me 16^16 distinct serial-number But I don't need all of them)
I need you all to suggest me an algorithm to generate these serial-numbers randomly with an special characteristic which is:
each two serial-numbers have (at-least) 6 different digits
(= It means if you are given two most similar serial-number, they should still have difference in 6 indexes)
I know that a good algorithm with this characteristic needs to remember previously generated serial-numbers and I don't want that much.
In fact, I need an algorithm which do this with least probability for a chosen pair to collide (less than 0.001 seems sufficient )
PS:
I've just tried to create 10K string randomly using MD5 hash and It gave similar string( similar=more than 3 same digits) with 0.00018 probability.
It is possible to construct a correct generator without having to remember all previously generated codes. You can generate serial numbers that are spaced 6 characters apart by using Hamming code. A hamming code can be designed to arbitrarily space out two distinct generated values. Obviously, the greater the distance, the higher redundancy you will have to use, resulting in more complex code and longer numbers.
First you design a hamming code to your liking, that encodes a number into a sequence of hexadecimal digits and then you can take any sequence of numbers and use it as a seed, such as prime numbers. You just always need to remember, what number was used last and use the next one.
That being said, if you don't need to properly ensure minimal distance of two serials, and would settle for a small error, I would suggest that any half decent hash function or cypher should produce decently spaced out outputs. Therefore the first thing I would try to do is to take MD5 or SHA hashes and test-drive them on numbers 1 - 1000. My hopes are, the results will be quite satisfactory.
I suggest you look into the ANSI X9.17 pseudorandom bit generator. An algorithmic sketch is given in these slides. ANSI X9.17 generates 64-bit pseudorandom strings which is what you want.
A revised and enhanced version of this generator was approved by NIST. Please have a look at this page.
Now whether you use ANSI X9.17 generator, another generator, or develop your own, it's a good idea to have the generator pass some statistical tests in order to ensure the quality of its pseudorandom bits.
Example tests include the ENT battery, the DIEHARD battery, and the NIST battery.

How much can you truncate a SHA1 hash and be reasonably sure of having an unique ID?

I am making an application that stores documents and gives each one a UID based on a SHA1 digest of a few things including the timestamp. The digest has a lot of characters, and I want to allow users to identify the documents by using the first x characters of the full digest. What's a good value for x if the number of documents is maybe around 10K - 100K?
Adapting the formulas on on wikipedia for the Birthday problem, you can approximate the probability of collision as 1 - e^(-n^2/(2^(b+1))), where n is the document count and b is the number of bits. Graphing this formula with n=100,000, it looks like you'll want b > 45 at least. I'd be more inclined to go with 64 to make it a nice and round number. That said, do have a plan to deal with collisions if they occur (maybe alter the timestamp slightly, or add a nonce?)
For that matter, if the sha1 is based on more than just the content of the document, why not simply make it a random ID? In this case collisions are less of a problem, as you can always generate a new random number and try again (the probability of a collision with a single try is the same, however).
Be careful of truncation as there is no reduction in proof that the smaller hash is secure. See Kelsey's http://csrc.nist.gov/groups/ST/hash/documents/Kelsey_Truncation.pdf. Kelsey gives to heuristic arguments stating the same ("Related Hash Outputs" and "Near Collisions"). Biham/Chen offer examples of Near Collisions; and Knudsen demonstrates Truncated Differentials.
In the end, you probably want to feed your data into an HMAC with the truncated size (the size is digested by the HMAC, too) and then use the truncated HMAC.
There really isn't a value for this; part of what makes SHA a good general-purpose hashing algorithm is that similar data does not necessarily produce similar hashed values. Your best bet (without knowing anything else about your system) would just be to search the list of documents whose hashes start with the value supplied by the user, then either present them with a list of documents to select from or go directly to the document if there's only one.
It's a generalization of the birthday problem. In you case n is number of documents, and instead of constant 365 you'd have number of possibilities the cutoff gives you (so for k bits it's 2k).
Of course exact calculation is out of the question, but you might use approximation.
Well, here's a possibly too simplistic of an answer..
If with full sha1 you get about 1 in 2^160 chance of collision, then by truncating one character you increase the chances of collision by 16 (all possible values of the truncated character)... which is 2^4.. So, if you truncate x characters you get 1 in 2^(160 - 4*x) chances of collision.. right?

YouTube URL algorithm?

How would you go about generating the unique video URL's that YouTube uses?
Example:
http://www.youtube.com/watch?v=CvUN8qg9lsk
YouTube uses Base64 encoding to generate IDs for each video.Characters involved in generating Ids consists of
(A-Z) + (a-z) + (0-9) + (-) + (_). (64 Characters).
Using Base64 encoding and only up to 11 characters they can generate 73+ Quintilian unique IDs.How much large pool of ID is that?
Well, it's enough for everyone on earth to produce video every single minute for 18000 years.
And they have achieved such huge number by only using 11 characters (64*64*64*64*64*64*64*64*64*64*64) if they need more IDs they will just have to add 1 more character to their IDs.
So when video is uploaded on YouTube they basically randomly select from 73+ Quintilian possibility and see if its already taken or not.if not use it otherwise look for another one.
Refer to this video for detailed explanation.
Using some non-trivial hashing function. The probability of collision is very low, depending on the function, the parameters and the input domain. Keep in mind that cryptographic hashes were specifically designed to have very low collision rates for non-random input (i.e. completely different hashes for two close-but-unequal inputs).
This post by Jeff Attwood is a nice overview of the topic.
And here is an online hash calculator you can play with.
There is no need to use a hash. It is probably just a quasi-random 64 bit value passed through base64 or some equivalent.
By quasi-random, I mean it is just a one-to-one mapping with the counting integers, just shuffled.
For example, you could take a monotonically increasing database id and multiply it by some prime near 2^64, then base64 the result. If you did not want people to be able to guess, you might choose a more complex mapping or just pick a random number that is not in the database yet.
Normal base64 would add an equals at the end, but in this case it is implied because the size is known. The character mapping could easily be something besides the standard.
Eli's link to Jeff's article is, in my opinion, irrelevant. URL shortening is not the same thing as presenting an ID to the world. Instead, a nicer way would be to convert your existing integer ID to a different radix.
An example in PHP:
$id = 9999;
//$url_id = base_convert($id, 10, 26+26+10); // PHP doesn't like this
$url_id = base_convert($id, 10, 26+10); // Works, but only digits + lowercase
Sadly, PHP only supports up to base 36 (digits + alphabet). Base 62 would support alphabet in both upper-case and lower-case.
People are talking about these other systems:
Random number/letters - Why? If you want people to not see the next video (id+1), then just make it private. On a website like youtube, where it actively shows any video it has, why bother with random ids?
Hashing an ID - This design concept really stinks. Think about it; so you have an ID guaranteed by your DBM software to be unique, and you hash it (introducing a collision factor)? Give me one reason why to even consider this idea.
Using the ID in URL - To be honest, I don't see any problems with this either, though it will grow to be large when in fact you can express the same number with fewer letters (hence my solution).
Using Base64 - Base64 expects bytes of data, literally anything from nulls to spaces. Why use this function when your data consists of a number (ie, a mix of 10 different characters, instead of 256)?
You can use any library or some languages like python provides it in standard library.
Example:
import secrets
id_length = 12
random_video_id = secrets.token_urlsafe(id_length)
You could generate a GUID and have that as the ID for the video.
Guids are very unlikely to collide.
Your best bet is probably to simply generate random strings, and keep track (in a DB for example) of which strings you've already used so you don't duplicate. This is very easy to implement and it cannot fail if properly implemented (no duplicates, etc).
I don't think that the URL v parameter has anything to do with the content (video properties, title, description etc).
It's a randomly generated string of fixed length and contains a very specific set of characters. No duplicates are allowed.
I suggest using a perfect hash function:
Perfect Hash Function for Human Readable Order Codes
As the accepted answer indicates, take a number, then apply a sequence of "bijective" (or reversible) operations on the number to get a hashed number.
The input numbers should be in sequence: 0, 1, 2, 3, and so on.
Typically you're hiding a numeric identifier in the form of something that doesn't look numeric. One simple method is something like base-36 encoding the number. You should be able to pull that off with one or another variant of itoa() in the language of your choice.
Just pick random values until you have one never seen before.
Randomly picking and exhausting all values form a set runs in expected time O(nlogn): What is O value for naive random selection from finite set?
In your case you wouldn't exhaust the set, so you should get constant time picks. Just use a fast data structure to do the duplication lookups.

Resources