Find numbers that differ by 1 digit from a set of 15,000 12-digit numbers - algorithm

I have a list of ~15,000 12-digit barcoded tickets. Most of the time they are scanned off paper or phone screens, but sometimes they are typed in (cracked screens, etc.) How would I go about finding if we have any sets of codes that differ by 1 digit, so typing the first one with a mis-typed digit might end up with another valid code?
The code numbers are 12-digit integers that are fairly random in the range 100000000000 to 999999999999 (we don't want leading zeroes to give problems with other systems)
e.g. given the three code numbers
123456789012
123456789013
223456789012
The first and second differ by only one digit and the second and third also. the first and third differ by 2 digits, so is ignored.

Use a hash set. Go through each of your 15,000 numbers in turn, and for each one, generate the 108 different numbers that differ from it in one place (12 digits times 9 possible alternate digits in each place). Check if each of those 108 numbers exists in the hash set (without inserting them). If any one of them does then you have a hit. If not then add the unmodified number to the hash set and move onto the next one.
You could also try with transpositions of adjacent numbers, which would give you another 11 digits on top of the 108 to try.

Related

Return the count of all prime numbers in range [a,b] such that all the digits are from set {1,5,9} . 1<=a<=b<=10⁹

Return the count of all prime numbers in range [a,b] such that all the digits are from set {1,5,9} . 1<=a<=b<=10⁹.
My approach -
I was trying to generate all the numbers which are from set {1,5,9}. which comes out to be 3^9(19683) and after that I am checking for is it prime or not.
Can I do this in a better time complexity?
Never generate a large set and after check all elements of the set, ruling out most. That requires a lot of memory to store things you'll be discarding. Instead, find a single number with "valid" digits, check for primeness, and only then store it in a set. Accessing large arrays of memory is very time-intense on modern computers compared to doing math.
"I produced all the numbers": I hope you're doing this smartly! You never have to check a number with a last digit being 5 for primeness (there's only a single prime that ends in 5; that's 5 itself!), for example. Also, you hopefully don't just build all combinations of digits "manually". Say, you find a number 19551, then 19559 is also a candidate, you never have to manually "combine" digits to try out the last digit.
Of course, your prime-checking algorithm needs to be matching your kind of problem: You can remove the initial check for divisibility by 2 (you never produce even numbers), for example. You never need to check for divisibility by 5, because you never use 5 or 0 as last digit. Depending on your prime checking algorithm, you also would want to save the factor that "killed" the xxxx1 – that's one factor you don't have to check xxxx9 against. Do your 3-factor-checking based on the count of 1,5 and 9 in your number; you can directly infer cross-sum and hence 3-divisibility from that.

LC-3 How to store a number large than 16-bit and print it out to console?

I'm having difficulty storing and displaying numbers greater than 32767 in LC-3 since a register can only hold values from -32768 to 32767. My apology for not being able to come up with any idea for the algorithm. Please give me some suggestion. Thanks!
You'll need a representation to store the larger number in a pair or more of words.
There are several approaches to how big integers are stored: in a fixed number of words, and in a variable number of words or bytes.  The critical part is being able to detect the presence and amount of overflow/carry on mathematical operations like *10.
For that reason, one simple approach is to use a variable number of words/bytes (for a single number), and store only one decimal digit in each of the words/bytes.  That way multiplication by 10, means simply adding a digit on the end (which has the effect of moving each existing digit to the next higher power of ten position).  Adding numbers of this form numbers is fairly easy as well, we need to line up the digits and then, we add them up and detect when the sum is >= 10, then there is a carry (of 1) to be added to the next higher order digit of the sum.  (If adding two such (variable length) numbers is desired, I would store the decimal digits in reverse order, because then the low order numbers are already lined up for addition.)  See also https://en.wikipedia.org/wiki/Binary-coded_decimal .  (In some sense, this is like storing numbers in a form like string, but using binary values instead of ascii characters.)
To simplify this approach for your needs, you can fix the number of words to use, e.g. at 7, for 7 digits.
A variation on (unpacked) Binary-coded Decimal to pack them two decimal digits per byte.  Its a bit more complicated but saves some storage.
Another approach is to store as many decimal digits as will fit full in a word, minus 1.  Which is to say if we can store 65536 in 16-bits that's only 4 full decimal digits, which means putting 3 digits at a time into a word.  You'd need 3 words for 9 digits.  Multiplication by 10 means multiplying each word by 10 numerically, and then checking for larger than 999, and if larger, then carry the 1 to the next higher order word while also subtracting 10,000 from the overflowing word.
This approach will require actual multiplication and division by 10 on each of the individual words.
There are other approaches, such as using all 16-bits in a word as magnitude, but the difficulty there is determining the amount of overflow/carry on *10 operations.  It is not a monumental task but will require work.  See https://stackoverflow.com/a/1815371/471129, for example.
(If you also want to store negative numbers, that is also an issue for representation.  We can either store the sign as separately known as sign-magnitude form (as in stored its own word/byte or packed into the highest byte) or store the number in a compliment form.  The former is better for variable length implementations and the latter can be made to work for fixed length implementations.)

What is the probability that a UUID, stripped of all of its letters and dashes, is unique?

Say I have a UUID a9318171-2276-498c-a0d6-9d6d0dec0e84.
I then remove all the letters and dashes to get 9318171227649806960084.
What is the probability that this is unique, given a set of ID's that are generated in the same way? How does this compare to a normal set of UUID's?
UUIDs are represented as 32 hexadecimal (base-16) digits, displayed in 5 groups separated by hyphens. The issue with your question is that for any generated UUID we could get any valid hexadecimal number from the set of [ 0-9,A-F ] inclusive.
This leaves us with a dilemma since we don't know, beforehand, how many of the hexadecimal digits generated for each UUID would be an alpha-characte: [A-F]. The only thing that we can be certain of, is that each generated character of the UUID has a 5/16 chance of being an alpha character: [A-F]. Knowing this makes it impossible to answer this question accurately since removing the hyphens and alpha characters leaves us with variable length UUIDs for each generated UUID...
With that being said, to give you something to think about we know that each UUID is 36 characters in length, including the hyphens. So if we simplify and say, we have no hyphens, now each UUID can be only be 32 characters in length. Building on this if we further simplify and say that each of the 32 characters can only be a numeric character: [0-9] we could now give an accurate probability for uniqueness of each generated, simplified, UUID (according to our above mentioned simplifications):
Assuming a UUID is represented by 32 characters, where each character is a numerical character from the set of [0-9]. We know that we need to generate 32 numbers in order to create a valid simplified UUID. Now the chances of selecting any given number: [0-9] is 1/10. Another way to think about this is the following: each number has an equal opportunity of being generated and since there are 10 numbers: each number has a 10% chance of being generated.
Furthermore, when a number is generated, the number is generated independently of the previously generated numbers i.e. each number generated doesn't depend on the outcome of the previous number generated. Therefore, for each of the 32 numeric characters generated: each number is independent of one another and since the outcome of any number selected is a number and only a number from [0-9] we can say that each number selected is mututally exclusive to one another.
Knowing these facts we can take advantage of the Product Rule which states that the probability of the occurrence of two independent events is the product of their individual probabilities. For example, the probability of getting two heads on two coin tosses is 0.5 x 0.5 or 0.25. Therefore, the generation of two identical UUIDs would be:
1/10 * 1/10 * 1/10 * .... * 1/10 where the number of 1/10s would be 32.
Simplifying to 1/(10^32), or in general: to 1/(10^n) where n is the length of your UUID. So with all that being said the possibility of generating two unique UUIDs, given our assumptions, is infinitesimally small.
Hopefully that helps!

Algorithm:how to find the minimum number of combined number from a Char array so the number exceeds the target number?

We have a char array. All chars in the array are from 0 to 9. For example : 1,9,2,3.
We need to find out the the minimum number of combined chars which is greater than the target value(for example :92), then the 93 is the value what I want.
one example : 1,9,2,3
target : 192
The minimum number which is greater than 192 : 193(i.e.:'1'+'9'+'3').
one more example:2,1,3
target :99
The minimum number which is greater than 99: 123
one more example:2,1,4
target :12
The minimum number which is greater than 12: 14
Please advice &help.
This is not home work, for sure. and there is no order in the char array.
for example:
target:23
the one i want:31
My question:do you need to find all possible combinations(two digit integer/three digit inters/four digit integer) and then find the closest integer to target number.
and length of char array could be 10. the target number could be greater than one million...
No repeat characters are allowed For instance for target 10 will the answer be 12 instead of 11
Any ideas?
Since no repeated digits are allowed, the very first thing to do is to remove repeated digits from the array. Also, sorting the array is a good idea.
If the target has d digits, the solution is either also a d-digit number or a d+1-digit number. If it's a d+1 digit number, it is the smallest you can construct from the array values. That part is very easy:
digit[1] = minimum of nonzero array elements
for p = 2 to d+1:
digit[p] = minimum of array elements not yet taken
If the solution is a d-digit number, its first digit is either equal to the first digit of the target, or it's larger. If it's larger, the constructed number will be larger than the target no matter what the following digits are, so for the remaining digits, you can copy part of the above case. If the first digit of the solution is equal to the first digit of the target, you have reduced the problem to that of finding a solution for a d-1-digit target with a smaller array of eligible digits. You can then recur.
For a dynamic programming approach, preserving the order in the original array, you could work out, after the first N characters, the maximum number possible using only 1,2,3...N characters. Then for the N+1th position the maximum number possible with i characters is either as before, or the previous answer with i-1 characters extended with the current character.
A hack, if you don't have to preserve the order, is to sort the original array.
Example given 1923
At position 1 you care about 1.
At position 2 you care about 19 and 9.
At position 3 you care about 192, 92, and 9.
A the end you care about 1923, 923, and 93.
Further comments:
There is an article on dynamic programming at http://en.wikipedia.org/wiki/Dynamic_programming. The main idea is to solve small problems, and then use those solutions to solve slightly larger problems, and then use those... and so on until you have worked your way up to the problem you actually want to solve.
In your case, you want to find how to take a small number of characters from 1923 so as to make a large number. Suppose you know how to take a small number of characters from 192 to make a large number. In that case, the best solution for 1923 will either be a best solution for 192 or that solution with the 3 that ends 1923 added on. This is because if you had a solution for 1923 that was better than any of the ones you could get as I described, you could get a better solution for 192 by taking it, and perhaps deleting its final character.
Of course, at the beginning you don't know the solution for 192 either, so you have to start at the very beginning, with the solution for 1, and from that work out the best solutions for 19, and then 192, and finally for 1923 - which is what I have shown in the example above.
Finally, I couldn't work out from your question whether e.g. 9321 or 932 are possible solutions. If they are, the problem is easier, but if you really want to you can solve it with much the same method. Just sort 1923 to give 9321 and then solve that as you solved for 1923.

Algorithm Question on File Search Indexing

There is one question and I have the solution to it also. But I couldn't understand the solution. Kindly help with some set of examples and shower some experience.
Question
Given a file containing roughly 300 million social security numbers (9-digit numbers), find a 9-digit number that is not in the file. You have unlimited drive space but only 2MB of RAM at your disposal.
Answer
In the first step, we build an array 2^16 integers that is initialized to 0 and for every number in the file, we take its 16 most significant bits to index into this array and increment the number.
Since there are less than 2^32 numbers in the file, there is bound to be (at least) one number in the array that is less than 2^16. This tells us that there is at least one number missing among the possible numbers with those upper bits.
In the second pass, we can focus only only on the numbers that match this criterion and use a bit-vector of size 2^16 to identify one of the missing numbers.
To make the explanation simpler, let's say you have a list of two-digit numbers, where each digit is between 0 and 3, but you can't spare the 16 bits to remember for each of the 16 possible numbers, whether you have already encountered it. What you do is to create an array a of 4 3-bit integers and in a[i], you store how many numbers with the first digit i you encountered. (Two-bit integers wouldn't be enough, because you need the values 0, 4 and all numbers between them.)
If you had the file
00, 12, 03, 31, 01, 32, 02
your array would look like this:
4, 1, 0, 2
Now you know that all numbers starting with 0 are in the file, but for each of the remaining, there is at least one missing. Let's pick 1. We know there is at least one number starting with 1 that is not in the file. So, create an array of 4 bits, for each number starting with 1 set the appropriate bit and in the end, pick one of the bits that wasn't set, in our example if could be 0. Now we have the solution: 10.
In this case, using this method is the difference between 12 bits and 16 bits. With your numbers, it's the difference between 32 kB and 119 MB.
In round terms, you have about 1/3 of the numbers that could exist in the file, assuming no duplicates.
The idea is to make two passes through the data. Treat each number as a 32-bit (unsigned) number. In the first pass, keep a track of how many numbers have the same number in the most significant 16 bits. In practice, there will be a number of codes where there are zero (all those for 10-digit SSNs, for example; quite likely, all those with a zero for the first digit are missing too). But of the ranges with a non-zero count, most will not have 65536 entries, which would be how many would appear if there were no gaps in the range. So, with a bit of care, you can choose one of the ranges to concentrate on in the second pass.
If you're lucky, you can find a range in the 100,000,000..999,999,999 with zero entries - you can choose any number from that range as missing.
Assuming you aren't quite that lucky, choose one with the lowest number of bits (or any of them with less than 65536 entries); call it the target range. Reset the array to all zeroes. Reread the data. If the number you read is not in your target range, ignore it. If it is in the range, record the number by setting the array value to 1 for the low-order 16-bits of the number. When you've read the whole file, any of the numbers with a zero in the array represents a missing SSN.

Resources