Efficient database lookup based on input where not all digits are sigificant - algorithm

I would like to do a database lookup based on a 10 digit numeric value where only the first n digits are significant. Assume that there is no way in advance to determine n by looking at the value.
For example, I receive the value 5432154321. The corresponding entry (if it exists) might have key 54 or 543215 or any value based on n being somewhere between 1 and 10 inclusive.
Is there any efficient approach to matching on such a string short of simply trying all 10 possibilities?
Some background
The value is from a barcode scan. The barcodes are EAN13 restricted circulation numbers so they have the following structure:
02[1234567890]C
where C is a check sum. The 10 digits in between the 02 and the check sum consist of an item identifier followed by an item measure. There might be a check digit after the item identifier.
Since I can't depend on the data to adhere to any single standard, I would like to be able to define on an ad-hoc basis, how particular barcodes are structured which means that the portion of the 10 digit number that I extract, can be any length between 1 and 10.

Just a few ideas here:
1)
Maybe store these numbers in reversed form in your DB.
If you have N = 54321 you store it as N = 12345 in the DB.
Say N is the name of the column you stored it in.
When you read K = 5432154321, reverse this one too,
you get K1 = 1234512345, now check the DB column N
(whose value is let's say P), if K1 % 10^s == P,
where s=floor(Math.log(P) + 1).
Note: floor(Math.log(P) + 1) is a formula for
the count of digits of the number P > 0.
The value floor(Math.log(P) + 1) you may also
store in the DB as precomputed one, so that
you don't need to compute it each time.
2) As this 1) is kind of sick (but maybe best of the 3 ideas here),
maybe you just store them in a string column and check it with
'like operator'. But this is trivial, you probably considered it
already.
3) Or ... you store the numbers reversed, but you also
store all their residues mod 10^k for k=1...10.
col1, col2,..., col10
Then you can compare numbers almost directly,
the check will be something like
N % 10 == col1
or
N % 100 == col2
or
...
(N % 10^10) == col10.
Still not very elegant though (and not quite sure
if applicable to your case).
I decided to check my idea 1).
So here is an example
(I did it in SQL Server).
insert into numbers
(number, cnt_dig)
values
(1234, 1 + floor(log10(1234)))
insert into numbers
(number, cnt_dig)
values
(51234, 1 + floor(log10(51234)))
insert into numbers
(number, cnt_dig)
values
(7812334, 1 + floor(log10(7812334)))
select * From numbers
/*
Now we have this in our table:
id number cnt_dig
4 1234 4
5 51234 5
6 7812334 7
*/
-- Note that the actual numbers stored here
-- are the reversed ones: 4321, 43215, 4332187.
-- So far so good.
-- Now we read say K = 433218799 on the input
-- We reverse it and we get K1 = 997812334
declare #K1 bigint
set #K1 = 997812334
select * From numbers
where
#K1 % power(10, cnt_dig) = number
-- So from the last 3 queries,
-- we get this row:
-- id number cnt_dig
-- 6 7812334 7
--
-- meaning we have a match
-- i.e. the actual number 433218799
-- was matched successfully with the
-- actual number (from the DB) 4332187.
So this idea 1) doesn't seem that bad after all.

Related

Quick way to compute n-th sequence of bits of size b with k bits set?

I want to develop a way to be able to represent all combinations of b bits with k bits set (equal to 1). It needs to be a way that given an index, can get quickly the binary sequence related, and the other way around too. For instance, the tradicional approach which I thought would be to generate the numbers in order, like:
For b=4 and k=2:
0- 0011
1- 0101
2- 0110
3- 1001
4-1010
5-1100
If I am given the sequence '1010', I want to be able to quickly generate the number 4 as a response, and if I give the number 4, I want to be able to quickly generate the sequence '1010'. However I can't figure out a way to do these things without having to generate all the sequences that come before (or after).
It is not necessary to generate the sequences in that order, you could do 0-1001, 1-0110, 2-0011 and so on, but there has to be no repetition between 0 and the (combination of b choose k) - 1 and all sequences have to be represented.
How would you approach this? Is there a better algorithm than the one I'm using?
pkpnd's suggestion is on the right track, essentially process one digit at a time and if it's a 1, count the number of options that exist below it via standard combinatorics.
nCr() can be replaced by a table precomputation requiring O(n^2) storage/time. There may be another property you can exploit to reduce the number of nCr's you need to store by leveraging the absorption property along with the standard recursive formula.
Even with 1000's of bits, that table shouldn't be intractably large. Storing the answer also shouldn't be too bad, as 2^1000 is ~300 digits. If you meant hundreds of thousands, then that would be a different question. :)
import math
def nCr(n,r):
return math.factorial(n) // math.factorial(r) // math.factorial(n-r)
def get_index(value):
b = len(value)
k = sum(c == '1' for c in value)
count = 0
for digit in value:
b -= 1
if digit == '1':
if b >= k:
count += nCr(b, k)
k -= 1
return count
print(get_index('0011')) # 0
print(get_index('0101')) # 1
print(get_index('0110')) # 2
print(get_index('1001')) # 3
print(get_index('1010')) # 4
print(get_index('1100')) # 5
Nice question, btw.

Linear Hashing calculation?

I am currently studying for my exams and have came up against this question:
(5d) Suppose we are using linear hashing, and start with an empty table with 2 buckets (M = 2), split = 0 and a load factor of 0.9. Explain the steps we go through when the following hashes are added (in order):
5,7,12,11,9
The answer provided for this is:
*— —5— (0,1)
* — —5,7 —
split —*—5,7— — (0,1,2)
—12*—5,7— — —
split —12—5—*—7— (0,1,2,3)
split =M, M = 2*M, split = 0
*—12—5— —7—
*—12—5— —7,11—
split —*—5— —7,11—12— (0,1,2,3,4)
—*—5,9— —7,11—12—
split — —9*— —7,11—12—5— (0,1,2,3,4,5)
This answer doesn't make any sense to me and the lecturer did not go through this.
How do I tackle this question?
I edited your question because the answer looks like a list of descriptions of the hash table state as each operation is performed. Did your professor cover linear hashing at all? The Wikipedia description mention a load factor precisely, but it's in the original LH paper by Witold Litwin. it's integral to when a controlled split occurs. I also found these descriptions:
Let l denote the Linear Hashing scheme’s load factor, i.e., l = S/b where S is the total number of records and b is the number of buckets used.
Linear Hashing by Zhang, et al (PDF)
The linear hashing algorithm performs splits in a deterministic order, rather than splitting at a bucket that overflowed. The splits are performed in linear order (bucket 0 first, then bucket 1, then 2, ...), and a split is performed when any bucket overflows. If the bucket that overflows is not the bucket that is split (which is the common case), overflow techniques such as chaining are used, but the common case is that few overflow buckets are needed.
snip
Instead of splitting on every collision, you can do a split when the "load" (which is bytes stored / (num buckets * bucket size), i.e. utilization of the data structure) crosses some watermark. This is called controlled splitting; the previously described is called uncontrolled splitting.
Linear Hashing: A new Tool for File and Table Addressing Witold Litwin, Summary by: Steve Gribble and Armando Fox, Online Berkley.edu retrieved June 16
So basically, a load factor is a means of predictably controlling when a split will occur. One implementation of linear hashing appears to be called 'uncontrolled split' which adds a new bucket and performs a split whenever a collision occurs. Using a load factor of 0.9 only has a split occur when 90% of the tables buckets are full - or rather, would be full based on the prediction that the buckets are uniformly assigned to.
Based on this and the Wikipedia article I just read, the setup is this:
Table is initially empty with two buckets (N = 2) - - (numbered 0 and 1)
N for number of buckets makes so much more sense to me than M, so I'm using that in my answer.
Apparently N is never changed even as new buckets are added to the table.
Our growth factor (L for bucket level) is 0. It is incremented every time every bucket in the table has been split once, which coincides with when our table has doubled in size.
Step pointer S (also called a split pointer) points to 0th bucket. It indicates which bucket will have a split applied to it next.
This follows the wikipedia article description I linked to above. Now we need to cover the hash and bucket assignment.
A decent hash function for integers you expect to have a normal distribution is to just use the integer itself. So for an input integer I, our hash H(I) is just I. I think this follows the answer key, which is good because the question is unanswerable without knowing H.
To determine which bucket an integer I is added to, one of two function values will be used, depending on whether or not the assignment points to before or after S.
First, calculate H(I) mod (N x 2L), which is really just I mod (N x 2L). I'm going to call this B(I) below for brevity (also for bucket). Call this the assignment address A.
If A is greater than or equal to S, we assign input I to address A and move on.
If A (B(I)) is less than S, we actually use a different hash function, I'll call B'(I), which is calculated as I mod (N x 2L + 1), giving us an actual assignment address of A'.
I think the reasoning for this is to keep the assignment to buckets more even as buckets are split along the way, but I don't have the mathematical proof of its importance.
I think the * in the answer's notation above denotes the location of the split pointer S. In my notation for the rest of the question below:
Let - denote an empty bucket, i denote a bucket with the Integer i in it, and i,j denote a bucket with both i and j in it.
So the first step of your answer key "— —5— (0,1)" is saying bucket 0 is empty and bucket 1 has 5 in it. I would rewrite this as - 5 for clarity.
I'm thinking your answer breakdown reads like this:
Add 5 to the table.
The linear hashing algorithm puts it into the second bucket (index 1) because:
B(5) = 5 mod (2 x 20) = 5 mod (2 x 1) = 5 mod 2 = 1
1 is greater than S, which is still 0, so we use 1 as the address.
Table now has - 5 (0th bucket empty, 1st bucket with 5 in it.
N, L, and S are unchanged
Add 7 to the table.
B(7) = 7 mod 2 = 1, so 7 is added to the same bucket as 5. S still hasn't changed, so again 1 is used as the address.
Table now has - 5,7
A split occurs! Not because a bucket has overflowed, but because the load factor has been exceeded. 2 items added, 2 total buckets, 2/2 = 1.0 > 0.9 = do a split.
First a new bucket is added at the end of the table.
S is incremented to 1. N is not incremented. L is unchanged
The split is done on a bucket. A split means all the items in the bucket get their assignment recalculated based on the new hash table size. However, one key to linear hashing is that the actual buckets are split in order, so the 0th bucket is split even though the 1st bucket is the one thats full.
Post split, the table is now - 5,7 -, with buckets 0 and 2 empty, and 1 still with 5 and 7 in it.
Add 12 to the table.
B(12) = 12 mod (2 x 20) = 12 mod 2 = 0
S is 1 and B(12) is 0, so we calculate B'(12) instead for our address.
Coincidentally, this is 12 mod (2 x 20 + 1) = 12 mod 4, which is still 0, so 12 is added to the 0th bucket.
Table now has 12 5,7 -, only the 3rd, new bucket is empty.
A split occurs again, because 3/3 = 1.0 > 0.9. This split promises to be more interesting than the last!
A new bucket is added to the end of the table, giving us 12 5,7 - -
S = 1, so the bucket with 5,7 is split. That means new buckets are picked for 5 and 7.
Increment S to 2. This is done after the split target bucket is picked, but before the new buckets are assigned. This ensures the new table is more evenly distributed (again, my supposition, don't have proof).
5 mod 2 = 1, 1 < S, calculate 5 mod 2 x 21 = 5 mod 4 = 1. 5 is re-assigned to its same bucket.
7 mod 2 = 1, 1 < S, calculate 7 mod 2 x 21 = 7 mod 4 = 3. 7 is re-assigned to 3.
Table now has 12 5 - 7
S = 2, N still equals 2, and L still = 0. S has now reached N x 2L = 2 x 20 = 2, so S is reset to 0 and L is incremented to 1.
Add 11 to the table.
B(11) = 11 mod (2 x 21) = 11 mod 4 = 3. 11 is assigned to the 3rd bucket.
Table now has 12 5 - 7,11, 4 items and 4 buckets, so a split occurs again.
S is 0 again, so the 0th bucket with 12 is reassigned after a new bucket is added. S is incremented to 1 before choosing a new bucket for 12.
B(12) = 12 mod (2 x 21) = 12 mod 4 = 0. 0 < 1, so recalculate
B'(12) = 12 mod (2 x 21+1) = 12 mod 8 = 4. 12 is assigned to the 4th bucket.
Table now contains - 5 - 7,11 12
Add 9 to the table.
I'll leave the steps to the last one for you. There are a few nuances to the LH algorithm that I'm not quite grasping. I might ask additional questions about them. But hopefully that's enough for you to get going on. In the future, I would recommend asking the course instructor directly.

String to Number and back algorithm

This is a hard one (for me) I hope people can help me. I have some text and I need to transfer it to a number, but it has to be unique just as the text is unique.
For example:
The word 'kitty' could produce 12432, but only the word kitty produces that number. The text could be anything and a proper number should be given.
One problem the result integer must me a 32-bit unsigned integer, that means the largest possible number is 2147483647. I don't mind if there is a text length restriction, but I hope it can be as large as possible.
My attempts. You have the letters A-Z and 0-9 so one character can have a number between 1-36. But if A = 1 and B = 2 and the text is A(1)B(2) and you add it you will get the result of 3, the problem is the text BA produces the same result, so this algoritm won't work.
Any ideas to point me in the right direction or is it impossible to do?
Your idea is generally sane, only needs to be developed a little.
Let f(c) be a function converting character c to a unique number in range [0..M-1]. Then you can calculate result number for the whole string like this.
f(s[0]) + f(s[1])*M + f(s[2])*M^2 + ... + f(s[n])*M^n
You can easily prove that number will be unique for particular string (and you can get string back from the number).
Obviously, you can't use very long strings here (up to 6 characters for your case), as 36^n grows fast.
Imagine you were trying to store Strings from the character set "0-9" only in a number (the equivalent of obtaining a number of a string of digits). What would you do?
Char 9 8 7 6 5 4 3 2 1 0
Str 0 5 2 1 2 5 4 1 2 6
Num = 6 * 10^0 + 2 * 10^1 + 1 * 10^2...
Apply the same thing to your characters.
Char 5 4 3 2 1 0
Str A B C D E F
L = 36
C(I): transforms character to number: C(0)=0, C(A)=10, C(B)=11, ...
Num = C(F) * L ^ 0 + C(E) * L ^ 1 + ...
Build a dictionary out of words mapped to unique numbers and use that, that's the best you can do.
I doubt there are more than 2^32 number of words in use, but this is not the problem you're facing, the problem is that you need to map numbers back to words.
If you were only mapping words over to numbers, some hash algorithm might work, although you'd have to work a bit to guarantee that you have one that won't produce collisions.
However, for numbers back to words, that's quite a different problem, and the easiest solution to this is to just build a dictionary and map both ways.
In other words:
AARDUANI = 0
AARDVARK = 1
...
If you want to map numbers to base 26 characters, you can only store 6 characters (or 5 or 7 if I miscalculated), but not 12 and certainly not 20.
Unless you only count actual words, and they don't follow any good countable rules. The only way to do that is to just put all the words in a long list, and start assigning numbers from the start.
If it's correctly spelled text in some language, you can have a number for each word. However you'd need to consider all possible plurals, place and people names etc. which is generally impossible. What sort of text are we talking about? There's usually going to be some existing words that can't be coded in 32 bits in any way without prior knowledge of them.
Can you build a list of words as you go along? Just give the first word you see the number 1, second number 2 and check if a word has a number already or it needs a new one. Then save your newly created dictionary somewhere. This would likely be the only workable solution if you require 100% reliable, reversible mapping from the numbers back to original words given new unknown text that doesn't follow any known pattern.
With 64 bits and a sufficiently good hash like MD5 it's extremely unlikely to have collisions, but for 32 bits it doesn't seem likely that a safe hash would exist.
Just treat each character as a digit in base 36, and calculate the decimal equivalent?
So:
'A' = 0
'B' = 1
[...]
'Z' = 25
'0' = 26
[...]
'9' = 35
'AA' = 36
'AB' = 37
[...]
'CAB' = 46657

How to balance the number of items across multiple columns

I need to find out a method to determine how many items should appear per column in a multiple column list to achieve the most visual balance. Here are my criteria:
The list should only be split into multiple columns if the item count is greater than 10.
If multiple columns are required, they should contain no less than 5 (except for the last column in case of a remainder) and no more than 10 items.
If all columns cannot contain an equal number of items
All but the last column should be equal in number.
The number of items in each column should be optimized to achieve the smallest difference between the last column and the other column(s).
Well, your requirements and your examples appear a bit contradictory. For instance, your second example could be divided into two columns with 11 items in each, and satisfy your criteria. Let's assume that for rule #2 you meant that there should be <= 10 items / column.
In addition, I think you need to add another rule to make the requirements sensible:
The number of columns must not be greater than what is required to accomodate overflow.
Otherwise, you will often end up with degenerate solutions where you have far more columns than you need. For example, in the case of 26 items you probably don't want 13 columns of 2 items each.
If that's case, here's a simple calculation that should work well and is easy to understand:
int numberOfColumns = CEILING(numberOfItems / 10);
int numberOfItemsPerColumn = CEILING(numberOfItems / numberOfColumns);
Now you'll create N-1 columns of items (having `numberOfItemsPerColumn each) and the overflow will go in the last column. By this definition, the overflow should be minimized in the last column.
If you want to automatically determine the appropriate number of columns, and have no restrictions on its limits, I would suggest the following:
Calculate the square root of the total number of items. That would make an squared layout.
Divide that number by 1.618, and assign that to the total number of rows.
Multiply that same number by 1.618, and assign that to the total number of columns.
All columns but the right most one will have the same number of items.
By the way, the constant 1.618 is the Golden Ratio. That will achieve a more pleasant layout than a squared one.
Divide and multiply the other way round for vertical displays.
Hope this algorithm helps anyone with a similar problem.
Here's what you're trying to solve:
minimize y - z where n = xy + z and 5 <= y <= 10 and 0 <= z <= y
where you have n items split into x full columns of y items and one remainder column of z items.
There is almost certainly a smart way of doing this, but given these constraints a brute force implementation exploring all 6 + 7 + 8 + 9 + 10 = 40 possible combinations for y and z would take no time at all (only assignments where (n - z) mod y = 0 are solutions).
I think a brute force solution is easy, given the constraint on the number of items per columns: let v be the number of items per column (except the last one), then v belongs to [5,10] and can thus take a whooping 6 different values.
Evaluating 6 values is easy enough. Python one-liner (or not so far) to prove it:
# compute the difference between the number of items for the normal columns
# and for the last column, lesser is better
def helper(n,v):
modulo = n % v
if modulo == 0: return 0
else: return v - modulo
# values can only be in [5,10]
# we compute the difference with the last column for each
# build a list of tuples (difference, - number of items)
# (because the greater the value the better, it means less columns)
# extract the min automatically (in case of equality, less is privileged)
# and then pick the number of items from the tuple and re-inverse it
def compute(n): return - min([(helper(n,v), -v) for v in [5,6,7,8,9,10]])[1]
For 77 this yields: 7 meaning 7 items per columns
For 22 this yields: 8 meaning 8 items per columns

Number base conversion as a stream operation

Is there a way in constant working space to do arbitrary size and arbitrary base conversions. That is, to convert a sequence of n numbers in the range [1,m] to a sequence of ceiling(n*log(m)/log(p)) numbers in the range [1,p] using a 1-to-1 mapping that (preferably but not necessarily) preservers lexigraphical order and gives sequential results?
I'm particularly interested in solutions that are viable as a pipe function, e.i. are able to handle larger dataset than can be stored in RAM.
I have found a number of solutions that require "working space" proportional to the size of the input but none yet that can get away with constant "working space".
Does dropping the sequential constraint make any difference? That is: allow lexicographically sequential inputs to result in non lexicographically sequential outputs:
F(1,2,6,4,3,7,8) -> (5,6,3,2,1,3,5,2,4,3)
F(1,2,6,4,3,7,9) -> (5,6,3,2,1,3,5,2,4,5)
some thoughts:
might this work?
streamBasen -> convert(n, lcm(n,p)) -> convert(lcm(n,p), p) -> streamBasep
(where lcm is least common multiple)
I don't think it's possible in the general case. If m is a power of p (or vice-versa), or if they're both powers of a common base, you can do it, since each group of logm(p) is then independent. However, in the general case, suppose you're converting the number a1 a2 a3 ... an. The equivalent number in base p is
sum(ai * mi-1 for i in 1..n)
If we've processed the first i digits, then we have the ith partial sum. To compute the i+1'th partial sum, we need to add ai+1 * mi. In the general case, this number is going have non-zero digits in most places, so we'll need to modify all of the digits we've processed so far. In other words, we'll have to process all of the input digits before we'll know what the final output digits will be.
In the special case where m are both powers of a common base, or equivalently if logm(p) is a rational number, then mi will only have a few non-zero digits in base p near the front, so we can safely output most of the digits we've computed so far.
I think there is a way of doing radix conversion in a stream-oriented fashion in lexicographic order. However, what I've come up with isn't sufficient for actually doing it, and it has a couple of assumptions:
The length of the positional numbers are already known.
The numbers described are integers. I've not considered what happens with the maths and -ive indices.
We have a sequence of values a of length p, where each value is in the range [0,m-1]. We want a sequence of values b of length q in the range [0,n-1]. We can work out the kth digit of our output sequence b from a as follows:
bk = floor[ sum(ai * mi for i in 0 to p-1) / nk ] mod n
Lets rearrange that sum into two parts, splitting it at an arbitrary point z
bk = floor[ ( sum(ai * mi for i in z to p-1) + sum(ai * mi for i in 0 to z-1) ) / nk ] mod n
Suppose that we don't yet know the values of a between [0,z-1] and can't compute the second sum term. We're left with having to deal with ranges. But that still gives us information about bk.
The minimum value bk can be is:
bk >= floor[ sum(ai * mi for i in z to p-1) / nk ] mod n
and the maximum value bk can be is:
bk <= floor[ ( sum(ai * mi for i in z to p-1) + mz - 1 ) / nk ] mod n
We should be able to do a process like this:
Initialise z to be p. We will count down from p as we receive each character of a.
Initialise k to the index of the most significant value in b. If my brain is still working, ceil[ logn(mp) ].
Read a value of a. Decrement z.
Compute the min and max value for bk.
If the min and max are the same, output bk, and decrement k. Goto 4. (It may be possible that we already have enough values for several consecutive values of bk)
If z!=0 then we expect more values of a. Goto 3.
Hopefully, at this point we're done.
I've not considered how to efficiently compute the range values as yet, but I'm reasonably confident that computing the sum from the incoming characters of a can be done much more reasonably than storing all of a. Without doing the maths though, I won't make any hard claims about it though!
Yes, it is possible
For every I character(s) you read in, you will write out O character(s)
based on Ceiling(Length * log(In) / log(Out)).
Allocate enough space
Set x to 1
Loop over digits from end to beginning # Horner's method
Set a to x * digit
Set t to O - 1
Loop while a > 0 and t >= 0
Set a to a + out digit
Set out digit at position t to a mod to base
Set a to a / to base
Set x to x * from base
Return converted digit(s)
Thus, for base 16 to 2 (which is easy), using "192FE" we read '1' and convert it, then repeat on '9', then '2' and so on giving us '0001', '1001', '0010', '1111', and '1110'.
Note that for bases that are not common powers, such as base 17 to base 2 would mean reading 1 characters and writing 5.

Resources