Representing a multiset as a sequence of bits - data-structures

We can use a bitmask to represent set presence in a finite (or at least indexed) domain efficiently, for instance to represent the letters in car we could represent this in a 26-bit set like so:
abcdefghijklmnopqrstuvwxyz
10100000000000000100000000
However of course this can only represent presence, not duplicates - carry for instance actually has two rs, but a set cannot represent that.
A multiset represents a count, not just existence, so we can count duplicates, however it's not clear to me if this can be represented logically in a single number.
One idea, suggested by a coworker, would be to use primes as our indices, and represent a multiset by it's prime factorization. So our cases above would become:
car = 2^1 * 3^0 * 5^1 * ... * 61^1 * ....
carry = 2^1 * 3^0 * 5^1 * ... * 61^2 * ... 97^1 * 101^0
Is this a sound way to represent multisets? Are there better binary representations of such a concept?

Trivial: Use k bits instead of 1 bit for each element of the universe. Concatenate them to get a single number, if you care about that, but you can equivalently consider it an array of numbers (the bitset equivalent, an array of booleans, is also valid and useful).
This probably takes more space than the prime factor approach, but on the bright side, it's still very space efficient and you can test presence (and extract the count) with an array lookup and some bit fiddling, as opposed to looking up/computing the relevant prime and performing integer division.

Related

Generating a perfect hash function given known list of strings?

Suppose I have a list of N strings, known at compile-time.
I want to generate (at compile-time) a function that will map each string to a distinct integer between 1 and N inclusive. The function should take very little time or space to execute.
For example, suppose my strings are:
{"apple", "orange", "banana"}
Such a function may return:
f("apple") -> 2
f("orange") -> 1
f("banana") -> 3
What's a strategy to generate this function?
I was thinking to analyze the strings at compile time and look for a couple of constants I could mod or add by or something?
The compile-time generation time/space can be quite expensive (but obviously not ridiculously so).
Say you have m distinct strings, and let ai, j be the jth character of the ith string. In the following, I'll assume that they all have the same length. This can be easily translated into any reasonable programming language by treating ai, j as the null character if j ≥ |ai|.
The idea I suggest is composed of two parts:
Find (at most) m - 1 positions differentiating the strings, and store these positions.
Create a perfect hash function by considering the strings as length-m vectors, and storing the parameters of the perfect hash function.
Obviously, in general, the hash function must check at least m - 1 positions. It's easy to see this by induction. For 2 strings, at least 1 character must be checked. Assume it's true for i strings: i - 1 positions must be checked. Create a new set of strings by appending 0 to the end of each of the i strings, and add a new string that is identical to one of the strings, except it has a 1 at the end.
Conversely, it's obvious that it's possible to find at most m - 1 positions sufficient for differentiating the strings (for some sets the number of course might be lower, as low as log to the base of the alphabet size of m). Again, it's easy to see so by induction. Two distinct strings must differ at some position. Placing the strings in a matrix with m rows, there must be some column where not all characters are the same. Partitioning the matrix into two or more parts, and applying the argument recursively to each part with more than 2 rows, shows this.
Say the m - 1 positions are p1, ..., pm - 1. In the following, recall the meaning above for ai, pj for pj ≥ |ai|: it is the null character.
let us define h(ai) = ∑j = 1m - 1[qj ai, pj % n], for random qj and some n. Then h is known to be a universal hash function: the probability of pair-collision P(x ≠ y ∧ h(x) = h(y)) ≤ 1/n.
Given a universal hash function, there are known constructions for creating a perfect hash function from it. Perhaps the simplest is creating a vector of size m2 and successively trying the above h with n = m2 with randomized coefficients, until there are no collisions. The number of attempts needed until this is achieved, is expected 2 and the probability that more attempts are needed, decreases exponentially.
It is simple. Make a dictionary and assign 1 to the first word, 2 to the second, ... No need to make things complicated, just number your words.
To make the lookup effective, use trie or binary search or whatever tool your language provides.

Calculating unique value from given numbers

Let's say I have some 6 random numbers and I want to calculate some unique value from these numbers.
Edit:
Allowed operations are +, -, *, and /. Every number could be used only once. You dont have to use all numbers.
Example:
Given numbers: 3, 6, 100, 50, 25, 75
Requested result: 953
3 + 6 = 9
9 * 100 = 900
900 + 50 = 950
75 / 25 = 3
3 + 950 = 953
What could be easiest algorithmic approach to write a program that solves this problem?
The easiest approach is to try them all: you have six numbers, meaning that there are up to five spots where you can place an operator, and up to 6! permutations. Given that there are only four operators, you need to go through 6!*4^5, or 737280 possibilities. This can be easily done with a recursive function, or even with nested loops. Depending on the language, you could use a library function to deal with permutations.
A language-agnostic recursive approach would have you define three functions:
int calc(int nums[6], int ops[5], int countNums) {
// Calculate the results for a given sequence of numbers
// with the specified operators.
// nums are your numbers; only countNums need to be used
// ops are your operators; only countNums-1 need to be used
// countNums is the number of items to use; it must be from 1 to 6
}
void permutations(int nums[6], int perm[6], int pos) {
// Produces all permutations of the original numbers
// nums are the original numbers
// perm, 0 through pos, is the indexes of nums used in the permutation so far
// pos, is the number of perm items filled so far
}
void solveRecursive(int numPerm[6], int permLen, int ops[5], int pos) {
// Tries all combinations of operations on the given permutation.
// numPermis the permutation of the original numbers
// permLen is the number of items used in the permutation
// ops 0 through pos are operators to be placed between elements
// of the permutation
// pos is the number of operators provided so far.
}
The easiest algorithmic approach would, I think, be backtracking. It's fairly easy to implement and will always find a solution if one exists. The basic idea is recursive: make an arbitrary choice at each step of building a solution and proceed from there. If it doesn't work out, try a different choice. When you run out of choices, report failure to the previous choice point (or report failure to find a solution if there is no previous choice point).
Your choices are: how many numbers will be involved, what each number is (a choice each number position), and how they are connected by operators (a choice for each operator position).
When you mention "unique numbers", assuming that you mean a result in the possible universe of results generated using all the numbers at hand.
If so, why not try a permutation of all operators and available numbers for a start?
If you want to guarantee that you generate a unique number from those numbers, with no chance of getting the same number from a different set of numbers, then you should use radix arithmetic, similar to decimal, hex, etc.
But you need to know the max values of the numbers.
Basically, it would be A + B * MAX_A + C * MAX_A * MAX_B + D * MAX_A * MAX_B * MAX_C + E * MAX_A * MAX_B * MAX_C * MAX_D + F * MAX_A * ... * MAX_E
use recursion to permutate the numbers and operators. it's O(6!*4^5)

Bijective "Integer <-> String" function

Here's a problem I'm trying to create the best solution for. I have a finite set of non-negative integers in the range of [0...N]. I need to be able to represent each number in this set as a string and be able to convert such string backwards to original number. So this should be a bijective function.
Additional requirements are:
String representation of a number should obfuscate original number at least to some degree. So primitive solution like f(x) = x.toString() will not work.
String length is important: the less the better.
If one knows the string representation of K, I would like it to be non-trivial (to some degree) to guess the string representation of K+1.
For p.1 & p.2 the obvious solution is to use something like Base64 (or whatever BaseXXX to fit all the values) notation. But can we fit into p.3 with minimal additional effort? Common sense tells me that I additionally need a bijective "String <-> String" function for BaseXXX values. Any suggestions?
Or maybe there's something better than BaseXXX to use to fit all 3 requirements?
If you do not need this to be too secure, you can just use a simple symmetric cipher after encoding in BaseXXX. For example you can choose a key sequence of integers [n₁, n₂, n₃...] and then use a Vigenere cipher.
The basic idea behind the cipher is simple--encode each character C as C + K (mod 26) where K is an element from the key. As you go along, just get the next number from the key for the next character, wrapping around once you run out of values in the key.
You really have two options here: you can first convert a number to a string in baseXXX and then encrypt, or you can use the same idea to just encrypt each number as a single character. In that case, you would want to change it from mod 26 to mod N + 1.
Come to think of it, an even simpler option would be to just xor the element from the key and the value. (As opposed to using the Vigenere formula.) I think this would work just as well for obfuscation.
This method meets requirements 1-3, but it is perhaps a bit too computationally expensive:
find a prime p > N+2, not too much larger
find a primitive root g modulo p, that is, a number whose multiplicative order modulo p is p-1
for 0 <= k <= N, let enc(k) = min {j > 0 : g^j == (k+2) (mod p)}
f(k) = enc(k).toString()
Construct a table of length M. This table should map the numbers 0 through M-1 to distinct short strings with a random ordering. Express the integer as a base-M number, using the strings from the table to represent the digits in the number. Decode with a straightforward reversal.
With M=26, you could just use a letter for each of the digits. Or take M=256 and use a byte for each digit.
Not even remotely a good cryptographic approach!
So you need a string that obfuscates the original number, but allows one to determine str(K+1) when str(K) is known?
How about just doing f(x) = (x + a).toString(), where a is secret? Then an outside user can't determine x from f(x), but they can be confident that if they have a string "1234", say, for an unknown x then "1235" maps to x+1.
p. 1 and p. 3 are slightly contradicting and a bit vague, too.
I would propose using hex representation of the integer numbers.
17 => 0x11
123123 => 1E0F3

Lists Hash function

I'm trying to make a hash function so I can tell if too lists with same sizes contain the same elements.
For exemple this is what I want:
f((1 2 3))=f((1 3 2))=f((2 1 3))=f((2 3 1))=f((3 1 2))=f((3 2 1)).
Any ideea how can I approch this problem ? I've tried doing the sum of squares of all elements but it turned out that there are collisions,for exemple f((2 2 5))=33=f((1 4 4)) which is wrong as the lists are not the same.
I'm looking for a simple approach if there is any.
Sort the list and then:
list.each do |current_element|
hash = (37 * hash + current_element) % MAX_HASH_VALUE
end
You're probably out of luck if you really want no collisions. There are N choose k sets of size k with elements in 1..N (and worse, if you allow repeats). So imagine you have N=256, k=8, then N choose k is ~4 x 10^14. You'd need a very large integer to distinctly hash all of these sets.
Possibly you have N, k such that you could still make this work. Good luck.
If you allow occasional collisions, you have lots of options. From simple things like your suggestion (add squares of elements) and computing xor the elements, to complicated things like sort them, print them to a string, and compute MD5 on them. But since collisions are still possible, you have to verify any hash match by comparing the original lists (if you keep them sorted, this is easy).
So you are looking something provides these properties,
1. If h(x1) == y1, then there is an inverse function h_inverse(y1) == x1
2. Because the inverse function exists, there cannot be a value x2 such that x1 != x2, and h(x2) == y1.
Knuth's Multiplicative Method
In Knuth's "The Art of Computer Programming", section 6.4, a multiplicative hashing scheme is introduced as a way to write hash function. The key is multiplied by the golden ratio of 2^32 (2654435761) to produce a hash result.
hash(i)=i*2654435761 mod 2^32
Since 2654435761 and 2^32 has no common factors in common, the multiplication produces a complete mapping of the key to hash result with no overlap. This method works pretty well if the keys have small values. Bad hash results are produced if the keys vary in the upper bits. As is true in all multiplications, variations of upper digits do not influence the lower digits of the multiplication result.
Robert Jenkins' 96 bit Mix Function
Robert Jenkins has developed a hash function based on a sequence of subtraction, exclusive-or, and bit shift.
All the sources in this article are written as Java methods, where the operator '>>>' represents the concept of unsigned right shift. If the source were to be translated to C, then the Java 'int' data type should be replaced with C 'uint32_t' data type, and the Java 'long' data type should be replaced with C 'uint64_t' data type.
The following source is the mixing part of the hash function.
int mix(int a, int b, int c)
{
a=a-b; a=a-c; a=a^(c >>> 13);
b=b-c; b=b-a; b=b^(a << 8);
c=c-a; c=c-b; c=c^(b >>> 13);
a=a-b; a=a-c; a=a^(c >>> 12);
b=b-c; b=b-a; b=b^(a << 16);
c=c-a; c=c-b; c=c^(b >>> 5);
a=a-b; a=a-c; a=a^(c >>> 3);
b=b-c; b=b-a; b=b^(a << 10);
c=c-a; c=c-b; c=c^(b >>> 15);
return c;
}
You can read details from here
If all the elements are numbers and they have a maximum, this is not too complicated, you sort those elements and then you put them together one after the other in the base of your maximum+1.
Hard to describe in words...
For example, if your maximum is 9 (that makes it easy to understand), you'd have :
f(2 3 9 8) = f(3 8 9 2) = 2389
If you maximum was 99, you'd have :
f(16 2 76 8) = (0)2081676
In your example with 2,2 and 5, if you know you would never get anything higher than 5, you could "compose" the result in base 6, so that would be :
f(2 2 5) = 2*6^2 + 2*6 + 5 = 89
f(1 4 4) = 1*6^2 + 4*6 + 4 = 64
Combining hash values is hard, I've found this way (no explanation, though perhaps someone would recognize it) within Boost:
template <class T>
void hash_combine(size_t& seed, T const& v)
{
seed ^= hash_value(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
}
It should be fast since there is only shifting, additions and xor taking place (apart from the actual hashing).
However the requirement than the order of the list does not influence the end-result would mean that you first have to sort it which is an O(N log N) operation, so it may not fit.
Also, since it's impossible without more stringent boundaries to provide a collision free hash function, you'll still have to actually compare the sorted lists if ever the hash are equals...
I'm trying to make a hash function so I can tell if two lists with same sizes contain the same elements.
[...] but it turned out that there are collisions
These two sentences suggest you are using the wrong tool for the job. The point of a hash (unless it is a 'perfect hash', which doesn't seem appropriate to this problem) is not to guarantee equality, or to provide a unique output for every given input. In the general usual case, it cannot, because there are more potential inputs than potential outputs.
Whatever hash function you choose, your hashing system is always going to have to deal with the possibility of collisions. And while different hashes imply inequality, it does not follow that equal hashes imply equality.
As regards your actual problem: a start might be to sort the list in ascending order, then use the sorted values as if they were the prime powers in the prime decomposition of an integer. Reconstruct this integer (modulo the maximum hash value) and there is a hash value.
For example:
2 1 3
sorted becomes
1 2 3
Treating this as prime powers gives
2^1.3^2.5^3
which construct
2.9.125 = 2250
giving 2250 as your hash value, which will be the same hash value as for any other ordering of 1 2 3, and also different from the hash value for any other sequence of three numbers that do not overflow the maximum hash value when computed.
A naïve approach to solving your essential problem (comparing lists in an order-insensitive manner) is to convert all lists being compared to a set (set in Python or HashSet in Java). This is more effective than making a hash function since a perfect hash seems essential to your problem. For almost any other approach collisions are inevitable depending on input.

Map strings to numbers maintaining the lexicographic ordering

I'm looking for an algorithm or function that is able to map a string to a number in such way that the resulting values correspond the lexicographic ordering of strings. Example:
"book" -> 50000
"car" -> 60000
"card" -> 65000
"a longer string" -> 15000
"another long string" -> 15500
"awesome" -> 16000
As a function it should be something like: f(x) = y, so that for any x1 < x2 => f(x1) < f(x2), where x is an arbitrary string and y is a number.
If the input set of x is finite, then I could always do a sort and assign the proper values, but I'm looking for something generic for an unlimited input set for x.
If you require that f map to integers this is impossible.
Suppose that there is such a map f. Consider the strings a, aa, aaa, etc. Consider the values f(a), f(aa), f(aaa), etc. As we require that f(a) < f(aa) < f(aaa) < ... we see that f(a_n) tends to infinity as n tends to infinity; here I am using the obvious notation that a_n is the character a repeated n times. Now consider the string b. We require that f(a_n) < f(b) for all n. But f(b) is some finite integer and we just showed that f(a_n) goes to infinity. We have a contradiction. No such map is possible.
Maybe you could tell us what you need this for? This is fairly abstract and we might be able to suggest something more suitable. Further, don't necessarily worry about solving "it" generally. YAGNI and all that.
As a corollary to Jason's answer, if you can map your strings to rational numbers, such a mapping is very straightforward. If code(c) is the ASCII code of the character c and s[i] is theith character in the string s, just sum like follows:
result <- 0
scale <- 1
for i from 1 to length(s)
scale <- scale / 26
index <- (1 + code(s[i]) - code('a'))
result <- result + index / scale
end for
return result
This maps the empty string to 0, and every other string to a rational number between 0 and 1, maintaining lexicographical order. If you have arbitrary-precision decimal floating-point numbers, you can replace the division by powers of 26 with powers of 100 and still have exactly representable numbers; with arbitrary precision binary floating-point numbers, you can divide by powers of 32.
what you are asking for is a a temporary suspension of the pigeon hole principle (http://en.wikipedia.org/wiki/Pigeonhole_principle).
The strings are the pigeons, the numbers are the holes.
There are more pigeons than holes, so you can't put each pigeon in its own hole.
You would be much better off writing a comparator which you can supply to a sort function. The comparator takes two strings and returns -1, 0, or 1. Even if you could create such a map, you still have to sort on it. If you need both a "hash" and the order, then keep stuff in two data structures - one that preserves the order, and one that allows fast access.
Maybe a Radix Tree is what you're looking for?
A radix tree, Patricia trie/tree, or
crit bit tree is a specialized set
data structure based on the trie that
is used to store a set of strings. In
contrast with a regular trie, the
edges of a Patricia trie are labelled
with sequences of characters rather
than with single characters. These can
be strings of characters, bit strings
such as integers or IP addresses, or
generally arbitrary sequences of
objects in lexicographical order.
Sometimes the names radix tree and
crit bit tree are only applied to
trees storing integers and Patricia
trie is retained for more general
inputs, but the structure works the
same way in all cases.
LWN.net also has an article describing this data structures use in the Linux kernel.
I have post a question here https://stackoverflow.com/questions/22798824/what-lexicographic-order-means
As workaround you can append empty symbols with code zero to right side of the string, and use expansion from case II.
Without such expansion with extra empty symbols I' m actually don't know how to make such mapping....
But if you have a finite set of Symbols (V), then |V*| is eqiualent to |N| -- fact from Disrete Math.

Resources