adversary argument for finding n-bit strings - algorithm

Given:
S, a set an odd number of n-bit strings
A, a particular n-bit string
show that any algorithm that decides whether A is in S must examine all n bits of A in the worst case.
Usually of course we would expect to have to look at all the parts of a string to do the matching, but there's something particular about S having an odd size that's escaping me.

Let's say we have an algorithm A that decides membership in S correctly and says, for any input n-bit string, whether the string is in S or not.
Suppose for a given input n-bit string s1, the algorithm A never looks at bit i of s1 and goes on to say "s1 is in (not in) S". Then a string s2 equal to s1 except with bit i flipped is also in (not in) S! That is, for any string we feed into A, if A doesn't look at a particular bit, then there is a second string also in (or not in) S with that bit flipped.
Then what is special about odd-sized sets S? We can't pair up strings in S evenly. That is, there must be a string s3 that A looks at and decides is in S, for which no single bit can be flipped to form another string in S. So A must look at all the bits of s3 (otherwise we could make such a string, as we did before).

I guess the odd number clue is to find the end of your set or array in memory.
Assume you are using a 32 bit system,
Perhaps the compiler aligns the data structutres of your program in memory on eight byte boundaries. You have a whole load of string pointers in your data segment.If there is an odd number of strings, the next thing that needs an eight byte alignment has four bytes of padding in front of it. If there is an even number of strings, there is no padding.

If i understand this correctly, it's irrelevant whether S has an odd or even number of strings. For any particular string in S to check that it matches arbitrary string A, you must check against each character in each. You can stop early if either string is shorter than the other or a character you're checking doesn't match.

Related

Look for a data structure to match words by letters

Given a list of lowercase radom words, each word with same length, and many patterns each with some letters at some positions are specified while other letters are unknown, find out all words that matches each pattern.
For example, words list is:
["ixlwnb","ivknmt","vvqnbl","qvhntl"]
And patterns are:
i-----
-v---l
-v-n-l
With a naive algorithm, one can do an O(NL) travel for each pattern, where N is the words count and L is the word length.
But since there may be a lot of patterns travel on the same words list, is there any good data structure to preprocess and store the words list, then give a sufficient matching for all patterns?
One simple idea is to use an inverted index. First, number your words -- you'll refer to them using these indices rather than the words themselves for speed and space efficiency. Probably the index fits in a 32-bit int.
Now your inverted index: for each letter in each position, construct a sorted list of IDs for words that have that letter in that location.
To do your search, you take the lists of IDs for each of the letters in the positions you're given, and take the intersection of the lists, using a an algorithm like the "merge" in merge-sort. All IDs in the intersection match the input.
Alternatively, if your words are short enough (12 characters or fewer), you could compress them into 64 bit words (using 5 bits per letter, with letters 1-26). Construct a bit-mask with binary 11111 in places where you have a letter, and 00000 in places where you have a blank. And a bit-test from your input with the 5-bit code for each letter in each place, using 00000 where you have blanks. For example, if your input is a-c then your bitmask will be binary 111110000011111 and your bittest binary 000010000000011. Go through your word-list, take this bitwise and of each word with the bit-mask and test to see if it's equal to the bit-test value. This is cache friendly and the inner loop is tight, so may be competitive with algorithms that look like they should be faster on paper.
I'll preface this with it's more of a comment and less of an answer (I don't have enough reputation to comment though). I can't think of any data structure that will satisfy the requirements of of the box. It was interesting to think about, and figured I'd share one potential solution that popped into my head.
I keyed in on the "same length" part, and figured I could come up with something based on that.
In theory we could have N(N being the length) maps of char -> set.
When strings are added, it goes through each character and adds the string to the corresponding set. psuedocode:
firstCharMap[s[0]].insert(s);
secondCharMap[s[1]].insert(s);
thirdCharMap[s[2]].insert(s);
fourthCharMap[s[3]].insert(s);
fifthCharMap[s[4]].insert(s);
sixthCharMap[s[5]].insert(s);
Then to determine which strings match the pattern, we take just do an intersection of the sets ex: "-v-n-l" would be:
intersection of sets: secondCharMap[v], fourthCharMap[n], sixthCharMap[l]
One edge case that jumps out is if I wanted to just get all of the strings, so if that's a requirement--we may also need an additional set of all of the strings.
This solution feels clunky, but I think it could work. Depending on the language, number of strings, etc--I wouldn't be surprised if it performed worse than just iterating over all strings and checking a predicate.

How good is hash function that is linear combination of values?

I was reading text about hashing , I found out that naive hash code of char string can be implemented as polynomial hash function
h(S0,S1,S2,...SN-1) = S0*A^N-1 + S1*A^N-2 + S2*A^N-3 ..... SN-1*A^0. Where Si is character at index i and A is some integer.
But cannot we straightaway sum as
h(S0,S1,S2,...SN-1) = S0*(N)+S1*(N-1)+S2*(N-2) ...... SN-1*1.
I see this function also as good since two values 2*S0+S1 != 2*S1+S0 (which are reverse) are not hashed to same values. But nowhere i find this type of hash function
Suppose we work with strings of 30 characters. That's not long, but it's not so short that problems with the hash should arise purely because the strings are too short.
The sum of the weights is 465 (1+2+...+30), with printable ASCII characters that makes the maximum hash 58590, attained by "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~". There are a lot more possible printable ASCII strings of 30 characters than that (9530 ≈ 2E59), but they all hash into the range of 0 to 58590. Naturally you cannot actually have that many strings at the same time, but you could have a lot more than 58590, and that would guarantee collisions just based on counting (it is very likely to happen much sooner of course).
The maximum hash grows only slowly, you'd need strings of 34 million characters before the entire range of a 32bit integer is used.
The other way, multiplying by powers of A (this can be evaluated with Horner's scheme so no powers needs to be calculated explicitly, it still only costs an addition and a multiplication per character, though the naive way is not the fastest way to compute that hash), does not have this problem. The powers of A quickly get big (and start wrapping, which is fine as long as A is odd), so strings with 30 characters stand a good chance to cover the entire range of whatever integer type you're using.
The problem with a linear hash function is that it's much easier to generate collisions.
Consider a string with 3 chars: S0, S1, S2.
The proposed hash code would be 3 * S0 + 2 * S1 + S2.
Every time we decrease char S2 by two (e.g. e --> c), and increase char S1 by one (e.g. m --> n), we obtain the same hash code.
Even only the fact that it is possible to describe an operation preserving hash so easily would be an alarm (because some algorithm might process the string exactly in that manner). As a more extreme case consider just summing the characters. In this situation all the anagrams of the original string would generate the same hash code (thus this hash would be useless in an application processing anagrams).

Map string to unique 0..1 float value, while keeping order

I would like to use Redis to sort string values (using sorted sets), but I can only use floats for that purpose. I am looking for an algorithm to convert string to a float 0..1 value, while keeping order.
I mean that s1 < s2 (alphabetically) should imply that f(s1) < f(s2).
Is there such an algorithm?
P.S. I will use such an algorithm for sorting usernames and in the most cases players with matching scores would have quite different usernames. So in the most cases either approach should work, but there is still room for collisions. On the other hand strings will be sorted moreless properly and it's acceptable if almost the same usernames are sorted incorrectly.
Each character can be mapped to its ASCII number. If you convert each string to its float equivalent concatenating all the ASCII numbers (with eventually zeros in front of them so that all characters will be mapped to three numbers) you will keep ordering. But if your strings are long, your floats will be huge and your mapping might not be unique (if several strings begin with the same characters, due to rounding inside the floats).
For example:
'hello' -> 104101108108111
If you know which subsets of characters your strings contain (for instance, only lowercase letters, or only uppercase letters and numbers) you can create your own mapping to use less numbers per character.
Mathematically, such an algorithm exists and is trivial: Simply put a radix point (“.”) before the string and interpret it as a base-256 numeral (assuming your string uses 8-bit characters). Analogously, if your string had just the characters “0” to “9”, you would read it as a decimal numeral, such as .58229 for the string “58229”. You are doing the same thing, just with base 256 instead of base 10.
Practically, this is not possible without a severely restricted set of potential strings or special floating-point software. Since a typical floating-point object has a finite size, it has a finite number of possible values. E.g., a floating-point object with 64 bits has at most 264 values, even neglecting those that stand for special notions such as NaN. Conversely, a string of arbitrary length has infinitely many potential values. Even if you limit the string to something reasonable in today’s computer memories, it has hugely more potential values than a normal floating-point object does.
To solve this, you must either decrease the number of potential strings (by limiting their length or otherwise restricting which strings are allowed) or increase the number of potential floating-point values (perhaps by using special arbitrary-precision floating-point software).

left to right radix sort

Radix sort sorts the numbers starting from lease significant digit to most significant digit.
I have the following scenario :
My alphabet is the english alphabet and therefore my "numbers" are english language strings. The characters of these strings are revealed one at a time left to right. That is, the most significant digit , for all strings, is revealed first and so on. At any stage, i will have a set of k character long strings, that is sorted. At this point one character more is revealed for every string. And i want to sort the new set of strings. How do i do this efficiently without starting from scratch ?
For example if i had the following sorted set { for, for, sta, sto, sto }
And after one more character each is revealed, the set is { form, fore, star, stop, stoc }
The new sorted set should be {fore, form, star, stoc, stop }
I m hoping for a complexity O(n) after each new character is added, where n is the size of the set.
If you want to do this in O(n) you have to somehow keep track of "groups":
for, for | sta | sto, sto
Within this groups, you can sort the strings according to their last character keeping the set sorted.
Storing groups can be done in various ways. At first sight, I would recommend some kind of remembering offsets of group beginnings/endings. However, this consumes extra memory.
Another possibility might be storing the strings in some kind of prefix-tree, which correspond quite naturally to "adding one char after another", but I don't know if this is suitable for your application.

Tokenize valid words from a long string

Suppose you have a dictionary that contains valid words.
Given an input string with all spaces removed, determine whether the string is composed of valid words or not.
You can assume the dictionary is a hashtable that provides O(1) lookup.
Some examples:
helloworld-> hello world (valid)
isitniceinhere-> is it nice in here (valid)
zxyy-> invalid
If a string has multiple possible parsings, just return true is sufficient.
The string can be very long. Hence think an algorithm that is both space & time efficient.
I think the set of all strings that occur as the concatenation of valid words (words taken from a finite dictionary) form a regular language over the alphabet of characters. You can then build a finite automaton that accepts exactly the strings you want; computation time is O(n).
For instance, let the dictionary consist of the words {bat, bag}. Then we construct the following automaton: states are denoted by 0, 1, 2. Edges: (0,1,b), (1,2,a), (2,0,t), (2,0,g); where the triple (x,y,z) means an edge leading from x to y on input z. The only accepting state is 0. In each step, on reading the next input sign, you have to calculate the set of states that are reachable on that input. Given that the number of states in the automaton is constant, this is of complexity O(n). As for space complexity, I think you can do with O(number of words) with the hint for construction above.
For an other example, with the words {bag, bat, bun, but} the automaton would look like this:
Supposing that the automaton has already been built (the time to do this has something to do with the length and number of words :-) we now argue that the time to decide whether a string is accepted by the automaton is O(n) where n is the length of the input string.
More formally, our algorithm is as follows:
Let S be a set of states, initially containing the starting state.
Read the next input character, let us denote it by a.
For each element s in S, determine the state that we move into from s on reading a; that is, the state r such that with the notation above (s,r,a) is an edge. Let us denote the set of these states by R. That is, R = {r | s in S, (s,r,a) is an edge}.
(If R is empty, the string is not accepted and the algorithm halts.)
If there are no more input symbols, check whether any of the accepting states is in R. (In our case, there is only one accepting state, the starting state.) If so, the string is accepted, if not, the string is not accepted.
Otherwise, take S := R and go to 2.
Now, there are as many executions of this cycle as there are input symbols. The only thing we have to examine is that steps 3 and 5 take constant time. Given that the size of S and R is not greater than the number of states in the automaton, which is constant and that we can store edges in a way such that lookup time is constant, this follows. (Note that we of course lose multiple 'parsings', but that was not a requirement either.)
I think this is actually called the membership problem for regular languages, but I couldn't find a proper online reference.
I'd go for a recursive algorithm with implicit backtracking. Function signature: f: input -> result, with input being the string, result either true or false depending if the entire string can be tokenized correctly.
Works like this:
If input is the empty string, return true.
Look at the length-one prefix of input (i.e., the first character). If it is in the dictionary, run f on the suffix of input. If that returns true, return true as well.
If the length-one prefix from the previous step is not in the dictionary, or the invocation of f in the previous step returned false, make the prefix longer by one and repeat at step 2. If the prefix cannot be made any longer (already at the end of the string), return false.
Rinse and repeat.
For dictionaries with low to moderate amount of ambiguous prefixes, this should fetch a pretty good running time in practice (O(n) in the average case, I'd say), though in theory, pathological cases with O(2^n) complexity can probably be constructed. However, I doubt we can do any better since we need backtracking anyways, so the "instinctive" O(n) approach using a conventional pre-computed lexer is out of the question. ...I think.
EDIT: the estimate for the average-case complexity is likely incorrect, see my comment.
Space complexity would be only stack space, so O(n) even in the worst-case.

Resources