Convert rank-per-candidate format to OpenSTV BLT format - format

I recently gathered, using a questionnaire, a set of opinions on the importance of various software components. Figuring that some form of Condorcet voting method would be the best way to obtain an overall rank, I opted to use OpenSTV to analyze it.
My data is in tabular format, space delimited, and looks more or less like:
A B C D E F G # Candidates
5 2 4 3 7 6 1 # First ballot. G is ranked first, and E is ranked 7th
4 2 6 5 1 7 3 # Second ballot
etc
In this format, the number indicates the rank and the sequence order indicates the candidate.
Each "candidate" has a rank (required) from 1 to 7, where a 1 means most important and a 7 means least important. No duplicates are allowed.
This format struck me as the most natural way to represent the output, being a direct representation of the ballot format.
The OpenSTV/BLT format uses a different method of representing the same info, conceptually as follows:
G B D C A F E # Again, G is ranked first and E is ranked 7th
E B G A D C F #
etc
The actual numeric file format uses the (1-based) index of the candidate, rather than the label, and so is more like:
7 2 4 3 1 6 5 # Same ballots as before.
5 2 7 1 4 3 6 # A -> 1, G -> 7
In this format, the number indicates the candidate, and the sequence order indicates the rank. The actual, real, BLT format also includes a leading weight and a following zero to indicate the end of each ballot, which I don't care too much about for this.
My question is, what is the most elegant way to convert from the first format to the (numeric) second?

Here's my solution in Python, and it works ok but feels a little clumsy. I'm sure there's a cleaner way(perhaps in another language?)
This took me longer than it should have to wrap my head around yesterday afternoon, so maybe somebody else can use this too.
Given:
ballot = '5 2 4 3 7 6 1'
Python one(ish)-liner to convert it:
rank = [i for r,i in sorted((int(r),i+1) for i,r in enumerate(ballot.split())]
rank = " ".join(rank)
Alternatively, in a slightly more understandable form:
# Split into a list and convert to integers
int_ballot = [int(x) for x in ballot.split()]
# This is the important bit.
# enumerate(int_ballot) yields pairs of (zero-based-candidate-index, rank)
# Use a list comprehension to swap to (rank, one-based-candidate-index)
ranked_ballot = [(rank,index+1) for index,rank in enumerate(int_ballot)]
# Sort by the ranking. Python sorts tuples in lexicographic order
# (ie sorts on first element)
# Use a comprehension to extract the candidate from each pair
rank = " ".join([candidate for rank,candidate in sorted(ranked_ballot)])

Related

Checking if two strings are equal after removing a subset of characters from both

I recently came across this problem:
You are given two strings, s1 and s2, comprised entirely of lowercase letters 'a' through 'r', and need to process a series of queries. Each query provides a subset of lowercase English letters from 'a' through 'r'. For each query, determine whether s1 and s2, when restricted only to the letters in the query, are equal.
s1 and s2 can contain up to 10^5 characters, and there are up to 10^5 queries.
For instance, if s1 is "aabcd" and s2 is "caabd", and you are asked to process a query with the subset "ac", then s1 becomes "aac" while s2 becomes "caa". These don't match, so the query would return false.
I was able to solve this in O(N^2) time by doing the following: For each query, I checked if s1 and s2 would be equal by iterating through both strings, one character at a time, skipping the characters that do not lie within the subset of allowed characters, and checking to see if the "allowed" characters from both s1 and s2 match. If at some point, the characters don't match, then the strings are not equal. Otherwise, the s1 and s2 are equal when restricted only to letters in the query. Each query takes O(N) time to process, and there are N queries, for a total of O(N^2) time.
However, I was told that there was a way to solve this faster in O(N). Does anyone know how this might be done?
The first obvious speedup is to ensure your set membership test is O(1). To do that, there's a couple of options:
Represent every letter as a single bit -- now every character is an 18-bit value with only one bit set. The set of allowed characters is now a mask with these bits ORed together and you can test membership of a character with a bitwise-AND;
Alternatively, you can have an 18-value array and index it by character (c - 'a' would give a value between 0 and 17). The test for membership is then basically the cost of an array lookup (and you can save operations by not doing the subtraction -- instead just make the array larger and index directly by character.
Thought experiment
The next potential speedup is to recognize that any character which does not appear exactly the same number of times in both strings will instantly be a failed match. You can count all character frequencies in both strings with a histogram which can be done in O(N) time. In this way, you can prune the search space if such a character were to appear in the query, and you can test for this in constant time.
Of course, that won't help for a real stress-test which will guarantee that all possible letters have a frequency matched in both strings. So, what do you do then?
Well, you extend the above premise by recognizing that for any position of character x in string 1 and some position of that character in string 2 that would be a valid match (i.e the same number of character x appears in both strings up to their respective positions), then the total count of any other character up to those positions must also be equal. For any character where that is not true, it cannot possibly be compatible with character x.
Concept
Let's start by thinking about this in terms of a technique known as memoization where you can leverage precomputed or partially-computed information and get a whole lot out of it. So consider two strings like this:
a b a a c d e | e a b a c a d
What useful thing can you do here? Well, why not store the partial sums of counts for each letter:
a b a a c d e | e a b a c a d
----------------------|----------------------
freq(a) = 1 1 2 3 3 3 3 | 0 1 1 2 2 3 3
freq(b) = 0 1 1 1 1 1 1 | 0 0 1 1 1 1 1
freq(c) = 0 0 0 0 1 1 1 | 0 0 0 0 1 1 1
freq(d) = 0 0 0 0 0 1 1 | 0 0 0 0 0 0 1
freq(e) = 0 0 0 0 0 0 1 | 1 1 1 1 1 1 1
This uses a whole lot of memory, but don't worry -- we'll deal with that later. Instead, take the time to absorb what we're actually doing.
Looking at the table above, we have the running character count totals for both strings at every position in those strings.
Now let's see how our matching rules work by showing an example of a matching query "ab" and a non-matching query "acd":
For "ab":
a b a a c d e | e a b a c a d
----------------------|----------------------
freq(a) = 1 1 2 3 3 3 3 | 0 1 1 2 2 3 3
freq(b) = 0 1 1 1 1 1 1 | 0 0 1 1 1 1 1
^ ^ ^ ^ ^ ^ ^ ^
We scan the frequency arrays until we locate one of our letters in the query. The locations I have marked with ^ above. If you remove all the unmarked columns, you'll see the remaining columns match on both sides. So this is a match.
For "acd":
a b a a c d e | e a b a c a d
----------------------|----------------------
freq(a) = 1 1 2 3 3 3 3 | 0 1 1 2 2 3 3
freq(c) = 0 0 0 0 1 1 1 | 0 0 0 0 1 1 1
freq(d) = 0 0 0 0 0 1 1 | 0 0 0 0 0 0 1
^ ^ # # ^ ^ ^ # # ^
Here, all columns are matching except those marked with #.
Putting it together
All right so you can see how this works, but you may wondering about the runtime, because those examples above seem to be doing even more scanning than you were doing before!
Here's where things get interesting, and where our character frequency counts really kick in.
First, observe what we're actually doing on those marked columns. For any one character that we care about (for example, the 'a'), we're looking for only the positions in both strings where its count matches, and then we're comparing these two columns to see what other values match. This gives us a set of all other characters that are valid when used with 'a'. And of course, 'a' itself is also valid.
And what was our very first optimization? A bitset -- an 18-bit value the represents valid characters. You can now use this. For the two columns in each string, you set a 1 for characters with matching counts and a zero for characters with non-matching counts. If you process every single pair of matching 'a' values in this manner, what you get is a bunch of sets that they work with. And you can keep a running "master" set that represents the intersection of these -- you just intersect it with each intermediate set you calculate, which is a single bitwise-AND.
By the time you reach the end of both strings, you have performed a O(N) search and you examined 18 rows any time you encountered 'a'. And the result was the set of all characters that work with 'a'.
Now repeat for the other characters, one at a time. Every time it's a O(N) scan as above and you wind up with the set of all other characters that work with that the one you're processing.
After processing all these rows you now have an array containing 18 values representing the set of all characters that work with any one character. The operation took O(18N) time and O(18N) storage.
Query
Since you now have an array where for each character you have all possible characters that work with it, you simply look up each character in the query, and you intersect their sets (again, with bitwise-AND). A further intersection is required by using the set of all characters present in the query. That prunes off all characters that are not relevant.
This leaves you with a value which, for all values in the query, represents the set of all values that can result in a matching string. So if this value is then equal to the query then you have a match. Otherwise, you don't.
This is the part that is now fast. It has essentially reduced your query tests to constant time. However, the original indexing did cost us a lot of memory. What can be done about that?
Dynamic Programming
Is it really necessary to allocate all that storage for our frequency arrays?
Well actually, no. It was useful as a visual tool by laying out our counts in tabular form, and to explain the method conceptually but actually a lot of those values are not very useful most of the time and it sure made the method seem a bit complicated.
The good news is we can compute our master sets at the same time as computing character counts, without needing to store any frequency arrays. Remember that when we're computing the counts we use a histogram which is as simple as having one small 18-value array array where you say count[c] += 1 (if c is a character or an index derived from that character). So we could save a ton of memory if we just do the following:
Processing the set (mask) of all compatible characters for character c:
Initialize the mask for character c to all 1s (mask[c] = (1 << 18) - 1) -- this represents all characters are currently compatible. Initialize a character histogram (count) to all zero.
Walk through string 1 until you reach character c. For every character x you encounter along the way, increase its count in the histogram (count[x]++).
Walk through string 2 until you reach character c. For every character x you encounter along the way, decrease its count in the histogram (count[x]--).
Construct a 'good' set where any character that currently has a zero-count has a 1-bit, otherwise 0-bit. Intersect this with the current mask for c (using bitwise-AND): mask[c] &= good
Continue from step 2 until you have reached the end of both strings. If you reach the end of one of the strings prematurely, then the character count does not match and so you set the mask for this character to zero: mask[c] = 0
Repeat from 1 for every character, until all characters are done.
Above, we basically have the same time complexity of O(18N) except we have absolutely minimal extra storage -- only one array of counts for each character and one array of masks.
Combining techniques like the above to solve seemingly complex combinatorial problems really fast is commonly referred to as Dynamic Programming. We reduced the problem down to a truth table representing all possible characters that work with any single character. The time complexity remains linear with respect to the length of the strings, and only scales by the number of possible characters.
Here is the algorithm above hacked together in C++: https://godbolt.org/z/PxzYvGs8q
Let RESTRICT(s,q) be the restriction of string s to the letters in the set q.
If q contains more than two letters, then the full string RESTRICT(s,q) can be reconstructed from all the strings RESTRICT(s,qij) where qij is a pair of letters in q.
Therefore, RESTRICT(s1,q) = RESTRICT(s2,q) if and only if RESTRICT(s1,qij) = RESTRICT(s2,qij) for all pairs qij in q.
Since you are restricted to 18 letters, there are only 153 letter pairs, or only 171 fundamental single- or double-letter queries.
If you precalculate the results of these 171 queries, then you can answer any other more complicated query just by combining their results, without inspecting the string at all.

How to perform range updates in sqrt{n} time?

I have an array and I have to perform query and updates on it.
For queries, I have to find frequency of a particular number in a range from l to r and for update, I have to add x from some range l to r.
How to perform this?
I thought of sqrt{n} optimization but I don't know how to perform range updates with this time complexity.
Edit - Since some people are asking for an example, here is one
Suppose the array is of size n = 8
and it is
1 3 3 4 5 1 2 3
And there are 3 queries to help everybody explain about what I am trying to say
Here they are
q 1 5 3 - This means that you have to find the frequency of 3 in range 1 to 5 which is 2 as 3 appears on 2nd and 3rd position.
second is update query and it goes like this - u 2 4 6 -> This means that you have to add 6 in the array from range 2 to 4. So the new array will become
1 9 9 10 5 1 2 3
And the last query is again the same as first one which will now return 0 as there is no 3 in the array from position 1 to 5 now.
I believe things must be more clear now. :)
I developed this algorithm long time (20+ years) ago for Arithmetic coder.
Both Update and Retrieve are performed in O(log(N)).
I named this algorithm "Method of Intervals". Let I show you the example.
Imagine, we have 8 intervals, with numbers 0-7:
+--0--+--1--+--2-+--3--+--4--+--5--+--6--+--7--+
Lets we create additional set of intervals, each spawns pair of original ones:
+----01-----+----23----+----45-----+----67-----+
Thereafter, we'll create the extra one layer of intervals, spawn pairs of 2nd:
+---------0123---------+---------4567----------+
And at last, we create single interval, covers all 8:
+------------------01234567--------------------+
As you see, in this structure, to retrieve right border of the interval [5], you needed just add together length of intervals [0123] + [45]. to retrieve left border of the interval [5], you needed sum of length the intervals [0123] + [4] (left border for 5 is right border for 4).
Of course, left border of the interval [0] is always = 0.
When you'll watch this proposed structure carefully, you will see, the odd elements in the each layers aren't needed. I say, you do not needed elements 1, 3, 5, 7, 23, 67, 4567, since these elements aren't used, during Retrieval or Update.
Lets we remove the odd elements and make following remuneration:
+--1--+--x--+--3-+--x--+--5--+--x--+--7--+--x--+
+-----2-----+-----x----+-----6-----+-----x-----+
+-----------4----------+-----------x-----------+
+----------------------8-----------------------+
As you see, with this remuneration, used the numbers [1-8]. Lets they will be array indexes. So, you see, there is used memory O(N).
To retrieve right border of the interval [7], you needed add length of the values with indexes 4,6,7. To update length of the interval [7], you needed add difference to all 3 of these values. As result, both Retrieval and Update are performed for Log(N) time.
Now is needed algorithm, how by the original interval number compute set of indexes in this data structure. For instance - how to convert:
1 -> 1
2 -> 2
3 -> 3,2
...
7 -> 7,6,4
This is easy, if we will see binary representation for these numbers:
1 -> 1
10 -> 10
11 -> 11,10
111 -> 111,110,100
As you see, in the each chain - next value is previous value, where rightmost "1" changed to "0". Using simple bit operation "x & (x - 1)", we can wtite a simple loop to iterate array indexes, related to the interval number:
int interval = 7;
do {
int index = interval;
do_something(index);
} while(interval &= interval - 1);

Palindrome partitioning with interval scheduling

So I was looking at the various algorithms of solving Palindrome partitioning problem.
Like for a string "banana" minimum no of cuts so that each sub-string is a palindrome is 1 i.e. "b|anana"
Now I tried solving this problem using interval scheduling like:
Input: banana
Transformed string: # b # a # n # a # n # a #
P[] = lengths of palindromes considering each character as center of palindrome.
I[] = intervals
String: # b # a # n # a # n # a #
P[i]: 0 1 0 1 0 3 0 5 0 3 0 1 0
I[i]: 0 1 2 3 4 5 6 7 8 9 10 11 12
Example: Palindrome considering 'a' (index 7) as center is 5 "anana"
Now constructing intervals for each character based on P[i]:
b = (0,2)
a = (2,4)
n = (2,8)
a = (2,12)
n = (6,12)
a = (10,12)
So, now if I have to schedule these many intervals on time 0 to 12 such that minimum no of intervals should be scheduled and no time slot remain empty, I would choose (0,2) and (2,12) intervals and hence the answer for the solution would be 1 as I have broken down the given string in two palindromes.
Another test case:
String: # E # A # B # A # E # A # B #
P[i]: 0 1 0 1 0 5 0 1 0 5 0 1 0 1 0
I[i]: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Plotting on graph:
Now, the minimum no of intervals that can be scheduled are either:
1(0,2), 2(2,4), 5(4,14) OR
3(0,10), 6(10,12), 7(12,14)
Hence, we have 3 partitions so the no of cuts would be 2 either
E|A|BAEAB
EABAE|A|B
These are just examples. I would like to know if this algorithm will work for all cases or there are some cases where it would definitely fail.
Please help me achieve a proof that it will work in every scenario.
Note: Please don't discourage me if this post makes no sense as i have put enough time and effort on this problem, just state a reason or provide some link from where I can move forward with this solution. Thank you.
As long as you can get a partition of the string, your algorith will work.
Recall to mind that a partion P of a set S is a set of non empty subset A1, ..., An:
The union of every set A1, ... An gives the set S
The intersection between any Ai, Aj (with i != j) is empty
Even if the palindrome partitioning deals with strings (which are a bit different from sets), the properties of a partition are still true.
Hence, if you have a partition, you consequently have a set of time intervals without "holes" to schedule.
Choosing the partition with the minimum number of subsets, makes you have the minimum number of time intervals and therefore the minimum number of cuts.
Furthermore, you always have at least one palindrome partition of a string: in the worst case, you get a palindrome partition made of single characters.

Find longest common substring of multiple strings using factor oracle enhanced with LRS array

Can we use a factor-oracle with suffix link (paper here) to compute the longest common substring of multiple strings? Here, substring means any part of the original string. For example "abc" is the substring of "ffabcgg", while "abg" is not.
I've found a way to compute the maximum length common substring of two strings s1 and s2. It works by concatenating the two strings using a character not in them, '$' for example. Then for each prefix of the concatenated string s with length i >= |s1| + 2, we calculate its LRS (longest repeated suffix) length lrs[i] and sp[i] (the end position of the first occurence of its LRS). Finally, the answer is
max{lrs[i]| i >= |s1| + 2 and sp[i] <= |s1|}
I've written a C++ program that uses this method, which can solve the problem within 200ms on my laptop when |s1|+|s2| <= 200000, using the factor oracle.
s1 = 'ffabcgg'
s2 = 'gfbcge'
s = s1+'$'+s2
= 'ffabcgg$gfbcge'
p: 0 1 2 3 4 5 6 7 8 9 10 11 12 13
s: f f a b c g g $ g f b c g e
sp: 0 1 0 0 0 0 6 0 6 1 4 5 6 0
lrs:0 1 0 0 0 0 1 0 1 1 1 2 3 0
ans = lrs[13] = 3
I know the both problems can be solved using suffix-array and suffix-tree with high efficiency, but I wonder if there is a method using factor oracle to solve it. I am interested in this because the factor oracle is easy to construct (with 30 lines of C++, suffix-array needs about 60, and suffix-tree needs 150), and it runs faster than suffix-array and suffix-tree.
You can test your method of the first problem in this OnlineJudge, and the second problem in here.
Can we use a factor-oracle with suffix link (paper here) to compute
the longest common substring of multiple strings?
I don't think the algorithm is a very good fit (it is designed to factor a single string) but you can use it by concatenating the original strings with a unique separator.
Given abcdefg and hijcdekl and mncdop, find the longest common substring cd:
# combine with unique joiners
>>> s = "abcdefg" + "1" + "hijcdekl" + "2" + "mncdop"
>>> factor_oracle(s)
"cd"
As part of its linear-time and space algorithm, the factor-oracle quickly rediscover the break points between the input strings as part of its search for common factors (the unique joiners provide and immediate cue to stop extending the best factor found so far).

How can I maximally partition a set?

I'm trying to solve one of the Project Euler problems. As a consequence, I need an algorithm that will help me find all possible partitions of a set, in any order.
For instance, given the set 2 3 3 5:
2 | 3 3 5
2 | 3 | 3 5
2 | 3 3 | 5
2 | 3 | 3 | 5
2 5 | 3 3
and so on. Pretty much every possible combination of the members of the set. I've searched the net of course, but haven't found much that's directly useful to me, since I speak programmer-ese not advanced-math-ese.
Can anyone help me out with this? I can read pretty much any programming language, from BASIC to Haskell, so post in whatever language you wish.
Have you considered a search tree? Each node would represent a choice of where to put an element and the leaf nodes are answers. I won't give you code because that's part of the fun of Project Euler ;)
Take a look at:
The Art of Computer Programming, Volume 4, Fascicle 3: Generating All Combinations and Partitions
7.2.1.5. Generating all set partitions
In general I would look at the structure of the recursion used to compute the number of configurations, and build a similar recursion for enumerating them. Best is to compute a one-to-one mapping between integers and configurations. This works well for permutations, combinations, etc. and ensures that each configuration is enumerated only once.
Now even the recursion for the number of partitions of some identical items is rather complicated.
For partitions of multisets the counting amounts to solving the generalization of Project Euler problem 181 to arbitrary multisets.
Well, the problem has two aspects.
Firsty, the items can be arranged in any order. So for N items, there are N! permutations (assuming the items are treated as unique).
Secondly, you can envision the grouping as a bit flag between each item indicating a divide. There would be N-1 of these flags, so for a given permutation there would be 2^(N-1) possible groupings.
This means that for N items, there would be a total of N!*(2^(N-1)) groupings/permutations, which gets big very very fast.
In your example, the top four items are groupings of one permutation. The last item is a grouping of another permutation. Your items can be viewed as :
2 on 3 off 3 off 5
2 on 3 on 3 off 5
2 on 3 off 3 on 5
2 on 3 on 3 on 5
2 off 5 on 3 off 3
The permutations (the order of display) can be derived by looking at them like a tree, as mentioned by the other two. This would almost certainly involve recursion, such as here.
The grouping is independent of them in many ways. Once you have all the permutations, you can link them with the groupings if needed.
Here is the code you need for this part of your problem:
def memoize(f):
memo={}
def helper(x):
if x not in memo:
memo[x]=f(x)
return memo[x]
return helper
#memoize
def A000041(n):
if n == 0: return 1
S = 0
J = n-1
k = 2
while 0 <= J:
T = A000041(J)
S = S+T if k//2%2!=0 else S-T
J -= k if k%2!=0 else k//2
k += 1
return S
print A000041(100) #the 100's number in this series, as an example
I quickly whipped up some code to do this. However, I left out separating every possible combination of the given list, because I wasn't sure it was actually needed, but it should be easy to add, if necessary.
Anyway, the code runs quite well for small amounts, but, as CodeByMoonlight already mentioned, the amount of possibilities gets really high really fast, so the runtime increases accordingly.
Anyway, here's the python code:
import time
def separate(toseparate):
"Find every possible way to separate a given list."
#The list of every possibility
possibilities = []
n = len(toseparate)
#We can distribute n-1 separations in the given list, so iterate from 0 to n
for i in xrange(n):
#Create a copy of the list to avoid modifying the already existing list
copy = list(toseparate)
#A boolean list indicating where a separator is put. 'True' indicates a separator
#and 'False', of course, no separator.
#The list will contain i separators, the rest is filled with 'False'
separators = [True]*i + [False]*(n-i-1)
for j in xrange(len(separators)):
#We insert the separators into our given list. The separators have to
#be between two elements. The index between two elements is always
#2*[index of the left element]+1.
copy.insert(2*j+1, separators[j])
#The first possibility is, of course, the one we just created
possibilities.append(list(copy))
#The following is a modification of the QuickPerm algorithm, which finds
#all possible permutations of a given list. It was modified to only permutate
#the spaces between two elements, so it finds every possibility to insert n
#separators in the given list.
m = len(separators)
hi, lo = 1, 0
p = [0]*m
while hi < m:
if p[hi] < hi:
lo = (hi%2)*p[hi]
copy[2*lo+1], copy[2*hi+1] = copy[2*hi+1], copy[2*lo+1]
#Since the items are non-unique, some possibilities will show up more than once, so we
#avoid this by checking first.
if not copy in possibilities:
possibilities.append(list(copy))
p[hi] += 1
hi = 1
else:
p[hi] = 0
hi += 1
return possibilities
t1 = time.time()
separations = separate([2, 3, 3, 5])
print time.time()-t1
sepmap = {True:"|", False:""}
for a in separations:
for b in a:
if sepmap.has_key(b):
print sepmap[b],
else:
print b,
print "\n",
It's based on the QuickPerm algorithm, which you can read more about here: QuickPerm
Basically, my code generates a list containing n separations, inserts them into the given list and then finds all possible permutations of the separations in the list.
So, if we use your example we would get:
2 3 3 5
2 | 3 3 5
2 3 | 3 5
2 3 3 | 5
2 | 3 | 3 5
2 3 | 3 | 5
2 | 3 3 | 5
2 | 3 | 3 | 5
In 0.000154972076416 seconds.
However, I read through the problem description of the problem you are doing and I see how you are trying to solve this, but seeing how quickly the runtime increases I don't think that it would work as fast you would expect. Remember that Project Euler's problems should solve in around a minute.

Resources