native string matching algorithm - algorithm

Following is a very famous question in native string matching. Please can someone explain me the answer.
Suppose that all characters in the pattern P are different. Show how to accelerate NAIVE-STRING MATCHER to run in time O(n) on an n-character text T.

The basic idea:
Iterate through the input and the pattern at the same time, comparing their characters to each other
Whenever you get a non-matching character between the two, you can just reset the pattern position and keep the input position as is
This works because the pattern characters are all different, which means that whenever you have a partial match, there can be no other match overlapping with that, so we can just start looking from the end of the partial match.
Here's some pseudo-code that shouldn't be too difficult to understand:
input[n]
pattern[k]
pPos = 0
iPos = 0
while iPos < n
if pPos == k
FOUND!
if pattern[pPos] == input[iPos]
pPos++
iPos++
else
// if pPos is already 0, we need to increase iPos,
// otherwise we just keep comparing the same characters
if pPos == 0
iPos++
pPos = 0
It's easy to see that iPos increases at least every second loop, thus there can be at most 2n loop runs, making the running time O(n).

When T[i] and P[j] mismatches in NAIVE-STRING-MATCHER, we can skip all characters before T[i] and begin new matching from T[i + 1] with P[1].
NAIVE-STRING-MATCHER(T, P)
1 n length[T]
2 m length[P]
3 for s 0 to n - m
4 do if P[1 . . m] = T[s + 1 . . s + m]
5 then print "Pattern occurs with shift" s

Naive string search algorithm implementations in Python 2.7:
https://gist.github.com/heyhuyen/4341692
In the middle of implementing Boyer-Moore's string search algorithm, I decided to play with my original naive search algorithm. It's implemented as an instance method that takes a string to be searched. The object has an attribute 'pattern' which is the pattern to match.
1) Here is the original version of the search method, using a double for-loop.
Makes calls to range and len
def search(self, string):
for i in range(len(string)):
for j in range(len(self.pattern)):
if string[i+j] != self.pattern[j]:
break
elif j == len(self.pattern) - 1:
return i
return -1
2) Here is the second version, using a double while-loop instead.
Slightly faster, not making calls to range
def search(self, string):
i = 0
while i < len(string):
j = 0
while j < len(self.pattern) and self.pattern[j] == string[i+j]:
j += 1
if j == len(self.pattern):
return i
i += 1
return -1
3) Here is the original, replacing range with xrange.
Faster than both of the previous two.
def search(self, string):
for i in xrange(len(string)):
for j in xrange(len(self.pattern)):
if string[i+j] != self.pattern[j]:
break
elif j == len(self.pattern) - 1:
return i
return -1
4) Storing values in local variables = win! With the double while loop, this is the fastest.
def search(self, string):
len_pat = len(self.pattern)
len_str = len(string)
i = 0
while i < len_str:
j = 0
while j < len_pat and self.pattern[j] == string[i+j]:
j += 1
if j == len_pat:
return i
i += 1
return -1

Related

Dynamic Programming for shortest subsequence that is not a subsequence of two strings

Problem: Given two sequences s1 and s2 of '0' and '1'return the shortest sequence that is a subsequence of neither of the two sequences.
E.g. s1 = '011' s2 = '1101' Return s_out = '00' as one possible result.
Note that substring and subsequence are different where substring the characters are contiguous but in a subsequence that needs not be the case.
My question: How is dynamic programming applied in the "Solution Provided" below and what is its time complexity?
My attempt involves computing all the subsequences for each string giving sub1 and sub2. Append a '1' or a '0' to each sub1 and determine if that new subsequence is not present in sub2.Find the minimum length one. Here is my code:
My Solution
def get_subsequences(seq, index, subs, result):
if index == len(seq):
if subs:
result.add(''.join(subs))
else:
get_subsequences(seq, index + 1, subs, result)
get_subsequences(seq, index + 1, subs + [seq[index]], result)
def get_bad_subseq(subseq):
min_sub = ''
length = float('inf')
for sub in subseq:
for char in ['0', '1']:
if len(sub) + 1 < length and sub + char not in subseq:
length = len(sub) + 1
min_sub = sub + char
return min_sub
Solution Provided (not mine)
How does it work and its time complexity?
It looks that the below solution looks similar to: http://kyopro.hateblo.jp/entry/2018/12/11/100507
def set_nxt(s, nxt):
n = len(s)
idx_0 = n + 1
idx_1 = n + 1
for i in range(n, 0, -1):
nxt[i][0] = idx_0
nxt[i][1] = idx_1
if s[i-1] == '0':
idx_0 = i
else:
idx_1 = i
nxt[0][0] = idx_0
nxt[0][1] = idx_1
def get_shortest(seq1, seq2):
len_seq1 = len(seq1)
len_seq2 = len(seq2)
nxt_seq1 = [[len_seq1 + 1 for _ in range(2)] for _ in range(len_seq1 + 2)]
nxt_seq2 = [[len_seq2 + 1 for _ in range(2)] for _ in range(len_seq2 + 2)]
set_nxt(seq1, nxt_seq1)
set_nxt(seq2, nxt_seq2)
INF = 2 * max(len_seq1, len_seq2)
dp = [[INF for _ in range(len_seq2 + 2)] for _ in range(len_seq1 + 2)]
dp[len_seq1 + 1][len_seq2 + 1] = 0
for i in range( len_seq1 + 1, -1, -1):
for j in range(len_seq2 + 1, -1, -1):
for k in range(2):
if dp[nxt_seq1[i][k]][nxt_seq2[j][k]] < INF:
dp[i][j] = min(dp[i][j], dp[nxt_seq1[i][k]][nxt_seq2[j][k]] + 1);
res = ""
i = 0
j = 0
while i <= len_seq1 or j <= len_seq2:
for k in range(2):
if (dp[i][j] == dp[nxt_seq1[i][k]][nxt_seq2[j][k]] + 1):
i = nxt_seq1[i][k]
j = nxt_seq2[j][k]
res += str(k)
break;
return res
I am not going to work it through in detail, but the idea of this solution is to create a 2-D array of every combinations of positions in the one array and the other. It then populates this array with information about the shortest sequences that it finds that force you that far.
Just constructing that array takes space (and therefore time) O(len(seq1) * len(seq2)). Filling it in takes a similar time.
This is done with lots of bit twiddling that I don't want to track.
I have another approach that is clearer to me that usually takes less space and less time, but in the worst case could be as bad. But I have not coded it up.
UPDATE:
Here is is all coded up. With poor choices of variable names. Sorry about that.
# A trivial data class to hold a linked list for the candidate subsequences
# along with information about they match in the two sequences.
import collections
SubSeqLinkedList = collections.namedtuple('SubSeqLinkedList', 'value pos1 pos2 tail')
# This finds the position after the first match. No match is treated as off the end of seq.
def find_position_after_first_match (seq, start, value):
while start < len(seq) and seq[start] != value:
start += 1
return start+1
def make_longer_subsequence (subseq, value, seq1, seq2):
pos1 = find_position_after_first_match(seq1, subseq.pos1, value)
pos2 = find_position_after_first_match(seq2, subseq.pos2, value)
gotcha = SubSeqLinkedList(value=value, pos1=pos1, pos2=pos2, tail=subseq)
return gotcha
def minimal_nonsubseq (seq1, seq2):
# We start with one candidate for how to start the subsequence
# Namely an empty subsequence. Length 0, matches before the first character.
candidates = [SubSeqLinkedList(value=None, pos1=0, pos2=0, tail=None)]
# Now we try to replace candidates with longer maximal ones - nothing of
# the same length is better at going farther in both sequences.
# We keep this list ordered by descending how far it goes in sequence1.
while candidates[0].pos1 <= len(seq1) or candidates[0].pos2 <= len(seq2):
new_candidates = []
for candidate in candidates:
candidate1 = make_longer_subsequence(candidate, '0', seq1, seq2)
candidate2 = make_longer_subsequence(candidate, '1', seq1, seq2)
if candidate1.pos1 < candidate2.pos1:
# swap them.
candidate1, candidate2 = candidate2, candidate1
for c in (candidate1, candidate2):
if 0 == len(new_candidates):
new_candidates.append(c)
elif new_candidates[-1].pos1 <= c.pos1 and new_candidates[-1].pos2 <= c.pos2:
# We have found strictly better.
new_candidates[-1] = c
elif new_candidates[-1].pos2 < c.pos2:
# Note, by construction we cannot be shorter in pos1.
new_candidates.append(c)
# And now we throw away the ones we don't want.
# Those that are on their way to a solution will be captured in the linked list.
candidates = new_candidates
answer = candidates[0]
r_seq = [] # This winds up reversed.
while answer.value is not None:
r_seq.append(answer.value)
answer = answer.tail
return ''.join(reversed(r_seq))
print(minimal_nonsubseq('011', '1101'))

Algorithm for simple string compression

I would like to find the shortest possible encoding for a string in the following form:
abbcccc = a2b4c
[NOTE: this greedy algorithm does not guarantee shortest solution]
By remembering all previous occurrences of a character it is straight forward to find the first occurrence of a repeating string (minimal end index including all repetitions = maximal remaining string after all repetitions) and replace it with a RLE (Python3 code):
def singleRLE_v1(s):
occ = dict() # for each character remember all previous indices of occurrences
for idx,c in enumerate(s):
if not c in occ: occ[c] = []
for c_occ in occ[c]:
s_c = s[c_occ:idx]
i = 1
while s[idx+(i-1)*len(s_c) : idx+i*len(s_c)] == s_c:
i += 1
if i > 1:
rle_pars = ('(',')') if len(s_c) > 1 else ('','')
rle = ('%d'%i) + rle_pars[0] + s_c + rle_pars[1]
s_RLE = s[:c_occ] + rle + s[idx+(i-1)*len(s_c):]
return s_RLE
occ[c].append(idx)
return s # no repeating substring found
To make it robust for iterative application we have to exclude a few cases where a RLE may not be applied (e.g. '11' or '))'), also we have to make sure the RLE is not making the string longer (which can happen with a substring of two characters occurring twice as in 'abab'):
def singleRLE(s):
"find first occurrence of a repeating substring and replace it with RLE"
occ = dict() # for each character remember all previous indices of occurrences
for idx,c in enumerate(s):
if idx>0 and s[idx-1] in '0123456789': continue # no RLE for e.g. '11' or other parts of previous inserted RLE
if c == ')': continue # no RLE for '))...)'
if not c in occ: occ[c] = []
for c_occ in occ[c]:
s_c = s[c_occ:idx]
i = 1
while s[idx+(i-1)*len(s_c) : idx+i*len(s_c)] == s_c:
i += 1
if i > 1:
print("found %d*'%s'" % (i,s_c))
rle_pars = ('(',')') if len(s_c) > 1 else ('','')
rle = ('%d'%i) + rle_pars[0] + s_c + rle_pars[1]
if len(rle) <= i*len(s_c): # in case of a tie prefer RLE
s_RLE = s[:c_occ] + rle + s[idx+(i-1)*len(s_c):]
return s_RLE
occ[c].append(idx)
return s # no repeating substring found
Now we can safely call singleRLE on the previous output as long as we find a repeating string:
def iterativeRLE(s):
s_RLE = singleRLE(s)
while s != s_RLE:
print(s_RLE)
s, s_RLE = s_RLE, singleRLE(s_RLE)
return s_RLE
With the above inserted print statements we get e.g. the following trace and result:
>>> iterativeRLE('xyabcdefdefabcdefdef')
found 2*'def'
xyabc2(def)abcdefdef
found 2*'def'
xyabc2(def)abc2(def)
found 2*'abc2(def)'
xy2(abc2(def))
'xy2(abc2(def))'
But this greedy algorithm fails for this input:
>>> iterativeRLE('abaaabaaabaa')
found 3*'a'
ab3abaaabaa
found 3*'a'
ab3ab3abaa
found 2*'b3a'
a2(b3a)baa
found 2*'a'
a2(b3a)b2a
'a2(b3a)b2a'
whereas one of the shortest solutions is 3(ab2a).
Since a greedy algorithm does not work, some search is necessary. Here is a depth first search with some pruning (if in a branch the first idx0 characters of the string are not touched, to not try to find a repeating substring within these characters; also if replacing multiple occurrences of a substring do this for all consecutive occurrencies):
def isRLE(s):
"is this a well nested RLE? (only well nested RLEs can be further nested)"
nestCnt = 0
for c in s:
if c == '(':
nestCnt += 1
elif c == ')':
if nestCnt == 0:
return False
nestCnt -= 1
return nestCnt == 0
def singleRLE_gen(s,idx0=0):
"find all occurrences of a repeating substring with first repetition not ending before index idx0 and replace each with RLE"
print("looking for repeated substrings in '%s', first rep. not ending before index %d" % (s,idx0))
occ = dict() # for each character remember all previous indices of occurrences
for idx,c in enumerate(s):
if idx>0 and s[idx-1] in '0123456789': continue # sub-RLE cannot start after number
if not c in occ: occ[c] = []
for c_occ in occ[c]:
s_c = s[c_occ:idx]
if not isRLE(s_c): continue # avoid RLEs for e.g. '))...)'
if idx+len(s_c) < idx0: continue # pruning: this substring has been tried before
if c_occ-len(s_c) >= 0 and s[c_occ-len(s_c):c_occ] == s_c: continue # pruning: always take all repetitions
i = 1
while s[idx+(i-1)*len(s_c) : idx+i*len(s_c)] == s_c:
i += 1
if i > 1:
rle_pars = ('(',')') if len(s_c) > 1 else ('','')
rle = ('%d'%i) + rle_pars[0] + s_c + rle_pars[1]
if len(rle) <= i*len(s_c): # in case of a tie prefer RLE
s_RLE = s[:c_occ] + rle + s[idx+(i-1)*len(s_c):]
#print(" replacing %d*'%s' -> %s" % (i,s_c,s_RLE))
yield s_RLE,c_occ
occ[c].append(idx)
def iterativeRLE_depthFirstSearch(s):
shortestRLE = s
candidatesRLE = [(s,0)]
while len(candidatesRLE) > 0:
candidateRLE,idx0 = candidatesRLE.pop(0)
for rle,idx in singleRLE_gen(candidateRLE,idx0):
if len(rle) <= len(shortestRLE):
shortestRLE = rle
print("new optimum: '%s'" % shortestRLE)
candidatesRLE.append((rle,idx))
return shortestRLE
Sample output:
>>> iterativeRLE_depthFirstSearch('tctttttttttttcttttttttttctttttttttttct')
looking for repeated substrings in 'tctttttttttttcttttttttttctttttttttttct', first rep. not ending before index 0
new optimum: 'tc11tcttttttttttctttttttttttct'
new optimum: '2(tctttttttttt)ctttttttttttct'
new optimum: 'tctttttttttttc2(ttttttttttct)'
looking for repeated substrings in 'tc11tcttttttttttctttttttttttct', first rep. not ending before index 2
new optimum: 'tc11tc10tctttttttttttct'
new optimum: 'tc11t2(ctttttttttt)tct'
new optimum: 'tc11tc2(ttttttttttct)'
looking for repeated substrings in 'tc5(tt)tcttttttttttctttttttttttct', first rep. not ending before index 2
...
new optimum: '2(tctttttttttt)c11tct'
...
new optimum: 'tc11tc10tc11tct'
...
new optimum: 'tc11t2(c10t)tct'
looking for repeated substrings in 'tc11tc2(ttttttttttct)', first rep. not ending before index 6
new optimum: 'tc11tc2(10tct)'
...
new optimum: '2(tc10t)c11tct'
...
'2(tc10t)c11tct'
Following is my C++ implementation to do it in-place with O(n) time complexity and O(1) space complexity.
class Solution {
public:
int compress(vector<char>& chars) {
int n = (int)chars.size();
if(chars.empty()) return 0;
int left = 0, right = 0, currCharIndx = left;
while(right < n) {
if(chars[currCharIndx] != chars[right]) {
int len = right - currCharIndx;
chars[left++] = chars[currCharIndx];
if(len > 1) {
string freq = to_string(len);
for(int i = 0; i < (int)freq.length(); i++) {
chars[left++] = freq[i];
}
}
currCharIndx = right;
}
right++;
}
int len = right - currCharIndx;
chars[left++] = chars[currCharIndx];
if(len > 1) {
string freq = to_string(len);
for(int i = 0; i < freq.length(); i++) {
chars[left++] = freq[i];
}
}
return left;
}
};
You need to keep track of three pointers - right is to iterate, currCharIndx is to keep track the first position of current character and left is to keep track the write position of the compressed string.

Erroneous dynamic programming algorithm

Transferred from Code Review. If this question is not suitable for SO either, please let me know. I will remove it.
I am working on an algorithm puzzle here at: https://www.hackerrank.com/challenges/hackerland-radio-transmitters/forum
It cannot pass all of the test cases. Some test cases have such large arrays that it gets so hard to debug. Simple cases from my end seem all work fine. Can anyone look into this and share what is wrong with this algorithm? Basically it just loops through the array and find every furthest covered station (as origin). A counter-like variable result record the origins (radio stations).
def solution(k, arr, origin=0):
arr = sorted(list(set(arr)))
result = 1
cur = 0
origin = 0
for i in range(1, len(arr)):
if arr[i] - arr[origin] <= k:
pass
else:
origin = i - 1
j = 1
while origin + j < len(arr):
if arr[origin + j] - arr[origin] <= k:
pass
else:
origin = origin + j
i = origin + j + 1
result += 1
continue
j += 1
return result
Most of your code is correct. Only problem is with the usage of For range outer loop and continue in the inner loop.
For range loop doesn't change the i value # runtime (it is more like a ForEach loop).
The continue will not terminate the inner loop - you may want to use break.
The following code passed all the test cases
def solution(k, arr, origin=0):
arr = sorted(list(set(arr)))
print arr
result = 1
cur = 0
origin = 0
i = 0
while (i < len(arr)):
if arr[i] - arr[origin] <= k:
i = i + 1
pass
else:
origin = i - 1
j = 1
while origin + j < len(arr):
if arr[origin + j] - arr[origin] <= k:
pass
else:
# Start for next station position from this point
i = origin + j
origin = i
# need another radio station
result += 1
break
j += 1
return result
hope it helps!
You're placing the first object on the first index. The first object in the optimal solution can be placed later too.
solution(1,[1,2,3,4,5,6]) prints 3, when it should be 2 (by placing the two objects on 2 and 5). You place your first object on 1, then 3 and then 5. It should ideally be placed on 2, then 5.

Bignum too big to convert into 'long' (RangeError)

Trying to teach myself ruby - I'm working on Project Euler problem 14 in ruby.
n = 1000000
array = Array.new(n,0)
#array[x] will store the number of steps to get to one if a solution has been found and 0 otherwise. x will equal the starting number. array[0] will be nonsensical for these purposes
i = n-1#We will start at array[n-1] and work down to 1
while i > 1
if array[i] == 0
numstep = 0 #numstep will hold the number of loops that j makes until it gets to 1 or a number that has already been solved
j = i
while j > 1 && (array[j] == 0 || array[j] == nil)
case j%2
when 1 # j is odd
j = 3*j + 1
when 0 # j is even
j = j/2
end
numstep += 1
end
stop = array[j] #if j has been solved, array[j] is the number of steps to j = 1. If j = 1, array[j] = 0
j = i
counter = 0
while j > 1 && (array[j] == 0 || array[j] == nil)
if j < n
array[j] = numstep + stop - counter #numstep + stop should equal the solution to the ith number, to get the jth number we subtract counter
end
case j%2
when 1 #j is odd
j = 3*j+1
when 0 #j is even
j = j/2
end
counter += 1
end
end
i = i-1
end
puts("The longest Collatz sequence starting below #{n} starts at #{array.each_with_index.max[1]} and is #{array.max} numbers long")
This code works fine for n = 100000 and below, but when I go up to n = 1000000, it runs for a short while (until j = 999167 *3 + 1 = 2997502). When it tries access the 2997502th index of array, it throws the error
in '[]': bignum too big to convert into 'long' (RangeError)
on line 27 (which is the while statement:
while j > 1 && (array[j] == 0 || array[j] == nil)
How can I get this to not throw an error? Checking if the array is zero saves code efficiency because it allows you to not recalculate something that's already been done, but if I remove the and statement, it runs and gives the correct answer. I'm pretty sure that the problem is that the index of an array can't be a bignum, but maybe there's a way to declare my array such that it can be? I don't much care about the answer itself; I've actually already solved this in C# - just trying to learn ruby, so I'd like to know why my code is doing this (if I'm wrong about why) and how to fix it.
The code above runs happily for me for any input that produces output in acceptable time. I believe this is because you might experience problems being on 32bit arch, or like. Anyway, the solution of the problem stated would be simple (unless you might run out of memory, which is another possible glitch.)
Array indices are limited, as is follows from the error you got. Cool, let’s use hash instead!
n = 1000000
array = Hash.new(0)
#array[x] will store the number of steps to get to one if a solution has been found and 0 otherwise. x will equal the starting number. arr
i = n-1#We will start at array[n-1] and work down to 1
while i > 1
if array[i].zero?
numstep = 0 #numstep will hold the number of loops that j makes until it gets to 1 or a number that has already been solved
j = i
while j > 1 && array[j].zero?
case j%2
when 1 # j is odd
j = 3*j + 1
when 0 # j is even
j = j/2
end
numstep += 1
end
stop = array[j] #if j has been solved, array[j] is the number of steps to j = 1. If j = 1, array[j] = 0
j = i
counter = 0
while j > 1 && array[j].zero?
if j < n
array[j] = numstep + stop - counter #numstep + stop should equal the solution to the ith number, to get the jth number we
end
case j%2
when 1 #j is odd
j = 3*j+1
when 0 #j is even
j = j/2
end
counter += 1
end
end
i = i-1
end
puts("Longest Collatz below #{n} ##{array.sort_by(&:first).map(&:last).each_with_index.max[1]} is #{arr
Please note, that since I used the hash with initializer, array[i] can’t become nil, that’s why the check is done for zero values only.

number which appears more than n/3 times in an array

I have read this problem
Find the most common entry in an array
and the answer from jon skeet is just mind blowing .. :)
Now I am trying to solve this problem find an element which occurs more than n/3 times in an array ..
I am pretty sure that we cannot apply the same method because there can be 2 such elements which will occur more than n/3 times and that gives false alarm of the count ..so is there any way we can tweak around jon skeet's answer to work for this ..?
Or is there any solution that will run in linear time ?
Jan Dvorak's answer is probably best:
Start with two empty candidate slots and two counters set to 0.
for each item:
if it is equal to either candidate, increment the corresponding count
else if there is an empty slot (i.e. a slot with count 0), put it in that slot and set the count to 1
else reduce both counters by 1
At the end, make a second pass over the array to check whether the candidates really do have the required count. This isn't allowed by the question you link to but I don't see how to avoid it for this modified version. If there is a value that occurs more than n/3 times then it will be in a slot, but you don't know which one it is.
If this modified version of the question guaranteed that there were two values with more than n/3 elements (in general, k-1 values with more than n/k) then we wouldn't need the second pass. But when the original question has k=2 and 1 guaranteed majority there's no way to know whether we "should" generalize it as guaranteeing 1 such element or guaranteeing k-1. The stronger the guarantee, the easier the problem.
Using Boyer-Moore Majority Vote Algorithm, we get:
vector<int> majorityElement(vector<int>& nums) {
int cnt1=0, cnt2=0;
int a,b;
for(int n: A){
if (n == a) cnt1++;
else if (n == b) cnt2++;
else if (cnt1 == 0){
cnt1++;
a = n;
}
else if (cnt2 == 0){
cnt2++;
b = n;
}
else{
cnt1--;
cnt2--;
}
}
cnt1=cnt2=0;
for(int n: nums){
if (n==a) cnt1++;
else if (n==b) cnt2++;
}
vector<int> result;
if (cnt1 > nums.size()/3) result.push_back(a);
if (cnt2 > nums.size()/3) result.push_back(b);
return result;
}
Updated, correction from #Vipul Jain
You can use Selection algorithm to find the number in the n/3 place and 2n/3.
n1=Selection(array[],n/3);
n2=Selection(array[],n2/3);
coun1=0;
coun2=0;
for(i=0;i<n;i++)
{
if(array[i]==n1)
count1++;
if(array[i]==n2)
count2++;
}
if(count1>n)
print(n1);
else if(count2>n)
print(n2);
else
print("no found!");
At line number five, the if statement should have one more check:
if(n!=b && (cnt1 == 0 || n == a))
I use the following Python solution to discuss the correctness of the algorithm:
class Solution:
"""
#param: nums: a list of integers
#return: The majority number that occurs more than 1/3
"""
def majorityNumber(self, nums):
if nums is None:
return None
if len(nums) == 0:
return None
num1 = None
num2 = None
count1 = 0
count2 = 0
# Loop 1
for i, val in enumerate(nums):
if count1 == 0:
num1 = val
count1 = 1
elif val == num1:
count1 += 1
elif count2 == 0:
num2 = val
count2 = 1
elif val == num2:
count2 += 1
else:
count1 -= 1
count2 -= 1
count1 = 0
count2 = 0
for val in nums:
if val == num1:
count1 += 1
elif val == num2:
count2 += 1
if count1 > count2:
return num1
return num2
First, we need to prove claim A:
Claim A: Consider a list C which contains a majority number m which occurs more floor(n/3) times. After 3 different numbers are removed from C, we have C'. m is the majority number of C'.
Proof: Use R to denote m's occurrence count in C. We have R > floor(n/3). R > floor(n/3) => R - 1 > floor(n/3) - 1 => R - 1 > floor((n-3)/3). Use R' to denote m's occurrence count in C'. And use n' to denote the length of C'. Since 3 different numbers are removed, we have R' >= R - 1. And n'=n-3 is obvious. We can have R' > floor(n'/3) from R - 1 > floor((n-3)/3). So m is the majority number of C'.
Now let's prove the correctness of the loop 1. Define L as count1 * [num1] + count2 * [num2] + nums[i:]. Use m to denote the majority number.
Invariant
The majority number m is in L.
Initialization
At the start of the first itearation, L is nums[0:]. So the invariant is trivially true.
Maintenance
if count1 == 0 branch: Before the iteration, L is count2 * [num2] + nums[i:]. After the iteration, L is 1 * [nums[i]] + count2 * [num2] + nums[i+1:]. In other words, L is not changed. So the invariant is maintained.
if val == num1 branch: Before the iteration, L is count1 * [nums[i]] + count2 * [num2] + nums[i:]. After the iteration, L is (count1+1) * [num[i]] + count2 * [num2] + nums[i+1:]. In other words, L is not changed. So the invariant is maintained.
f count2 == 0 branch: Similar to condition 1.
elif val == num2 branch: Similar to condition 2.
else branch: nums[i], num1 and num2 are different to each other in this case. After the iteration, L is (count1-1) * [num1] + (count2-1) * [num2] + nums[i+1:]. In other words, three different numbers are moved from count1 * [num1] + count2 * [num2] + nums[i:]. From claim A, we know m is the majority number of L.So the invariant is maintained.
Termination
When the loop terminates, nums[n:] is empty. L is count1 * [num1] + count2 * [num2].
So when the loop terminates, the majority number is either num1 or num2.
If there are n elements in the array , and suppose in the worst case only 1 element is repeated n/3 times , then the probability of choosing one number that is not the one which is repeated n/3 times will be (2n/3)/n that is 1/3 , so if we randomly choose N elements from the array of size ‘n’, then the probability that we end up choosing the n/3 times repeated number will be atleast 1-(2/3)^N . If we eqaute this to say 99.99 percent probability of getting success, we will get N=23 for any value of “n”.
Therefore just choose 23 numbers randomly from the list and count their occurrences , if we get count greater than n/3 , we will return that number and if we didn’t get any solution after checking for 23 numbers randomly , return -1;
The algorithm is essentially O(n) as the value 23 doesn’t depend on n(size of list) , so we have to just traverse array 23 times at worst case of algo.
Accepted Code on interviewbit(C++):
int n=A.size();
int ans,flag=0;
for(int i=0;i<23;i++)
{
int index=rand()%n;
int elem=A[index];
int count=0;
for(int i=0;i<n;i++)
{
if(A[i]==elem)
count++;
}
if(count>n/3)
{
flag=1;
ans=elem;
}
if(flag==1)
break;
}
if(flag==1)
return ans;
else return -1;
}

Resources