Algorithm for simple string compression

Algorithm for simple string compression - algorithm

I would like to find the shortest possible encoding for a string in the following form:
abbcccc = a2b4c

[NOTE: this greedy algorithm does not guarantee shortest solution]
By remembering all previous occurrences of a character it is straight forward to find the first occurrence of a repeating string (minimal end index including all repetitions = maximal remaining string after all repetitions) and replace it with a RLE (Python3 code):
def singleRLE_v1(s):
occ = dict() # for each character remember all previous indices of occurrences
for idx,c in enumerate(s):
if not c in occ: occ[c] = []
for c_occ in occ[c]:
s_c = s[c_occ:idx]
i = 1
while s[idx+(i-1)*len(s_c) : idx+i*len(s_c)] == s_c:
i += 1
if i > 1:
rle_pars = ('(',')') if len(s_c) > 1 else ('','')
rle = ('%d'%i) + rle_pars[0] + s_c + rle_pars[1]
s_RLE = s[:c_occ] + rle + s[idx+(i-1)*len(s_c):]
return s_RLE
occ[c].append(idx)
return s # no repeating substring found
To make it robust for iterative application we have to exclude a few cases where a RLE may not be applied (e.g. '11' or '))'), also we have to make sure the RLE is not making the string longer (which can happen with a substring of two characters occurring twice as in 'abab'):
def singleRLE(s):
"find first occurrence of a repeating substring and replace it with RLE"
occ = dict() # for each character remember all previous indices of occurrences
for idx,c in enumerate(s):
if idx>0 and s[idx-1] in '0123456789': continue # no RLE for e.g. '11' or other parts of previous inserted RLE
if c == ')': continue # no RLE for '))...)'
if not c in occ: occ[c] = []
for c_occ in occ[c]:
s_c = s[c_occ:idx]
i = 1
while s[idx+(i-1)*len(s_c) : idx+i*len(s_c)] == s_c:
i += 1
if i > 1:
print("found %d*'%s'" % (i,s_c))
rle_pars = ('(',')') if len(s_c) > 1 else ('','')
rle = ('%d'%i) + rle_pars[0] + s_c + rle_pars[1]
if len(rle) <= i*len(s_c): # in case of a tie prefer RLE
s_RLE = s[:c_occ] + rle + s[idx+(i-1)*len(s_c):]
return s_RLE
occ[c].append(idx)
return s # no repeating substring found
Now we can safely call singleRLE on the previous output as long as we find a repeating string:
def iterativeRLE(s):
s_RLE = singleRLE(s)
while s != s_RLE:
print(s_RLE)
s, s_RLE = s_RLE, singleRLE(s_RLE)
return s_RLE
With the above inserted print statements we get e.g. the following trace and result:
>>> iterativeRLE('xyabcdefdefabcdefdef')
found 2*'def'
xyabc2(def)abcdefdef
found 2*'def'
xyabc2(def)abc2(def)
found 2*'abc2(def)'
xy2(abc2(def))
'xy2(abc2(def))'
But this greedy algorithm fails for this input:
>>> iterativeRLE('abaaabaaabaa')
found 3*'a'
ab3abaaabaa
found 3*'a'
ab3ab3abaa
found 2*'b3a'
a2(b3a)baa
found 2*'a'
a2(b3a)b2a
'a2(b3a)b2a'
whereas one of the shortest solutions is 3(ab2a).

Since a greedy algorithm does not work, some search is necessary. Here is a depth first search with some pruning (if in a branch the first idx0 characters of the string are not touched, to not try to find a repeating substring within these characters; also if replacing multiple occurrences of a substring do this for all consecutive occurrencies):
def isRLE(s):
"is this a well nested RLE? (only well nested RLEs can be further nested)"
nestCnt = 0
for c in s:
if c == '(':
nestCnt += 1
elif c == ')':
if nestCnt == 0:
return False
nestCnt -= 1
return nestCnt == 0
def singleRLE_gen(s,idx0=0):
"find all occurrences of a repeating substring with first repetition not ending before index idx0 and replace each with RLE"
print("looking for repeated substrings in '%s', first rep. not ending before index %d" % (s,idx0))
occ = dict() # for each character remember all previous indices of occurrences
for idx,c in enumerate(s):
if idx>0 and s[idx-1] in '0123456789': continue # sub-RLE cannot start after number
if not c in occ: occ[c] = []
for c_occ in occ[c]:
s_c = s[c_occ:idx]
if not isRLE(s_c): continue # avoid RLEs for e.g. '))...)'
if idx+len(s_c) < idx0: continue # pruning: this substring has been tried before
if c_occ-len(s_c) >= 0 and s[c_occ-len(s_c):c_occ] == s_c: continue # pruning: always take all repetitions
i = 1
while s[idx+(i-1)*len(s_c) : idx+i*len(s_c)] == s_c:
i += 1
if i > 1:
rle_pars = ('(',')') if len(s_c) > 1 else ('','')
rle = ('%d'%i) + rle_pars[0] + s_c + rle_pars[1]
if len(rle) <= i*len(s_c): # in case of a tie prefer RLE
s_RLE = s[:c_occ] + rle + s[idx+(i-1)*len(s_c):]
#print(" replacing %d*'%s' -> %s" % (i,s_c,s_RLE))
yield s_RLE,c_occ
occ[c].append(idx)
def iterativeRLE_depthFirstSearch(s):
shortestRLE = s
candidatesRLE = [(s,0)]
while len(candidatesRLE) > 0:
candidateRLE,idx0 = candidatesRLE.pop(0)
for rle,idx in singleRLE_gen(candidateRLE,idx0):
if len(rle) <= len(shortestRLE):
shortestRLE = rle
print("new optimum: '%s'" % shortestRLE)
candidatesRLE.append((rle,idx))
return shortestRLE
Sample output:
>>> iterativeRLE_depthFirstSearch('tctttttttttttcttttttttttctttttttttttct')
looking for repeated substrings in 'tctttttttttttcttttttttttctttttttttttct', first rep. not ending before index 0
new optimum: 'tc11tcttttttttttctttttttttttct'
new optimum: '2(tctttttttttt)ctttttttttttct'
new optimum: 'tctttttttttttc2(ttttttttttct)'
looking for repeated substrings in 'tc11tcttttttttttctttttttttttct', first rep. not ending before index 2
new optimum: 'tc11tc10tctttttttttttct'
new optimum: 'tc11t2(ctttttttttt)tct'
new optimum: 'tc11tc2(ttttttttttct)'
looking for repeated substrings in 'tc5(tt)tcttttttttttctttttttttttct', first rep. not ending before index 2
...
new optimum: '2(tctttttttttt)c11tct'
...
new optimum: 'tc11tc10tc11tct'
...
new optimum: 'tc11t2(c10t)tct'
looking for repeated substrings in 'tc11tc2(ttttttttttct)', first rep. not ending before index 6
new optimum: 'tc11tc2(10tct)'
...
new optimum: '2(tc10t)c11tct'
...
'2(tc10t)c11tct'

Following is my C++ implementation to do it in-place with O(n) time complexity and O(1) space complexity.
class Solution {
public:
int compress(vector<char>& chars) {
int n = (int)chars.size();
if(chars.empty()) return 0;
int left = 0, right = 0, currCharIndx = left;
while(right < n) {
if(chars[currCharIndx] != chars[right]) {
int len = right - currCharIndx;
chars[left++] = chars[currCharIndx];
if(len > 1) {
string freq = to_string(len);
for(int i = 0; i < (int)freq.length(); i++) {
chars[left++] = freq[i];
}
}
currCharIndx = right;
}
right++;
}
int len = right - currCharIndx;
chars[left++] = chars[currCharIndx];
if(len > 1) {
string freq = to_string(len);
for(int i = 0; i < freq.length(); i++) {
chars[left++] = freq[i];
}
}
return left;
}
};
You need to keep track of three pointers - right is to iterate, currCharIndx is to keep track the first position of current character and left is to keep track the write position of the compressed string.

Related

Dynamic Programming for shortest subsequence that is not a subsequence of two strings

Problem: Given two sequences s1 and s2 of '0' and '1'return the shortest sequence that is a subsequence of neither of the two sequences.
E.g. s1 = '011' s2 = '1101' Return s_out = '00' as one possible result.
Note that substring and subsequence are different where substring the characters are contiguous but in a subsequence that needs not be the case.
My question: How is dynamic programming applied in the "Solution Provided" below and what is its time complexity?
My attempt involves computing all the subsequences for each string giving sub1 and sub2. Append a '1' or a '0' to each sub1 and determine if that new subsequence is not present in sub2.Find the minimum length one. Here is my code:
My Solution
def get_subsequences(seq, index, subs, result):
if index == len(seq):
if subs:
result.add(''.join(subs))
else:
get_subsequences(seq, index + 1, subs, result)
get_subsequences(seq, index + 1, subs + [seq[index]], result)
def get_bad_subseq(subseq):
min_sub = ''
length = float('inf')
for sub in subseq:
for char in ['0', '1']:
if len(sub) + 1 < length and sub + char not in subseq:
length = len(sub) + 1
min_sub = sub + char
return min_sub
Solution Provided (not mine)
How does it work and its time complexity?
It looks that the below solution looks similar to: http://kyopro.hateblo.jp/entry/2018/12/11/100507
def set_nxt(s, nxt):
n = len(s)
idx_0 = n + 1
idx_1 = n + 1
for i in range(n, 0, -1):
nxt[i][0] = idx_0
nxt[i][1] = idx_1
if s[i-1] == '0':
idx_0 = i
else:
idx_1 = i
nxt[0][0] = idx_0
nxt[0][1] = idx_1
def get_shortest(seq1, seq2):
len_seq1 = len(seq1)
len_seq2 = len(seq2)
nxt_seq1 = [[len_seq1 + 1 for _ in range(2)] for _ in range(len_seq1 + 2)]
nxt_seq2 = [[len_seq2 + 1 for _ in range(2)] for _ in range(len_seq2 + 2)]
set_nxt(seq1, nxt_seq1)
set_nxt(seq2, nxt_seq2)
INF = 2 * max(len_seq1, len_seq2)
dp = [[INF for _ in range(len_seq2 + 2)] for _ in range(len_seq1 + 2)]
dp[len_seq1 + 1][len_seq2 + 1] = 0
for i in range( len_seq1 + 1, -1, -1):
for j in range(len_seq2 + 1, -1, -1):
for k in range(2):
if dp[nxt_seq1[i][k]][nxt_seq2[j][k]] < INF:
dp[i][j] = min(dp[i][j], dp[nxt_seq1[i][k]][nxt_seq2[j][k]] + 1);
res = ""
i = 0
j = 0
while i <= len_seq1 or j <= len_seq2:
for k in range(2):
if (dp[i][j] == dp[nxt_seq1[i][k]][nxt_seq2[j][k]] + 1):
i = nxt_seq1[i][k]
j = nxt_seq2[j][k]
res += str(k)
break;
return res

I am not going to work it through in detail, but the idea of this solution is to create a 2-D array of every combinations of positions in the one array and the other. It then populates this array with information about the shortest sequences that it finds that force you that far.
Just constructing that array takes space (and therefore time) O(len(seq1) * len(seq2)). Filling it in takes a similar time.
This is done with lots of bit twiddling that I don't want to track.
I have another approach that is clearer to me that usually takes less space and less time, but in the worst case could be as bad. But I have not coded it up.
UPDATE:
Here is is all coded up. With poor choices of variable names. Sorry about that.
# A trivial data class to hold a linked list for the candidate subsequences
# along with information about they match in the two sequences.
import collections
SubSeqLinkedList = collections.namedtuple('SubSeqLinkedList', 'value pos1 pos2 tail')
# This finds the position after the first match. No match is treated as off the end of seq.
def find_position_after_first_match (seq, start, value):
while start < len(seq) and seq[start] != value:
start += 1
return start+1
def make_longer_subsequence (subseq, value, seq1, seq2):
pos1 = find_position_after_first_match(seq1, subseq.pos1, value)
pos2 = find_position_after_first_match(seq2, subseq.pos2, value)
gotcha = SubSeqLinkedList(value=value, pos1=pos1, pos2=pos2, tail=subseq)
return gotcha
def minimal_nonsubseq (seq1, seq2):
# We start with one candidate for how to start the subsequence
# Namely an empty subsequence. Length 0, matches before the first character.
candidates = [SubSeqLinkedList(value=None, pos1=0, pos2=0, tail=None)]
# Now we try to replace candidates with longer maximal ones - nothing of
# the same length is better at going farther in both sequences.
# We keep this list ordered by descending how far it goes in sequence1.
while candidates[0].pos1 <= len(seq1) or candidates[0].pos2 <= len(seq2):
new_candidates = []
for candidate in candidates:
candidate1 = make_longer_subsequence(candidate, '0', seq1, seq2)
candidate2 = make_longer_subsequence(candidate, '1', seq1, seq2)
if candidate1.pos1 < candidate2.pos1:
# swap them.
candidate1, candidate2 = candidate2, candidate1
for c in (candidate1, candidate2):
if 0 == len(new_candidates):
new_candidates.append(c)
elif new_candidates[-1].pos1 <= c.pos1 and new_candidates[-1].pos2 <= c.pos2:
# We have found strictly better.
new_candidates[-1] = c
elif new_candidates[-1].pos2 < c.pos2:
# Note, by construction we cannot be shorter in pos1.
new_candidates.append(c)
# And now we throw away the ones we don't want.
# Those that are on their way to a solution will be captured in the linked list.
candidates = new_candidates
answer = candidates[0]
r_seq = [] # This winds up reversed.
while answer.value is not None:
r_seq.append(answer.value)
answer = answer.tail
return ''.join(reversed(r_seq))
print(minimal_nonsubseq('011', '1101'))

DNA subsequence dynamic programming question

I'm trying to solve DNA problem which is more of improved(?) version of LCS problem.
In the problem, there is string which is string and semi-substring which allows part of string to have one or no letter skipped. For example, for string "desktop", it has semi-substring {"destop", "dek", "stop", "skop","desk","top"}, all of which has one or no letter skipped.
Now, I am given two DNA strings consisting of {a,t,g,c}. I"m trying to find longest semi-substring, LSS. and if there is more than one LSS, print out the one in the fastest order.
For example, two dnas {attgcgtagcaatg, tctcaggtcgatagtgac} prints out "tctagcaatg"
and aaaattttcccc, cccgggggaatatca prints out "aattc"
I'm trying to use common LCS algorithm but cannot solve it with tables although I did solve the one with no letter skipped. Any advice?

This is a variation on the dynamic programming solution for LCS, written in Python.
First I'm building up a Suffix Tree for all the substrings that can be made from each string with the skip rule. Then I'm intersecting the suffix trees. Then I'm looking for the longest string that can be made from that intersection tree.
Please note that this is technically O(n^2). Its worst case is when both strings are the same character, repeated over and over again. Because you wind up with a lot of what logically is something like, "an 'l' at position 42 in the one string could have matched against position l at position 54 in the other". But in practice it will be O(n).
def find_subtree (text, max_skip=1):
tree = {}
tree_at_position = {}
def subtree_from_position (position):
if position not in tree_at_position:
this_tree = {}
if position < len(text):
char = text[position]
# Make sure that we've populated the further tree.
subtree_from_position(position + 1)
# If this char appeared later, include those possible matches.
if char in tree:
for char2, subtree in tree[char].iteritems():
this_tree[char2] = subtree
# And now update the new choices.
for skip in range(max_skip + 1, 0, -1):
if position + skip < len(text):
this_tree[text[position + skip]] = subtree_from_position(position + skip)
tree[char] = this_tree
tree_at_position[position] = this_tree
return tree_at_position[position]
subtree_from_position(0)
return tree
def find_longest_common_semistring (text1, text2):
tree1 = find_subtree(text1)
tree2 = find_subtree(text2)
answered = {}
def find_intersection (subtree1, subtree2):
unique = (id(subtree1), id(subtree2))
if unique not in answered:
answer = {}
for k, v in subtree1.iteritems():
if k in subtree2:
answer[k] = find_intersection(v, subtree2[k])
answered[unique] = answer
return answered[unique]
found_longest = {}
def find_longest (tree):
if id(tree) not in found_longest:
best_candidate = ''
for char, subtree in tree.iteritems():
candidate = char + find_longest(subtree)
if len(best_candidate) < len(candidate):
best_candidate = candidate
found_longest[id(tree)] = best_candidate
return found_longest[id(tree)]
intersection_tree = find_intersection(tree1, tree2)
return find_longest(intersection_tree)
print(find_longest_common_semistring("attgcgtagcaatg", "tctcaggtcgatagtgac"))

Let g(c, rs, rt) represent the longest common semi-substring of strings, S and T, ending at rs and rt, where rs and rt are the ranked occurences of the character, c, in S and T, respectively, and K is the number of skips allowed. Then we can form a recursion which we would be obliged to perform on all pairs of c in S and T.
JavaScript code:
function f(S, T, K){
// mapS maps a char to indexes of its occurrences in S
// rsS maps the index in S to that char's rank (index) in mapS
const [mapS, rsS] = mapString(S)
const [mapT, rsT] = mapString(T)
// h is used to memoize g
const h = {}
function g(c, rs, rt){
if (rs < 0 || rt < 0)
return 0
if (h.hasOwnProperty([c, rs, rt]))
return h[[c, rs, rt]]
// (We are guaranteed to be on
// a match in this state.)
let best = [1, c]
let idxS = mapS[c][rs]
let idxT = mapT[c][rt]
if (idxS == 0 || idxT == 0)
return best
for (let i=idxS-1; i>=Math.max(0, idxS - 1 - K); i--){
for (let j=idxT-1; j>=Math.max(0, idxT - 1 - K); j--){
if (S[i] == T[j]){
const [len, str] = g(S[i], rsS[i], rsT[j])
if (len + 1 >= best[0])
best = [len + 1, str + c]
}
}
}
return h[[c, rs, rt]] = best
}
let best = [0, '']
for (let c of Object.keys(mapS)){
for (let i=0; i<(mapS[c]||[]).length; i++){
for (let j=0; j<(mapT[c]||[]).length; j++){
let [len, str] = g(c, i, j)
if (len > best[0])
best = [len, str]
}
}
}
return best
}
function mapString(s){
let map = {}
let rs = []
for (let i=0; i<s.length; i++){
if (!map[s[i]]){
map[s[i]] = [i]
rs.push(0)
} else {
map[s[i]].push(i)
rs.push(map[s[i]].length - 1)
}
}
return [map, rs]
}
console.log(f('attgcgtagcaatg', 'tctcaggtcgatagtgac', 1))
console.log(f('aaaattttcccc', 'cccgggggaatatca', 1))
console.log(f('abcade', 'axe', 1))

Find the minimum number of edits to balance parentheses?

I was very confused about this question. I know about finding the edit distance between 2 strings using recursion and dynamic programming as an improvement, however am confused about how to go with this one.
Not sure if my thinking is correct. But we have a string of parenthesis which is unbalanced say
String s = "((())))";
How to find the String with balanced Parenthesis which requires minimum number of edits ?
Can some one explain this with an example ?
I am still not sure if I am explaining it correctly.

Given a string consisting of left and right parentheses, we are asked to balance it by performing a minimal number of delete, insert, and replace operations.
To begin with, let's look at the input string and distinguish matched pairs from unmatched characters. We can mark all the characters belonging to matched pairs by executing the following algorithm:
Find an unmarked '(' that is followed by an unmarked ')', with zero or more marked characters between the two.
If there is no such pair of characters, terminate the algorithm.
Otherwise, mark the '(' and the ')'.
Return to step 1.
The marked pairs are already balanced at zero cost, so the optimal course of action is to do nothing further with them.
Now let's consider the unmarked characters. Notice that no unmarked '(' is followed by an unmarked ')', or else the pair would have been marked. Therefore, if we scan the unmarked characters from left to right, we will find zero or more ')' characters followed by zero or more '(' characters.
To balance the sequence of ')' characters, it is optimal to rewrite every other one to '(', starting with the first one and excluding the last one. If there is an odd number of ')' characters, it is optimal to delete the last one.
As for the sequence of '(' characters, it is optimal to rewrite every other one to ')', starting with the second one. If there is a leftover '(' character, we delete it.
The following Python code implements the steps described above and displays the intermediate results.
def balance(s): # s is a string of '(' and ')' characters in any order
n = len(s)
print('original string: %s' % s)
# Mark all matched pairs
marked = n * [ False ]
left_parentheses = []
for i, ch in enumerate(s):
if ch == '(':
left_parentheses.append(i)
else:
if len(left_parentheses) != 0:
marked[i] = True
marked[left_parentheses.pop()] = True
# Display the matched pairs and unmatched characters.
matched, remaining = [], []
for i, ch in enumerate(s):
if marked[i]:
matched.append(ch)
remaining.append(' ')
else:
matched.append(' ')
remaining.append(ch)
print(' matched pairs: %s' % ''.join(matched))
print(' unmatched: %s' % ''.join(remaining))
cost = 0
deleted = n * [ False ]
new_chars = list(s)
# Balance the unmatched ')' characters.
right_count, last_right = 0, -1
for i, ch in enumerate(s):
if not marked[i] and ch == ')':
right_count += 1
if right_count % 2 == 1:
new_chars[i] = '('
cost += 1
last_right = i
if right_count % 2 == 1: # Delete the last ')' if we couldn't match it.
deleted[last_right] = True # The cost was incremented during replacement.
# Balance the unmatched '(' characters.
left_count, last_left = 0, -1
for i, ch in enumerate(s):
if not marked[i] and ch == '(':
left_count += 1
if left_count % 2 == 0:
new_chars[i] = ')'
cost += 1
else:
last_left = i
if left_count % 2 == 1: # Delete the last '(' if we couldn't match it.
deleted[last_left] = True # This character wasn't replaced, so we must
cost += 1 # increment the cost now.
# Display the outcome of replacing and deleting.
balanced = []
for i, ch in enumerate(new_chars):
if marked[i] or deleted[i]:
balanced.append(' ')
else:
balanced.append(ch)
print(' balance: %s' % ''.join(balanced))
# Display the cost of balancing and the overall balanced string.
print(' cost: %d' % cost)
result = []
for i, ch in enumerate(new_chars):
if not deleted[i]: # Skip deleted characters.
result.append(ch)
print(' new string: %s' % ''.join(result))
balance(')()(()())))()((())((')
For the test case ')()(()())))()((())((', the output is as follows.
original string: )()(()())))()((())((
matched pairs: ()(()()) () (())
unmatched: ) )) ( ((
balance: ( ) ( )
cost: 4
new string: (()(()()))()((()))

The idea is simple:
Find final string having left over open and close brackets which couldn't make pair. Remember that in this final string, close brackets will be present 1st and then open brackets.
Now we will have to edit open brackets and close brackets separately.
eg: for close brackets:
(1) if it is of even length:
min edit to balance will be to change half close brackets to open brackets.
So minEdit = closeBracketCount/2 .
(2) If it is of odd length:
min edit to balance will be to do above step 1 and remove the remaining 1 bracket.
So minEdit = closeBracketCount/2 + 1
For open brackets:
(1) if it is of even length:
min edit to balance will be to change half open brackets to close brackets.
So minEdit = openBracketCount/2.
(2) If it is of odd length:
min edit to balance will be to do above step 1 and remove the remaining 1 bracket.
So minEdit = openBracketCount/2 + 1
Here is the running code: http://codeshare.io/bX1Dt
Let me know your thoughts.

While this interesting problem can be solved with dynamic programming as mentioned in the comments, there exists an easier solution to it. You can solve it with the greedy algorithm.
Idea for this greedy algorithm comes from how we check the validity of parentheses expression. You set counter to 0 and traverse the parentheses string, add 1 at "(" and substract 1 at ")". If counter always stays above or at 0 and finishes at 0, you have a valid string.
This implies that if the lowest value that we encountered while traversing is -maxi, we need to add exactly -maxi "(" at the start. Adjust final counter value for added "(" and add enough ")" at the end to finish at 0.
Here is the pseudo-code for the algorithm:
counter = 0
mini = 0
for each p in string:
if p == "(":
counter++
else:
counter--
mini = min(counter, mini)
add -mini "(" at the start of the string
counter -= mini
add counter ")" at the end of the string

I tired to solve the problem with DP algorithm and it passed a few test cases made up by myself. Let me know if you think it's correct.
Let P(i,j) be the minimum number of edits to make string S[i..j] balanced.
When S[i] equals S[j], the number of minimum edits is obviously P(i+1,j-1)
There are a few options to make the string balanced when S[i] != S[j], but in the end we could either add '(' to the front of i or ')' at the end of j, or remove the parenthesis at i or j. In all these cases, the minimum number of edits is min{P(i+1, j), P(i, j-1)} + 1.
We therefore have below DP formula:
P(i,j) = 0 if i > j
= P(i + 1, j - 1) if S[i] matches S[j] OR S[i] and S[j] are not parenthesis
= min{P(i + 1, j), P(i, j - 1)} + 1

I would use stack to balance them efficiently. Here is python code:
a=['(((((','a(b)c)','((())))',')()(()())))()((())((']
def balance(s):
st=[]
l=len(s)
i=0
while i<l:
if s[i]=='(':
st.append(i)
elif s[i]==')':
if st:
st.pop()
else:
del s[i]
i-=1
l-=1
i+=1
while st:
del s[st.pop()]
return ''.join(s)
for i in a:
print balance(list(i))
Output:
Empty
a(b)c
((()))
()(()())()(())

//fisher
public int minInsertions(String s) {
Stack<Character> stack = new Stack<>();
int insertionsNeeded = 0;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (c == '(') {
if (stack.isEmpty()) {
stack.add(c);
} else {
if (stack.peek() == ')') {
//in this case, we need to add one more ')' to get two consecutive right paren, then we could pop the one ')' and one '(' off the stack
insertionsNeeded++;
stack.pop();
stack.pop();
stack.add(c);
} else {
stack.add(c);
}
}
} else if (c == ')') {
if (stack.isEmpty()) {
//in this case, we need to add one '(' before we add this ')' onto this stack
insertionsNeeded++;
stack.add('(');
stack.add(c);
} else {
if (stack.peek() == ')') {
//in this case, we could pop the one ')' and one '(' off the stack
stack.pop();
stack.pop();
} else {
stack.add(c);
}
}
}
}
if (stack.isEmpty()) {
return insertionsNeeded;
} else {
while (!stack.isEmpty()) {
char pop = stack.pop();
if (pop == '(') {
insertionsNeeded += 2;
} else {
insertionsNeeded++;
stack.pop();
}
}
return insertionsNeeded;
}
}
}

Find all possible combinations of a String representation of a number

Given a mapping:
A: 1
B: 2
C: 3
...
...
...
Z: 26
Find all possible ways a number can be represented. E.g. For an input: "121", we can represent it as:
ABA [using: 1 2 1]
LA [using: 12 1]
AU [using: 1 21]
I tried thinking about using some sort of a dynamic programming approach, but I am not sure how to proceed. I was asked this question in a technical interview.
Here is a solution I could think of, please let me know if this looks good:
A[i]: Total number of ways to represent the sub-array number[0..i-1] using the integer to alphabet mapping.
Solution [am I missing something?]:
A[0] = 1 // there is only 1 way to represent the subarray consisting of only 1 number
for(i = 1:A.size):
A[i] = A[i-1]
if(input[i-1]*10 + input[i] < 26):
A[i] += 1
end
end
print A[A.size-1]

To just get the count, the dynamic programming approach is pretty straight-forward:
A[0] = 1
for i = 1:n
A[i] = 0
if input[i-1] > 0 // avoid 0
A[i] += A[i-1];
if i > 1 && // avoid index-out-of-bounds on i = 1
10 <= (10*input[i-2] + input[i-1]) <= 26 // check that number is 10-26
A[i] += A[i-2];
If you instead want to list all representations, dynamic programming isn't particularly well-suited for this, you're better off with a simple recursive algorithm.

First off, we need to find an intuitive way to enumerate all the possibilities. My simple construction, is given below.
let us assume a simple way to represent your integer in string format.
a1 a2 a3 a4 ....an, for instance in 121 a1 -> 1 a2 -> 2, a3 -> 1
Now,
We need to find out number of possibilities of placing a + sign in between two characters. + is to mean characters concatenation here.
a1 - a2 - a3 - .... - an, - shows the places where '+' can be placed. So, number of positions is n - 1, where n is the string length.
Assume a position may or may not have a + symbol shall be represented as a bit.
So, this boils down to how many different bit strings are possible with the length of n-1, which is clearly 2^(n-1). Now in order to enumerate the possibilities go through every bit string and place right + signs in respective positions to get every representations,
For your example, 121
Four bit strings are possible 00 01 10 11
1 2 1
1 2 + 1
1 + 2 1
1 + 2 + 1
And if you see a character followed by a +, just add the next char with the current one and do it sequentially to get the representation,
x + y z a + b + c d
would be (x+y) z (a+b+c) d
Hope it helps.
And you will have to take care of edge cases where the size of some integer > 26, of course.

I think, recursive traverse through all possible combinations would do just fine:
mapping = {"1":"A", "2":"B", "3":"C", "4":"D", "5":"E", "6":"F", "7":"G",
"8":"H", "9":"I", "10":"J",
"11":"K", "12":"L", "13":"M", "14":"N", "15":"O", "16":"P",
"17":"Q", "18":"R", "19":"S", "20":"T", "21":"U", "22":"V", "23":"W",
"24":"A", "25":"Y", "26":"Z"}
def represent(A, B):
if A == B == '':
return [""]
ret = []
if A in mapping:
ret += [mapping[A] + r for r in represent(B, '')]
if len(A) > 1:
ret += represent(A[:-1], A[-1]+B)
return ret
print represent("121", "")

Assuming you only need to count the number of combinations.
Assuming 0 followed by an integer in [1,9] is not a valid concatenation, then a brute-force strategy would be:
Count(s,n)
x=0
if (s[n-1] is valid)
x=Count(s,n-1)
y=0
if (s[n-2] concat s[n-1] is valid)
y=Count(s,n-2)
return x+y
A better strategy would be to use divide-and-conquer:
Count(s,start,n)
if (len is even)
{
//split s into equal left and right part, total count is left count multiply right count
x=Count(s,start,n/2) + Count(s,start+n/2,n/2);
y=0;
if (s[start+len/2-1] concat s[start+len/2] is valid)
{
//if middle two charaters concatenation is valid
//count left of the middle two characters
//count right of the middle two characters
//multiply the two counts and add to existing count
y=Count(s,start,len/2-1)*Count(s,start+len/2+1,len/2-1);
}
return x+y;
}
else
{
//there are three cases here:
//case 1: if middle character is valid,
//then count everything to the left of the middle character,
//count everything to the right of the middle character,
//multiply the two, assign to x
x=...
//case 2: if middle character concatenates the one to the left is valid,
//then count everything to the left of these two characters
//count everything to the right of these two characters
//multiply the two, assign to y
y=...
//case 3: if middle character concatenates the one to the right is valid,
//then count everything to the left of these two characters
//count everything to the right of these two characters
//multiply the two, assign to z
z=...
return x+y+z;
}
The brute-force solution has time complexity of T(n)=T(n-1)+T(n-2)+O(1) which is exponential.
The divide-and-conquer solution has time complexity of T(n)=3T(n/2)+O(1) which is O(n**lg3).
Hope this is correct.

Something like this?
Haskell code:
import qualified Data.Map as M
import Data.Maybe (fromJust)
combs str = f str [] where
charMap = M.fromList $ zip (map show [1..]) ['A'..'Z']
f [] result = [reverse result]
f (x:xs) result
| null xs =
case M.lookup [x] charMap of
Nothing -> ["The character " ++ [x] ++ " is not in the map."]
Just a -> [reverse $ a:result]
| otherwise =
case M.lookup [x,head xs] charMap of
Just a -> f (tail xs) (a:result)
++ (f xs ((fromJust $ M.lookup [x] charMap):result))
Nothing -> case M.lookup [x] charMap of
Nothing -> ["The character " ++ [x]
++ " is not in the map."]
Just a -> f xs (a:result)
Output:
*Main> combs "121"
["LA","AU","ABA"]

Here is the solution based on my discussion here:
private static int decoder2(int[] input) {
int[] A = new int[input.length + 1];
A[0] = 1;
for(int i=1; i<input.length+1; i++) {
A[i] = 0;
if(input[i-1] > 0) {
A[i] += A[i-1];
}
if (i > 1 && (10*input[i-2] + input[i-1]) <= 26) {
A[i] += A[i-2];
}
System.out.println(A[i]);
}
return A[input.length];
}

Just us breadth-first search.
for instance 121
Start from the first integer,
consider 1 integer character first, map 1 to a, leave 21
then 2 integer character map 12 to L leave 1.

This problem can be done in o(fib(n+2)) time with a standard DP algorithm.
We have exactly n sub problems and button up we can solve each problem with size i in o(fib(i)) time.
Summing the series gives fib (n+2).
If you consider the question carefully you see that it is a Fibonacci series.
I took a standard Fibonacci code and just changed it to fit our conditions.
The space is obviously bound to the size of all solutions o(fib(n)).
Consider this pseudo code:
Map<Integer, String> mapping = new HashMap<Integer, String>();
List<String > iterative_fib_sequence(string input) {
int length = input.length;
if (length <= 1)
{
if (length==0)
{
return "";
}
else//input is a-j
{
return mapping.get(input);
}
}
List<String> b = new List<String>();
List<String> a = new List<String>(mapping.get(input.substring(0,0));
List<String> c = new List<String>();
for (int i = 1; i < length; ++i)
{
int dig2Prefix = input.substring(i-1, i); //Get a letter with 2 digit (k-z)
if (mapping.contains(dig2Prefix))
{
String word2Prefix = mapping.get(dig2Prefix);
foreach (String s in b)
{
c.Add(s.append(word2Prefix));
}
}
int dig1Prefix = input.substring(i, i); //Get a letter with 1 digit (a-j)
String word1Prefix = mapping.get(dig1Prefix);
foreach (String s in a)
{
c.Add(s.append(word1Prefix));
}
b = a;
a = c;
c = new List<String>();
}
return a;
}

old question but adding an answer so that one can find help
It took me some time to understand the solution to this problem – I refer accepted answer and #Karthikeyan's answer and the solution from geeksforgeeks and written my own code as below:
To understand my code first understand below examples:
we know, decodings([1, 2]) are "AB" or "L" and so decoding_counts([1, 2]) == 2
And, decodings([1, 2, 1]) are "ABA", "AU", "LA" and so decoding_counts([1, 2, 1]) == 3
using the above two examples let's evaluate decodings([1, 2, 1, 4]):
case:- "taking next digit as single digit"
taking 4 as single digit to decode to letter 'D', we get decodings([1, 2, 1, 4]) == decoding_counts([1, 2, 1]) because [1, 2, 1, 4] will be decode as "ABAD", "AUD", "LAD"
case:- "combining next digit with the previous digit"
combining 4 with previous 1 as 14 as a single to decode to letter N, we get decodings([1, 2, 1, 4]) == decoding_counts([1, 2]) because [1, 2, 1, 4] will be decode as "ABN" or "LN"
Below is my Python code, read comments
def decoding_counts(digits):
# defininig count as, counts[i] -> decoding_counts(digits[: i+1])
counts = [0] * len(digits)
counts[0] = 1
for i in xrange(1, len(digits)):
# case:- "taking next digit as single digit"
if digits[i] != 0: # `0` do not have mapping to any letter
counts[i] = counts[i -1]
# case:- "combining next digit with the previous digit"
combine = 10 * digits[i - 1] + digits[i]
if 10 <= combine <= 26: # two digits mappings
counts[i] += (1 if i < 2 else counts[i-2])
return counts[-1]
for digits in "13", "121", "1214", "1234121":
print digits, "-->", decoding_counts(map(int, digits))
outputs:
13 --> 2
121 --> 3
1214 --> 5
1234121 --> 9
note: I assumed that input digits do not start with 0 and only consists of 0-9 and have a sufficent length

For Swift, this is what I came up with. Basically, I converted the string into an array and goes through it, adding a space into different positions of this array, then appending them to another array for the second part, which should be easy after this is done.
//test case
let input = [1,2,2,1]
func combination(_ input: String) {
var arr = Array(input)
var possible = [String]()
//... means inclusive range
for i in 2...arr.count {
var temp = arr
//basically goes through it backwards so
// adding the space doesn't mess up the index
for j in (1..<i).reversed() {
temp.insert(" ", at: j)
possible.append(String(temp))
}
}
print(possible)
}
combination(input)
//prints:
//["1 221", "12 21", "1 2 21", "122 1", "12 2 1", "1 2 2 1"]

def stringCombinations(digits, i=0, s=''):
if i == len(digits):
print(s)
return
alphabet = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
total = 0
for j in range(i, min(i + 1, len(digits) - 1) + 1):
total = (total * 10) + digits[j]
if 0 < total <= 26:
stringCombinations(digits, j + 1, s + alphabet[total - 1])
if __name__ == '__main__':
digits = list()
n = input()
n.split()
d = list(n)
for i in d:
i = int(i)
digits.append(i)
print(digits)
stringCombinations(digits)

native string matching algorithm

Following is a very famous question in native string matching. Please can someone explain me the answer.
Suppose that all characters in the pattern P are different. Show how to accelerate NAIVE-STRING MATCHER to run in time O(n) on an n-character text T.

The basic idea:
Iterate through the input and the pattern at the same time, comparing their characters to each other
Whenever you get a non-matching character between the two, you can just reset the pattern position and keep the input position as is
This works because the pattern characters are all different, which means that whenever you have a partial match, there can be no other match overlapping with that, so we can just start looking from the end of the partial match.
Here's some pseudo-code that shouldn't be too difficult to understand:
input[n]
pattern[k]
pPos = 0
iPos = 0
while iPos < n
if pPos == k
FOUND!
if pattern[pPos] == input[iPos]
pPos++
iPos++
else
// if pPos is already 0, we need to increase iPos,
// otherwise we just keep comparing the same characters
if pPos == 0
iPos++
pPos = 0
It's easy to see that iPos increases at least every second loop, thus there can be at most 2n loop runs, making the running time O(n).

When T[i] and P[j] mismatches in NAIVE-STRING-MATCHER, we can skip all characters before T[i] and begin new matching from T[i + 1] with P[1].
NAIVE-STRING-MATCHER(T, P)
1 n length[T]
2 m length[P]
3 for s 0 to n - m
4 do if P[1 . . m] = T[s + 1 . . s + m]
5 then print "Pattern occurs with shift" s

Naive string search algorithm implementations in Python 2.7:
https://gist.github.com/heyhuyen/4341692
In the middle of implementing Boyer-Moore's string search algorithm, I decided to play with my original naive search algorithm. It's implemented as an instance method that takes a string to be searched. The object has an attribute 'pattern' which is the pattern to match.
1) Here is the original version of the search method, using a double for-loop.
Makes calls to range and len
def search(self, string):
for i in range(len(string)):
for j in range(len(self.pattern)):
if string[i+j] != self.pattern[j]:
break
elif j == len(self.pattern) - 1:
return i
return -1
2) Here is the second version, using a double while-loop instead.
Slightly faster, not making calls to range
def search(self, string):
i = 0
while i < len(string):
j = 0
while j < len(self.pattern) and self.pattern[j] == string[i+j]:
j += 1
if j == len(self.pattern):
return i
i += 1
return -1
3) Here is the original, replacing range with xrange.
Faster than both of the previous two.
def search(self, string):
for i in xrange(len(string)):
for j in xrange(len(self.pattern)):
if string[i+j] != self.pattern[j]:
break
elif j == len(self.pattern) - 1:
return i
return -1
4) Storing values in local variables = win! With the double while loop, this is the fastest.
def search(self, string):
len_pat = len(self.pattern)
len_str = len(string)
i = 0
while i < len_str:
j = 0
while j < len_pat and self.pattern[j] == string[i+j]:
j += 1
if j == len_pat:
return i
i += 1
return -1

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Algorithm for simple string compression - algorithm

I would like to find the shortest possible encoding for a string in the following form: abbcccc = a2b4c

Related

Dynamic Programming for shortest subsequence that is not a subsequence of two strings

DNA subsequence dynamic programming question

Find the minimum number of edits to balance parentheses?

Find all possible combinations of a String representation of a number

native string matching algorithm

Categories

Resources