Find a substitution that sorts the list - algorithm

Consider the following words:
PINEAPPLE
BANANA
ARTICHOKE
TOMATO
The goal is to sort it (in lexicographical order) without moving the words themselves, but using letter substitution. In this example, I can replace the letter P with A and A replace with P, so:
AINEPAALE
BPNPNP
PRTICHOKE
TOMPTO
This is a list in lexicographical order. If you switch letters, the letters will be switched in all words. It is worth noting that you can use the whole alphabet, nut just the letters in the words in the list.
I spent considerable time with this problem, but was not able to think of anything other than brute forcing it (trying all letter switch combinations) nor was I able to come up with the conditions that define when the list can be sorted.
Some more examples:
ABC
ABB
ABD
can be turned into
ACB
ACC
ACD
which satisfies the condition.

Let's assume the problem is possible for a particular case, just for now. Also, for simplicity, assume all the words are distinct (if two words are identical, they must be adjacent and one can be ignored).
The problem then turns into topological sort, though the details are slightly different from suspicious dog's answer, which misses a couple of cases.
Consider a graph of 26 nodes, labeled A through Z. Each pair of words contributes one directed edge to the partial ordering; this corresponds to the first character in which the words differ. For example, with the two words ABCEF and ABRKS in order, the first difference is in the third character, so sigma(C) < sigma(R).
The result can be obtained by doing a topological sort on this graph, and substituting A for the first node in the ordering, B for the second, etc.
Note that this also gives a useful measure of when the problem is impossible to solve. This occurs when two words are the same but not adjacent (in a "cluster"), when one word is a prefix of another but is after it, or when the graph has a cycle and topological sort is impossible.
Here is a fully functional solution in Python, complete with detection of when a particular instance of the problem is unsolvable.
def topoSort(N, adj):
stack = []
visited = [False for _ in range(N)]
current = [False for _ in range(N)]
def dfs(v):
if current[v]: return False # there's a cycle!
if visited[v]: return True
visited[v] = current[v] = True
for x in adj[v]:
if not dfs(x):
return False
current[v] = False
stack.append(v)
return True
for i in range(N):
if not visited[i]:
if not dfs(i):
return None
return list(reversed(stack))
def solve(wordlist):
N = 26
adj = [set([]) for _ in range(N)] # adjacency list
for w1, w2 in zip(wordlist[:-1], wordlist[1:]):
idx = 0
while idx < len(w1) and idx < len(w2):
if w1[idx] != w2[idx]: break
idx += 1
else:
# no differences found between the words
if len(w1) > len(w2):
return None
continue
c1, c2 = w1[idx], w2[idx]
# we want c1 < c2 after the substitution
adj[ord(c1) - ord('A')].add(ord(c2) - ord('A'))
li = topoSort(N, adj)
sub = {}
for i in range(N):
sub[chr(ord('A') + li[i])] = chr(ord('A') + i)
return sub
def main():
words = ['PINEAPPLE', 'BANANA', 'ARTICHOKE', 'TOMATO']
print('Before: ' + ' '.join(words))
sub = solve(words)
nwords = [''.join(sub[c] for c in w) for w in words]
print('After : ' + ' '.join(nwords))
if __name__ == '__main__':
main()
EDIT: The time complexity of this solution is a provably-optimal O(S), where S is the length of the input. Thanks to suspicious dog for this; the original time complexity was O(N^2 L).

Update: the original analysis was wrong and failed on some class of test cases, as pointed out by Eric Zhang.
I believe this can be solved with a form of topological sort. Your initial list of words defines a partial order or a directed graph on some set of letters. You wish to find a substitution that linearizes this graph of letters. Let's use one of your non-trivial examples:
P A R K O V I S T E
P A R A D O N T O Z A
P A D A K
A B B A
A B E C E D A
A B S I N T
Let x <* y indicate that substitution(x) < substitution(y) for some letters (or words) x and y. We want word1 <* word2 <* word3 <* word4 <* word5 <* word6 overall, but in terms of letters, we just need to look at each pair of adjacent words and find the first pair of differing characters in the same column:
K <* A (from PAR[K]OVISTE <* PAR[A]DONTOZA)
R <* D (from PA[R]ADONTOZA <* PA[D]AK)
P <* A (from [P]ADAK <* [A]BBA)
B <* E (from AB[B]A <* AB[E]CEDA)
E <* S (from AB[E]CEDA <* AB[S]INT)
If we find no mismatched letters, then there are 3 cases:
word 1 and word 2 are the same
word 1 is a prefix of word 2
word 2 is a prefix of word 1
In case 1 and 2, the words are already in lexicographic order, so we don't need to perform any substitutions (although we might) and they add no extra constraints that we need to adhere to. In case 3, there is no substitution at all that will fix this (think of ["DOGGO", "DOG"]), so there's no possible solution and we can quit early.
Otherwise, we build the directed graph corresponding to the partial ordering information we obtained and perform a topological sort. If the sorting process indicates that no linearization is possible, then there is no solution for sorting the list of words. Otherwise, you get back something like:
P <* K <* R <* B <* E <* A <* D <* S
Depending on how you implement your topological sort, you might get a different linear ordering. Now you just need to assign each letter a substitution that respects this ordering and is itself sorted alphabetically. A simple option is to pair the linear ordering with itself sorted alphabetically, and use that as the substitution:
P <* K <* R <* B <* E <* A <* D <* S
| | | | | | | |
A < B < D < E < K < P < R < S
But you could implement a different substitution rule if you wish.
Here's a proof-of-concept in Python:
import collections
import itertools
# a pair of outgoing and incoming edges
Edges = collections.namedtuple('Edges', 'outgoing incoming')
# a mapping from nodes to edges
Graph = lambda: collections.defaultdict(lambda: Edges(set(), set()))
def substitution_sort(words):
graph = build_graph(words)
if graph is None:
return None
ordering = toposort(graph)
if ordering is None:
return None
# create a substitition that respects `ordering`
substitutions = dict(zip(ordering, sorted(ordering)))
# apply substititions
return [
''.join(substitutions.get(char, char) for char in word)
for word in words
]
def build_graph(words):
graph = Graph()
# loop over every pair of adjacent words and find the first
# pair of corresponding characters where they differ
for word1, word2 in zip(words, words[1:]):
for char1, char2 in zip(word1, word2):
if char1 != char2:
break
else: # no differing characters found...
if len(word1) > len(word2):
# ...but word2 is a prefix of word1 and comes after;
# therefore, no solution is possible
return None
else:
# ...so no new information to add to the graph
continue
# add edge from char1 -> char2 to the graph
graph[char1].outgoing.add(char2)
graph[char2].incoming.add(char1)
return graph
def toposort(graph):
"Kahn's algorithm; returns None if graph contains a cycle"
result = []
working_set = {node for node, edges in graph.items() if not edges.incoming}
while working_set:
node = working_set.pop()
result.append(node)
outgoing = graph[node].outgoing
while outgoing:
neighbour = outgoing.pop()
neighbour_incoming = graph[neighbour].incoming
neighbour_incoming.remove(node)
if not neighbour_incoming:
working_set.add(neighbour)
if any(edges.incoming or edges.outgoing for edges in graph.values()):
return None
else:
return result
def print_all(items):
for item in items:
print(item)
print()
def test():
test_cases = [
('PINEAPPLE BANANA ARTICHOKE TOMATO', True),
('ABC ABB ABD', True),
('AB AA AB', False),
('PARKOVISTE PARADONTOZA PADAK ABBA ABECEDA ABSINT', True),
('AA AB CA', True),
('DOG DOGGO DOG DIG BAT BAD', False),
('DOG DOG DOGGO DIG BIG BAD', True),
]
for words, is_sortable in test_cases:
words = words.split()
print_all(words)
subbed = substitution_sort(words)
if subbed is not None:
assert subbed == sorted(subbed), subbed
print_all(subbed)
else:
print('<no solution>')
print()
print('expected solution?', 'yes' if is_sortable else 'no')
print()
if __name__ == '__main__':
test()
Now, it's not ideal--for example, it still performs a substitution even if the original list of words is already sorted--but it appears to work. I can't formally prove it works though, so if you find a counter-example, please let me know!

Extract all the first letter of each word in a list. (P,B,A,T)
Sort the list. (A,B,P,T)
Replace all occurrences of the first letter in the word with the first character in the sorted list.
Replace P(Pineapple) from all words with A.
Replace B from all words with B.
Replace A from all words with P.
Replace T from all words with T.
This will give you your intended result.
Edit:
Compare two adjacent strings. If one is greater than the other, then find the first occurrence of character mismatch and swap and replace all words with the swapped characters.
Repeat this for the entire list like in bubble sort.
Example -
ABC < ABB
First occurrence of character mismatch is at 3rd position. So we swap all C's with B's.

Related

Efficient way to find maximum number in previous N numbers in an array [duplicate]

Input:
listi = [9, 7, 8, 4, 6, 1, 3, 2, 5]
Output:
# m=3
listo = [9, 8, 8, 6, 6, 3, 5]
Given a random list composed of n numbers, I need to find all of the sublists of m consequtive elements, choose the largest value from the sublist and put them in a new list.
def convert(listi, m):
listo = []
n = len(listi)
for i in range(n-m+1):
listo.append(max(listi[i:3+i]))
return listo
The time complexity for this implementation is O(m\^{(n-m+1)}, which is pretty bad if listi is long, is there a way to implement this in the complexity of O(n)?
Surprisingly, the easily accessible descriptions of this algorithm are not that easy to understand, so the trick is this:
As you slide a window of length m over your list of length n, you maintain a deque of all the elements in the current window that might, at some point, become the maximum in any window.
An element in the current window might become the maximum if it is greater than all the elements that occur after it in the window. Note that this always includes the last element in the current window.
Since every element in the deque is > all the elements after it, elements in the deque are monotonically decreasing, and the first one is therefore the maximum element in the current window.
As the window slides one position to the right, you can maintain this deque as follows: remove all the elements from the end that are <= the new element. Then, add the new element to the end of the deque. If the element that drops off the front of the window is the first element in the deque, then remove it. Since each element is added and removed at most one time, the total time required to maintain this deque is in O(n).
To make it easy to tell when an element at the front of the deque is falls out of the window, store the elements' indexes in the deque instead of their values.
Here's a reasonably efficient python implementation:
def windowMax(listi, m):
# the part of this list at positions >= qs is a deque
# with elements monotonically decreasing. Each one
# may be the max in a window at some point
q = []
qs = 0
listo=[]
for i in range(len(listi)):
# remove items from the end of the q that are <= the new one
while len(q) > qs and listi[q[-1]] <= listi[i]:
del q[-1]
# add new item
q.append(i)
if i >= m-1:
listo.append(listi[q[qs]])
# element falls off start of window
if i-q[qs] >= m-1:
qs+=1
# don't waste storage in q. This doesn't change the deque
if qs > m:
del q[0:m]
qs -= m
return listo
There is a beautiful solution with a running time independent of M.
In the figure below, the first row represents the initial sequence.In the second row, we have the maxima of groups of 1, 2, … M consecutive elements from left to right ("prefix" maxima). In the third row, we have the maxima of groups of 1, 2, … M consecutive elements, from right to left ("suffix" maxima). And in the fourth row, the maxima of elements of the second and third rows.
a b c d e f g h i j k l m n o
a ab abc d de def g gh ghi j jk jkl m mn mno
abc bc c def ef f ghi hi i jkl kl l mno no o
abc bcd cde def efg fgh ghi hij ijk jkl klm lmn mno
Note that there are replicated elements in row three, which we needn't compute.
The computation of the second row takes M-1 comparisons per slice of M elements; the second row M-2, and the third M. So ignoring the effect at the ends, we perform slightly less than 3 comparisons per element.
The required storage is an additional array of M elements to temporarily evaluate slices of the third row.
Source: Efficient Dilation, Erosion, Opening and Closing
Algorithms,
JOSEPH (YOSSI) GIL & RON KIMMEL.
I tried timing with zip and it seems the result is 50% faster than your current function - can't really tell the time complexity difference though.
import timeit
setup = """
from random import randint
listi = [randint(1,100) for _ in range(1000)]
def convert(iterable, m):
t = [iterable[x:] for x in range(m)]
result = [max(combo) for combo in zip(*t)]
return result"""
print (min(timeit.Timer('a=listi; convert(a,3)', setup=setup).repeat(7, 1000)))
#0.250054761
setup2 = """
from random import randint
listi = [randint(1,100) for _ in range(1000)]
def convert2(listi, m):
listo = []
n = len(listi)
for i in range(n-m+1):
listo.append(max(listi[i:3+i]))
return listo"""
print (min(timeit.Timer('a=listi; convert2(a,3)', setup=setup2).repeat(7, 1000)))
#0.400374625

Algorithm to group items in groups of 3

I am trying to solve a problem where I have pairs like:
A C
B F
A D
D C
F E
E B
A B
B C
E D
F D
and I need to group them in groups of 3 where I must have a triangule of matching from that list. Basically I need a result if its possible or not to group a collection.
So the possible groups are (ACD and BFE), or (ABC and DEF) and this collection is groupable since all letters can be grouped in groups of 3 and no one is left out.
I made a script where I can achieve this for small ammounts of input but for big ammounts it gets too slow.
My logic is:
make nested loop to find first match (looping untill I find a match)
> remove 3 elements from the collection
> run again
and I do this until I am out of letters. Since there can be different combinations I run this multiple times starting on different letters until I find a match.
I can understand that this gives me loops in order at least N^N and can get too slow. Is there a better logic for such problems? can a binary tree be used here?
This problem can be modeled as a graph Clique cover problem. Every letter is a node and every pair is an edge and you want to partition the graph into vertex-disjoint cliques of size 3 (triangles). If you want the partitioning to be of minimum cardinality then you want a minimum clique cover.
Actually this would be a k-clique cover problem, because in the clique cover problem you can have cliques of arbitrary/different sizes.
As Alberto Rivelli already stated, this problem is reducible to the Clique Cover problem, which is NP-hard.
It is also reducible to the problem of finding a clique of particular/maximum size. Maybe there are others, not NP-hard problems to which your particular case could be reduced to, but I didn't think of any.
However, there do exist algorithms which can find the solution in polynomial time, although not always for worst cases. One of them is Bron–Kerbosch algorithm, which is known by far to be the most efficient algorithm for finding the maximum clique and can find a clique in the worst case of O(3^(n/3)). I don't know the size of your inputs, but I hope it will be sufficient for your problem.
Here is the code in Python, ready to go:
#!/usr/bin/python3
# #by DeFazer
# Solution to:
# stackoverflow.com/questions/40193648/algorithm-to-group-items-in-groups-of-3
# Input:
# N P - number of vertices and number of pairs
# P pairs, 1 pair per line
# Output:
# "YES" and groups themselves if grouping is possible, and "NO" otherwise
# Input example:
# 6 10
# 1 3
# 2 6
# 1 4
# 4 3
# 6 5
# 5 2
# 1 2
# 2 3
# 5 4
# 6 4
# Output example:
# YES
# 1-2-3
# 4-5-6
# Output commentary:
# There are 2 possible coverages: 1-2-3*4-5-6 and 2-5-6*1-3-4.
# If required, it can be easily modified to return all possible groupings rather than just one.
# Algorithm:
# 1) List *all* existing triangles (1-2-3, 1-3-4, 2-5-6...)
# 2) Build a graph where vertices represent triangles and edges connect these triangles with no common... vertices. Sorry for ambiguity. :)
# 3) Use [this](en.wikipedia.org/wiki/Bron–Kerbosch_algorithm) algorithm (slightly modified) to find a clique of size N/3.
# The grouping is possible if such clique exists.
N, P = map(int, input().split())
assert (N%3 == 0) and (N>0)
cliquelength = N//3
pairs = {} # {a:{b, d, c}, b:{a, c, f}, c:{a, b}...}
# Get input
# [(0, 1), (1, 3), (3, 2)...]
##pairlist = list(map(lambda ab: tuple(map(lambda a: int(a)-1, ab)), (input().split() for pair in range(P))))
pairlist=[]
for pair in range(P):
a, b = map(int, input().split())
if a>b:
b, a = a, b
a, b = a-1, b-1
pairlist.append((a, b))
pairlist.sort()
for pair in pairlist:
a, b = pair
if a not in pairs:
pairs[a] = set()
pairs[a].add(b)
# Make list of triangles
triangles = []
for a in range(N-2):
for b in pairs.get(a, []):
for c in pairs.get(b, []):
if c in pairs[a]:
triangles.append((a, b, c))
break
def no_mutual_elements(sortedtupleA, sortedtupleB):
# Utility function
# TODO: if too slow, can be improved to O(n) since tuples are sorted. However, there are only 9 comparsions in case of triangles.
return all((a not in sortedtupleB) for a in sortedtupleA)
# Make a graph out of that list
tgraph = [] # if a<b and (b in tgraph[a]), then triangles[a] has no common elements with triangles[b]
T = len(triangles)
for t1 in range(T):
s = set()
for t2 in range(t1+1, T):
if no_mutual_elements(triangles[t1], triangles[t2]):
s.add(t2)
tgraph.append(s)
def connected(a, b):
if a > b:
b, a = a, b
return (b in tgraph[a])
# Finally, the magic algorithm!
CSUB = set()
def extend(CAND:set, NOT:set) -> bool:
# while CAND is not empty and there is no vertex in NOT connected to *all* vertexes in CAND
while CAND and all((any(not connected(n, c) for c in CAND)) for n in NOT):
v = CAND.pop()
CSUB.add(v)
newCAND = {c for c in CAND if connected(c, v)}
newNOT = {n for n in NOT if connected(n, v)}
if (not newCAND) and (not newNOT) and (len(CSUB)==cliquelength): # the last condition is the algorithm modification
return True
elif extend(newCAND, newNOT):
return True
else:
CSUB.remove(v)
NOT.add(v)
if extend(set(range(T)), set()):
print("YES")
# If the clique itself is not needed, it's enough to remove the following 2 lines
for a, b, c in [triangles[c] for c in CSUB]:
print("{}-{}-{}".format(a+1, b+1, c+1))
else:
print("NO")
If this solution is still too slow, perphaps it may be more efficient to solve the Clique Cover problem instead. If that's the case, I can try to find a proper algorithm for it.
Hope that helps!
Well i have implemented the job in JS where I feel most confident. I also tried with 100000 edges which are randomly selected from 26 letters. Provided that they are all unique and not a point such as ["A",A"] it resolves in like 90~500 msecs. The most convoluted part was to obtain the nonidentical groups, those without just the order of the triangles changing. For the given edges data it resolves within 1 msecs.
As a summary the first reduce stage finds the triangles and the second reduce stage groups the disconnected ones.
function getDisconnectedTriangles(edges){
return edges.reduce(function(p,e,i,a){
var ce = a.slice(i+1)
.filter(f => f.some(n => e.includes(n))), // connected edges
re = []; // resulting edges
if (ce.length > 1){
re = ce.reduce(function(r,v,j,b){
var xv = v.find(n => e.indexOf(n) === -1), // find the external vertex
xe = b.slice(j+1) // find the external edges
.filter(f => f.indexOf(xv) !== -1 );
return xe.length ? (r.push([...new Set(e.concat(v,xe[0]))]),r) : r;
},[]);
}
return re.length ? p.concat(re) : p;
},[])
.reduce((s,t,i,a) => t.used ? s
: (s.push(a.map((_,j) => a[(i+j)%a.length])
.reduce((p,c,k) => k-1 ? p.every(t => t.every(n => c.every(v => n !== v))) ? (c.used = true, p.push(c),p) : p
: [p].every(t => t.every(n => c.every(v => n !== v))) ? (c.used = true, [p,c]) : [p])),s)
,[]);
}
var edges = [["A","C"],["B","F"],["A","D"],["D","C"],["F","E"],["E","B"],["A","B"],["B","C"],["E","D"],["F","D"]],
ps = 0,
pe = 0,
result = [];
ps = performance.now();
result = getDisconnectedTriangles(edges);
pe = performance.now();
console.log("Disconnected triangles are calculated in",pe-ps, "msecs and the result is:");
console.log(result);
You may generate random edges in different lengths and play with the code here

Group a range of integers such as no pair of numbers is shared by two or more groups

You are given two numbers, N and G. Your goal is to split a range of integers [1..N] in equal groups of G numbers each. Each pair of numbers must be placed in exactly one group. Order does not matter.
For example, given N=9 and G=3, I could get these 12 groups:
1-2-3
1-4-5
1-6-7
1-8-9
2-4-6
2-5-8
2-7-9
3-4-9
3-5-7
3-6-8
4-7-8
5-6-9
As you can see, each possible pair of numbers from 1 to 9 is found in exactly one group. I should also mention that such grouping cannot be done for every possible combination of N and G.
I believe that this problem can be modelled best with hypergraphs: numbers are vertexes, hyperedges are groups, each hyperedge must connect exactly $G$ vertexes and no pair of vertexes can be shared by any two hyperedges.
At first, I've tried to bruteforce this problem - recursively pick valid vertexes until either running out of vertexes or finding a solution. It was way too slow, so I've started to look for ways to cut off some definitely wrong groups. If a lesser set of groups was found to be invalid, then we can predict any other set of groups which includes that one to be invalid too.
Here is the code I have so far (I hope that lack of comments is not a big concern):
#!/usr/bin/python3
# Input format:
# vertexes group_size
# Example:
# 9 3
from collections import deque
import itertools
def log(frmt, *args, **kwargs):
'Lovely logging subroutine'
if len(args)==len(kwargs)==0:
print(frmt, file=stderr)
else:
print(frmt.format(*args, **kwargs), file=stderr)
v, g = map(int, input().split())
linkcount = (v*(v-1)) // 2
if (linkcount % g) != 0:
print("INVALID GROUP SIZE")
exit
groupcount = linkcount // g
def pairs(it):
return itertools.combinations(it, 2)
# --- Vertex connections routines ---
connections = [[False for dst in range(v)] for src in range(v)]
#TODO: optimize matrix to eat up less space for large graphs
#...that is, when the freaking SLOWNESS is fixed for graph size to make any difference. >_<
def connected(a, b):
if a==b:
return True
if a>b:
a, b = b, a
# assert a<b
return connections[a][b]
def setconnect(a, b, value):
if a==b:
return False
if a>b:
a, b = b, a
# assert a<b
connections[a][b] = value
def connect(*vs):
for v1, v2 in pairs(vs):
setconnect(v1, v2, True)
def disconnect(*vs):
for v1, v2 in pairs(vs):
setconnect(v1, v2, False)
# --
# --- Failure prediction routines ---
failgroups = {}
def addFailure(groupId):
'Mark current group set as unsuccessful'
cnode = failgroups
sgroups = sorted(groups[:groupId+1])
for gp in groups:
if gp not in cnode:
cnode[gp]={}
cnode=cnode[gp]
cnode['!'] = True # Aka "end of node"
def findInSubtree(node, string, stringptr):
if stringptr>=len(string):
return False
c = string[stringptr]
if c in node:
if '!' in node[c]:
return True
else:
return findInSubtree(node[c], string, stringptr+1)
else:
return findInSubtree(node, string, stringptr+1)
def predictFailure(groupId) -> bool:
'Predict if the current group set will be unsuccessful'
sgroups = sorted(groups[:groupId+1])
return findInSubtree(failgroups, sgroups, 0)
# --
groups = [None for grp in range(groupcount)]
def debug_format_groups():
return ' '.join(('-'.join((str(i+1)) for i in group) if group else '?') for group in groups) # fluffy formatting for debugging
def try_group(groupId):
for cg in itertools.combinations(range(v), g):
groups[groupId]=cg
# Predict whether or not this group will be unsuccessful
if predictFailure(groupId):
continue
# Verify that all vertexes are unconnected
if any(connected(v1,v2) for v1,v2 in pairs(cg)):
continue
# Connect all vertexes
connect(*cg)
if groupId==groupcount-1:
return True # Last group is successful! Yupee!
elif try_group(groupId+1):
# Next group was successful, so -
return True
# Disconnect these vertexes
disconnect(*cg)
# Mark this group set as unsuccessful
addFailure(groupId)
else:
groups[groupId]=None
return False
result = try_group(0)
if result:
formatted_groups = sorted(['-'.join(str(i+1) for i in group) for group in groups])
for f in formatted_groups:
print(f)
else:
print("NO SOLUTION")
Is this an NP-complete problem? Can it be generalized as another, well-known problem, and if so - as which?
P.S. That's not a homework or contest task, if anything.
P.P.S. Sorry for my bad English, it's not my native language. I have no objections if someone edited my question for more clear phrasing. Thanks! ^^
In the meantime, I'm ready to clarify all confusing moments here.
UPDATE:
I've been thinking about it and realized than there is a better way than backtracking. First, we could build a graph where vertices represent all possible groups and edges connect all groups without common pairs of numbers. Then, each clique of power (N(N-1)/2G) would represent a solution! Unfortunately, clique problem is an NP-complete task, and generating all possible binom(N, G) groups would eat up much memory on large values of N and G. Is it possible to find a better solution?

Boyer-Moore Galil Rule

I was implementing the Boyer-Moore Algorithm for substring search in Python when I learned about the Galil Rule. I've looked around online for the Galil Rule but I haven't found anything more than a couple of sentences, and I cannot get access to the original paper. How can I implement this into my current algorithm?
i = 0
while i < (N - M + 1):
skip = 0
for j in reversed(range(0, M)):
if pattern[j] != text[i + j]:
skip = max(1, j - offsets[text[i+j]])
break
if skip == 0:
return i
i += skip
return -1
Notes:
offsets[c] = -1 if c is not in the pattern
offsets[c] = last index of c in the pattern
Example:
aaabcb
offsets[a] = 2
offsets[b] = 5
offsets[c] = 4
offsets[d] = -1
The few sentences I have found have said to keep track of when the first mismatch occurs in my inner loop (j, if the if-statement inside the inner loop is True) and the position in which I started the comparisons (i + j, in my case). I understand the intuition that I've already checked all the indices in between those, so I shouldn't have to do those comparisons again. I just don't understand how to connect the dots and arrive at an implementation.
The Galil rule is about exploiting periodicity in the pattern to reduce comparisons. Say you have a pattern abcabcab. It's periodic with smallest period abc. In general, a pattern P is periodic if there's a string U such that P is a prefix of UUUUU.... (In the above example, abcabcab is clearly a prefix of the repeating string abc = U.) We call the shortest such string the period of P. Let the length of that period be k (in the example above k = 3 since U = abc).
First of all, keep in mind that the Galil rule applies only after you've found an occurrence of P in the text. When you do that, the Galil rule says that you could shift by k (the periodicity of the pattern) and you only have to compare the last k characters of the now shifted pattern to determine if there was a match.
Here's an example:
P = ababa
T = bababababab
U = ab
k = 2
First occurrence: b[ababa]babab. Now you can shift by k = 2 and you only have to check the last two characters of the pattern:
T = bababa[ba]bab
P = aba[ba] // Only need to compare chars inside brackets for next match.
The rest of P must match since P is periodic and you shifted it by its period k from an existing match (this is crucial) so the repeating parts will nicely line up.
If you've found another match, just repeat. If you find a mismatch, however, you revert to the standard Boyer-Moore algorithm until you find another match. Remember, you can only use the Galil rule when you find a match and you shift by k (otherwise the pattern is not guaranteed to line up with the previous occurrence).
Now, you might wonder, how to determine k for a given pattern P. You'll need to calculate the suffixes array N first, where N[i] will be the length of the longest common suffix of the prefix P[0, i] and P. (You can calculate the suffixes array by calculating the prefixes array Z on the reverse of P using the Z algorithm, as described here, for example.) Once you have the suffixes array, you can easily find k since it'll be the smallest k > 0 such that N[m - k - 1] == m - k (where m = |P|).
For example:
P = ababa
m = 5
N = [1, 0, 3, 0, 5]
k = 2 because N[m - k - 1] == N[5 - 2 - 1] == N[2] == 3 == 5 - k
The answer by #Lajos Nagy has explained the idea of Galil rule perfectly, however we have a more straightforward way to calculate k:
Just use the prefix function of KMP algorithm.
The prefix[i] means the longest proper prefix of P[0..i] which is also a suffix.
And, k = m-prefix[m-1] .
This article has explained the details.

The Movie Scheduling _Problem_

Currently I'm reading "The Algorithm Design Manual" by Skiena (well, beginning to read)
He asks a problem he calls the "Movie Scheduling Problem":
Problem: Movie Scheduling Problem
Input: A set I of n intervals on the line.
Output: What is the largest subset of mutually non-overlapping intervals which can
be selected from I?
Example: (Each dashed line is a movie, you want to find a set with the highest quantity of movies)
----a---
-----b---- -----c--- ---d---
-----e--- -------f---
--g-- --h--
The algorithm I thought of to solve it was like this:
I could throw out the "worst offender" (intersects with the most other movies) until there are no worst offenders (zero intersections). The only problem I see is that if there is a tie (say two different movies each intersect with 3 other movies) could it matter which one I throw out?
Basically I'm wondering how I go about turning the idea into "math" and how to prove it correct/incorrect.
The algorithm is incorrect. Let's consider the following example:
Counterexample
|----F----| |-----G------|
|-------D-------| |--------E--------|
|-----A------| |------B------| |------C-------|
You can see that there is a solution of size at least 3 because you can pick A, B and C.
Firstly, let's count, for each interval the number of intersections:
A = 2 [F, D]
B = 4 [D, F, E, G]
C = 2 [E, G]
D = 3 [A, B, F]
E = 3 [B, C, G]
F = 3 [A, B, D]
G = 3 [B, C, E]
Now consider a run of your algorithm. In the first step we delete B because it intersects with the most number of invervals and we get:
|----F----| |-----G------|
|-------D-------| |--------E--------|
|-----A------| |------C-------|
It's easy to see that now from {A, D, F} you can choose only one, because each pair intersects. The same case with {G, E, C}, so after deleting B, you can choose at most one from {A, D, F} and at most one from {G, E, C}, to get the total of 2, which is smaller than the size of {A, B, C}.
The conclusion is, that after deleting B which intersects with the most number of invervals, you can't get the maximum number of nonintersecting movies.
Correct solution
The problem is very well known and one solution is to pick the interval which ends first, delete all intervals intersecting with it and continue until there are no intervals to examine. This is an example of a greedy method and you can find or develop a proof that it's correct.
This looks like a dynamic programming problem to me:
Define the following functions:
sched(t) = best schedule starting at time t
next(t) = set of movies that start next after time t
len(m) = length of movie m
next returns a set because there may be more than one movie that starts at the same time.
then sched should be defined as follows:
sched(t) = max { 1 + sched(t + len(m)), sched(t+1) } where m in next(t)
This recursive function selects a movie m from next(t) and compares the largest possible sets that either include or don't include m.
Invoke sched with the time of your first movie and you will get the size of the optimal set. Getting the optimal set itself just requires a little extra logic to remember which movies you select at each invocation.
I think this recursive (as opposed to iterative) algorithm runs in O(n^2) if you use memoization, where n is the number of movies.
It's correct, but I'd have to consult my algorithms textbook to give you an explicit proof, but hopefully this algorithm makes intuitive sense why it is correct.
# go through the database and create a 2-D matrix indexed a..h by a..h. Set each
# element of the matrix to 1 if the row index movie overlaps the column index movie.
mtx = []
for i in range(8):
column = []
for j in range(8):
column.append(0)
mtx.append(column)
# b <> e
mtx[1][4] = 1
mtx[4][1] = 1
# e <> g
mtx[4][6] = 1
mtx[6][4] = 1
# e <> c
mtx[4][2] = 1
mtx[2][4] = 1
# c <> a
mtx[2][0] = 1
mtx[0][2] = 1
# c <> f
mtx[2][5] = 1
mtx[5][2] = 1
# c <> g
mtx[2][6] = 1
mtx[6][2] = 1
# c <> h
mtx[2][7] = 1
mtx[7][2] = 1
# d <> f
mtx[3][5] = 1
mtx[5][3] = 1
# a <> f
mtx[0][5] = 1
mtx[5][0] = 1
# a <> d
mtx[0][3] = 1
mtx[3][0] = 1
# a <> h
mtx[0][7] = 1
mtx[7][0] = 1
# g <> e
mtx[4][7] = 1
mtx[7][4] = 1
# print out contstraints
for line in mtx:
print line
# keep track of which movies are still allowed
allowed = set(range(8))
# loop through in greedy fashion, picking movie that throws out the least
# number of other movies at each step
best = 8
while best > 0:
best_col = None
best_lost = set()
best = 8 # score if move does not overlap with any other
# each step, only try movies still allowed
for col in allowed:
lost = set()
for row in range(8):
# keep track of other movies eliminated by this selection
if mtx[row][col] == 1:
lost.add(row)
# this was the best of all the allowed choices so far
if len(lost) < best:
best_col = col
best_lost = lost
best = len(lost)
# there was a valid selection, process
if best_col > 0:
print 'watch movie: ', str(unichr(best_col+ord('a')))
for row in best_lost:
# now eliminate the other movies you can't now watch
if row in allowed:
print 'throwing out: ', str(unichr(row+ord('a')))
allowed.remove(row)
# also throw out this movie from the allowed list (can't watch twice)
allowed.remove(best_col)
# this is just a greedy algorithm, not guaranteed optimal!
# you could also iterate through all possible combinations of movies
# and simply eliminate all illegal possibilities (brute force search)

Resources