Most efficient way to determine an intersection

Most efficient way to determine an intersection - algorithm

Suppose I have a two vectors of "starts" and "stops", sorted in ascending order.
Vector 1 = [start1 stop1;
start2 stop2;
start3 stop3];
Vector 2 = [start4 stop4;
start5 stop5;
start6 stop6];
What is the most efficient way of determining the intersection/overlap of these two vectors?

I've had to do this on a couple of occasions. It's a simple task, but the logic can get quite messy.
One thing you must decide upon, up-front, is whether intervals are closed or open. That is, do the intervals [1,3] and [3,5] have an intersection at [3,3], or no intersection? I strongly advise "no intersection" (closed intervals tend to be much more painful to reason about than open or half-open intervals), but your use case may require otherwise.
I think the cleanest way to do this is to maintain a "current partial interval" from each list. By "partial" I mean that each interval may be "eaten away" from the bottom as intersections with intervals from the other list are recognized and output. This simplifies the logic by only forcing you to consider two intervals at a time, rather than processing all the V2 intervals which are relevant to some interval in V1.
To simplify the code further, you can allow intervals to be temporarily invalid, and start with both current intervals invalid. This makes the code more unnecessarily branchy, but it means you only have to handle updating them in one place and with one rule.
So the pseudocode goes like this (I'm destructively reading from V1 and V2, and writing to VI):
v1a,v1b = 0,0 # Empty and hence invalid
v2a,v2b = 0,0 # intervals to start with.
while True:
if v1a >= v1b: # Handle an invalid V1 interval
if V1.empty(): # If there's no more V1s,
return # No more intersections are possible.
else:
v1a,v2a = V1.pop() # Grab the next full interval from V1
if v2a >= v2b:
if V2.empty():
return
else:
v2a,v2b = V2.pop()
lower_bound = max(v1a, v2a) # Determine the overlap, if any, between
upper_bound = min(v1b, v2b) # the current two intervals.
if lower_bound < upper_bound:
VI.push(lower_bound, upper_bound) # Output the overlapping interval.
v1a = max(v1a, upper_bound) # Snip away the region which has now been
v2a = max(v2a, upper_bound) # handled. This may make one or both invalid.
The last two lines are the tricky bit. If there was an intersection, then upper_bound is its upper end: There are no remaining intersecting ranges below it, so they can be removed from either or both current intervals. If, however, the two current intervals did not overlap, then it has the effect of setting the lower interval's a to its own b, making it invalid and causing it to be replaced on the next iteration.

I believe you can take advantage of the fact that the lists are sorted and then do the following (pseudo code)
Grab the first spans
Determine if the spans overlap
If the spans overlap, then max(starts) to min(ends) is an overlap
Increment the span with the smallest end
* Merge results - You may need to pass over your results and merge overlapping elements
Here's some python that implements this (note - a and b would be lists that contain (start,end) tuples):
try:
while True:
if b_span[1]>a_span[0] and b_span[0]<a_span[1]:
overlaps.append((a_span[0] if a_span[0] > b_span[0] else b_span[0],
a_span[1] if a_span[1] < b_span[1] else b_span[1]))
if a_span[1] < b_span[1]:
a_span = a.pop(0)
else:
b_span = b.pop(0)
except IndexError:
pass

Related

What is a greedy algorithm for this problem that is minimally optimal + proof?

The details are a bit cringe, fair warning lol:
I want to set up meters on the floor of my building to catch someone; assume my floor is a number line from 0 to length L. The specific type of meter I am designing has a radius of detection that is 4.7 meters in the -x and +x direction (diameter of 9.4 meters of detection). I want to set them up in such a way that if the person I am trying to find steps foot anywhere in the floor, I will know. However, I can't just setup a meter anywhere (it may annoy other residents); therefore, there are only n valid locations that I can setup a meter. Additionally, these meters are expensive and time consuming to make, so I would like to use as few as possible.
For simplicity, you can assume the meter has 0 width, and that each valid location is just a point on the number line aformentioned. What is a greedy algorithm that places as few meters as possible, while being able to detect the entire hallway of length L like I want it to, or, if detecting the entire hallway is not possible, will output false for the set of n locations I have (and, if it isn't able to detect the whole hallway, still uses as few meters as possible while attempting to do so)?
Edit: some clarification on being able to detect the entire hallway or not

Given:
L (hallway length)
a list of N valid positions to place a meter (p_0 ... p_N-1) of radius 4.7
You can determine in O(N) either a valid and minimal ("good") covering of the whole hallway or a proof that no such covering exists given the constraints as follows (pseudo-code):
// total = total length;
// start = current starting position, initially 0
// possible = list of possible meter positions
// placed = list of (optimal) meter placements, initially empty
boolean solve(float total, float start, List<Float> possible, List<Float> placed):
if (total-start <= 0):
return true; // problem solved with no additional meters - woo!
else:
Float next = extractFurthestWithinRange(start, possible, 4.7);
if (next == null):
return false; // no way to cover end of hall: report failure
else:
placed.add(next); // placement decided
return solve(total, next + 4.7, possible, placed);
Where extractFurthestWithinRange(float start, List<Float> candidates, float range) returns null if there are no candidates within range of start, or returns the last position p in candidates such that p <= start + range -- and also removes p, and all candidates c such that p >= c.
The key here is that, by always choosing to place a meter in the next position that a) leaves no gaps and b) is furthest from the previously-placed position we are simultaneously creating a valid covering (= no gaps) and an optimal covering (= no possible valid covering could have used less meters - because our gaps are already as wide as possible). At each iteration, we either completely solve the problem, or take a greedy bite to reduce it to a (guaranteed) smaller problem.
Note that there can be other optimal coverings with different meter positions, but they will use the exact same number of meters as those returned from this pseudo-code. For example, if you adapt the code to start from the end of the hallway instead of from the start, the covering would still be good, but the gaps could be rearranged. Indeed, if you need the lexicographically minimal optimal covering, you should use the adapted algorithm that places meters starting from the end:
// remaining = length (starts at hallway length)
// possible = positions to place meters at, starting by closest to end of hallway
// placed = positions where meters have been placed
boolean solve(float remaining, List<Float> possible, Queue<Float> placed):
if (remaining <= 0):
return true; // problem solved with no additional meters - woo!
else:
// extracts points p up to and including p such that p >= remaining - range
Float next = extractFurthestWithinRange2(remaining, possible, 4.7);
if (next == null):
return false; // no way to cover start of hall: report failure
else:
placed.add(next); // placement decided
return solve(next - 4.7, possible, placed);

To prove that your solution is optimal if it is found, you merely have to prove that it finds the lexicographically last optimal solution.
And you do that by induction on the size of the lexicographically last optimal solution. The case of a zero length floor and no monitor is trivial. Otherwise you demonstrate that you found the first element of the lexicographically last solution. And covering the rest of the line with the remaining elements is your induction step.
Technical note, for this to work you have to be allowed to place monitoring stations outside of the line.

Is there an established method to weigh a weighted mean?

Problem. The weighted mean/average can be used to give differing weight in a mean computation to elements of differing importance. I need to figure out an extension that would in turn 'scale' or 'weigh' the resulting weighted mean with regards to zero, depending on the actual (non-normalized) values of the weights:
if the weights are low, the scaled weighted mean should be close to 0.
if at least some weights are close to the max weight, then the scaled weighted mean should be more or less equivalent with simple weighted mean.
Rationale and details. I need such an extension in order to produce a more sensible mean value in a case where:
the weights are proximity/similarity scores (of interval (0,1)) of the elements (let's call them neighbors for simplicity) of a target element, in some space, and
the values on the neighbors (being averaged) reflect a change in some quality of theirs (because it is assumed to have an effect on the target, if they are close enough)
elements that are further away should have less weight, so using weighted mean seems reasonable - but in some cases, all the neighbors are far away - in these cases, they presumably should have little to no effect on the target (so their mean should reflect this, and be closer to zero).
Reproducible example. This requirement is not met when using a simple weighted means:
# Using R for example code (answer doesn't have to use R)
weighted.mean = function(x, w){
return( sum(x*w)/sum(w) ) # standard way to calculate weighted mean
}
# Example data:
weights1 = c(0.9, 0.1, 0.01) # proximity of neighbors to target
weights2 = c(0.1, 0.1, 0.01) # proximity of neighbors to some other target
values = c(1,2,10) # values on these neighbors
mean(values)
# 4.333333 # not useful, doesn't take into account distance of elements at all
weighted.mean(values, weights1)
# 1.188119 # useful result, reflects distance/weight!
weighted.mean(values, weights2)
# 1.904762 # not useful result - none of them should have any effect, being all distant; the mean should be close to 0 (no effect) instead
What I've tried so far (1) Removing the normalizing sum(weights) business and just taking mean of values*weights:
weighted.mean2 = function(x, w){
return( mean(x*w) )
}
weighted.mean2(values, weights1)
# 0.4 # lower value, but should be viewed relatively in comparison now
weighted.mean2(values, weights2)
# 0.1333333 # makes more sense, low proximity leads to low(er) mean value
What I've tried so far (2) Call weighted mean on 0 and the weighted mean, with the new weights for this vector of length two being 1 (max proximity/identity) and the proximity of the closest neighbor as a scale; the reasoning being that if the target has no close neighbors, then the effect in question should be about 0:
weighted.mean3 = function(x, w){
tmp = weighted.mean(x, w)
maxw = max(w)
return( weighted.mean( c(0, tmp), c(1, maxw)) )
}
weighted.mean3(values, weights1)
# 0.5627931
weighted.mean3(values, weights2)
# 0.1731602 # also makes sense, low proximity leads to low(er) mean value
Both approaches seem to yield a smaller value for the target with distant neighbors, and a comparatively higher value to a target with closer neighbors. However, this feels rather hacky to me, and I'm not sure if there might be cases where either approach might fail - surely there must be a more principled/established algorithm to do something like this out there (perhaps it's not called 'mean' or 'average' though; also, if one of my attempts is equivalent with one, then the answer could just confirm that). Long story short:
Is there an established/published method to weigh/scale a weighted mean in the way I've described above?
Note on previous version of the question: it was initially flagged as too broad, so I rewrote it and applied to reopen, but it was auto-closed as being abandoned; so I rewrote a new question; this one also now has a clear yes or no answer (rationale and/or references beyond a simple yes/no are of course appreciated)

Algorithm to find matching real values in a list

I have a complex algorithm which calculates the result of a function f(x). In the real world f(x) is a continuous function. However due to rounding errors in the algorithm this is not the case in the computer program. The following diagram gives an example:
Furthermore I have a list of several thousands values Fi.
I am looking for all the x values which meet an Fi value i.e. f(xi)=Fi
I can solve this problem with by simply iterating through the x values like in the following pseudo code:
for i=0 to NumberOfChecks-1 do
begin
//calculate the function result with the algorithm
x=i*(xmax-xmin)/NumberOfChecks;
FunctionResult=CalculateFunctionResultWithAlgorithm(x);
//loop through the value list to see if the function result matches a value in the list
for j=0 to NumberOfValuesInTheList-1 do
begin
if Abs(FunctionResult-ListValues[j])<Epsilon then
begin
//mark that element j of the list matches
//and store the corresponding x value in the list
end
end
end
Of course it is necessary to use a high number of checks. Otherwise I will miss some x values. The higher the number of checks the more complete and accurate is the result. It is acceptable that the list is 90% or 95% complete.
The problem is that this brute force approach takes too much time. As I mentioned before the algorithm for f(x) is quite complex and with a high number of checks it takes too much time.
What would be a better solution for this problem?

Another way to do this is in two parts: generate all of the results, sort them, and then merge with the sorted list of existing results.
First step is to compute all of the results and save them along with the x value that generated them. That is:
results = list of <x, result>
for i = 0 to numberOfChecks
//calculate the function result with the algorithm
x=i*(xmax-xmin)/NumberOfChecks;
FunctionResult=CalculateFunctionResultWithAlgorithm(x);
results.Add(x, FunctionResult)
end for
Now, sort the results list by FunctionResult, and also sort the FunctionResult-ListValues array by result.
You now have two sorted lists that you can move through linearly:
i = 0, j = 0;
while (i < results.length && j < ListValues.length)
{
diff = ListValues[j] - results[i];
if (Abs(diff) < Episilon)
{
// mark this one with the x value
// and move to the next result
i = i + 1
}
else if (diff > 0)
{
// list value is much larger than result. Move to next result.
i = i + 1
}
else
{
// list value is much smaller than result. Move to next list value.
j = j + 1
}
}

Sort the list, producing an array SortedListValues that contains
the sorted ListValues and an array SortedListValueIndices that
contains the index in the original array of each entry in
SortedListValues. You only actually need the second of these and
you can create both of them with a single sort by sorting an array
of tuples of (value, index) using value as the sort key.
Iterate over your range in 0..NumberOfChecks-1 and compute the
value of the function at each step, and then use a binary chop
method to search for it in the sorted list.
Pseudo-code:
// sort as described above
SortedListValueIndices = sortIndices(ListValues);
for i=0 to NumberOfChecks-1 do
begin
//calculate the function result with the algorithm
x=i*(xmax-xmin)/NumberOfChecks;
FunctionResult=CalculateFunctionResultWithAlgorithm(x);
// do a binary chop to find the closest element in the list
highIndex = NumberOfValuesInTheList-1;
lowIndex = 0;
while true do
begin
if Abs(FunctionResult-ListValues[SortedListValueIndices[lowIndex]])<Epsilon then
begin
// find all elements in the range that match, breaking out
// of the loop as soon as one doesn't
for j=lowIndex to NumberOfValuesInTheList-1 do
begin
if Abs(FunctionResult-ListValues[SortedListValueIndices[j]])>=Epsilon then
break
//mark that element SortedListValueIndices[j] of the list matches
//and store the corresponding x value in the list
end
// break out of the binary chop loop
break
end
// break out of the loop once the indices match
if highIndex <= lowIndex then
break
// do the binary chop searching, adjusting the indices:
middleIndex = (lowIndex + 1 + highIndex) / 2;
if ListValues[SortedListValueIndices[middleIndex] < FunctionResult then
lowIndex = middleIndex;
else
begin
highIndex = middleIndex;
lowIndex = lowIndex + 1;
end
end
end
Possible complications:
The binary chop isn't taking the epsilon into account. Depending on
your data this may or may not be an issue. If it is acceptable that
the list is only 90 or 95% complete this might be ok. If not then
you'll need to widen the range to take it into account.
I've assumed you want to be able to match multiple x values for each FunctionResult. If that's not necessary you can simplify the code.

Naturally this depends very much on the data, and especially on the numeric distribution of Fi. Another problem is that the f(x) looks very jumpy, eliminating the concept of "assumption of nearby value".
But one could optimise the search.
Picture below.
Walking through F(x) at sufficient granularity, define a rough min
(red line) and max (green line), using suitable tolerance (the "air"
or "gap" in between). The area between min and max is "AREA".
See where each Fi-value hits AREA, do a stacked marking ("MARKING") at X-axis accordingly (can be multiple segments of X).
Where lots of MARKINGs at top of each other (higher sum - the vertical black "sum" arrows), do dense hit tests, hence increasing the overall
chance to get as many hits as possible. Elsewhere do more sparse tests.
Tighten this schema (decrease tolerance) as much as you dare.
EDIT: Fi is a bit confusing. Is it an ordered array or does it have random order (as i assumed)?

Jim Mischel's solution would work in a O(i+j) instead of the O(i*j) solution that you currently have. But, there is a (very) minor bug in his code. The correct code would be :
diff = ListValues[j] - results[i]; //no abs() here
if (abs(diff) < Episilon) //add abs() here
{
// mark this one with the x value
// and move to the next result
i = i + 1
}

the best methods will relay on the nature of your function f(x).
The best solution is if you can create the reversing to F(x) and use it
as you said F(x) is continuous:
therefore you can start evaluating small amount of far points, then find ranges that makes sense, and refine your "assumption" for x that f(x)=Fi
it is not bullet proof, but it is an option.
e.g. Fi=5.7; f(1)=1.4 ,f(4)=4,f(16)=12.6, f(10)=10.1, f(7)=6.5, f(5)=5.1, f(6)=5.8, you can take 5 < x < 7
on the same line as #1, and IF F(x) is hard to calculate, you can use Interpolation, and then evaluate F(x) only at the values that are probable.

Elements mixing algorithm

Not sure about title.
Here is what I need.
Lets for example have this set of elements 20*A, 10*B, 5*C, 5*D, 2*E, 1*F
I need to mix them so there are not two same elements next to each other and also I can for example say I don't want B and C to be next to each other. Elements have to be evenly spread (if there are 2 E one should be near begining/ in firs half a and second near end/in second half. Number of elements can of course change.
I haven't done anything like this yet. Is there some knowledge-base of this kind of algorithms where could I find some hints and methods how to solve this kind of problem or do I have to do all the math myself?

I think the solution is pretty easy.
Start with an array x initialised to empty values such that there is one space for each item you need to place.
Then, for each (item, frequency) pair in descending order of frequency, assign item values to x in alternating slots starting from the first empty slot.
Here's how it works for your example:
20*A A_A_A_A_A_A_A_A_A_A_A_A_A_A_A_A_A_A_A_A
10*B ABABABABABABABABABABA_A_A_A_A_A_A_A_A_A
5*C ABABABABABABABABABABACACACACACA_A_A_A_A
2*E ABABABABABABABABABABACACACACACAEAEA_A_A
1*F ABABABABABABABABABABACACACACACAEAEAFA_A
At this point we fail, since x still has an empty slot. Note that we could have identified this right from the start since we need at least 19 slots between the As, but we only have 18 other items.
UPDATE
Leonidas has now explained that the items should be distributed "evenly" (that is, if we have k items of a particular kind, and n slots to fill, each "bucket" of n/k slots must contain one item of that kind.
We can adapt to this constraint by spreading out our allocations rather than simply going for alternating slots. In this case (and let's assume 2 Fs so we can solve this), we would have
20*A A_A_A_A_A_A_A_A_A_A_A_A_A_A_A_A_A_A_A_A
10*B ABA_ABA_ABA_ABA_ABA_ABA_ABA_ABA_ABA_ABA
5*C ABACABA_ABACABA_ABACABA_ABACABA_ABACABA
2*E ABACABAEABACABA_ABACABAEABACABA_ABACABA
2*F ABACABAEABACABAFABACABAEABACABAFABACABA

You can solve this problem recursively:
def generate(lastChar, remDict):
res = []
for i in remDict:
if i!=lastChar):
newRemDict = remDict
newRemDict[i]-=1
subres = generate(i,newRemDict)
res += [i+j for j in subres]
return res
Note that I am leaving out corner conditions and many checks that need to be done. But only the core recursion is shown. You can also quit pursuing a branch if more than half+1 of the remaining letters is a same letter.

I ran into a similar problem, and after evaluating various metrics, I came up with the idea of grabbing the first item for which the proportion through the source array is less than the proportion through the result array. There is a case where all of these values may come out as 1, for instance when halfway through merging a group of even arrays - everything's exactly half done - so I grab something from the first array in that case.
This solution does use the source array order, which is something that I wanted. If the calling routine wants to merge arrays A, B, and C, where A has 3 elements but B and C have 2, we should get A,B,C,A,B,C,A, not A,C,B,A,C,B,A or other possibilities. I find that choosing the first of my source arrays that's "overdue" (by having a proportion that's lower than our overall progress), I get a nice spacing with all arrays.
Source in Python:
#classmethod
def intersperse_arrays(cls, arrays: list):
# general idea here is to produce a result with as even a balance as possible between all the arrays as we go down.
# Make sure we don't have any component arrays of length 0 to worry about.
arrays = [array for array in arrays if len(array) > 0]
# Handle basic cases:
if len(arrays) == 0:
return []
if len(arrays) == 1:
return arrays[0]
ret = []
num_used = []
total_count = 0
for j in range(0, len(arrays)):
num_used.append(0)
total_count += len(arrays[j])
while len(ret) < total_count:
first_overdue_array = None
first_remaining_array = None
overall_prop = len(ret) / total_count
for j in range(0, len(arrays)):
# Continue if this array is already done.
if len(arrays[j]) <= num_used[j]:
continue
current_prop = num_used[j] / len(arrays[j])
if current_prop < overall_prop:
first_overdue_array = j
break
elif first_remaining_array is None:
first_remaining_array = j
if first_overdue_array is not None:
next_array = first_overdue_array
else:
# Think this only happens in an exact tie. (Halfway through all arrays, for example.)
next_array = first_remaining_array
if next_array is None:
log.error('Internal error in intersperse_arrays')
break # Shouldn't happen - hasn't been seen.
ret.append(arrays[next_array][num_used[next_array]])
num_used[next_array] += 1
return ret
When used on the example given, I got:
ABCADABAEABACABDAFABACABADABACDABAEABACABAD
(Seems reasonable.)

Fastest way to find minimal Hamming distance to any substring?

Given a long string L and a shorter string S (the constraint is that L.length must be >= S.length), I want to find the minimum Hamming distance between S and any substring of L with length equal to S.length. Let's call the function for this minHamming(). For example,
minHamming(ABCDEFGHIJ, CDEFGG) == 1.
minHamming(ABCDEFGHIJ, BCDGHI) == 3.
Doing this the obvious way (enumerating every substring of L) requires O(S.length * L.length) time. Is there any clever way to do this in sublinear time? I search the same L with several different S strings, so doing some complicated preprocessing to L once is acceptable.
Edit: The modified Boyer-Moore would be a good idea, except that my alphabet is only 4 letters (DNA).

Perhaps surprisingly, this exact problem can be solved in just O(|A|nlog n) time using Fast Fourier Transforms (FFTs), where n is the length of the larger sequence L and |A| is the size of the alphabet.
Here is a freely available PDF of a paper by Donald Benson describing how it works:
Fourier methods for biosequence analysis (Donald Benson, Nucleic Acids Research 1990 vol. 18, pp. 3001-3006)
Summary: Convert each of your strings S and L into several indicator vectors (one per character, so 4 in the case of DNA), and then convolve corresponding vectors to determine match counts for each possible alignment. The trick is that convolution in the "time" domain, which ordinarily requires O(n^2) time, can be implemented using multiplication in the "frequency" domain, which requires just O(n) time, plus the time required to convert between domains and back again. Using the FFT each conversion takes just O(nlog n) time, so the overall time complexity is O(|A|nlog n). For greatest speed, finite field FFTs are used, which require only integer arithmetic.
Note: For arbitrary S and L this algorithm is clearly a huge performance win over the straightforward O(mn) algorithm as |S| and |L| become large, but OTOH if S is typically shorter than log|L| (e.g. when querying a large DB with a small sequence), then obviously this approach provides no speedup.
UPDATE 21/7/2009: Updated to mention that the time complexity also depends linearly on the size of the alphabet, since a separate pair of indicator vectors must be used for each character in the alphabet.

Modified Boyer-Moore
I've just dug up some old Python implementation of Boyer-Moore I had lying around and modified the matching loop (where the text is compared to the pattern). Instead of breaking out as soon as the first mismatch is found between the two strings, simply count up the number of mismatches, but remember the first mismatch:
current_dist = 0
while pattern_pos >= 0:
if pattern[pattern_pos] != text[text_pos]:
if first_mismatch == -1:
first_mismatch = pattern_pos
tp = text_pos
current_dist += 1
if current_dist == smallest_dist:
break
pattern_pos -= 1
text_pos -= 1
smallest_dist = min(current_dist, smallest_dist)
# if the distance is 0, we've had a match and can quit
if current_dist == 0:
return 0
else: # shift
pattern_pos = first_mismatch
text_pos = tp
...
If the string did not match completely at this point, go back to the point of the first mismatch by restoring the values. This makes sure that the smallest distance is actually found.
The whole implementation is rather long (~150LOC), but I can post it on request. The core idea is outlined above, everything else is standard Boyer-Moore.
Preprocessing on the Text
Another way to speed things up is preprocessing the text to have an index on character positions. You only want to start comparing at positions where at least a single match between the two strings occurs, otherwise the Hamming distance is |S| trivially.
import sys
from collections import defaultdict
import bisect
def char_positions(t):
pos = defaultdict(list)
for idx, c in enumerate(t):
pos[c].append(idx)
return dict(pos)
This method simply creates a dictionary which maps each character in the text to the sorted list of its occurrences.
The comparison loop is more or less unchanged to naive O(mn) approach, apart from the fact that we do not increase the position at which comparison is started by 1 each time, but based on the character positions:
def min_hamming(text, pattern):
best = len(pattern)
pos = char_positions(text)
i = find_next_pos(pattern, pos, 0)
while i < len(text) - len(pattern):
dist = 0
for c in range(len(pattern)):
if text[i+c] != pattern[c]:
dist += 1
if dist == best:
break
c += 1
else:
if dist == 0:
return 0
best = min(dist, best)
i = find_next_pos(pattern, pos, i + 1)
return best
The actual improvement is in find_next_pos:
def find_next_pos(pattern, pos, i):
smallest = sys.maxint
for idx, c in enumerate(pattern):
if c in pos:
x = bisect.bisect_left(pos[c], i + idx)
if x < len(pos[c]):
smallest = min(smallest, pos[c][x] - idx)
return smallest
For each new position, we find the lowest index at which a character from S occurs in L. If there is no such index any more, the algorithm will terminate.
find_next_pos is certainly complex, and one could try to improve it by only using the first several characters of the pattern S, or use a set to make sure characters from the pattern are not checked twice.
Discussion
Which method is faster largely depends on your dataset. The more diverse your alphabet is, the larger will be the jumps. If you have a very long L, the second method with preprocessing might be faster. For very, very short strings (like in your question), the naive approach will certainly be the fastest.
DNA
If you have a very small alphabet, you could try to get the character positions for character bigrams (or larger) rather than unigrams.

You're stuck as far as big-O is concerned.. At a fundamental level, you're going to need to test if every letter in the target matches each eligible letter in the substring.
Luckily, this is easily parallelized.
One optimization you can apply is to keep a running count of mismatches for the current position. If it's greater than the lowest hamming distance so far, then obviously you can skip to the next possibility.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Most efficient way to determine an intersection - algorithm

Suppose I have a two vectors of "starts" and "stops", sorted in ascending order. Vector 1 = [start1 stop1; start2 stop2; start3 stop3]; Vector 2 = [start4 stop4; start5 stop5; start6 stop6]; What is the most efficient way of determining the intersection/overlap of these two vectors?

Related

What is a greedy algorithm for this problem that is minimally optimal + proof?

Is there an established method to weigh a weighted mean?

Algorithm to find matching real values in a list

Elements mixing algorithm

Fastest way to find minimal Hamming distance to any substring?

Categories

Resources