Boyer-Moore Galil Rule

Boyer-Moore Galil Rule - algorithm

I was implementing the Boyer-Moore Algorithm for substring search in Python when I learned about the Galil Rule. I've looked around online for the Galil Rule but I haven't found anything more than a couple of sentences, and I cannot get access to the original paper. How can I implement this into my current algorithm?
i = 0
while i < (N - M + 1):
skip = 0
for j in reversed(range(0, M)):
if pattern[j] != text[i + j]:
skip = max(1, j - offsets[text[i+j]])
break
if skip == 0:
return i
i += skip
return -1
Notes:
offsets[c] = -1 if c is not in the pattern
offsets[c] = last index of c in the pattern
Example:
aaabcb
offsets[a] = 2
offsets[b] = 5
offsets[c] = 4
offsets[d] = -1
The few sentences I have found have said to keep track of when the first mismatch occurs in my inner loop (j, if the if-statement inside the inner loop is True) and the position in which I started the comparisons (i + j, in my case). I understand the intuition that I've already checked all the indices in between those, so I shouldn't have to do those comparisons again. I just don't understand how to connect the dots and arrive at an implementation.

The Galil rule is about exploiting periodicity in the pattern to reduce comparisons. Say you have a pattern abcabcab. It's periodic with smallest period abc. In general, a pattern P is periodic if there's a string U such that P is a prefix of UUUUU.... (In the above example, abcabcab is clearly a prefix of the repeating string abc = U.) We call the shortest such string the period of P. Let the length of that period be k (in the example above k = 3 since U = abc).
First of all, keep in mind that the Galil rule applies only after you've found an occurrence of P in the text. When you do that, the Galil rule says that you could shift by k (the periodicity of the pattern) and you only have to compare the last k characters of the now shifted pattern to determine if there was a match.
Here's an example:
P = ababa
T = bababababab
U = ab
k = 2
First occurrence: b[ababa]babab. Now you can shift by k = 2 and you only have to check the last two characters of the pattern:
T = bababa[ba]bab
P = aba[ba] // Only need to compare chars inside brackets for next match.
The rest of P must match since P is periodic and you shifted it by its period k from an existing match (this is crucial) so the repeating parts will nicely line up.
If you've found another match, just repeat. If you find a mismatch, however, you revert to the standard Boyer-Moore algorithm until you find another match. Remember, you can only use the Galil rule when you find a match and you shift by k (otherwise the pattern is not guaranteed to line up with the previous occurrence).
Now, you might wonder, how to determine k for a given pattern P. You'll need to calculate the suffixes array N first, where N[i] will be the length of the longest common suffix of the prefix P[0, i] and P. (You can calculate the suffixes array by calculating the prefixes array Z on the reverse of P using the Z algorithm, as described here, for example.) Once you have the suffixes array, you can easily find k since it'll be the smallest k > 0 such that N[m - k - 1] == m - k (where m = |P|).
For example:
P = ababa
m = 5
N = [1, 0, 3, 0, 5]
k = 2 because N[m - k - 1] == N[5 - 2 - 1] == N[2] == 3 == 5 - k

The answer by #Lajos Nagy has explained the idea of Galil rule perfectly, however we have a more straightforward way to calculate k:
Just use the prefix function of KMP algorithm.
The prefix[i] means the longest proper prefix of P[0..i] which is also a suffix.
And, k = m-prefix[m-1] .
This article has explained the details.

Related

Number of ways to form a string from a matrix of characters with the optimal approach in terms of time complexity?

(UPDATED)
We need to find the number of ways a given string can be formed from a matrix of characters.
We can start forming the word from any position(i, j) in the matrix and can go in any unvisited direction from the 8 directions available across every cell(i, j) of the matrix, i.e
(i + 1, j)
(i + 1, j + 1)
(i + 1, j - 1)
(i - 1, j)
(i - 1, j + 1)
(i - 1, j - 1)
(i, j + 1)
(i, j - 1)
Sample test cases:
(1) input:
N = 3 (length of string)
string = "fit"
matrix: fitptoke
orliguek
ifefunef
tforitis
output: 7
(2) input:
N = 5 (length of string)
string = "pifit"
matrix: qiq
tpf
pip
rpr
output: 5
Explanation:
num of ways to make 'fit' are as given below:
(0,0)(0,1)(0,2)
(2,1)(2,0)(3,0)
(2,3)(1,3)(0,4)
(3,1)(2,0)(3,0)
(2,3)(3,4)(3,5)
(2,7)(3,6)(3,5)
(2,3)(1,3)(0,2)
I approach the solution as a naive way, go to every possible position (i,j) in the matrix and start forming the string from that cell (i, j) by performing DFS search on the matrix and add the number of ways to form the given string from that pos (i, j) to total_num_ways variable.
pseudocode:
W = 0
for i : 0 - n:
for j: 0 - m:
visited[n][m] = {false}
W += DFS(i, j, 0, str, matrix, visited);
But it turns out that this solution would be exponential in time complexity as we are going to every possible n * m position and then traversing to every possible k(length of the string) length path to form the string.
How can we improve the solution efficiency?

Suggestion - 1: Preprocessing the matrix and the input string
We are only concerned about a cell of the matrix if the character in the cell appears anywhere in the input string. So, we aren't concerned about a cell containing the alphabet 'z' if our input string is 'fit'.
Using that, following is a suggestion.
Taking the input string, first put its characters in a set S. It is an O(k) step, where k is the length of the string;
Next we iterate over the matrix (a O(m*n) step) and:
If the character in the cell does not appear in the S, we continue to the next one;
If the character in the cell appears, we add an entry of cell position in a map of > called M.
Now, iterating over the input (not the matrix), for each position where current char c appears, get the unvisited positions of the right, left, above and below of the current cell;
If any of these positions are present in the list of cells in M where the next character is present in the matrix, then:
Recursively go to the next character of the input string, until you have exhausted all the characters.
What is better in this solution? We are getting the next cell we need to explore in O(1) because it is already present in the map. As a result, the complexity is not exponential anymore, but it is actually O(c) where c is the total occurrences of the input string in the matrix.
Suggestion - 2: Dynamic Programming
DP helps in case where there is Optimal Substructure and Overlapping Subproblems. So, in situations where the same substring is a part of multiple solutions, using DP could help.
Ex: If we found 'fit' somewhere then if there is an 'f' in an adjacent cell, it could use the substring 'it' from the first 'fit' we found. This way we would prevent recursing down the rest of the string the moment we encounter a substring that was previously explored.

# Checking if the given (x,y) coordinates are within the boundaries
# of the matrix
def in_bounds(x, y, rows, cols):
return x >= 0 and x < rows and y >= 0 and y < cols
# Finding all possible moves from the current (x,y) position
def possible_moves(position, path_set, rows, cols):
moves = []
move_range = [-1,0,1]
for i in range(len(move_range)):
for j in range(len(move_range)):
x = position[0] + move_range[i]
y = position[1] + move_range[j]
if in_bounds(x,y,rows,cols):
if x in path_set:
if y in path_set[x]:
continue
moves.append((x,y))
return moves
# Deterimine which of the possible moves lead to the next letter
# of the goal string
def check_moves(goal_letter, candidates, search_space):
moves = []
for x, y in candidates:
if search_space[x][y] == goal_letter:
moves.append((x,y))
return moves
# Recursively expanding the paths of each starting coordinate
def search(goal, path, search_space, path_set, rows, cols):
# Base Case
if goal == '':
return [path]
x = path[-1][0]
y = path[-1][1]
if x in path_set:
path_set[x].add(y)
else:
path_set.update([(x,set([y]))])
results = []
moves = possible_moves(path[-1],path_set,rows,cols)
moves = check_moves(goal[0],moves,search_space)
for move in moves:
result = search(goal[1:], path + [move], search_space, path_set, rows, cols)
if result is not None:
results += result
return results
# Finding the coordinates in the matrix where the first letter from the goal
# string appears which is where all potential paths will begin from.
def find_paths(goal, search_space):
results = []
rows, cols = len(search_space), len(search_space[0])
# Finding starting coordinates for candidate paths
for i in range(len(search_space)):
for j in range(len(search_space[i])):
if search_space[i][j] == goal[0]:
# Expanding path from root letter
results += search(goal[1:],[(i,j)],search_space,dict(),rows,cols)
return results
goal = "fit"
matrix = [
'fitptoke',
'orliguek',
'ifefunef',
'tforitis'
]
paths = find_paths(goal, matrix)
for path in paths:
print(path)
print('# of paths:',len(paths))
Instead of expanding the paths from every coordinate of the matrix, the matrix can first be iterated over to find all the (i,j) coordinates that have the same letter as the first letter from the goal string. This takes O(n^2) time.
Then, for each (i,j) coordinate found which contained the first letter from the goal string, expand the paths from there by searching for the second letter from the goal string and expand only the paths that match the second letter. This action is repeated for each letter in the goal string to recursively find all valid paths from the starting coordinates.

Homework: Implementing Karp-Rabin; For the hash values modulo q, explain why it is a bad idea to use q as a power of 2?

I have a two-fold homework problem, Implement Karp-Rabin and run it on a test file and the second part:
For the hash values modulo q, explain why it is a bad idea to use q as a power of 2. Can you construct a terrible example e.g. for q=64
and n=15?
This is my implementation of the algorithm:
def karp_rabin(text, pattern):
# setup
alphabet = 'ACGT'
d = len(alphabet)
n = len(pattern)
d_n = d**n
q = 2**32-1
m = {char:i for i,char in enumerate(alphabet)}
positions = []
def kr_hash(s):
return sum(d**(n-i-1) * m[s[i]] for i in range(n))
def update_hash():
return d*text_hash + m[text[i+n-1]] - d_n * m[text[i-1]]
pattern_hash = kr_hash(pattern)
for i in range(0, len(text) - n + 1):
text_hash = update_hash() if i else kr_hash(text[i:n])
if pattern_hash % q == text_hash % q and pattern == text[i:i+n]:
positions.append(i)
return ' '.join(map(str, positions))
...The second part of the question is referring to this part of the code/algo:
pattern_hash = kr_hash(pattern)
for i in range(0, len(text) - n + 1):
text_hash = update_hash() if i else kr_hash(text[i:n])
# the modulo q used to check if the hashes are congruent
if pattern_hash % q == text_hash % q and pattern == text[i:i+n]:
positions.append(i)
I don't understand why it would be a bad idea to use q as a power of 2. I've tried running the algorithm on the test file provided(which is the genome of ecoli) and there's no discernible difference.
I tried looking at the formula for how the hash is derived (I'm not good at math) trying to find some common factors that would be really bad for powers of two but found nothing. I feel like if q is a power of 2 it should cause a lot of clashes for the hashes so you'd need to compare strings a lot more but I didn't find anything along those lines either.
I'd really appreciate help on this since I'm stumped. If someone wants to point out what I can do better in the first part (code efficiency, readability, correctness etc.) I'd also be thrilled to hear your input on that.

There is a problem if q divides some power of d, because then only a few characters contribute to the hash. For example in your code d=4, if you take q=64 only the last three characters determine the hash (d**3 = 64).
I don't really see a problem if q is a power of 2 but gcd(d,q) = 1.
Your implementation looks a bit strange because instead of
if pattern_hash % q == text_hash % q and pattern == text[i:i+n]:
you could also use
if pattern_hash == text_hash and pattern == text[i:i+n]:
which would be better because you get fewer collisions.

The Thue–Morse sequence has among its properties that its polynomial hash quickly becomes zero when a power of 2 is the hash module, for whatever polynomial base (d). So if you will try to search a short Thue-Morse sequence in a longer one, you will have a great lot of hash collisions.
For example, your code, slightly adapted:
def karp_rabin(text, pattern):
# setup
alphabet = '01'
d = 15
n = len(pattern)
d_n = d**n
q = 32
m = {char:i for i,char in enumerate(alphabet)}
positions = []
def kr_hash(s):
return sum(d**(n-i-1) * m[s[i]] for i in range(n))
def update_hash():
return d*text_hash + m[text[i+n-1]] - d_n * m[text[i-1]]
pattern_hash = kr_hash(pattern)
for i in range(0, len(text) - n + 1):
text_hash = update_hash() if i else kr_hash(text[i:n])
if pattern_hash % q == text_hash % q : #and pattern == text[i:i+n]:
positions.append(i)
return ' '.join(map(str, positions))
print(karp_rabin('0110100110010110100101100110100110010110011010010110100110010110', '0110100110010110'))
outputs a lot of positions, although only three of then are proper matches.
Note that I have dropped the and pattern == text[i:i+n] check. Obviously if you restore it, the result will be correct, but also it is obvious that the algorithm will do much more work checking this additional condition than for other q. In fact, because there are so many collisions, the whole idea of algorithm becomes not working: you could almost as effectively wrote a simple algorithm that checks every position for a match.
Also note that your implementation is quite strange. The whole idea of polynomial hashing is to take the modulo operation each time you compute the hash. Otherwise your pattern_hash and text_hash are very big numbers. In other languages this might mean arithmetic overflow, but in Python this will invoke big integer arithmetic, which is slow and once again loses the whole idea of the algorithm.

How to use KMP failure function to determine minimum length repeated substring?

I want to solve UVA 10298 -"Power Strings" problem using KMP algorithm. In this blog a technique is shown how failure function can be used to calculate minimum length repeated substring. The technique is as follows:
Compute prefix-suffix table pi[ ] for the given string.
Let len be the string length, and last_in_pi be the value stored at the last index of pi table.
Check whether len % (len - last_in_pi) == 0 is true or not. If it is true then the length of the minimum length repeated substring is (len - last_in_pi), otherwise it is the length of the given string.
I understand what is failure function and how it is used to find pattern in a text but I am struggling to understand proof of correctness of this technique.

Remember that Pi[i] is defined as the (length of the) longest prefix of your_string that is a proper suffix (so not the whole string) of the substring your_string[0 ... i].
There is an example on the blog post you linked to:
0 1 2 3 4 5
S : a b a b a b
Pi: 0 0 1 2 3 4
Where we have:
a b a
a b a b
Etc. I hope this makes it clear what Pi (the prefix function / table) does.
Now, the blog says:
The last value of prefix table = 4..
Now If it is a repeated string than , It’s minimal length would be 2. (6(string length) – 4) , Now
So you have to check if len % (len - last_in_pi) == 0. If yes, then len - last_in_pi is the length of the shortest repeated string (the period string).
This works because, if you rotate a string with len(period) positions either way, it will match itself. len - last_in_pi tells you how much you'd need to rotate.

Problem
S (of length Ls) is the given string. M (of length Lm) is the largest proper suffix of S, which is also a prefix of S. We have to prove Ls - Lm is the length of the shortest period of S.
Proof by Contradiction
Let's say there were a period Y whose length Ly < Ls - Lm (i.e, it's shorter than the one the above technique gives).
An important property to note is that M is a proper prefix of Y or vice-versa depending on their lengths. We can denote this as M = n*Y + Z, where n >= 0 and Z is the additional part and Lz < Ly. Z forms a prefix to Y, since Y repeats itself. Let Y = Z + W.
Consider M the suffix. Append the previous Ly number of characters from the original string S to it. This won't exceed the string length because (Ly < Ls - Lm). The new suffix is (n + 1)*Y + Z.
Consider M the prefix. Now append the next Ly number of characters from the original string S to it. The new prefix here is
M + (next Ly characters from S)
- > n*Y + Z + (Ly characters)
- > n*Y + Z + (Ly - Lz characters) + (Lz characters)
- > n*Y + (Z + W) + (Z)
{The `Ly - Lz` characters should be `W` because `Z` and these together form `Y`; The last Lz characters are actually the the first Lz characters of Y which is nothing but Z}
- > (n + 1)*Y + Z
Now we have a proper suffix of S which is also a prefix and is greater than M. But we started off saying M is the longest proper suffix which is also a prefix. So it's a contradiction, implying such a Y can not exist.

Assume you have a string s of size n, which looks like s = x1x2x3...x[n-2]x[n-1]x[n]
Assume s has a maximum common prefix/suffix of length len
Then it's period is p = (n - len), iff n % p == 0
Induction：
Denote prefix = s[1...len], postfix = s[p+1...n]
Then we have prefix[1...p] == postfix[1...p] == s[p+1...2p]
Since s[p+1...2p] == prefix[p+1...2p] so postfix[1...p] == postfix[p+1...2p]
Recursively postfix[p+1...2p] == s[2p+1...3p] == prefix[2p+1...3p]
...

Optimized way of finding similar strings

Suppose I have a large list of words. (about 4-5 thousands and increasing). Someone searched for a string. Unfortunately, The string was not found on the wordlist. Now what would be the best and optimized way to find words similar to the input string? The first thing that came to my mind was calculating Levenshtein distance between each entry of the wordlist and the input string. But is that the optimized way to do that?
(Note that, this is not language-specific question)

EDIT: new solution
Yes, calculating Levenshtein distances between your input and the word list can be a reasonable approach, but takes a lot of time. BK-trees can improve this, but they become slow quickly as the Levenshtein distance becomes bigger. It seems we can speed up the Levenshtein distance calculations using a trie, as described in this excellent blog post:
Fast and Easy Levenshtein distance using a Trie
It relies on the fact that the dynamic programming lookup table for Levenshtein distance has common rows in different invocations i.e. levenshtein(kate,cat) and levenshtein(kate,cats).
Running the Python program given on that page with the TWL06 dictionary gives:
> python dict_lev.py HACKING 1
Read 178691 words into 395185 nodes
('BACKING', 1)
('HACKING', 0)
('HACKLING', 1)
('HANKING', 1)
('HARKING', 1)
('HAWKING', 1)
('HOCKING', 1)
('JACKING', 1)
('LACKING', 1)
('PACKING', 1)
('SACKING', 1)
('SHACKING', 1)
('RACKING', 1)
('TACKING', 1)
('THACKING', 1)
('WHACKING', 1)
('YACKING', 1)
Search took 0.0189998 s
That's really fast, and would be even faster in other languages. Most of the time is spent in building the trie, which is irrelevant as it needs to be done just once and stored in memory.
The only minor downside to this is that tries take up a lot of memory (which can be reduced with a DAWG, at the cost of some speed).
Another approach: Peter Norvig has an great article (with complete source code) on spelling correction.
http://norvig.com/spell-correct.html
The idea is to build possible edits of the words, and then choose the most likely spelling correction of that word.

I think that something better than this exists, but BK trees are a good optimization from brute force at least.
It uses the property of Levenshtein distance being a metric space, and hence if you get a Levenshtein distance of d between your query and an arbitrary string s from the dict, then all your results must be at a distance from (d+n) to (d-n) to s. Here n is the maximum Levenshtein distance from the query you want to output.
It's explained in detail here: http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees

If you are interested in the very code itself I implemented an algorithm for finding optimal alignment between two strings. It basically shows how to transform one string into the other with k operations (where k is Levenstein/Edit Distance of the strings). It could be bit simplified for your needs (as you only need the distance itself). By the way it works in O(mn) where m and n are lengths of the strings. My implementation is based on: this and this.
#optimalization: using int instead of strings:
#1 ~ "left", "insertion"
#7 ~ "up", "deletion"
#17 ~ "up-left", "match/mismatch"
def GetLevenshteinDistanceWithBacktracking(sequence1,sequence2):
distances = [[0 for y in range(len(sequence2)+1)] for x in range(len(sequence1)+1)]
backtracking = [[1 for y in range(len(sequence2)+1)] for x in range(len(sequence1)+1)]
for i in range(1, len(sequence1)+1):
distances[i][0]=i
for i in range(1, len(sequence2)+1):
distances[0][i]=i
for j in range(1, len(sequence2)+1):
for i in range(1, len(sequence1)+1):
if sequence1[i-1] == sequence2[j-1]:
distances[i][j]=distances[i-1][j-1]
backtracking[i][j] = 17
else:
deletion = distances[i-1][j]+1
substitution = distances[i-1][j-1]+1
insertion = distances[i][j-1] + 1
distances[i][j]=min( deletion, substitution, insertion)
if distances[i][j] == deletion:
backtracking[i][j] = 7
elif distances[i][j] == insertion:
backtracking[i][j] = 1
else:
backtracking[i][j] = 17
return (distances[len(sequence1)][len(sequence2)], backtracking)
def Alignment(sequence1, sequence2):
cost, backtracking = GetLevenshteinDistanceWithBacktracking(sequence1, sequence2)
alignment1 = alignment2 = ""
i = len(sequence1)
j = len(sequence2)
#from backtracking-matrix we get optimal-alignment
while(i > 0 or j > 0):
if i > 0 and j > 0 and backtracking[i][j] == 17:
alignment1 = sequence1[i-1] + alignment1
alignment2 = sequence2[j-1] + alignment2
i -= 1
j -= 1
elif i > 0 and backtracking[i][j] == 7:
alignment1 = sequence1[i-1] + alignment1
alignment2 = "-" + alignment2
i -= 1
elif j > 0 and backtracking[i][j]==1:
alignment2 = sequence2[j-1] + alignment2
alignment1 = "-" + alignment1
j -= 1
elif i > 0:
alignment1 = sequence1[i-1] + alignment1
alignment2 = "-" + alignment2
i -= 1
return (cost, (alignment1, alignment2))
It depends on broader context and how accurate you want to be. But what I would (probably) start with:
Only consider the subset which starts with the same character as the query word. It would decrease the amount of work by ~20 for a single query.
I would categorized words according to their lengths and for each category would allow the maximal distance to be a different number. In case of 4 categories e.g.:
0 -- if length is between 0 and 2; 1 -- if length is between 3 and 5; 2 -- if length is between 6 and 8; 3 -- if length is 9+. Then based on the query length you could just check the words from given category. Moreover it should not be hard to implement the algorithm to stop when the max. distance has been exceeded.
If needed would start to think about implementing some machine learning approach.

Find the minimum number of operations required to compute a number using a specified range of numbers

Let me start with an example -
I have a range of numbers from 1 to 9. And let's say the target number that I want is 29.
In this case the minimum number of operations that are required would be (9*3)+2 = 2 operations. Similarly for 18 the minimum number of operations is 1 (9*2=18).
I can use any of the 4 arithmetic operators - +, -, / and *.
How can I programmatically find out the minimum number of operations required?
Thanks in advance for any help provided.
clarification: integers only, no decimals allowed mid-calculation. i.e. the following is not valid (from comments below): ((9/2) + 1) * 4 == 22
I must admit I didn't think about this thoroughly, but for my purpose it doesn't matter if decimal numbers appear mid-calculation. ((9/2) + 1) * 4 == 22 is valid. Sorry for the confusion.

For the special case where set Y = [1..9] and n > 0:
n <= 9 : 0 operations
n <=18 : 1 operation (+)
otherwise : Remove any divisor found in Y. If this is not enough, do a recursion on the remainder for all offsets -9 .. +9. Offset 0 can be skipped as it has already been tried.
Notice how division is not needed in this case. For other Y this does not hold.
This algorithm is exponential in log(n). The exact analysis is a job for somebody with more knowledge about algebra than I.
For more speed, add pruning to eliminate some of the search for larger numbers.
Sample code:
def findop(n, maxlen=9999):
# Return a short postfix list of numbers and operations
# Simple solution to small numbers
if n<=9: return [n]
if n<=18: return [9,n-9,'+']
# Find direct multiply
x = divlist(n)
if len(x) > 1:
mults = len(x)-1
x[-1:] = findop(x[-1], maxlen-2*mults)
x.extend(['*'] * mults)
return x
shortest = 0
for o in range(1,10) + range(-1,-10,-1):
x = divlist(n-o)
if len(x) == 1: continue
mults = len(x)-1
# We spent len(divlist) + mults + 2 fields for offset.
# The last number is expanded by the recursion, so it doesn't count.
recursion_maxlen = maxlen - len(x) - mults - 2 + 1
if recursion_maxlen < 1: continue
x[-1:] = findop(x[-1], recursion_maxlen)
x.extend(['*'] * mults)
if o > 0:
x.extend([o, '+'])
else:
x.extend([-o, '-'])
if shortest == 0 or len(x) < shortest:
shortest = len(x)
maxlen = shortest - 1
solution = x[:]
if shortest == 0:
# Fake solution, it will be discarded
return '#' * (maxlen+1)
return solution
def divlist(n):
l = []
for d in range(9,1,-1):
while n%d == 0:
l.append(d)
n = n/d
if n>1: l.append(n)
return l

The basic idea is to test all possibilities with k operations, for k starting from 0. Imagine you create a tree of height k that branches for every possible new operation with operand (4*9 branches per level). You need to traverse and evaluate the leaves of the tree for each k before moving to the next k.
I didn't test this pseudo-code:
for every k from 0 to infinity
for every n from 1 to 9
if compute(n,0,k):
return k
boolean compute(n,j,k):
if (j == k):
return (n == target)
else:
for each operator in {+,-,*,/}:
for every i from 1 to 9:
if compute((n operator i),j+1,k):
return true
return false
It doesn't take into account arithmetic operators precedence and braces, that would require some rework.

Really cool question :)
Notice that you can start from the end! From your example (9*3)+2 = 29 is equivalent to saying (29-2)/3=9. That way we can avoid the double loop in cyborg's answer. This suggests the following algorithm for set Y and result r:
nextleaves = {r}
nops = 0
while(true):
nops = nops+1
leaves = nextleaves
nextleaves = {}
for leaf in leaves:
for y in Y:
if (leaf+y) or (leaf-y) or (leaf*y) or (leaf/y) is in X:
return(nops)
else:
add (leaf+y) and (leaf-y) and (leaf*y) and (leaf/y) to nextleaves
This is the basic idea, performance can be certainly be improved, for instance by avoiding "backtracks", such as r+a-a or r*a*b/a.

I guess my idea is similar to the one of Peer Sommerlund:
For big numbers, you advance fast, by multiplication with big ciphers.
Is Y=29 prime? If not, divide it by the maximum divider of (2 to 9).
Else you could subtract a number, to reach a dividable number. 27 is fine, since it is dividable by 9, so
(29-2)/9=3 =>
3*9+2 = 29
So maybe - I didn't think about this to the end: Search the next divisible by 9 number below Y. If you don't reach a number which is a digit, repeat.
The formula is the steps reversed.
(I'll try it for some numbers. :) )
I tried with 2551, which is
echo $((((3*9+4)*9+4)*9+4))
But I didn't test every intermediate result whether it is prime.
But
echo $((8*8*8*5-9))
is 2 operations less. Maybe I can investigate this later.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio