My problem is to match text with math expressions inside.
I saw this topic - What is the best way to index documents which contain mathematical expression in elastic search?
But in my context I can't just create an additional field with math expression. I have to match text like math tasks for students and these tasks could have multiple math expressions.
I think the MathML is most preferred format here because I can split MathML tags into words and match them as a usual words.
I'm interested to get most close match results to math expressions. What is the most proper way to reach this kind of matching?
Examples:
Solve the equation (2x + 7) ^ 2 = (2x - 1) ^ 2 .
Find all values of the parameter a, for each of which the equation: | x - a ^ 2 + a + 2 | + | x - a ^ 2 + 3a - 1 | = 2a - 3 has roots, but none of them belongs to the interval (4; 19)
P.S. graphical representation of equation:
Related
Expected Behaviour of the algorithm
I have two strings a and b, with a being the shorter string. I would like to find the substring of b, that has the biggest similarity to a. The substring has to be of len(a), or has to be placed at the end of b.
e.g. for the following two strings:
a = "aa"
b = "bbaba"
the possible substrings of b would be
"bb"
"ba"
"ab"
"ba"
"a"
""
The edit distance is defined as amount of Insertions and Deletion. Substitutions are not possible (Insertion + Deletion has to be used instead). The similarity between the two strings is calulated according to the following equation: norm = 1 - distance / (len(a) + len(substring)).
So the substrings above would provide the following results:
"bb" -> 2 DEL + 2 INS -> 1 - 4 / 4 = 0
"ba" -> 1 DEL + 1 INS -> 1 - 2 / 4 = 0.5
"ab" -> 1 DEL + 1 INS -> 1 - 2 / 4 = 0.5
"ba" -> 1 DEL + 1 INS -> 1 - 2 / 4 = 0.5
"a" -> 1 INS -> 1 - 1 / 3 = 0.66
"" -> 2 INS -> 1 - 2 / 2 = 0
So the algorithm should return 0.66.
Different implementations
A similar ratio is implemented by the Python library FuzzyWuzzy in the form of fuzz.partial_ratio. It calculates the ratio in two steps:
searches for matching subsequences in the longer sequence using difflib.SequenceMatcher.get_matching_blocks
calculates the ratio for substrings of len(shorter_string) starting at the matching subsequences and returns the maximum ratio
This is really slow, so it uses python-Levenshtein for this similarity calculation when it is available. This performs the same calculation based on the Levenshtein distance, which is faster. However in edge cases the calculated matching_blocks used for the ratio calculation is completely wrong (see issue 16), which does not make it a suitable replacement, when the correctness is relevant.
Current implementation
I currently use a C++ port of difflib in combination with a fast bitparallel implementation of the Levenshtein distance with the weights insertion=1, deletion=1 and substitution=2. The current implementation can be found here:
extracting matching_blocks: matching_blocks
calculating weighted Levenshtein: weighted Levenshtein
combining them to calculate the end ratio: partial_ratio
Question
Is there a faster algorithm to calculate this kind of similarity. Requirements are:
only uses Replacement/Insertion (or gives substitutions the weight 2, which has a similar effect)
allows a gap at the beginning of the longer string
allows a gap at the end of the longer string, as long as the remaining substring does not become shorter, than the length of the shorter string.
optimally it enforces, that the substring has a similar length (when it is not in the end), so it matches the behaviour of FuzzyWuzzy, but it would be fine when it allows longer subsequences to be matched aswell: e.g. for aaba:aaa this would mean, that it is allowed to use aaba as optimal subsequence instead of aab.
I came across the following practice problem.
You are free to put any parentheses to the expression anywhere you want and as many as you want. However it should be a valid expression after you put the parentheses. The question is how many different numbers can you make? Ex. for 1 - 2 + 3 - 4 - 5 you can get six unique values as below:
1 - 2 + 3 - 4 - 5 = -7
1 - (2 + 3) - 4 - 5 = -13
1 - (2 + 3 - 4) - 5 = -5
1 - (2 + 3 - 4 - 5) = 5
1 - 2 + 3 - (4 - 5) = 3
1 - (2 + 3) - (4 - 5) = -3
I can't seem to figure out how to have a Dynamic Programming formulation for the problem. I just started solving problems involving Dynamic Programming and can't seem to figure out how to approach this problem.
EDIT The range of numbers is 0<=N<=100 and length of expression (<=30)
Basic idea
The parentheses are basically interposed between numbers and operators, any imbalance can be fixed at the ends of the entire expression.
Possible placements of parentheses
A ( immediatetly before either operator is illegal syntax.
A ( immediately after a + is legal but pointless, since it doesn't change the order of evaluation. I'll assume we don't do this.
A ( immediately after a - is legal and important.
A ) immediately before a + is legal and important iff there was a matching ( before.
A ) immediately before a - is legal but pointless, since opening a new pair of parenthesis after the following number gives the same sign-change and more options later on, because we will have one more open pair we can close. I'll assume we don't do this either.
This means the only parentheses we actually need are opening parentheses before negative numbers and closing parentheses after positive ones. If we stick to those two, the sign the next number is multiplied with in the summation depends only on the number of open parentheses being even or odd.
This gives us the
Substructure
Parsing the state from left to right, after every number the present sub-problem can be represented as a set of pairs of
the partial sum and
the number of open parentheses.
Working out the specific example
Reading in +1:
(1, 0)
That is, there is only one solution for this sub-problem: The partial sum so far is 1 and the number of open parentheses is 0. From now on in each sub-problem I'll have one line for the pairs arising from every pair in the previous sub-problem.
Reading in -2:
(-1, 1), (-1, 0)
I.e. the partial sum is -1, but we may or may not have inserted an opening parenthesis.
Reading in +3:
(-4,1),(-4,0)
(2,0)
New in this sub-problem: We could optionally close a pair of parentheses, but only if one was open.
Reading in -4:
(0,2), (0,1)
(-8,0), (-8,1)
(-2,0), (-2,1)
Reading in -5:
(-5,2), (-5,3)
(5,1), (5,2)
(-13,0), (-13, 1)
(-3,1), (-3,2)
(-7,0), (-7,1)
(3,1), (3,2)
In the end we get the possible sums by looking only at the first element of each pair and discarding duplicates.
Given a string consisting of number, add + or - sign to make the expression values 0. Return the expression.
For example,
123 => 1 + 2 -3 = 0
173956 => 17 + 39 - 56 = 0
I have no clues to solve this problem other than a brute-force way.
Is there any suggestion?
This is a search problem. Search must be performed in the solution space.
Suppose we starting from '123' string, at this point, we can add + or - sign after '1', as result we get '1 + 23' or '1 - 23'. Every variant can be split further by adding a sign after next character. As result, all possible sign additions will form tree-like structure - the solution space. Your algorithm must search solution in this structure. I think A* can be used to do this.
Anders K draw nice ASCII graph of the solution space, you just need to search it for solution. Simple breadth first search or depth first search can do the work, but I think it will be slow if solution space is large.
Also, I think that is possible to find more optimal, specific solution that exploits properties of the solution space, for example - it's tree-like structure.
you can solve it in many ways for example using a recursive approach which becomes obvious if you structure it up as a tree
e.g. 123
since there can be two different signs after a digit digit (+|-) :
1
/ \
+ -
/ \
2 2
/ \ / \
+ - + -
| | | |
3 3 3 3
I have a dataset of 25 integer fields and 40k records, e.g.
1:
field1: 0
field2: 3
field3: 1
field4: 2
[...]
field25: 1
2:
field1: 2
field2: 1
field3: 4
field4: 0
[...]
field25: 2
etc.
I'm testing with MySQL but am not tied to it.
Given a single record, I need to retrieve the records most similar to it; something like the lowest average difference of the fields. I started looking at the following, but I don't know how to map this onto the problem of searching for similarities in a large dataset.
https://en.wikipedia.org/wiki/Euclidean_distance
https://en.wikipedia.org/wiki/S%C3%B8rensen_similarity_index
https://en.wikipedia.org/wiki/Similarity_matrix
I know it's an old post, but for anyone who comes by it seeking similar algorithms, one that works particularly well is Cosine Similarity. Find a way to vectorize your records, then look for vectors with minimum angle between them. If vectorizing a record is not trivial, then you can vectorize similarity between them via some known algorithm, and then look at cosine similarity of the similarity vectors to the perfect match vector (assuming perfect matches aren't the goal since they're easy to find anyway). I get tremendous results with this matching even comparing things like lists of people in various countries working on a particular project with various contributions to the project. Vectorization implies looking at number of country matches, country mismatches, ratio of people in a matching country between two datasets, etc etc etc. I use string edit distance functions like Levenshtein distance for getting numeric value from string dissimilarities, but one could use phonetic matching, etc. As long as the target number is not 0 (vector [0 0 ... 0] is the subspace of ANY vector and thus its angle would be undefined. Sometimes to get away from the problem, such as the case of edit distance, I give a perfect match (e.d. 0) a negative weight, so that perfect matches are really emphasized. -1 and 1 are farther away than 1 and 2, which makes a lot of sense - perfect match is better than anything with even 1 misspelling.
Cos(theta) = (A dot B) / (Norm(A)*Norm(B)) where dot is the dot-product, and Norm is the Euclidian magnitude of the vector.
Good luck!
Here's a possibility with straight average distance between each of the fields (the value after each minus is from the given record needing a match):
SELECT id,
(
ABS(field1-2)
+ ABS(field2-2)
+ ABS(field3-3)
+ ABS(field4-1)
+ ABS(field5-0)
+ ABS(field6-3)
+ ABS(field7-2)
+ ABS(field8-0)
+ ABS(field9-1)
+ ABS(field10-0)
+ ABS(field11-2)
+ ABS(field12-2)
+ ABS(field13-3)
+ ABS(field14-2)
+ ABS(field15-0)
+ ABS(field16-1)
+ ABS(field17-0)
+ ABS(field18-2)
+ ABS(field19-3)
+ ABS(field20-1)
+ ABS(field21-0)
+ ABS(field22-1)
+ ABS(field23-3)
+ ABS(field24-2)
+ ABS(field25-2)
)/25
AS distance
FROM mytable
ORDER BY distance ASC
LIMIT 20;
If you're not familiar with the Rowland prime sequence, you can find out about it here. I've created an ugly, procedural monadic verb in J to generate the first n terms in this sequence, as follows:
rowland =: monad define
result =. 0 $ 0
t =. 1 $ 7
while. (# result) < y do.
a =. {: t
n =. 1 + # t
t =. t , a + n +. a
d =. | -/ _2 {. t
if. d > 1 do.
result =. ~. result , d
end.
end.
result
)
This works perfectly, and it indeed generates the first n terms in the sequence. (By n terms, I mean the first n distinct primes.)
Here is the output of rowland 20:
5 3 11 23 47 101 7 13 233 467 941 1889 3779 7559 15131 53 30323 60647 121403 242807
My question is, how can I write this in more idiomatic J? I don't have a clue, although I did write the following function to find the differences between each successive number in a list of numbers, which is one of the required steps. Here it is, although it too could probably be refactored by a more experienced J programmer than I:
diffs =: monad : '}: (|#-/#(2&{.) , $:#}.) ^: (1 < #) y'
While I already marked estanford's answer as the correct one, I've come a long, long way with J since I asked this question. Here's a much more idiomatic way to generate the rowland prime sequence in J:
~. (#~ 1&<) | 2 -/\ (, ({: + ({: +. >:##)))^:(1000 - #) 7
The expression (, ({: + ({: +. >:##)))^:(1000 - #) 7 generates the so-called original sequence up to 1000 members. The first differences of this sequence can be generated by | 2 -/\, i.e., the absolute values of the differences of every two elements. (Compare this to my original, long-winded diffs verb from the original question.)
Lastly, we remove the ones and the duplicate primes ~. (#~ 1&<) to get the sequence of primes.
This is vastly superior to the way I was doing it before. It can easily be turned into a verb to generate n number of primes with a little recursion.
I don't have a full answer yet, but this essay by Roger Hui has a tacit construct you can use to replace explicit while loops. Another (related) avenue would be to make the inner logic of the block into a tacit expression like so:
FUNWITHTACIT =: ] , {: + {: +. 1 + #
rowland =: monad define
result =. >a:
t =. 7x
while. (# result) < y do.
t =. FUNWITHTACIT t
d =. | -/ _2 {. t
result =. ~.result,((d>1)#d)
end.
result
)
(You might want to keep the if block for efficiency, though, since I wrote the code in such a way that result is modified regardless of whether or not the condition was met -- if it wasn't, the modification has no effect. The if logic could even be written back into the tacit expression by using the Agenda operator.)
A complete solution would consist of finding out how to represent all the logic inside the while loop of as a single function, and then use Roger's trick to implement the while logic as a tacit expression. I'll see what I can turn up.
As an aside, I got J to build FUNWITHTACIT for me by taking the first few lines of your code, manually substituting in the functions you declared for the variable values (which I could do because they were all operating on a single argument in different ways), replaced every instance of t with y and told J to build the tacit equivalent of the resulting expression:
]FUNWITHTACIT =: 13 : 'y,({:y)+(1+#y)+.({:y)'
] , {: + {: +. 1 + #
Using 13 to declare the monad is how J knows to take a monad (otherwise explicitly declared with 3 : 0, or monad define as you wrote in your program) and convert the explicit expression into a tacit expression.
EDIT:
Here are the functions I wrote for avenue (2) that I mentioned in the comment:
candfun =: 3 : '(] , {: + {: +. 1 + #)^:(y) 7'
firstdiff =: [: | 2 -/\ ]
triage =: [: /:~ [: ~. 1&~: # ]
rowland2 =: triage #firstdiff #candfun
This function generates the first n-many candidate numbers using the Rowland recurrence relation, evaluates their first differences, discards all first-differences equal to 1, discards all duplicates, and sorts them in ascending order. I don't think it's completely satisfactory yet, since the argument sets the number of candidates to try instead of the number of results. But, it's still progress.
Example:
rowland2 1000
3 5 7 11 13 23 47 101 233 467 941
Here's a version of the first function I posted, keeping the size of each argument to a minimum:
NB. rowrec takes y=(n,an) where n is the index and a(n) is the
NB. nth member of the rowland candidate sequence. The base case
NB. is rowrec 1 7. The output is (n+1,a(n+1)).
rowrec =: (1 + {.) , }. + [: +./ 1 0 + ]
rowland3 =: 3 : 0
result =. >a:
t =. 1 7
while. y > #result do.
ts =. (<"1)2 2 $ t,rowrec t
fdiff =. |2-/\,>(}.&.>) ts
if. 1~:fdiff do.
result =. ~. result,fdiff
end.
t =. ,>}.ts
end.
/:~ result
)
which finds the first y-many distinct Rowland primes and presents them in ascending order:
rowland3 20
3 5 7 11 13 23 47 53 101 233 467 941 1889 3779 7559 15131 30323 60647 121403 242807
Most of the complexity of this function comes from my handling of boxed arrays. It's not pretty code, but it only keeps 4+#result many data elements (which grows on a log scale) in memory during computation. The original function rowland keeps (#t)+(#result) elements in memory (which grows on a linear scale). rowland2 y builds an array of y-many elements, which makes its memory profile almost the same as rowland even though it never grows beyond a specified bound. I like rowland2 for its compactness, but without a formula to predict the exact size of y needed to generate n-many distinct primes, that task will need to be done on a trial-and-error basis and thus potentially use many more cycles than rowland or rowland3 on redundant calculations. rowland3 is probably more efficient than my version of rowland, since FUNWITHTACIT recomputes #t on every loop iteration -- rowland3 just increments a counter, which is less computationally intensive.
Still, I'm not happy with rowland3's explicit control structures. It seems like there should be a way to accomplish this behavior using recursion or something.