Efficient automatic bracket completion algorithm - algorithm

I'm writing automatic bracket completion feature for a text editor, my current approach is to call a bracket matching procedure (assume it's correct and efficient) on every newline where the last typed symbol is the open bracket, starting from the new line (with that last typed open bracket on the top of the stack) until either the stack is empty or it comes to the end of the source with no matching close bracket found.
However, this approach seems to be slow for big source files (>1 MB), especially when the newline is added at the first half of the source lines (newline at first line = worst case = whole text is scanned). Some IDEs have this capability and could handle it fast, so they must be using different approach. So, I would like to know what algorithm they use or any other approach better than mine.

If matching brackets is an important concern you can handle it using a secondary data structure for brackets alone.
E.g. you can build a tree which stores begin_pos/end_pos for a pair of brackets within a node and as children all pairs of brackets within. Take special care of non-matching brackets (i.e. they start/end at their parents boundaries).
Adding an extra open/close bracket is quite comfortable with such a structure since you can skip all child and sibling trees.
An appropriate implementation should perform with O(log(n)) solution, n being the number of pairs of brackets.

Related

Turing Machine Algorithm

Could you please help me? I need to write code for a one-tape Turing Machine that uses the following two-letter alphabet a and b.
So the programme should show the common prefix of the two words.
For example:
g(aab,aaaba) -> aa; g(_,abab) -> _; g(aaba,baa) -> _; g(_,_) -> _; g(babaab,babb) -> bab
Where g is the function of the Machine and underscore means an Empty word, between words we have space
I tried to implement the following option:
If at the start we see the letter a, then we erase it and move to the beginning of the second word. If we also see a letter a there, we erase it too and after both words we write a through a space. After that we return to the beginning of the first word and repeat this operation. When the first letter of the first word and the first letter of the second no longer match, we erase everything that is left.
But I have some troubles with code, because after each operation a space between two words gets longer and I don't know how to control this. Also there is a trouble when the first or the second word is a common prefix fully, like this:
g(baa,baabab) -> baa
Your approach seems reasonable. From your description it sounds like you just have trouble generalizing some of the individual steps.
For instance, to deal with the growing spaces between the two words, remember that at any time in the program, the two words are separated by one or more spaces. So implement your seek operation for that general case.
For the common prefix case you have to deal with the situation that you eventually run out of characters to compare. So after deleting the character from the first word, while seeking for the beginning of the second word, check whether the first character you pass over is a letter or a space. If it's a space, you're in the prefix case and need to take care that you don't try to seek back to the first word later, because you already erased all of it and there's only spaces left. Similarly, if the second word is the prefix, you can detect this when seeking to the output.
Try breaking the algorithm down into its essential steps and test each of those steps in isolation. It is much easier to make sure you handle the corner cases correctly when you can focus on a simple step in isolation, instead of having to test it as part of the larger algorithm. Doing this is an essential skill in debugging code, so consider this a good exercise for that. Even if it seems painful at first, make sure you have a structured approach to analyzing problems and breaking your code down into smaller parts, and you will be able to fix any problems eventually. Happy coding!

Chained XPath axes with trailing `[1]` or `[last()]` predicates

This question is specifically about using XPath in XSLT 2.0 and Saxon.
XPaths ending with [1]
For XPaths like
following-sibling::foo[1]
descendant::bar[1]
I take it for granted that Saxon will not iterate over the entire axis but stop when it finds the first matching node - crucial in situations like:
following-sibling::foo[some:expensivePredicate(.)][1]
I assume that this is also the case for XPaths like this:
(following-sibling::foo/descendant::bar)[1]
I.e. Saxon will not compile the entire set of nodes matching following-sibling::foo/descendant::bar before picking the first one in the set. Rather, it will (even for chained axes) stop at the first matching node.
XPaths ending with [last()]
Now it gets interesting. When going "backwards" in the tree, I assume that XPaths like
preceding-sibling::foo[1]
work just as efficiently as their following-sibling equivalents. But what happens when chaining axes, e.g.
(preceding-sibling::foo/descendant::bar)[last()]
As we need to use [last()] here instead of [1],
will Saxon compile the entire set of nodes to count them to get a numeric value for last()?
Or will it be smart and stop iterating the preceding-sibling axis when it found a matching descendant?
Or will it be even more clever and iterate the descendant axis in reverse to more efficiently find the last descendant?
Saxon has a variety of strategies for evaluating last(). When used as a predicate, meaning [position()=last()], it is generally translated to an internal function [isLast()] which can be evaluated by a single-item lookahead. (So in your example of (preceding-sibling::foo /descendant::bar)[last()], it doesn't build the node-set in memory, rather it reads the nodes one by one and when it hits the end, returns the last one it found).
In other cases, particularly when used in XSLT match patterns, Saxon will convert child::x[last()] to child::x[not(following-sibling::x)].
When none of these approaches work, for many years Saxon had two strategies for evaluating last() depending on the expression it was applied to: (a) sometimes it would evaluate the expression twice, counting nodes the first time and returning them the second time; (b) in other cases it would read all the nodes into memory. We've recently encountered cases where strategy (a) fails: see https://saxonica.plan.io/issues/3122, and so we're always doing (b).
The last() expression is potentially expensive and it should be avoided where possible. For example the classic "insert a separator between adjacent items" which is often written
xx
if (position() != last()) sep
is much better written as
if (position() != 1) sep
xx
i.e. instead of inserting the separator after every item except the last, insert it before every item except the first. Or use string-join, or xsl:value-of/#separator.

How to get rid of unnecessary parentheses in mathematical expression

Hi I was wondering if there is any known way to get rid of unnecessary parentheses in mathematical formula. The reason I am asking this question is that I have to minimize such formula length
if((-if(([V].[6432])=0;0;(([V].[6432])-([V].[6445]))*(((([V].[6443]))/1000*([V].[6448])
+(([V].[6443]))*([V].[6449])+([V].[6450]))*(1-([V].[6446])))))=0;([V].[6428])*
((((([V].[6443]))/1000*([V].[6445])*([V].[6448])+(([V].[6443]))*([V].[6445])*
([V].[6449])+([V].[6445])*([V].[6450])))*(1-([V].[6446])));
it is basically part of sql select statement. It cannot surpass 255 characters and I cannot modify the code that produces this formula (basically a black box ;) )
As you see many parentheses are useless. Not mentioning the fact that:
((a) * (b)) + (c) = a * b + c
So I want to keep the order of operations Parenthesis, Multiply/Divide, Add/Subtract.
Im working in VB, but solution in any language will be fine.
Edit
I found an opposite problem (add parentheses to a expression) Question.
I really thought that this could be accomplished without heavy parsing. But it seems that some parser that will go through the expression and save it in a expression tree is unevitable.
If you are interested in remove the non-necessary parenthesis in your expression, the generic solution consists in parsing your text and build the associated expression tree.
Then, from this tree, you can find the corresponding text without non-necessary parenthesis, by applying some rules:
if the node is a "+", no parenthesis are required
if the node is a "*", then parenthesis are required for left(right) child only if the left(right) child is a "+"
the same apply for "/"
But if your problem is just to deal with these 255 characters, you can probably just use intermediate variables to store intermediate results
T1 = (([V].[6432])-([V].[6445]))*(((([V].[6443]))/1000*([V].[6448])+(([V].[6443]))*([V].[6449])+([V].[6450]))*(1-([V].[6446])))))
T2 = etc...
You could strip the simplest cases:
([V].[6432]) and (([V].[6443]))
Becomes
v.[6432]
You shouldn't need the [] around the table name or its alias.
You could shorten it further if you can alias the columns:
select v.[6432] as a, v.[6443] as b, ....
Or even put all the tables being queried into a single subquery - then you wouldn't need the table prefix:
if((-if(a=0;0;(a-b)*((c/1000*d
+c*e+f)*(1-g))))=0;h*
(((c/1000*b*d+c*b*
e+b*f))*(1-g));
select [V].[6432] as a, [V].[6445] as b, [V].[6443] as c, [V].[6448] as d,
[V].[6449] as e, [V].[6450] as f,[V].[6446] as g, [V].[6428] as h ...
Obviously this is all a bit psedo-code, but it should help you simplify the full statement
I know this thread is really old, but as it is searchable from google.
I'm writing a TI-83 plus calculator program that addresses similar issues. In my case, I'm trying to actually solve the equation for a specific variable in number, but it may still relate to your problem, although I'm using an array, so it might be easier for me to pick out specific values...
It's not quite done, but it does get rid of the vast majority of parentheses with (I think), a somewhat elegant solution.
What I do is scan through the equation/function/whatever, keeping track of each opening parenthese "(" until I find a closing parenthese ")", at which point I can be assured that I won't run into any more deeply nested parenthese.
y=((3x + (2))) would show the (2) first, and then the (3x + (2)), and then the ((3x + 2))).
What it does then is checks the values immediately before and after each parenthese. In the case above, it would return + and ). Each of these is assigned a number value. Between the two of them, the higher is used. If no operators are found (*,/,+,^, or -) I default to a value of 0.
Next I scan through the inside of the parentheses. I use a similar numbering system, although in this case I use the lowest value found, not the highest. I default to a value of 5 if nothing is found, as would be in the case above.
The idea is that you can assign a number to the importance of the parentheses by subtracting the two values. If you have something like a ^ on the outside of the parentheses
(2+3)^5
those parentheses are potentially very important, and would be given a high value, (in my program I use 5 for ^).
It is possible however that the inside operators would render the parentheses very unimportant,
(2)^5
where nothing is found. In that case the inside would be assigned a value of 5. By subtracting the two values, you can then determine whether or not a set of parentheses is neccessary simply by checking whether the resulting number is greater than 0. In the case of (2+3)^5, a ^ would give a value of 5, and a + would give a value of 1. The resulting number would be 4, which would indicate that the parentheses are in fact needed.
In the case of (2)^5 you would have an inner value of 5 and an outer value of 5, resulting
in a final value of 0, showing that the parentheses are unimportant, and can be removed.
The downside to this is that, (at least on the TI-83) scanning through the equation so many times is ridiculously slow. But if speed isn't an issue...
Don't know if that will help at all, I might be completely off topic. Hope you got everything up and working.
I'm pretty sure that in order to determine what parentheses are unnecessary, you have to evaluate the expressions within them. Because you can nest parentheses, this is is the sort of recursive problem that a regular expression can only address in a shallow manner, and most likely to incorrect results. If you're already evaluating the expression, maybe you'd like to simplify the formula if possible. This also gets kind of tricky, and in some approaches uses techniques that that are also seen in machine learning, such as you might see in the following paper: http://portal.acm.org/citation.cfm?id=1005298
If your variable names don't change significantly from 1 query to the next, you could try a series of replace() commands. i.e.
X=replace([QryString],"(([V].[6443]))","[V].[6443]")
Also, why can't it surpass 255 characters? If you are storing this as a string field in an Access table, then you could try putting half the expression in 1 field and the second half in another.
You could also try parsing your expression using ANTLR, yacc or similar and create a parse tree. These trees usually optimize parentheses away. Then you would just have to create expression back from tree (without parentheses obviously).
It might take you more than a few hours to get this working though. But expression parsing is usually the first example on generic parsing, so you might be able to take a sample and modify it to your needs.

intelligent path truncation/ellipsis for display

I am looking for an existign path truncation algorithm (similar to what the Win32 static control does with SS_PATHELLIPSIS) for a set of paths that should focus on the distinct elements.
For example, if my paths are like this:
Unit with X/Test 3V/
Unit with X/Test 4V/
Unit with X/Test 5V/
Unit without X/Test 3V/
Unit without X/Test 6V/
Unit without X/2nd Test 6V/
When not enough display space is available, they should be truncated to something like this:
...with X/...3V/
...with X/...4V/
...with X/...5V/
...without X/...3V/
...without X/...6V/
...without X/2nd ...6V/
(Assuming that an ellipsis generally is shorter than three letters).
This is just an example of a rather simple, ideal case (e.g. they'd all end up at different lengths now, and I wouldn't know how to create a good suggestion when a path "Thingie/Long Test/" is added to the pool).
There is no given structure of the path elements, they are assigned by the user, but often items will have similar segments. It should work for proportional fonts, so the algorithm should take a measure function (and not call it to heavily) or generate a suggestion list.
Data-wise, a typical use case would contain 2..4 path segments anf 20 elements per segment.
I am looking for previous attempts into that direction, and if that's solvable wiht sensible amount of code or dependencies.
I'm assuming you're asking mainly about how to deal with the set of folder names extracted from the same level of hierarchy, since splitting by rows and path separators and aggregating by hierarchy depth is simple.
Your problem reminds me a lot of the longest common substring problem, with the differences that:
You're interested in many substrings, not just one.
You care about order.
These may appear substantial, but if you examine the dynamic-programming solution in the article you can see that it revolves around creating a table of "character collisions" and then looking for the longest diagonal in this table. I think that you could instead enumerate all diagonals in the table by the order in which they appear, and then for each path replace, by order, all appearances of these strings with ellipses.
Enforcing a minimal substring length of 2 will return a result similar to what you've outlined in your question.
It does seem like it requires some tinkering with the algorithm (for example, ensuring a certain substring is first in all strings), and then you need to invoke it over your entire set... I hope this at least gives you a possible direction.
Well, the "natural number" ordering part is actually easy, simply replace all numbers with formatted number where there is enough leading zeroes, eg. Test 9V -> Test 000009V and Test 12B -> Test 000012B. These are now sortable by standard methods.
For the actual ellipsisizing. Unless this is actually a huge system, I'd just add manual ellipsisizing "list" (of regexes, for flexibility and pain) that'd turn certain words into ellipses. This does requires continuous work, but coming up with the algorithm eats your time too; there are myriads of corner cases.
I'd probably try a "Floodfill" approach. Arrange first level of directories as you would a bitmap, every letter is a pixel. iterate over all characters that are in names of directories. with all of them, "paint" this same character, then "paint" the next character from first string such that it follows this previous character (and so on etc.) Then select the longest painted string that you find.
Example (if prefixed with *, it's painted)
Foo
BarFoo
*Foo
Bar*Foo
*F*oo
Bar*F*oo
...
note that:
*ofoo
b*oo
*o*foo
b*oo
.. painting of first 'o' stops since there are no continuing characters.
of*oo
b*oo
...
And then you get to to second "o" and it will find a substring of at least 2.
So you will have to iterate over most possible character instances (one optimization is to stop in each string at position Length-n, where n is the longest already found common substring. But then there is yet another problem (here with "Beta Beta")
| <- visibility cutout
Alfa Beta Gamma Delta 1
Alfa Beta Gamma Delta 2
Alfa Beta Beta 1
Alfa Beta Beta 2
Beta Beta 1
Beta Beta 2
Beta Beta 3
Beta Beta 4
What do you want to do? Cut Alfa Beta Gamma Delta or Alfa Beta or Beta Beta or Beta?
This is a bit rambling, but might be entertaining :).

How to elegantly compute the anagram signature of a word in ruby?

Arising out of this question, I'm looking for an elegant (ruby) way to compute the word signature suggested in this answer.
The idea suggested is to sort the letters in the word, and also run length encode repeated letters. So, for example "mississippi" first becomes "iiiimppssss", and then could be further shortened by encoding as "4impp4s".
I'm relatively new to ruby and though I could hack something together, I'm sure this is a one liner for somebody with more experience of ruby. I'd be interested to see people's approaches and improve my ruby knowledge.
edit: to clarify, performance of computing the signature doesn't much matter for my application. I'm looking to compute the signature so I can store it with each word in a large database of words (450K words), then query for words which have the same signature (i.e. all anagrams of a given word, that are actual english words). Hence the focus on space. The 'elegant' part is just to satisfy my curiosity.
The fastest way to create a sorted list of the letters is this:
"mississippi".unpack("c*").sort.pack("c*")
It is quite a bit faster than split('') and join(). For comparison it is also best to pack the array back together into a String, so you dont have to compare arrays.
I'm not much of a Ruby person either, but as I noted on the other comment this seems to work for the algorithm described.
s = "mississippi"
s.split('').sort.join.gsub(/(.)\1{2,}/) { |s| s.length.to_s + s[0,1] }
Of course, you'll want to make sure the word is lowercase, doesn't contain numbers, etc.
As requested, I'll try to explain the code. Please forgive me if I don't get all of the Ruby or reg ex terminology correct, but here goes.
I think the split/sort/join part is pretty straightforward. The interesting part for me starts at the call to gsub. This will replace a substring that matches the regular expression with the return value from the block that follows it. The reg ex finds any character and creates a backreference. That's the "(.)" part. Then, we continue the matching process using the backreference "\1" that evaluates to whatever character was found by the first part of the match. We want that character to be found a minimum of two more times for a total minimum number of occurrences of three. This is done using the quantifier "{2,}".
If a match is found, the matching substring is then passed to the next block of code as an argument thanks to the "|s|" part. Finally, we use the string equivalent of the matching substring's length and append to it whatever character makes up that substring (they should all be the same) and return the concatenated value. The returned value replaces the original matching substring. The whole process continues until nothing is left to match since it's a global substitution on the original string.
I apologize if that's confusing. As is often the case, it's easier for me to visualize the solution than to explain it clearly.
I don't see an elegant solution. You could use the split message to get the characters into an array, but then once you've sorted the list I don't see a nice linear-time concatenate primitive to get back to a string. I'm surprised.
Incidentally, run-length encoding is almost certainly a waste of time. I'd have to see some very impressive measurements before I'd think it worth considering. If you avoid run-length encoding, you can anagrammatize any string, not just a string of letters. And if you know you have only letters and are trying to save space, you can pack them 5 bits to a letter.
---Irma Vep
EDIT: the other poster found join which I missed. Nice.

Resources