How to improve performance of Nearley-based parser

How to improve performance of Nearley-based parser - performance

I'm trying to parse SQL INSERT statements like:
INSERT INTO `Album` (`Title`, `ArtistId`) VALUES ('Blue Moods', 89);
using the following grammar written for Nearley:
main -> statement:*
statement -> insert_clause values_clause ";"
insert_clause -> "INSERT INTO" identifier parenthesis
values_clause -> "VALUES" parenthesis
parenthesis -> "(" list ")"
list -> value ("," value):*
value -> identifier | literal
identifier -> %IDENTIFIER | %QUOTED_IDENTIFIER
literal -> %STRING | %NUMBER
I have removed all the post-processor code for clarity as I think it shouldn't have much of a performance impact. The most expensive thing I'm doing there is calling Array.flat() in post-processors of main and list rules.
The generated parser takes over 20 seconds to parse 1.6 MB of SQL. In contrast, a recursive-descent parser which I wrote manually to do the same thing, takes under a second. The parser also gets slower and slower as I give it more input, so it's definitely not linear time.
I found an issue from Nearley Github which mentions 3 possible culprits for performance problems with Nearley:
Not using a Tokenizer - I do use a tokenizer, so not a problem.
Having an ambiguous grammar - originally I had problems with ambiguity, but I've since fixed them and the grammar in here shouldn't IMHO have any issues.
Using right-recursive rules.
The last one I'm not sure of. The list rule seems right-recursive to me, so I tried rewriting it as:
list -> (value ","):* value
But this had zero effect on performance.
I'm out of ideas now.
I'm pretty inexperienced with parser generators, so hopefully this is some sort of silly mistake that I'm making here.
PS. Ignore the "INSERT INTO" token. I know it should really be two separate tokens.

Discovered that the problems was with "expensive" post processor functions. Basically I had this:
main -> statement:* {% flatten %}
Turns out that instead of the post-processor function being executed when all statements are matched it gets executed at every increment:
with 0 statements
with 1 statement
with 2 statements
...
As I was calling flatten() function which loops though the whole array, this resulted in O(n2) complexity.
There's a nearley issue, which describes a similar problem, stating that this behavior is by design, as Nearley is a streaming parser.

Related

Compile error using fast pipe operator after pipe last in ReasonML

The way that the "fast pipe" operator is compared to the "pipe last" in many places implies that they are drop-in replacements for each other. Want to send a value in as the last parameter to a function? Use pipe last (|>). Want to send it as the first parameter? Use fast pipe (once upon a time |., now deprecated in favour of ->).
So you'd be forgiven for thinking, as I did until earlier today, that the following code would get you the first match out of the regular expression match:
Js.String.match([%re "/(\\w+:)*(\\w+)/i"], "int:id")
|> Belt.Option.getExn
-> Array.get(1)
But you'd be wrong (again, as I was earlier today...)
Instead, the compiler emits the following warning:
We've found a bug for you!
OCaml preview 3:10-27
This has type:
'a option -> 'a
But somewhere wanted:
'b array
See this sandbox. What gives?

Looks like they screwed up the precedence of -> so that it's actually interpreted as
Js.String.match([%re "/(\\w+:)*(\\w+)/i"], "int:id")
|> (Belt.Option.getExn->Array.get(1));
With the operators inlined:
Array.get(Belt.Option.getExn, 1, Js.String.match([%re "/(\\w+:)*(\\w+)/i"], "int:id"));
or with the partial application more explicit, since Reason's syntax is a bit confusing with regards to currying:
let f = Array.get(Belt.Option.getExn, 1);
f(Js.String.match([%re "/(\\w+:)*(\\w+)/i"], "int:id"));
Replacing -> with |. works. As does replacing the |> with |..
I'd consider this a bug in Reason, but would in any case advise against using "fast pipe" at all since it invites a lot of confusion for very little benefit.

Also see this discussion on Github, which contains various workarounds. Leaving #glennsl's as the accepted answer because it describes the nature of the problem.
Update: there is also an article on Medium that goes into a lot of depth about the pros and cons of "data first" and "data last" specifically as it applies to Ocaml / Reason with Bucklescript.

Pythonesque blocks and postfix expressions

In JavaScript,
f = function(x) {
return x + 1;
}
(5)
seems at a glance as though it should assign f the successor function, but actually assigns the value 6, because the lambda expression followed by parentheses is interpreted by the parser as a postfix expression, specifically a function call. Fortunately this is easy to fix:
f = function(x) {
return x + 1;
};
(5)
behaves as expected.
If Python allowed a block in a lambda expression, there would be a similar problem:
f = lambda(x):
return x + 1
(5)
but this time we can't solve it the same way because there are no semicolons. In practice Python avoids the problem by not allowing multiline lambda expressions, but I'm working on a language with indentation-based syntax where I do want multiline lambda and other expressions, so I'm trying to figure out how to avoid having a block parse as the start of a postfix expression. Thus far I'm thinking maybe each level of the recursive descent parser should have a parameter along the lines of 'we have already eaten a block in this statement so don't do postfix'.
Are there any existing languages that encounter this problem, and how do they solve it if so?

Python has semicolons. This is perfectly valid (though ugly and not recommended) Python code: f = lambda(x): x + 1; (5).
There are many other problems with multi-line lambdas in otherwise standard Python syntax though. It is completely incompatible with how Python handles indentation (whitespace in general, actually) inside expressions - it doesn't, and that's the complete opposite of what you want. You should read the numerous python-ideas thread about multi-line lambdas. It's somewhere between very hard to impossible.
If you want arbitrarily complex compound statements inside lambdas you can't use the existing rules for multi-line expressions even if you made all statements expressions. You'd have to change the indentation handling (see the language reference for how it works right now) so that expressions can also contain blocks. This is hard to do without breaking perfectly fine Python code, and will certainly result in a language many Python programmers will consider worse in several regards: Harder to understand, more complex to implement, permits some stupid errors, etc.
Most languages don't solve this exact problem at all. Most candidates (Scala, Ruby, Lisps, and variants of these three) have explicit end-of-block tokens. I know of two languages that have the same problem, one of which (Haskell) has been mentioned by another answer. Coffeescript also uses indentation without end-of-block tokens. It parses the transliteration of your example correctly. However, I could not find any specification of how or why it does this (and I won't dig through the parser source code). Both differ significantly from Python in syntax as well as design philosophy, so their solution is of little (if any) use for Python.

In Haskell, there is an implicit semicolon whenever you start a line with the same indentation as a previous one, assuming the parser is in a layout-sensitive mode.
More specifically, after a token is encountered that signals the start of a (layout-sensitive) block, the indentation level of the first token of the first block item is remembered. Each line that is indented more continues the current block item; each line that is indented the same starts a new block item, and the first line that is indented less implies the closure of the block.
How your last example would be treated depends on whether the f = is a block item in some block or not. If it is, then there will be an implicit semicolon between the lambda expression and the (5), since the latter is indented the same as the former. If it is not, then the (5) will be treated as continuing whatever block item the f = is a part of, making it an argument to the lamda function.
The details are a bit messier than this; look at the Haskell 2010 report.

Unwanted evaluation in assignments in Mathematica: why it happens and how to debug it during the package-loading?

I am developing a (large) package which does not load properly anymore.
This happened after I changed a single line of code.
When I attempt to load the package (with Needs), the package starts loading and then one of the setdelayed definitions “comes alive” (ie. Is somehow evaluated), gets trapped in an error trapping routine loaded a few lines before and the package loading aborts.
The error trapping routine with abort is doing its job, except that it should not have been called in the first place, during the package loading phase.
The error message reveals that the wrong argument is in fact a pattern expression which I use on the lhs of a setdelayed definition a few lines later.
Something like this:
……Some code lines
Changed line of code
g[x_?NotGoodQ]:=(Message[g::nogood, x];Abort[])
……..some other code lines
g/: cccQ[g[x0_]]:=True
When I attempt to load the package, I get:
g::nogood: Argument x0_ is not good
As you see the passed argument is a pattern and it can only come from the code line above.
I tried to find the reason for this behavior, but I have been unsuccessful so far.
So I decided to use the powerful Workbench debugging tools .
I would like to see step by step (or with breakpoints) what happens when I load the package.
I am not yet too familiar with WB, but it seems that ,using Debug as…, the package is first loaded and then eventually debugged with breakpoints, ect.
My problem is that the package does not even load completely! And any breakpoint set before loading the package does not seem to be effective.
So…2 questions:
can anybody please explain why these code lines "come alive" during package loading? (there are no obvious syntax errors or code fragments left in the package as far as I can see)
can anybody please explain how (if) is possible to examine/debug
package code while being loaded in WB?
Thank you for any help.
Edit
In light of Leonid's answer and using his EvenQ example:
We can avoid using Holdpattern simply by definying upvalues for g BEFORE downvalues for g
notGoodQ[x_] := EvenQ[x];
Clear[g];
g /: cccQ[g[x0_]] := True
g[x_?notGoodQ] := (Message[g::nogood, x]; Abort[])
Now
?g
Global`g
cccQ[g[x0_]]^:=True
g[x_?notGoodQ]:=(Message[g::nogood,x];Abort[])
In[6]:= cccQ[g[1]]
Out[6]= True
while
In[7]:= cccQ[g[2]]
During evaluation of In[7]:= g::nogood: -- Message text not found -- (2)
Out[7]= $Aborted
So...general rule:
When writing a function g, first define upvalues for g, then define downvalues for g, otherwise use Holdpattern
Can you subscribe to this rule?
Leonid says that using Holdpattern might indicate improvable design. Besides the solution indicated above, how could one improve the design of the little code above or, better, in general when dealing with upvalues?
Thank you for your help

Leaving aside the WB (which is not really needed to answer your question) - the problem seems to have a straightforward answer based only on how expressions are evaluated during assignments. Here is an example:
In[1505]:=
notGoodQ[x_]:=True;
Clear[g];
g[x_?notGoodQ]:=(Message[g::nogood,x];Abort[])
In[1509]:= g/:cccQ[g[x0_]]:=True
During evaluation of In[1509]:= g::nogood: -- Message text not found -- (x0_)
Out[1509]= $Aborted
To make it work, I deliberately made a definition for notGoodQ to always return True. Now, why was g[x0_] evaluated during the assignment through TagSetDelayed? The answer is that, while TagSetDelayed (as well as SetDelayed) in an assignment h/:f[h[elem1,...,elemn]]:=... does not apply any rules that f may have, it will evaluate h[elem1,...,elem2], as well as f. Here is an example:
In[1513]:=
ClearAll[h,f];
h[___]:=Print["Evaluated"];
In[1515]:= h/:f[h[1,2]]:=3
During evaluation of In[1515]:= Evaluated
During evaluation of In[1515]:= TagSetDelayed::tagnf: Tag h not found in f[Null]. >>
Out[1515]= $Failed
The fact that TagSetDelayed is HoldAll does not mean that it does not evaluate its arguments - it only means that the arguments arrive to it unevaluated, and whether or not they will be evaluated depends on the semantics of TagSetDelayed (which I briefly described above). The same holds for SetDelayed, so the commonly used statement that it "does not evaluate its arguments" is not literally correct. A more correct statement is that it receives the arguments unevaluated and does evaluate them in a special way - not evaluate the r.h.s, while for l.h.s., evaluate head and elements but not apply rules for the head. To avoid that, you may wrap things in HoldPattern, like this:
Clear[g,notGoodQ];
notGoodQ[x_]:=EvenQ[x];
g[x_?notGoodQ]:=(Message[g::nogood,x];Abort[])
g/:cccQ[HoldPattern[g[x0_]]]:=True;
This goes through. Here is some usage:
In[1527]:= cccQ[g[1]]
Out[1527]= True
In[1528]:= cccQ[g[2]]
During evaluation of In[1528]:= g::nogood: -- Message text not found -- (2)
Out[1528]= $Aborted
Note however that the need for HoldPattern inside your left-hand side when making a definition is often a sign that the expression inside your head may also evaluate during the function call, which may break your code. Here is an example of what I mean:
In[1532]:=
ClearAll[f,h];
f[x_]:=x^2;
f/:h[HoldPattern[f[y_]]]:=y^4;
This code attempts to catch cases like h[f[something]], but it will obviously fail since f[something] will evaluate before the evaluation comes to h:
In[1535]:= h[f[5]]
Out[1535]= h[25]
For me, the need for HoldPattern on the l.h.s. is a sign that I need to reconsider my design.
EDIT
Regarding debugging during loading in WB, one thing you can do (IIRC, can not check right now) is to use good old print statements, the output of which will appear in the WB's console. Personally, I rarely feel a need for debugger for this purpose (debugging package when loading)
EDIT 2
In response to the edit in the question:
Regarding the order of definitions: yes, you can do this, and it solves this particular problem. But, generally, this isn't robust, and I would not consider it a good general method. It is hard to give a definite advice for a case at hand, since it is a bit out of its context, but it seems to me that the use of UpValues here is unjustified. If this is done for error - handling, there are other ways to do it without using UpValues.
Generally, UpValues are used most commonly to overload some function in a safe way, without adding any rule to the function being overloaded. One advice is to avoid associating UpValues with heads which also have DownValues and may evaluate -by doing this you start playing a game with evaluator, and will eventually lose. The safest is to attach UpValues to inert symbols (heads, containers), which often represent a "type" of objects on which you want to overload a given function.
Regarding my comment on the presence of HoldPattern indicating a bad design. There certainly are legitimate uses for HoldPattern, such as this (somewhat artificial) one:
In[25]:=
Clear[ff,a,b,c];
ff[HoldPattern[Plus[x__]]]:={x};
ff[a+b+c]
Out[27]= {a,b,c}
Here it is justified because in many cases Plus remains unevaluated, and is useful in its unevaluated form - since one can deduce that it represents a sum. We need HoldPattern here because of the way Plus is defined on a single argument, and because a pattern happens to be a single argument (even though it describes generally multiple arguments) during the definition. So, we use HoldPattern here to prevent treating the pattern as normal argument, but this is mostly different from the intended use cases for Plus. Whenever this is the case (we are sure that the definition will work all right for intended use cases), HoldPattern is fine. Note b.t.w., that this example is also fragile:
In[28]:= ff[Plus[a]]
Out[28]= ff[a]
The reason why it is still mostly OK is that normally we don't use Plus on a single argument.
But, there is a second group of cases, where the structure of usually supplied arguments is the same as the structure of patterns used for the definition. In this case, pattern evaluation during the assignment indicates that the same evaluation will happen with actual arguments during the function calls. Your usage falls into this category. My comment for a design flaw was for such cases - you can prevent the pattern from evaluating, but you will have to prevent the arguments from evaluating as well, to make this work. And pattern-matching against not completely evaluated expression is fragile. Also, the function should never assume some extra conditions (beyond what it can type-check) for the arguments.

Why use short-circuit code?

Related Questions: Benefits of using short-circuit evaluation, Why would a language NOT use Short-circuit evaluation?, Can someone explain this line of code please? (Logic & Assignment operators)
There are questions about the benefits of a language using short-circuit code, but I'm wondering what are the benefits for a programmer? Is it just that it can make code a little more concise? Or are there performance reasons?
I'm not asking about situations where two entities need to be evaluated anyway, for example:
if($user->auth() AND $model->valid()){
$model->save();
}
To me the reasoning there is clear - since both need to be true, you can skip the more costly model validation if the user can't save the data.
This also has a (to me) obvious purpose:
if(is_string($userid) AND strlen($userid) > 10){
//do something
};
Because it wouldn't be wise to call strlen() with a non-string value.
What I'm wondering about is the use of short-circuit code when it doesn't effect any other statements. For example, from the Zend Application default index page:
defined('APPLICATION_PATH')
|| define('APPLICATION_PATH', realpath(dirname(__FILE__) . '/../application'));
This could have been:
if(!defined('APPLICATION_PATH')){
define('APPLICATION_PATH', realpath(dirname(__FILE__) . '/../application'));
}
Or even as a single statement:
if(!defined('APPLICATION_PATH'))
define('APPLICATION_PATH', realpath(dirname(__FILE__) . '/../application'));
So why use the short-circuit code? Just for the 'coolness' factor of using logic operators in place of control structures? To consolidate nested if statements? Because it's faster?

For programmers, the benefit of a less verbose syntax over another more verbose syntax can be:
less to type, therefore higher coding efficiency
less to read, therefore better maintainability.
Now I'm only talking about when the less verbose syntax is not tricky or clever in any way, just the same recognized way of doing, but in fewer characters.
It's often when you see specific constructs in one language that you wish the language you use could have, but didn't even necessarily realize it before. Some examples off the top of my head:
anonymous inner classes in Java instead of passing a pointer to a function (way more lines of code).
in Ruby, the ||= operator, to evaluate an expression and assign to it if it evaluates to false or is null. Sure, you can achieve the same thing by 3 lines of code, but why?
and many more...

Use it to confuse people!

I don't know PHP and I've never seen short-circuiting used outside an if or while condition in the C family of languages, but in Perl it's very idiomatic to say:
open my $filehandle, '<', 'filename' or die "Couldn't open file: $!";
One advantage of having it all in one statement is the variable declaration. Otherwise you'd have to say:
my $filehandle;
unless (open $filehandle, '<', 'filename') {
die "Couldn't open file: $!";
}
Hard to claim the second one is cleaner in that case. And it'd be wordier still in a language that doesn't have unless

I think your example is for the coolness factor. There's no reason to write code like that.
EDIT: I have no problem with doing it for idiomatic reasons. If everyone else who uses a language uses short-circuit evaluation to make statement-like entities that everyone understands, then you should too. However, my experience is that code of that sort is rarely written in C-family languages; proper form is just to use the "if" statement as normal, which separates the conditional (which presumably has no side effects) from the function call that the conditional controls (which presumably has many side effects).

Short circuit operators can be useful in two important circumstances which haven't yet been mentioned:
Case 1. Suppose you had a pointer which may or may not be NULL and you wanted to check that it wasn't NULL, and that the thing it pointed to wasn't 0. However, you must not dereference the pointer if it's NULL. Without short-circuit operators, you would have to do this:
if (a != NULL) {
if (*a != 0) {
⋮
}
}
However, short-circuit operators allow you to write this more compactly:
if (a != NULL && *a != 0) {
⋮
}
in the certain knowledge that *a will not be evaluated if a is NULL.
Case 2. If you want to set a variable to a non-false value returned from one of a series of functions, you can simply do:
my $file = $user_filename ||
find_file_in_user_path() ||
find_file_in_system_path() ||
$default_filename;
This sets the value of $file to $user_filename if it's present, or the result of find_file_in_user_path(), if it's true, or … so on. This is seen perhaps more often in Perl than C, but I have seen it in C.
There are other uses, including the rather contrived examples which you cite above. But they are a useful tool, and one which I have missed when programming in less complex languages.

Related to what Dan said, I'd think it all depends on the conventions of each programming language. I can't see any difference, so do whatever is idiomatic in each programming language. One thing that could make a difference that comes to mind is if you had to do a series of checks, in that case the short-circuiting style would be much clearer than the alternative if style.

What if you had a expensive to call (performance wise) function that returned a boolean on the right hand side that you only wanted called if another condition was true (or false)? In this case Short circuiting saves you many CPU cycles. It does make the code more concise because of fewer nested if statements. So, for all the reasons you listed at the end of your question.

The truth is actually performance. Short circuiting is used in compilers to eliminate dead code saving on file size and execution speed. At run-time short-circuiting does not execute the remaining clause in the logical expression if their outcome does not affect the answer, speeding up the evaluation of the formula. I am struggling to remember an example. e.g
a AND b AND c
There are two terms in this formula evaluated left to right.
if a AND b evaluates to FALSE then the next expression AND c can either be FALSE AND TRUE or FALSE AND FALSE. Both evaluate to FALSE no matter what the value of c is. Therefore the compiler does not include AND c in the compiled format hence short-circuiting the code.
To answer the question there are special cases when the compiler cannot determine whether the logical expression has a constant output and hence would not short-circuit the code.

Think of it this way, if you have a statement like
if( A AND B )
chances are if A returns FALSE you'll only ever want to evaluate B in rare special cases. For this reason NOT using short ciruit evaluation is confusing.
Short circuit evaluation also makes your code more readable by preventing another bracketed indentation and brackets have a tendency to add up.

Which method will perform better? (Or is there not enough of a difference to matter?)

Which (if any) of the following will give the smallest performance hit? Or is the difference so small that I should use the most readable?
In the same page I've noted 3 styles used by previous maintainers:
Method 1:
If (strRqMethod = "Forum" or strRqMethod = "URL" or strRqMethod = "EditURL" or strRqMethod = "EditForum") Then
...
End If
Method 2:
Select Case strRqMethod
Case "Reply", "ReplyQuote", "TopicQuote"
'This is the only case in this statement...'
...
End Select
Method 3:
If InArray("Edit,EditTopic,Reply,ReplyQuote,Topic,TopicQuote",strRqMethod) Then
...
End If
.
.
.
'Elsewhere in the code'
function InArray(strArray,strValue)
if strArray <> "" and strArray <> "0" then
if (instr("," & strArray & "," ,"," & strValue & ",") > 0) then
InArray = True
else
InArray = False
end if
else
InArray = False
end if
end function
Moving away from Classic ASP/VBScript is not an option, so those comments need not bother to post.

You can benchmark this yourself to get the best results, as some performance will differ depending on the size of the input string.
However, I will say from a maintenance perspective, the second one is a bit easier to read/understand.

Well Method 3 is clearly going to perform worse than the other two.
Between Method 1 and Method 2 the difference is going to be marginal. Its worth remembering that VBScript doesn't do boolean expression short cutting hence in Method 1 strRqMethod will be compared with all strings even if it matches the first one. The Case statement in Method 2 at least has the option not to do that and likely will stop comparing when the first match is found in the set.
Utimately I would choose Method 2 not because I think it might be faster but because it expresses the intent of the code in the clearest way.

Educated guess:
Performance-wise, first two approaches are roughly equivalent; third method is very likely slower, even if it gets inlined.
Furthermore the differential between the first two are likely in the micro-seconds range, so you can safely consider this to be a bone fide case of premature optimization...
Since we're on the topic of OR-ed boolean evaluation, a few things to know:
Most compilers/interpreters will evaluate boolean expressions with "short circuit optimization", which means that at the first true condition found, the subsequent OR-ed conditions are NOT evaluated (since they wouldn't change the outcome). It is therefore a good idea to list the condition in [rough] decreasing order of probability, i.e. listing all the common cases first. (Also note that short circuit evaluation is also used with AND-ed expressions, but of course in the reverse, i.e. at the first false condition, the evalation stops, hence suggesting to write the expression with the most likely conditions to fail first).
Comparing strings is such a common task that most languages have this done in a very optimized fashion, at a very low level of the language. Most any trick we can think to improve this particular task is typically less efficient than the native operator.

As long as this is not done 100.000 (in other words: a lof of) times in a loop, it makes no difference. Although it is parsed code, we may still assume that the parsing is done swift and quickly enough not to make a difference.
I found severe performance problems only when you are concatenating a lot of strings - like I once found out when running a page, adding debug code to a global string to be able to dispay the debug only at the bottom of the page. The longer the page was, the more code it ran, the more debug code I added, and the longer the time it took to display the page. Since this page was doing some database access, I presumed it was somewhere in that code that the delay occured, only to found out that it was just the debug statements (to be honest, I had a log of debug string concatenated).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to improve performance of Nearley-based parser - performance

Related

Compile error using fast pipe operator after pipe last in ReasonML

Pythonesque blocks and postfix expressions

Unwanted evaluation in assignments in Mathematica: why it happens and how to debug it during the package-loading?

Why use short-circuit code?

Which method will perform better? (Or is there not enough of a difference to matter?)

Categories

Resources