I'm trying to scan through a JavaScript document that has many functions defined throughout it, and to delete a function from the document. I'm hoping that there's a trivial regex trick that I can do here.
Example:
Some JavaScript document:
function wanted_foo(bar) {
...
...
}
function unwanted_foo(bar) {
...
inner_foo {
...
...
}
}
function wanted_foo(bar) {
...
...
inner_foo {
...
...
}
}
The obvious problem here is that I need to match things of the form "function unwanted_foo(bar) { ... }", except that I only need to match up until the last curly brace of the function, and not to the next curly brace of another function. Is there a simple Regex way of doing this?
Normal regular expressions can't do this because they can't keep count of the previously matched braces and therefore can't find the last one, nonetheless it seems many implementations have capabilities that go beyond normal regex's and Ruby is one of those cases.
Hare are a couple of references about that, although it might not be what you would call simple.
Matching braces in ruby with a character in front
Backreferences
One "trick" is to use a counter together with regex.
Initialize your counter to 0
Match something of the form /^\s*function\s+\w+\(.*\)\s*\{ and when you have found this pattern, remember the position.
When you match that pattern, increment your counter
Now match { and } separately and increment or decrement your counter depending on what you've found.
Keep doing this until your counter is 0 again, then you should have found a function
Hope that's useful to you?
Related
I need to check if any elements of a large (60,000+ elements) array are present in a long string of text. My current code looks like this:
if $TARGET_PARTLIST.any? { |target_pn| pdf_content_string.include? target_pn }
self.last_match_code = target_pn
self.is_a_match = true
end
I get a syntax error undefined local variable or method target_pn.
Could someone let me know the correct syntax to use for this block of code? Also, if anyone knows of a quicker way to do this, I'm all ears!
In this case, all your syntax is correct, you've just got a logic error. While target_pn is defined (as a parameter) inside the block passed to any?, it is not defined in the block of the if statement because the scope of the any?-block ends with the closing curly brace, and target_pn is not available outside its scope. A correct (and more idiomatic) version of your code would look like this:
self.is_a_match = $TARGET_PARTLIST.any? do |target_pn|
included = pdf_content_string.include? target_pn
self.last_match_code = target_pn if included
included
end
Alternately, as jvillian so kindly suggests, one could turn the string into an array of words, then do an intersection and see if the resulting set is nonempty. Like this:
self.is_a_match = !($TARGET_PARTLIST &
pdf_content_string.gsub(/[^A-Za-z ]/,"")
.split).empty?
Unfortunately, this approach loses self.last_match_code. As a note, pointed out by Sergio, if you're dealing with non-English languages, the above regex will have to be changed.
Hope that helps!
You should use Enumerable#find rather than Enumerable#any?.
found = $TARGET_PARTLIST.find { |target_pn| pdf_content_string.include? target_pn }
if found
self.last_match_code = found
self.is_a_match = true
end
Note this does not ensure that the string contains a word that is an element of $TARGET_PARTLIST. For example, if $TARGET_PARTLIST contains the word "able", that string will be found in the string, "Are you comfortable?". If you only want to match words, you could do the following.
found = $TARGET_PARTLIST.find { |target_pn| pdf_content_string[/\b#{target_pn}\b/] }
Note this uses the method String#[].
\b is a word break in the regular expression, meaning that the first (last) character of the matched cannot be preceded (followed) by a word character (a letter, digit or underscore).
If speed is important it may be faster to use the following.
found = $TARGET_PARTLIST.find { |target_pn|
pdf_content_string.include?(target_on) && pdf_content_string[/\b#{target_pn}\b/] }
A probably more performant way would be to move all this into native code by letting Regexp search for it.
# needed only once
TARGET_PARTLIST_RE = Regexp.new("\\b(?:#{$TARGET_PARTLIST.sort.map { |pl| Regexp.escape(pl) }.join('|')})\\b")
# to check
self.last_match_code = pdf_content_string[TARGET_PARTLIST_RE]
self.is_a_match = !self.last_match_code.nil?
A much more performant way would be to build a prefix tree and create the regexp using the prefix tree (this optimises the regexp lookup), but this is a bit more work :)
I made an interpreter for a script language containing a loop for, using javacc I have defined the grammar but I can't have a way to back up to line to repeat the execution of the block "for".
how back up the token manager so that loop-bodies can be re-parsed, and thus reevaluated, over and over again ?.
void For(): {ArrayList<String> lst;Token n,v;int i=0;} {
"for" "(" n=<ID> ":" v=<ID> ")" "{"
(actions()";" )+
"}"
}
As explained in the JavaCC FAQ (http://www.engr.mun.ca/~theo/JavaCC-FAQ/javacc-faq-moz.htm#tth_sEc7.3), the best approach is to output some form of intermediate representation and then interpret that. Some common approaches follow.
One way is to output a tree from the parser. Then interpret the tree. The interpreter could use the Interpreter design pattern. https://en.wikipedia.org/wiki/Interpreter_pattern
A second way is to translate to the machine code for a virtual machine. An example of this approach can be seen in the Turtle Talk interpreter at http://www.engr.mun.ca/~theo/Courses/sd/5895-downloads/turtle-talk.zip .
A third way is to translate to another high-level programming language which can then be compiled and executed.
As other answers have said, this is why usually people just build a data structure in memory, where the group of commands is parsed once and can then just be executed repeatedly.
But if you want to do a pure interpreter for now, what is needed is to remember the start position of the token that turned out to be the loop (i.e. the "for" token) and rewind to that position in the token stream. You might also have to scan ahead for the opening and closing brackets ("{" and "}") so that you know the end of the loop.
Once you have that, your "for" command changes from an actual loop into a variant of the "if" statement. If the condition is true, you execute the commands, if the condition is false, you jump over everything, including the closing "}". And when you hit the "}", you jump back to the position of the "for" to check the condition all over again.
We can use the JavaCode Production like below,
provided the action() production takes care of each line.
Please Move semicolon to the actions() Production.
TOKEN : {
<LCURLY : "{" >
<RCURLY : "}" >
}
void For(): {ArrayList<String> lst;Token n,v;int i=0;} {
"for" "(" n=<ID> ":" v=<ID> ")" <LCURLY>
loopBody ();
<RCURLY>
}
JAVACODE
void loopBody () {
Token t = getNextToken();
while (t.kind != RCURLY) {
actions();
t = getNextToken();
}
}
Hope This will help.
I have a struct:
type nameSorter struct {
names []Name
by func(s1, s2 *Name) bool
Which is used in this method. What is going on with that comma? If I remove it there is a syntax error.
func (by By) Sort(names []Name) {
sorter := &nameSorter{
names: names,
by: by, //why does there have to be a comma here?
}
sort.Sort(sorter)
Also, the code below works perfectly fine and seems to be more clear.
func (by By) Sort(names []Name) {
sorter := &nameSorter{names, by}
sort.Sort(sorter)
For more context this code is part of a series of declarations for sorting of a custom type that looks like this:
By(lastNameSort).Sort(Names)
This is how go works, and go is strict with things like comma and parentheses.
The good thing about this notion is that when adding or deleting a line, it does not affect other line. Suppose the last comma can be omitted, if you want to add a field after it, you have to add the comma back.
See this post: https://dave.cheney.net/2014/10/04/that-trailing-comma.
From https://golang.org/doc/effective_go.html#semicolons:
the lexer uses a simple rule to insert semicolons automatically as it scans, so the input text is mostly free of them
In other words, the programmer is unburdened from using semicolons, but Go still uses them under the hood, prior to compilation.
Semicolons are inserted after the:
last token before a newline is an identifier (which includes words like int and float64), a basic literal such as a number or string constant, or one of the tokens break continue fallthrough return ++ -- ) }
Thus, without a comma, the lexer would insert a semicolon and cause a syntax error:
&nameSorter{
names: names,
by: by; // semicolon inserted after identifier, syntax error due to unclosed braces
}
So the title might be a little bit misleading, but I can't think of any better way to phrase it.
Basically, I'm writing a lexical-scanner using cygwin/lex. A part of the code reads a token /* . It the goes into a predefined state C_COMMENT, and ends when C_COMMENT"/*". Below is the actual code
"/*" {BEGIN(C_COMMENT); printf("%d: /*", linenum++);}
<C_COMMENT>"*/" { BEGIN(INITIAL); printf("*/\n"); }
<C_COMMENT>. {printf("%s",yytext);}
The code works when the comment is in a single line, such as
/* * Example of comment */
It will print the current line number, with the comment behind. But it doesn't work if the comment spans multiple lines. Rewriting the 3rd line into
<C_COMMENT>. {printf("%s",yytext);
printf("\n");}
doesn't work. It will result in \n printed for every letter in the comment. I'm guessing it has something to do with C having no strings or maybe I'm using the states wrong.
Hope someone will be able to help me out :)
Also if there's any other info you need, just ask, and I'll provide.
The easiest way to echo the token scanned by a pattern is to use the special action ECHO:
"/*" { printf("%d: ", linenum++); ECHO; BEGIN(C_COMMENT); }
<C_COMMENT>"*/" { ECHO; BEGIN(INITIAL); }
<C_COMMENT>. { ECHO; }
None of the above rules matches a newline inside a comment, because in (f)lex . doesn't match newlines:
<C_COMMENT>\n { linenum++; ECHO; }
A faster way of recognizing C comments is with a single regular expression, although it's a little hard to read:
[/][*][^*]*[*]+([^/*][^*][*]+)*[/]
In this case, you'll have to rescan the comment to count newlines, unless you get flex to do the line number counting.
flex scanners maintain a line number count in yylineno, if you request that feature (using %option yylineno). It's often more efficient and always more reliable than keeping the count yourself. However, in the action, the value of yylineno is the line number count at the end of the pattern, not at the beginning, which can be misleading for multiline patterns. A common workaround is to save the value of yylineno in another variable at the beginning of the token scan.
I would like to find a regex that will pick out all commas that fall outside quote sets.
For example:
'foo' => 'bar',
'foofoo' => 'bar,bar'
This would pick out the single comma on line 1, after 'bar',
I don't really care about single vs double quotes.
Has anyone got any thoughts? I feel like this should be possible with readaheads, but my regex fu is too weak.
This will match any string up to and including the first non-quoted ",". Is that what you are wanting?
/^([^"]|"[^"]*")*?(,)/
If you want all of them (and as a counter-example to the guy who said it wasn't possible) you could write:
/(,)(?=(?:[^"]|"[^"]*")*$)/
which will match all of them. Thus
'test, a "comma,", bob, ",sam,",here'.gsub(/(,)(?=(?:[^"]|"[^"]*")*$)/,';')
replaces all the commas not inside quotes with semicolons, and produces:
'test; a "comma,"; bob; ",sam,";here'
If you need it to work across line breaks just add the m (multiline) flag.
The below regexes would match all the comma's which are present outside the double quotes,
,(?=(?:[^"]*"[^"]*")*[^"]*$)
DEMO
OR(PCRE only)
"[^"]*"(*SKIP)(*F)|,
"[^"]*" matches all the double quoted block. That is, in this buz,"bar,foo" input, this regex would match "bar,foo" only. Now the following (*SKIP)(*F) makes the match to fail. Then it moves on to the pattern which was next to | symbol and tries to match characters from the remaining string. That is, in our output , next to pattern | will match only the comma which was just after to buz . Note that this won't match the comma which was present inside double quotes, because we already make the double quoted part to skip.
DEMO
The below regex would match all the comma's which are present inside the double quotes,
,(?!(?:[^"]*"[^"]*")*[^"]*$)
DEMO
While it's possible to hack it with a regex (and I enjoy abusing regexes as much as the next guy), you'll get in trouble sooner or later trying to handle substrings without a more advanced parser. Possible ways to get in trouble include mixed quotes, and escaped quotes.
This function will split a string on commas, but not those commas that are within a single- or double-quoted string. It can be easily extended with additional characters to use as quotes (though character pairs like « » would need a few more lines of code) and will even tell you if you forgot to close a quote in your data:
function splitNotStrings(str){
var parse=[], inString=false, escape=0, end=0
for(var i=0, c; c=str[i]; i++){ // looping over the characters in str
if(c==='\\'){ escape^=1; continue} // 1 when odd number of consecutive \
if(c===','){
if(!inString){
parse.push(str.slice(end, i))
end=i+1
}
}
else if(splitNotStrings.quotes.indexOf(c)>-1 && !escape){
if(c===inString) inString=false
else if(!inString) inString=c
}
escape=0
}
// now we finished parsing, strings should be closed
if(inString) throw SyntaxError('expected matching '+inString)
if(end<i) parse.push(str.slice(end, i))
return parse
}
splitNotStrings.quotes="'\"" // add other (symmetrical) quotes here
Try this regular expression:
(?:"(?:[^\\"]+|\\(?:\\\\)*[\\"])*"|'(?:[^\\']+|\\(?:\\\\)*[\\'])*')\s*=>\s*(?:"(?:[^\\"]+|\\(?:\\\\)*[\\"])*"|'(?:[^\\']+|\\(?:\\\\)*[\\'])*')\s*,
This does also allow strings like “'foo\'bar' => 'bar\\',”.
MarkusQ's answer worked great for me for about a year, until it didn't. I just got a stack overflow error on a line with about 120 commas and 3682 characters total. In Java, like this:
String[] cells = line.split("[\t,](?=(?:[^\"]|\"[^\"]*\")*$)", -1);
Here's my extremely inelegant replacement that doesn't stack overflow:
private String[] extractCellsFromLine(String line) {
List<String> cellList = new ArrayList<String>();
while (true) {
String[] firstCellAndRest;
if (line.startsWith("\"")) {
firstCellAndRest = line.split("([\t,])(?=(?:[^\"]|\"[^\"]*\")*$)", 2);
}
else {
firstCellAndRest = line.split("[\t,]", 2);
}
cellList.add(firstCellAndRest[0]);
if (firstCellAndRest.length == 1) {
break;
}
line = firstCellAndRest[1];
}
return cellList.toArray(new String[cellList.size()]);
}
#SocialCensus, The example you gave in the comment to MarkusQ, where you throw in ' alongside the ", doesn't work with the example MarkusQ gave right above that if we change sam to sam's: (test, a "comma,", bob, ",sam's,",here) has no match against (,)(?=(?:[^"']|["|'][^"']")$). In fact, the problem itself, "I don't really care about single vs double quotes", is ambiguous. You have to be clear what you mean by quoting either with " or with '. For example, is nesting allowed or not? If so, to how many levels? If only 1 nested level, what happens to a comma outside the inner nested quotation but inside the outer nesting quotation? You should also consider that single quotes happen by themselves as apostrophes (ie, like the counter-example I gave earlier with sam's). Finally, the regex you made doesn't really treat single quotes on par with double quotes since it assumes the last type of quotation mark is necessarily a double quote -- and replacing that last double quote with ['|"] also has a problem if the text doesn't come with correct quoting (or if apostrophes are used), though, I suppose we probably could assume all quotes are correctly delineated.
MarkusQ's regexp answers the question: find all commas that have an even number of double quotes after it (ie, are outside double quotes) and disregard all commas that have an odd number of double quotes after it (ie, are inside double quotes). This is generally the same solution as what you probably want, but let's look at a few anomalies. First, if someone leaves off a quotation mark at the end, then this regexp finds all the wrong commas rather than finding the desired ones or failing to match any. Of course, if a double quote is missing, all bets are off since it might not be clear if the missing one belongs at the end or instead belongs at the beginning; however, there is a case that is legitimate and where the regex could conceivably fail (this is the second "anomaly"). If you adjust the regexp to go across text lines, then you should be aware that quoting multiple consecutive paragraphs requires that you place a single double quote at the beginning of each paragraph and leave out the quote at the end of each paragraph except for at the end of the very last paragraph. This means that over the space of those paragraphs, the regex will fail in some places and succeed in others.
Examples and brief discussions of paragraph quoting and of nested quoting can be found here http://en.wikipedia.org/wiki/Quotation_mark .