Recursive use of regex's in TextMate JSON grammars - syntax-highlighting

I'm trying to write a grammar JSON file for TextMate to use with VSCode highlighting.
Let's say an int is:
([0-9]+|[0-9]+e[0-9]+)
And now I want to use ints to define this new expression:
int:int
example: 33:5 or 2e4:9
or any combination of some ints for that matter. Do I have to define it like:
([0-9]+|[0-9]+e[0-9]+):([0-9]+|[0-9]+e[0-9]+)
Or is there a better way?! Can I use stuff like "begin/end" or "begin/while" with "include" to achieve this? There must be a better way than just repeating my regexes thousands of times for different variations of items.
I'm looking for a way to name a certain pattern and then use the pattern in combination with other regex stuff to create a new patter. Kinda like in bnf grammar.
I tried using "begin" with (?=\d) and then "end" with (?!\d) and then "include" #int as pattern. It did not work...

Related

Named subroutines in Oniguruma regex engine?

In Perl, you can do this:
(?x)
(?(DEFINE)
(?<animal>dog|cat)
)
(?&animal)
In Ruby (Oniguruma engine), it seems that the (?(DEFINE... syntax is not supported. Also, (?&... becomes \g. So, you can do this:
(?x)
(?<animal>dog|cat)
\g<animal>
But of course, this is not equivalent to the Perl example I gave above, becuase the first (?<animal>dog|cat) is not ignored, since there isn't anything like (?(DEFINE....
If I want to define a large regex with a bunch of named subroutines, what I could once do in Perl can't be done this way.
It does seem that I could hack together a pretty awkward solution by doing something like this:
(?x)
(?:^$DEFINE
(?<animal>dog|cat)
){0}
\g<animal>
But, that is pretty hackish. Is there a better way to do this? Does Oniguruma support a way to define named subroutines without having to try to "match" them first?
Alternatively, if there is a way to get true PCRE to work in Ruby, with ?(DEFINE... and (?&... I'd take that too.
Thanks!
You don't need a so complicated hack. Writing:
(?x)
(?<animal>dog|cat){0}
(?<color>red|green|blue){0}
...
your main pattern here
does exactly the same.
Putting all group definitions inside (?:^$DEFINE ... ){0} is only cosmetic.
Note that a group with the quantifier {0} isn't tried at all (the quantifier is taken in account first), and if in this way the named group is defined anyway, man can deduce that it isn't really a hack, but the way to do it with oniguruma.

Find filenames with regexp group capture

I'm way to find matching files by regexp and also supports groups in the regexp. Like:
match_files('/home/(*)/**/(*).txt')
would return something like:
[ ['/home/bob/docs/abc.txt', 'bob', 'abc'], ['/home/sue/archive/docs/def.txt', 'sue', 'def'] ]
Guard does something like this. I'm not looking to match this specific regex; rather to match any arbitrary regex input that might be provided.
Dir.glob() normally returns a flat array and doesn't support groups. I'm trying to locate a library or some technique that would support this kind of thing, for a DSL.
I'm trying to locate a library or trick that would support this kind of thing, for a DSL.
So your question seem to be off topic, because you are asking to recommend or find a tool or library to solve your problem.
Also, your question should include valid code examples:
['/home/bob/docs/abc.txt', 'bob', 'readme']
I guess it's supposed to mean
['/home/bob/docs/abc.txt', 'bob', 'abc']
Anyways... I think the question is quite interesting, but I don't think that you can't solve it with the standard library.
Dir.glob:
Returns true if path matches against pattern. The pattern is not a
regular expression; instead it follows rules similar to shell filename
globbing. It may contain the following metacharacters...
The only reasonable thing to do is to allow special characters, parse the string, extract the matches, create a glob and then apply matching to the filenames.
How about this.
regex = %r{/home/([^/]+)/.*/([^/]+).txt}
`find .`.split.grep(regex).map { |l| l.match(regex) }.map(&:to_a)
Could certainly be improved.

Extract function names from function calls in C files

Is it posible to extract function calls in C source files, e.g.,
...
myfunc(1);
...
or
...
myfunc(anotherfunc(1, 2));
....
by just using Ruby regular expression? If not, would a parser generator such as ANTLR be useful?
This is not a full-proof pattern for finding out method calls but should just serve the pattern that you are interested in.
[a-zA-Z\s]*\([a-zA-Z0-9]*(\([a-zA-Z0-9\s]*[\s,]*[\sa-zA-Z0-9]*\))?\);
This regex will match following method call patterns.
1. myfunc(another(one,two));
2. myfunc();
3. myfunc(another());
4. myfunc(oneArg);
You can also use the regular expressions already written from grammar that are used by emacs -- imenu , etags, ecb, c-mode etc.
In the purest sense you can't, because the possibility to nest function calls recursively makes it a non-regular language. That is, you cannot write a regular expression that matches an arbitrary function call and extracts all of the contained function names.
But of course you could search incrementally for sequences of characters allowed in function names (ie., must start with a letter or underscore, followed by letters, underscore, numbers, etc...) followed by an left parenthesis, or something along those lines.
Keep in mind, however, that any such approach is prone to errors: what if a function is referenced in a comment? What if it appears inside a string constant? Really, to catch all the special cases you would have to (almost) properly parse the full C file.
Most modern regular expression engines have features to parse more than regular languages e.g. by means of back-references to subexpressions. But you shouldn't go down that road. With a proper parser such as ANTLR that can parse context-free languages you'll make your own life a lot easier.

Tokenize (lex? parse?) a regular expression

Using Ruby I'd like to take a Regexp object (or a String representing a valid regex; your choice) and tokenize it so that I may manipulate certain parts.
Specifically, I'd like to take a regex/string like this:
regex = /var (\w+) = '([^']+)';/
parts = ["foo","bar"]
and create a replacement string that replaces each capture with a literal from the array:
"var foo = 'bar';"
A naïve regex-based approach to parsing the regex, such as:
i = -1
result = regex.source.gsub(/\([^)]+\)/){ parts[i+=1] }
…would fail for things like nested capture groups, or non-capturing groups, or a regex that had a parenthesis inside a character class. Hence my desire to properly break the regex into semantically-valid pieces.
Is there an existing Regex parser available for Ruby? Is there a (horror of horrors) known regex that cleanly matches regexes? Is there a gem I've not found?
The motivation for this question is a desire to find a clean and simple answer to this question.
I have a JavaScript project on GitHub called: Dynamic (?:Regex Highlighting)++ with Javascript! you may want to look at. It parses PCRE compatible regular expressions written in both free-spacing and non-free-spacing modes. Since the regexes are written in the less-feature-rich JavaScript syntax, these regexes could be easily converted to Ruby.
Note that regular expressions may contain arbitrarily nested parentheses structures and JavaScript has no recursive regex features, so the code must parse the tree of nested parens from the-inside-out. Its a bit tricky but works quite well. Be sure to try it out on the highlighter demo page, where you can input and dynamically highlight any regex. The JavaScript regular expressions used to parse regular expressions are documented here.

Regexp in ruby - can I use parenthesis without grouping?

I have a regexp of the form:
/(something complex and boring)?(something complex and interesting)/
I'm interested in the contents of the second parenthesis; the first ones are there only to ensure a correct match (since the boring part might or might not be present but if it is, I'll match it by accident with the regexp for the interesting part).
So I can access the second match using $2. However, for uniformity with other regexps I'm using I want that somehow $1 will contain the contents of the second parethesis. Is it possible?
Use a non-capturing group:
r = /(?:ab)?(cd)/
This is a non-ruby regexp feature. Use /(?:something complex and boring)?(something complex and interesting)/ (note the ?:) to achieve this.
By the way, in Ruby 1.9, you can do /(something complex and boring)?(?<interesting>something complex and interesting)/ and access the group with $~[:interesting] ;)
Yup, use the ?: syntax:
/(?:something complex and boring)?(something complex and interesting)/
I'm not a ruby developer however I know other regex flavors. So I bet you can use a non capturing group
/(?:something complex and boring)?(something complex and interesting)/
There is only one capturing group, hence $1
HTH
Not really, no. But you can use a named group for uniformity, like this:
/(?<group1>something complex and boring)?(?<group2>something complex and interesting)/
You can change the names (the text in the angle brackets) for the uniformity that you want to achieve. You can then access the groups like this:
string.match(/(?<group1>something complex and boring)?(?<group2>something complex and interesting)/) do |m|
# Do something with the match, m['group'] can be used to access the group
end

Resources