Lua regex to match pattern in makefile - makefile

I'm writing a script to automate the mantainment of my makefile. I need a Lua pattern that matches the following lines:
# objects {
objects = build/somefile1.o \
build/somefile2.o \
...
build/somefileN.o \
# } objects
I tried with %# objects %{[a-z%.%s%/%\\]+%# %} objects but it doesn't seem to work.

I suggest using:
"\n(# objects %b{} objects)"
To make it work for cases when the match is at the start of the string, you need to prepend the string input with a newline. Here, a newline is matched first, then# objects, then a space, then %b{} matches balanced nested curly braces (if any) and then objects is matched.
When running the extraction, the captured part (within (...)) will be returned with string.gmatch.
See the Lua online demo
s = [[ YOUR_TEXT_HERE ]]
for m in string.gmatch("\n" .. s, "\n(# objects %b{} objects)") do
print(m)
end

Related

TCL/TK script issue with string match inside if-statement

I have a script in bash that calls a TCL script for each element on my network which performs some actions based on the type of the element. This is part of the code that checks whether or not the hostname contains a specific pattern(e.g. *CGN01) and then gives the appropriate command to that machine.
if {[string match "{*CGN01}" $hostname] || $hostname == "AthMet1BG01"} {
expect {
"*#" {send "admin show inventory\r"; send "exit\r"; exp_continue}
eof
}
}
With the code i quoted above i get no error BUT when the hostname is "PhiMSC1CGN01" then the code inside the if is not executed which means that the expression is not correct.
I have tried everything (use of "()" or "{}" or"[]" inside the if) but when i dont put "" on the pattern i get an error like:
invalid bareword "string"
in expression "(string match {*DR0* *1TS0* *...";
should be "$string" or "{string}" or "string(...)" or ...
(parsing expression "(string match {*DR0* *...")
invoked from within
"if {$hostname == "AthMar1BG03" || [string match *CGN01 $hostname]...
or this:
expected boolean value but got "[string match -nocase "*CGN01" $hostname]==0"
while executing
"if {$hostname == "AthMar1BG03" || {[string match -nocase "*CGN01" $hostname]==0}...
when i tried to use ==0 or ==1 on the expression.
My TCL-Version is 8.3 and i cant update it because the machine has no internet connecticity :(
Please help me i am trying to fix this for over a month...
If you want to match a string that is either exactly AthMet1BG01 or any string that ends with CGN01, you should use
if {[string match *CGN01 $hostname] || $hostname == "AthMet1BG01"} {
(For Tcl 8.5 or later, use eq instead of ==.)
Some comments on your attempts:
(The notes about the expression language used by if go for expr and while as well. It is fully described in the documentation for expr.)
To invoke a command inside the condition and substitute its result, it needs to be enclosed in brackets ([ ]). Parentheses (( )) can be used to set the priority of subexpressions within the condition, but don't indicate a command substitution.
Normally, inside the condition strings need to be enclosed in double quotes or braces ({ }). This is because the expression language that is used to express the condition needs to distinguish between e.g. numbers and strings, which Tcl in general doesn't. Inside a command substitution within a condition, you don't need to use quotes or braces, as long as there are no characters in the string that you need to quote.
The string {abc} contains the characters abc. The string "{abc}" contains the characters {abc}, because the double quotes make the braces normal characters (the reverse also holds). [string match "{*bar}" $str] matches the string {foobar} (with the braces as part of the text), but not foobar.
If you put braces around a command substitution, {[incr foo]}, it becomes just the string [incr foo], i.e. the command isn't invoked and no substitution is made. If you use {[incr foo]==1} you get the string [incr foo]==1. The correct way to write this within an expression is [incr foo]==1, with optional whitespace around the ==.
All this is kind of hard to grok, but when you have it is really easy to use. Tcl is stubborn as a mule about interpreting strings, but carries heavy loads if you treat her right.
ETA an alternate matcher (see comments)
You can write your own alternate string matcher:
proc altmatch {patterns string} {
foreach pattern $patterns {
if {[string match $pattern $string]} {
return 1
}
}
return 0
}
If any of the patterns match, you get 1; if none of the patterns match, you get 0.
% altmatch {*bar f?o} foobar
1
% altmatch {*bar f?o} fao
1
% altmatch {*bar f?o} foa
0
For those who have a modern Tcl version, you can actually add it to the string ensemble so it works like other string commands. Put it in the right namespace:
proc ::tcl::string::altmatch {patterns string} {
... as before ...
and install it like this:
% set map [namespace ensemble configure string -map]
% dict set map altmatch ::tcl::string::altmatch
% namespace ensemble configure string -map $map
Documentation:
expr,
string,
Summary of Tcl language syntax
This command:
if {[string match "{*CGN01}" $hostname] || $hostname == "AthMet1BG01"} {
is syntactically valid but I really don't think that you want to use that pattern with string match. I'd guess that you really want:
if {[string match "*CGN01" $hostname] || $hostname == "AthMet1BG01"} {
The {braces} inside that pattern are not actually meaningful (string match only does a subset of the full capabilities of a glob match) so with your erroneous pattern you're actually trying to match a { at the start of $hostname, any number of characters, and then CGN01} at the end of $hostname. With the literal braces. Simply removing the braces lets PhiMSC1CGN01 match.

Capturing groups don't work as expected with Ruby scan method

I need to get an array of floats (both positive and negative) from the multiline string. E.g.: -45.124, 1124.325 etc
Here's what I do:
text.scan(/(\+|\-)?\d+(\.\d+)?/)
Although it works fine on regex101 (capturing group 0 matches everything I need), it doesn't work in Ruby code.
Any ideas why it's happening and how I can improve that?
See scan documentation:
If the pattern contains no groups, each individual result consists of the matched string, $&. If the pattern contains groups, each individual result is itself an array containing one entry per group.
You should remove capturing groups (if they are redundant), or make them non-capturing (if you just need to group a sequence of patterns to be able to quantify them), or use extra code/group in case a capturing group cannot be avoided.
In this scenario, the capturing group is used to quantifiy a pattern sequence, thus all you need to do is convert the capturing group into a non-capturing one by replacing all unescaped ( with (?: (there is only one occurrence here):
text = " -45.124, 1124.325"
puts text.scan(/[+-]?\d+(?:\.\d+)?/)
See demo, output:
-45.124
1124.325
Well, if you need to also match floats like .04 you can use [+-]?\d*\.?\d+. See another demo
There are cases when you cannot get rid of a capturing group, e.g. when the regex contains a backreference to a capturing group. In that case, you may either a) declare a variable to store all matches and collect them all inside a scan block, or b) enclose the whole pattern with another capturing group and map the results to get the first item from each match, c) you may use a gsub with just a regex as a single argument to return an Enumerator, with .to_a to get the array of matches:
text = "11234566666678"
# Variant a:
results = []
text.scan(/(\d)\1+/) { results << Regexp.last_match(0) }
p results # => ["11", "666666"]
# Variant b:
p text.scan(/((\d)\2+)/).map(&:first) # => ["11", "666666"]
# Variant c:
p text.gsub(/(\d)\1+/).to_a # => ["11", "666666"]
See this Ruby demo.
([+-]?\d+\.\d+)
assumes there is a leading digit before the decimal point
see demo at Rubular
If you need capture groups for a complex pattern match, but want the entire expression returned by .scan, this can work for you.
Suppose you want to get the image urls in this string perhaps from a markdown text with html image tags:
str = %(
Before
<img src="https://images.zenhubusercontent.com/11223344e051aa2c30577d9d17/110459e6-915b-47cd-9d2c-1842z4b73d71">
After
<img src="https://user-images.githubusercontent.com/111222333/75255445-f59fb800-57af-11ea-9b7a-a235b84bf150.png">).strip
You may have a regular expression defined to match just the urls, and maybe used a Rubular example like this to build/test your Regexp
image_regex =
/https\:\/\/(user-)?images.(githubusercontent|zenhubusercontent).com.*\b/
Now you don't need each sub-capture group, but just the the entire expression in your your .scan, you can just wrap the whole pattern inside a capture group and use it like this:
image_regex =
/(https\:\/\/(user-)?images.(githubusercontent|zenhubusercontent).com.*\b)/
str.scan(image_regex).map(&:first)
=> ["https://user-images.githubusercontent.com/1949900/75255445-f59fb800-57af-11ea-9b7a-e075f55bf150.png",
"https://user-images.githubusercontent.com/1949900/75255473-02bca700-57b0-11ea-852a-58424698cfb0.png"]
How does this actually work?
Since you have 3 capture groups, .scan alone will return an Array of arrays with, one for each capture:
str.scan(image_regex)
=> [["https://user-images.githubusercontent.com/111222333/75255445-f59fb800-57af-11ea-9b7a-e075f55bf150.png", "user-", "githubusercontent"],
["https://images.zenhubusercontent.com/11223344e051aa2c30577d9d17/110459e6-915b-47cd-9d2c-0714c8f76f68", nil, "zenhubusercontent"]]
Since we only want the 1st (outter) capture group, we can just call .map(&:first)

Is backreference available in Parslet?

Is there a way to backreference a previous string in parslet similarly to the \1 functionality in typical regular expressions ?
I want to extract the characters within a block such as:
Marker SomeName
some random text, numbers123
and symbols !#%
SomeName
in which "Marker" is a known string but "SomeName" is not known a-priori, so I believe I need something like:
rule(:name) { ( match('\w') >> match('\w\d') ).repeat(1) }
rule(:text_within_the_block) {
str('Marker') >> name >> any.repeat.as(:text_block) >> backreference_to_name
}
What I don't know is how to write the backreference_to_name rule using Parslet and/or Ruby language.
From http://kschiess.github.io/parslet/parser.html
Capturing input
Sometimes a parser needs to match against something that was already
matched against. Think about Ruby heredocs for example:
str = <-HERE
This is part of the heredoc.
HERE
The key to matching this kind of document is to capture part of the
input first and then construct the rest of the parser based on the
captured part. This is what it looks like in its simplest form:
match['ab'].capture(:capt) >> # create the capture
dynamic { |s,c| str(c.captures[:capt]) } # and match using the capture
The key here is that the dynamic block returns a lazy parser. It's only evaluated at the point it's being used and gets passed it's current context to reference at the point of execution.
-- Updated : To add a worked example --
So for your example:
require 'parslet'
require 'parslet/convenience'
class Mini < Parslet::Parser
rule(:name) { match("[a-zA-Z]") >> match('\\w').repeat }
rule(:text_within_the_block) {
str('Marker ') >>
name.capture(:namez).as(:name) >>
str(" ") >>
dynamic { |_,scope|
(str(scope.captures[:namez]).absent? >> any).repeat
}.as(:text_block) >>
dynamic { |src,scope| str(scope.captures[:namez]) }
}
root (:text_within_the_block)
end
puts Mini.new.parse_with_debug("Marker BOB some text BOB") .inspect
#=> {:name=>"BOB"#7, :text_block=>"some text "#11}
This required a couple of changes.
I changed rule(:name) to match a single word and added a str(" ") to detect that word had ended. (Note: \w is short for [A-Za-z0-9_] so it includes digits)
I changed the "any" match to be conditional on the text not being the :name text. (otherwise it consumes the 'BOB' and then fails to match, ie. it's greedy!)
I don't exactly want to support stackoverflow, but as you seem to be a parslet user, here goes: Try asking on the mailing list for a real nice answer. (http://dir.gmane.org/gmane.comp.lang.ruby.parslet)
What you call back-reference here is called a 'capture' in parslet. Please see the example 'capture.rb' in parslets source tree.

Ruby Regexp - Matching multiple result when within markup

I have the following string:
nothing to match
<-
this rocks should match as should this still and this rocks and still
->
should not match still or rocks
<- no matches here ->
And i want to find all matches of 'rocks' and 'still', but only when they are within <- ->
The purpose is to markup glossary words but be able to only mark them up in areas of text that are defined by the editor.
I currently have:
<-.*?(rocks|still).*?->
This unfortunately only matches the first 'rocks' and ignores all subsequent instances and all the 'still's
I have this in a Rubular
The usage of this will be somthing like
Regexp.new( '<-.*?(' + self.all.map{ |gt| gt.name }.join("|") + ').*?->', Regexp::IGNORECASE, Regexp::MULTILINE )
Thanks in advance for any help
There may be a way to do this with a single regex, but it will probably be simpler to just do it in two steps. First match all of the markups, and then search the markups for the glossary words:
text = <<END
nothing to match
<-
this rocks should match as should this still and this rocks and still
->
should not match still or rocks
<- no matches here ->
END
text.scan(/<-.*?->/m).each do |match|
print match.scan(/rocks|still/), "\n"
end
Also, you should probably note that regex is only a good solution here if there is never any nested markup (<-...<-...->...->) and no escaped <- or -> whether it is inside or outside of a markup.
Don't forget your Ruby string methods. Use them first before considering regular expressions
$ ruby -0777 -ne '$_.split("->").each{|x| x.split("<-").each{|y| puts "#{y}" if (y[/rocks.*still/]) } }' file
In Ruby, it depends on what you want to do with the regexp. You're matching a regular expression against a string, so you'll be using String methods. Certain of these will have an effect on all matches (e.g. gsub or rpartition); others will have an effect on only the first match (e.g. rindex, =~).
If you're working with any of the latter (that return only the first match), you'll want to make use of a loop that calls the method again, starting from a certain offset. For example:
# A method to print the indices of all matches
def print_match_indices(string, regex)
i = string.rindex(regex, 0)
while !i.nil? do
puts i
i = string.rindex(regex, i+1)
end
end
(Yes, you can use split first, but I expect that a regex loop like the foregoing would require fewer system resources.)

Ruby MatchData class is repeating captures, instead of including additional captures as it "should"

Ruby 1.9.1, OSX 10.5.8
I'm trying to write a simple app that parses through of bunch of java based html template files to replace a period (.) with an underscore if it's contained within a specific tag. I use ruby all the time for these types of utility apps, and thought it would be no problem to whip up something using ruby's regex support. So, I create a Regexp.new... object, open a file, read it in line by line, then match each line against the pattern, if I get a match, I create a new string using replaceString = currentMatch.gsub(/./, '_'), then create another replacement as whole string by newReplaceRegex = Regexp.escape(currentMatch) and finally replace back into the current line with line.gsub(newReplaceRegex, replaceString) Code below, of course, but first...
The problem I'm having is that when accessing the indexes within the returned MatchData object, I'm getting the first result twice, and it's missing the second sub string it should otherwise be finding. More strange, is that when testing this same pattern and same test text using rubular.com, it works as expected. See results here
My pattern:
(<(?:WEBOBJECT|webobject) (?:NAME|name)=(?:[a-zA-Z0-9]+.)+(?:[a-zA-Z0-9]+)(?:>))
Text text:
<WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>moreNonMatchingText<WEBOBJECT NAME=admin.SecondLineMatch>AndEvenMoreNonMatchingText
Here's the relevant code:
tagRegex = Regexp.new('(<(?:WEBOBJECT|webobject) (?:NAME|name)=(?:[a-zA-Z0-9]+\.)+(?:[a-zA-Z0-9]+)(?:>))+')
testFile = File.open('RegexTestingCompFix.txt', "r+")
lineCount=0
testFile.each{|htmlLine|
lineCount += 1
puts ("Current line: #{htmlLine} at line num: #{lineCount}")
tagMatch = tagRegex.match(htmlLine)
if(tagMatch)
matchesArray = tagMatch.to_a
firstMatch = matchesArray[0]
secondMatch = matchesArray[1]
puts "First match: #{firstMatch} and second match #{secondMatch}"
tagMatch.captures.each {|lineMatchCapture|
puts "Current capture for tagMatches: #{lineMatchCapture} of total match count #{matchesArray.size}"
#create a new regex using the match results; make sure to use auto escape method
originalPatternString = Regexp.escape(lineMatchCapture)
replacementRegex = Regexp.new(originalPatternString)
#replace any periods with underscores in a copy of lineMatchCapture
periodToUnderscoreCorrection = lineMatchCapture.gsub(/\./, '_')
#replace original match with underscore replaced copy within line
htmlLine.gsub!(replacementRegex, periodToUnderscoreCorrection)
puts "The modified htmlLine is now: #{htmlLine}"
}
end
}
I would think that I should get the first tag in matchData[0] then the second tag in matchData1, or, what I'm really doing because I don't know how many matches I'll get within any given line is matchData.to_a.each. And in this case, matchData has two captures, but they're both the first tag match
which is: <WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>
So, what the heck am I doing wrong, why does rubular test give me the expected results?
You want to use the on String#scan instead of the Regexp#match:
tag_regex = /<(?:WEBOBJECT|webobject) (?:NAME|name)=(?:[a-zA-Z0-9]+\.)+(?:[a-zA-Z0-9]+)(?:>)/
lines = "<WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>moreNonMatchingText\
<WEBOBJECT NAME=admin.SecondLineMatch>AndEvenMoreNonMatchingText"
lines.scan(tag_regex)
# => ["<WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>", "<WEBOBJECT NAME=admin.SecondLineMatch>"]
A few recommendations for next ruby questions:
newlines and spaces are your friends, you don't loose points for using more lines on your code ;-)
use do-end on blocks instead of {}, improves readability a lot
declare variables in snake case (hello_world) instead of camel case (helloWorld)
Hope this helps
I ended up using the String.scan approach, the only tricky point there was figuring out that this returns an array of arrays, not a MatchData object, so there was some initial confusion on my part, mostly due to my ruby green-ness, but it's working as expected now. Also, I trimmed the regex per Trevoke's suggestion. But snake case? Never...;-) Anyway, here goes:
tagRegex = /(<(?:webobject) (?:name)=(?:\w+\.)+(?:\w+)(?:>))/i
testFile = File.open('RegexTestingCompFix.txt', "r+")
lineCount=0
testFile.each do |htmlLine|
lineCount += 1
puts ("Current line: #{htmlLine} at line num: #{lineCount}")
oldMatches = htmlLine.scan(tagRegex) #oldMatches thusly named due to not explicitly using Regexp or MatchData, as in "the old way..."
if(oldMatches.size > 0)
oldMatches.each_index do |index|
arrayMatch = oldMatches[index]
aMatch = arrayMatch[0]
#create a new regex using the match results; make sure to use auto escape method
replacementRegex = Regexp.new(Regexp.escape(aMatch))
#replace any periods with underscores in a copy of lineMatchCapture
periodToUnderscoreCorrection = aMatch.gsub(/\./, '_')
#replace original match with underscore replaced copy within line, matching against the new escaped literal regex
htmlLine.gsub!(replacementRegex, periodToUnderscoreCorrection)
puts "The modified htmlLine is now: #{htmlLine}"
end # I kind of still prefer the brackets...;-)
end
end
Now, why does MatchData work the way it does? It seems like it's behavior is a bug really, and certainly not very useful in general if you can't get it provide a simple means of accessing all the matches. Just my $.02
Small bits:
This regexp helps you get "normalMode" .. But not "secondLineMatch":
<webobject name=\w+\.((?:\w+)).+> (with option 'i', for "case insensitive")
This regexp helps you get "secondLineMatch" ... But not "normalMode":
<webobject name=\w+\.((?:\w+))> (with option 'i', for "case insensitive").
I'm not really good at regexpt but I'll keep toiling at it.. :)
And I don't know if this helps you at all, but here's a way to get both:
<webobject name=admin.(\w+) (with option 'i').

Resources