Why does gsub's '\1' capture group produce this string? - ruby

I am confused as to why I am capturing this pattern via '\1' grouping. I am capturing two digits at a time, but why does it skip here:
"123 456 789".gsub(/(\d)(\d)/, '\1')
=> "13 46 79"
I can understand that '\0' gives me the original string:
"123 456 789".gsub(/(\d)(\d)/, '\0')
=> "123 456 789"
This also confuses me, but I can understand '\2' once I learn what '\1' is doing:
"123 456 789".gsub(/(\d)(\d)/, '\2')
=> "23 56 89"

The regex matches "12", "45", "78", and gsub replaces them with "1", "4", "7", respectively, giving "13 46 79".

To obtain 12 45 78, you need to use
(\d)\d\b
And replace with \1.
See demo
Here, we match a digit and capture it ((\d)), then we match another digit (with \d) that is right before a word boundary \b.
IDEONE demo:
puts "123 456 789".gsub(/(\d)\d\b/, '\1')

Related

Sed print the items in square brackets that's after the occurence of a text

I have the following Scenarios:
Scenario 1
foo_bar = ["123", "456", "789"]
Scenario 2
foo_bar = [
"123",
"456",
"789"
]
Scenario 3
variable "foo_bar" {
type = list(string)
default = ["123", "456", "789"]
}
So i'm trying to figure out how I can print with sed the items inside the brackets that are under foo_bar accounting scenario 2 which is a multiline
so the resulting matches here would be
Scenario 1
"123", "456", "789"
Scenario 2
"123",
"456",
"789"
Scenario 3
"123", "456", "789"
In the case of
not_foo_bar = [
"123",
"456",
"789"
]
This should not match, only match foo_bar
This is what I've tried so far
sed -e '1,/foo_bar/d' -e '/]/,$d' test.tf
And this
sed -n 's/.*\foo_bar\(.*\)\].*/\1/p' test.tf
This is a mouthful, but it’s POSIX sed and works.
sed -Ene \
'# scenario 1
s/(([^[:alnum:]_]|^)foo_bar[^[:alnum:]_][[:space:]]*=[[:space:]]*\[)([^]]+)(\]$)/\3/p
# scenario 2 and 3
/([^[:alnum:]_]|^)foo_bar[^[:alnum:]_][[:space:]]*=?[[:space:]]*[[{][[:space:]]*$/,/^[]}]$/ {
//!p
s/(([^[:alnum:]_]|^)default[^[:alnum:]_][[:space:]]*=[[:space:]]*\[)([^]]+)(\]$)/\3/p
}' |
# filter out unwanted lines from scenario 3 ("type =")
sed -n '/^[[:space:]]*"/p'
I couldn’t quite get it all in a single sed.
The first and last lines of the first sed are the same command (using default instead of foobar).
edit: in case it confuses someone, I left in that last [[:space:]]*, in the second really long regex, by mistake. I won’t edit it, but it’s not vital, nor consistent - I didn’t allow for any trailing whitespace in line ends in other patterns.
This might work for you (GNU sed):
sed -En '/foo_bar/{:a;/.*\[([^]]*)\].*/!{N;ba};s//\1/p}' file
Turn off implicit printing and on extended regexp -nE.
Pattern match on foo_bar, then gather up line(s) between the next [ and ] and print the result.

How do I write a regex that captures the first non-numeric part of string that also doesn't include 3 or more spaces?

I'm using Ruby 2.4. I want to extract from a string the first consecutive occurrence of non-numeric characters that do not include at least three or more spaces. For example, in this string
str = "123 aa bb cc 33 dd"
The first such occurrence is " aa bb ". I thought the below expression would help me
data.split(/[[:space:]][[:space:]][[:space:]]+/).first[/\p{L}\D+\p{L}\p{L}/i]
but if the string is "123 456 aaa", it fails to return " aaa", which I would want it to.
r = /
(?: # begin non-capture group
[ ]{,2} # match 0, 1 or 2 spaces
[^[ ]\d]+ # match 1+ characters that are neither spaces nor digits
)+ # end non-capture group and perform 1+ times
[ ]{,2} # match 0, 1 or 2 spaces
/x # free-spacing regex definition mode
str = "123 aa bb cc 33 dd"
str[r] #=> " aa bb "
Note that [ ] could be replaced by a space if free-spacing regex definition mode is not used:
r = /(?: {,2}[^ \d]+)+ {,2}/
Remove all digits + spaces from the start of a string. Then split with 3 or more whitespaces and grab the first item.
def parse_it(s)
s[/\A(?:[\d[:space:]]*\d)?(\D+)/, 1].split(/[[:space:]]{3,}/).first
end
puts parse_it("123 aa bb cc 33 dd")
# => aa bb
puts parse_it("123 456 aaa")
# => aaa
See the Ruby demo
The first regex \A(?:[\d[:space:]]*\d)?(\D+) matches:
\A - start of a string
(?:[\d[:space:]]*\d)? - an optional sequence of:
[\d[:space:]]* - 0+ digits or whitespaces
\d - a digit
(\D+) -Group 1 capturing 1 or more non-digits
The splitting regex is [[:space:]]{3,}, it matches 3 or more whitespaces.
It looks like this'd do it:
regex = /(?: {1,2}[[:alpha:]]{2,})+/
"123 aa bb cc 33 dd"[regex] # => " aa bb"
"123 456 aaa"[regex] # => " aaa"
(?: ... ) is a non-capturing group.
{1,2} means "find at least one, and at most two".
[[:alpha:]] is a POSIX definition for alphabet characters. It's more comprehensive than [a-z].
You should be able to figure out the rest, which is all documented in the Regexp documentation and String's [] documentation.
Will this work?
str.match(/(?: ?)?(?:[^ 0-9]+(?: ?)?)+/)[0]
or apparently
str[/(?: ?)?(?:[^ 0-9]+(?: ?)?)+/]
or using Cary's nice space match,
str[/ {,2}(?:[^ 0-9]+ {,2})+/]

expr does not return the pattern if not at the beginning of the string

Using this version of bash:
GNU bash, version 4.1.2(1)-release (i386-redhat-linux-gnu)
How can I get expr to find my pattern within a string, if the pattern I'm looking for does not begin this string?
Example:
expr match "123 abc 456 def ghi789" '\([0-9]*\)' #returns 123 as expected
expr match "z 123 abc 456 def ghi789" '\([0-9]*\)' #returns nothing
In the second example, I would expect 123 to be returned.
Further analysis:
If I start from the end of the string by adding .* in my command, I get a weird result:
expr match "123 abc 456 def ghi789" '.*\([0-9]*\)' #returns nothing
expr match "123 abc 456 def ghi789" '.*\([0-9]\)' #returns 9 as expected
expr match "123 abc 456 def ghi789 z" '.*\([0-9]\)' #returns also 9
Here, it seems that the pattern can be found at the end of the string (so at the beginning of my search), and also if it's not at the end of the string. But it does not work if I add the * at the end of the regular expression.
In the other hand, the same does not apply if I start from the beginning of my string:
expr match "z 123 abc 456 def ghi789" '\([0-9]\)' #returns nothing
I think I must misunderstand something obvious, but I cannot find what.
Thank you for your help :)
Would
expr match "123 abc 456 def ghi789 z" '[^0-9]*\([0-9]*\)
do it? (Just added [0-9]* instead of .* at the beginning)
Like mentioned in the comments - the expression
expr match "123 abc 456 def ghi789 z" '^[^0-9]*\([0-9]*\)
would fit better, because the part ^[^0-9] can be read as "skip all characters which are not digits ([^0-9]) form begin (the ^as first character)"

Is it possible in Ruby to print a part of a regex (group) and instead of the whole matched substring?

Is it possible in sed may be even in Ruby to memorize the matched part of a pattern and print it instead of the full string which was matched:
"aaaaaa bbb ccc".strip.gsub(/([a-z])+/, \1) # \1 as a part of the regex which I would like to remember and print then instead of the matched string.
# => "a b c"
I thing in sed it should be possible with its h = holdspace command or similar, but what also about Ruby?
Not sure what you mean. Here are few example:
print "aaaaaa bbb ccc".strip.gsub(/([a-z])+/, '\1')
# => "a b c"
And,
print "aaaaaa bbb ccc".strip.scan(/([a-z])+/).flatten
# => ["a", "b", "c"]
The shortest answer is grep:
echo "aaaaaa bbb ccc" | grep -o '\<.'
You can do:
"aaaaaa bbb ccc".split
and then join that array back together with the first character of each element
[a[0][0,1], a[1][0,1], a[2][0,1], a[3][0,1], ... ].join(" ")
#glennjackman's suggestion: ruby -ne 'puts $_.split.map {|w| w[0]}.join(" ")'

Ruby scan Regular Expression

I'm trying to split the string:
"[test| blah] \n [foo |bar bar bar]\n[test| abc |123 | 456 789]"
into the following array:
[
["test","blah"]
["foo","bar bar bar"]
["test","abc","123","456 789"]
]
I tried the following, but it isn't quite right:
"[test| blah] \n [foo |bar bar bar]\n[test| abc |123 | 456 789]"
.scan(/\[(.*?)\s*\|\s*(.*?)\]/)
# =>
# [
# ["test", "blah"]
# ["foo", "bar bar bar"]
# ["test", "abc |123 | 456 789"]
# ]
I need to split at every pipe instead of the first pipe. What would be the correct regular expression to achieve this?
s = "[test| blah] \n [foo |bar bar bar]\n[test| abc |123 | 456 789]"
arr = s.scan(/\[(.*?)\]/).map {|m| m[0].split(/ *\| */)}
Two alternatives:
s = "[test| blah] \n [foo |bar bar bar]\n[test| abc |123 | 456 789]"
s.split(/\s*\n\s*/).map{ |p| p.scan(/[^|\[\]]+/).map(&:strip) }
#=> [["test", "blah"], ["foo", "bar bar bar"], ["test", "abc", "123", "456 789"]]
irb> s.split(/\s*\n\s*/).map do |line|
line.sub(/^\s*\[\s*/,'').sub(/\s*\]\s*$/,'').split(/\s*\|\s*/)
end
#=> [["test", "blah"], ["foo", "bar bar bar"], ["test", "abc", "123", "456 789"]]
Both of them start by splitting on newlines (throwing away surrounding whitespace).
The first one then splits each chunk by looking for anything that is not a [, |, or ] and then throws away extra whitespace (calling strip on each).
The second one then throws away leading [ and trailing ] (with whitespace) and then splits on | (with whitespace).
You cannot get the final result you want with a single scan. About the closest you can get is this:
s.scan /\[(?:([^|\]]+)\|)*([^|\]]+)\]/
#=> [["test", " blah"], ["foo ", "bar bar bar"], ["123 ", " 456 789"]]
…which drops information, or this:
s.scan /\[((?:[^|\]]+\|)*[^|\]]+)\]/
#=> [["test| blah"], ["foo |bar bar bar"], ["test| abc |123 | 456 789"]]
…which captures the contents of each "array" as a single capture, or this:
s.scan /\[(?:([^|\]]+)\|)?(?:([^|\]]+)\|)?(?:([^|\]]+)\|)?([^|\]]+)\]/
#=> [["test", nil, nil, " blah"], ["foo ", nil, nil, "bar bar bar"], ["test", " abc ", "123 ", " 456 789"]]
…which is hardcoded to a maximum of four items, and inserts nil entries that you would need to .compact away.
There is no way to use Ruby's scan to take a regex like /(?:(aaa)b)+/ and get multiple captures for each time the repetition is matched.
Why the hard path (single regex)? Why not a simple combo of splits? Here are the steps, to visualize the process.
str = "[test| blah] \n [foo |bar bar bar]\n[test| abc |123 | 456 789]"
arr = str.split("\n").map(&:strip) # => ["[test| blah]", "[foo |bar bar bar]", "[test| abc |123 | 456 789]"]
arr = arr.map{|s| s[1..-2] } # => ["test| blah", "foo |bar bar bar", "test| abc |123 | 456 789"]
arr = arr.map{|s| s.split('|').map(&:strip)} # => [["test", "blah"], ["foo", "bar bar bar"], ["test", "abc", "123", "456 789"]]
This is likely far less efficient than scan, but at least it's simple :)
A "Scan, Split, Strip, and Delete" Train-Wreck
The whole premise seems flawed, since it assumes that you will always find alternation in your sub-arrays and that expressions won't contain character classes. Still, if that's the problem you really want to solve for, then this should do it.
First, str.scan( /\[.*?\]/ ) will net you three array elements, each containing pseudo-arrays. Then you map the sub-arrays, splitting on the alternation character. Each element of the sub-array is then stripped of whitespace, and the square brackets deleted. For example:
str = "[test| blah] \n [foo |bar bar bar]\n[test| abc |123 | 456 789]"
str.scan( /\[.*?\]/ ).map { |arr| arr.split('|').map { |m| m.strip.delete '[]' }}
#=> [["test", "blah"], ["foo", "bar bar bar"], ["test", "abc", "123", "456 789"]]
Verbosely, Step-by-Step
Mapping nested arrays is not always intuitive, so I've unwound the train-wreck above into more procedural code for comparison. The results are identical, but the following may be easier to reason about.
string = "[test| blah] \n [foo |bar bar bar]\n[test| abc |123 | 456 789]"
array_of_strings = string.scan( /\[.*?\]/ )
#=> ["[test| blah]", "[foo |bar bar bar]", "[test| abc |123 | 456 789]"]
sub_arrays = array_of_strings.map { |sub_array| sub_array.split('|') }
#=> [["[test", " blah]"],
# ["[foo ", "bar bar bar]"],
# ["[test", " abc ", "123 ", " 456 789]"]]
stripped_sub_arrays = sub_arrays.map { |sub_array| sub_array.map(&:strip) }
#=> [["[test", "blah]"],
# ["[foo", "bar bar bar]"],
# ["[test", "abc", "123", "456 789]"]]
sub_arrays_without_brackets =
stripped_sub_arrays.map { |sub_array| sub_array.map {|elem| elem.delete '[]'} }
#=> [["test", "blah"], ["foo", "bar bar bar"], ["test", "abc", "123", "456 789"]]

Resources