Ruby regex to get text blocks including delimiters - ruby

When using scan in Ruby, we are searching for a block within a text file.
Sample file:
sometextbefore
begin
sometext
end
sometextafter
begin
sometext2
end
sometextafter2
We want the following result in an array:
["begin\nsometext\nend","begin\nsometext2\nend"]
With this scan method:
textfile.scan(/begin\s.(.*?)end/m)
we get:
["sometext","sometext2"]
We want the begin and end still in the output, not cut off.
Any suggestions?

You may remove the capturing group completely:
textfile.scan(/begin\s.*?end/m)
See the IDEONE demo
The String#scan method returns captured values only if you have capturing groups defined inside the pattern, thus a non-capturing one should fix the issue.
UPDATE
If the lines inside the blocks must be trimmed from leading/trailing whitespace, you can just use a gsub against each matched block of text to remove all the horizontal whitespace (with the help of \p{Zs} Unicode category/property class):
.scan(/begin\s.*?end/m).map { |s| s.gsub(/^\p{Zs}+|\p{Zs}+$/, "") }
Here, each match is passed to a block where /^\p{Zs}+|\p{Zs}+$/ matches either the start of a line with 1+ horizontal whitespace(s) (see ^\p{Zs}+), or 1+ horizontal whitespace(s) at the end of the line (see \p{Zs}+$).
See another IDEONE demo

Here's another approach, using Ruby's flip-flop operator. I cannot say I would recommend this approach, but Rubiests should understand how the flip-flop operator works.
First let's create a file.
str =<<_
some
text
at beginning
begin
some
text
1
end
some text
between
begin
some
text
2
end
some text at end
_
#=> "some\ntext\nat beginning\nbegin\n some\n text\n 1\nend\n...at end\n"
FName = "text"
File.write(FName, str)
Now read the file line-by-line into the array lines:
lines = File.readlines(FName)
#=> ["some\n", "text\n", "at beginning\n", "begin\n", " some\n", " text\n",
# " 1\n", "end\n", "some text\n", "between\n", "begin\n", " some\n",
# " text\n", " 2\n", "end\n", "some text at end\n"]
We can obtain the desired result as follows.
lines.chunk { |line| true if line =~ /^begin\s*$/ .. line =~ /^end\s*$/ }.
map { |_,arr| arr.map(&:strip).join("\n") }
#=> ["begin\nsome\ntext\n1\nend", "begin\nsome\ntext\n2\nend"]
The two steps are as follows.
First, select and group the lines of interest, using Enumerable#chunk with the flip-flop operator.
a = lines.chunk { |line| true if line =~ /^begin\s*$/ .. line =~ /^end\s*$/ }
#=> #<Enumerator: #<Enumerator::Generator:0x007ff62b981510>:each>
We can see the objects that will be generated by this enumerator by converting it to an array.
a.to_a
#=> [[true, ["begin\n", " some\n", " text\n", " 1\n", "end\n"]],
# [true, ["begin\n", " some\n", " text\n", " 2\n", "end\n"]]]
Note that the flip-flop operator is distinguished from a range definition by making it part of a logical expression. For that reason we cannot write
lines.chunk { |line| line =~ /^begin\s*$/ .. line =~ /^end\s*$/ }.to_a
#=> ArgumentError: bad value for range
The second step is the following:
b = a.map { |_,arr| arr.map(&:strip).join("\n") }
#=> ["begin\nsome\ntext\n1\nend", "begin\nsome\ntext\n2\nend"]

Ruby has some great methods in Enumerable. slice_before and slice_after can help with this sort of problem:
string = <<EOT
sometextbefore
begin
sometext
end
sometextafter
begin
sometext2
end
sometextafter2
EOT
ary = string.split # => ["sometextbefore", "begin", "sometext", "end", "sometextafter", "begin", "sometext2", "end", "sometextafter2"]
.slice_after(/^end/) # => #<Enumerator: #<Enumerator::Generator:0x007fb1e20b42a8>:each>
.map{ |a| a.shift; a } # => [["begin", "sometext", "end"], ["begin", "sometext2", "end"], []]
ary.pop # => []
ary # => [["begin", "sometext", "end"], ["begin", "sometext2", "end"]]
If you want the resulting sub-arrays joined then that's an easy step:
ary.map{ |a| a.join("\n") } # => ["begin\nsometext\nend", "begin\nsometext2\nend"]

Related

How to use gsub with a file in Ruby?

Hey I've a little problem, I've a string array text_word and I want to replace some letters with my file transform.txt, my file looks like this:
/t/ 3
/$/ 1
/a/ !
But when I use gsub I get an Enumerator back, does anyone know how to fix this?
text_transform= Array.new
new_words= Array.new
File.open("transform.txt", "r") do |fi|
fi.each_line do |words|
text_transform << words.chomp
end
end
text_transform.each do |transform|
text_word.each do |words|
new_words << words.gsub(transform)
end
end
You can see String#gsub
If the second argument is a Hash, and the matched text is one of its
keys, the corresponding value is the replacement string.
Also you can use IO::readlines
File.readlines('transform.txt', chomp: true).map { |word| word.gsub(/[t$a]/, 't' => 3, '$' => 1, 'a' => '!') }
gsub returns an Enumerator when you provide just one argument (the pattern). If you want to replace just add the replacement string:
pry(main)> 'this is my string'.gsub(/i/, '1')
"th1s 1s my str1ng"
You need to refactor your code:
text_transform = Array.new
new_words = Array.new
File.open("transform.txt", "r") do |fi|
fi.each_line do |words|
text_transform << words.chomp.strip.split # "/t/ 3" -> ["/t/", "3"]
end
end
text_transform.each do |pattern, replacement| # pattern = "/t/", replacement = "3"
text_word.each do |words|
new_words << words.gsub(pattern, replacement)
end
end

Using STDIN in ruby , how to print out the multiline comments without REGEX

Given STDIN with the following:
=begin This is a multiline comment and con spwan
as many lines as you like. But =begin and =end
should come in the first line only.
=end
Without using regex , how do you print the in between line as well?
(side question, is ARGF expensive since it has to wait for all the input?)
this is a start:
starting = "=begin"
ending = "=end"
ARGF.each do | line |
comments = false
if line.include?(starting) && !line.include?(ending)
incomments = true
end
if !line.include?(starting) && line.include?(ending)
puts line
incomments = false
end
if incomments == true
puts line.lstrip
end
end
expected output is:
This is a multiline comment and con spwan
as many lines as you like. But =begin and =end
should come in the first line only.
The generic answer that works for any number of nested levels:
input = "..." # could be taken from ARGF
input.
split($/).
each_with_object(result: Hash.new {|h, k| h[k] = []}, level: 0) do |line, acc|
acc[:level] += 1 if line.include?('=begin')
(1..acc[:level]).each do |level|
acc[:result]["Level: #{level}"] << line
end
acc[:level] -= 1 if line.include?('=end');
end[:result]
#⇒ {
# "Level: 1" => [
# "=begin This is a multiline comment and con spwan as many lines as you like.",
# "But =begin and =end should come in the first line only.",
# "=end"
# ],
# "Level: 2" => [
# "But =begin and =end should come in the first line only."
# ]
# }
If you need the comments on top level, just get the value for "Level: 1" key and join it with $/ delimiter.

Split an array by a repetitive value

I have a variable length array of arbitrary strings. The one consistency is the string "hello" is repeated and I want to partition the array in groups by the string "hello".
So given this:
[
"hello\r\n",
"I\r\n",
"am\r\n",
"Bob\r\n",
"hello\r\n",
"How\r\n",
"are you?\r\n"
]
I want this:
[
[
"hello\r\n",
"I\r\n",
"am\r\n",
"Bob\r\n"
],
[
"hello\r\n",
"How\r\n",
"are you?\r\n"
]
]
What I have tried:
partition = []
last = input.size
index = 0
input.each_with_object([]) do |line, acc|
index += 1
if line == "hello\r\n"
acc << partition
partition = []
partition << line
else
partition << line
end
if index == last
acc << partition
end
acc
end.delete_if(&:blank?)
=> [["hello\r\n", "I\r\n", "am\r\n", "Bob\r\n"], ["hello\r\n", "How\r\n", "are you?\r\n"]]
The result is right, but is it possible to do what I want with ruby array iterators? My solution seems clunky.
You can use Enumerable#slice_before
arr.slice_before { |i| i[/hello/] }.to_a
#=> [["hello\r\n", "I\r\n", "am\r\n", "Bob\r\n"],
# ["hello\r\n", "How\r\n", "are you?\r\n"]]
or more succinctly (as pointed out by #tokland):
arr.slice_before(/hello/).to_a
Here is a method that does not use Enumerable#slice_before, which was introduced in Ruby v.2.2. It works with v1.9+ (and would work with v1.87+ if each_with_object were replaced with reduce/inject).
Assumptions
I have assumed:
all strings preceding the first string beginning with "hello" are discarded
to match "hello" the string must begin "hello" and cannot be a word merely containing hello (e.g., "hellonfire")
Code
def group_em(arr, target)
arr.each_with_object([]) { |s,a| (s =~ /\A#{target}(?!\p{alpha})/) ?
(a << [s]) : (a.last << s unless a.empty?) }
end
Example
arr = ["Ahem\r\n", "hello\r\n", "I\r\n", "hello again\r\n", "am\r\n",
"Bob\r\n", "hellonfire\r\n", "How\r\n", "are you?\r\n"]
group_em(arr, 'hello')
#=> [["hello\r\n", "I\r\n"],
# ["hello again\r\n", "am\r\n", "Bob\r\n", "hellonfire\r\n",
# "How\r\n", "are you?\r\n"]]
Note that "Ahem\r\n" is not included because it does not follow "hello" and "hellonfire\r\n" does not trigger a new slice because it does not match `"hello"``.
Discussion
In the example, the regular expression was computed to equal
/(?m-ix:\Ahello(?!\p{alpha}))/
It could instead be defined in free-spacing mode to make it self-documenting.
/
\A # match the beginning of the string
#{target} # match target word
(?!\p{alpha}) # do not match a letter (negative lookbehind)
/x # free-spacing regex definition mode

ruby multiline scan between ; and negate?

I'm trying to match text between ;-.
I used:
inputx.scan(/;-.+?\n[^\n]*;-/)
but it doesn't work.
My text is:
baseball;-1
norm;4
dad;3
soda;1
robot;-8
mmm;3
fly;-1
cat;4
bird;4
dragon;6
mor;-1
I need to separate the text between ;-.
For example, this is the first element of the resulting array:
baseball;-1
norm;4
dad;3
soda;1
robot;-8
And this is second:
fly;-1
cat;4
bird;4
dragon;6
mor;-1
You may use a regex that will match any line that ends with - and 1 or more digits, and then matches any text up to the first line that ends with - and 1 or more digits:
/.*-\d+$(?m:.*?-\d+$)/
See the Rubular demo
Details:
.*-\d+$ - any 0+ chars other than line breaks, followed with - and 1+ digits
(?m:.*?-\d+$) - a modifier group where . matches line breaks matching:
.*? - any 0+ chars, as few as possible
- - a hyphen
\d+ - 1 or more digits
$ - end of line.
You can use Array#split twice, the first to split by lines, and the second to split based on the presence of either ; or ;- (using the pattern /;-?/)
The pattern /;-?/ matches a semicolon followed by an optional -.
inputx.split("\n").map{|s| s.split(/;-?/)}
#=> [[" baseball", "1"], [" norm", "4"], [" dad", "3"], [" soda", "1"], [" robot", "8"], [" mmm", "3"], [" fly", "1"], [" cat", "4"], [" bird", "4"], [" dragon", "6"], [" mor", "1"]]
A pattern with scan or split results in a regex that is needlessly complicated because it's not the best tool in the box for the problem.
I'd use something like this:
text = <<EOT
baseball;-1
norm;4
dad;3
soda;1
robot;-8
mmm;3
fly;-1
cat;4
bird;4
dragon;6
mor;-1
EOT
ary = [[]]
text.lines.each do |l|
if l[';-'] ... l[';-']
ary.last << l
else
ary << []
end
end
ary
# => [[" baseball;-1\n",
# " norm;4\n",
# " dad;3\n",
# " soda;1\n",
# " robot;-8\n"],
# [" fly;-1\n",
# " cat;4\n",
# " bird;4\n",
# " dragon;6\n",
# " mor;-1\n"]]
If you don't want trailing new-lines:
ary = [[]]
text.lines.map(&:chomp).each do |l|
if l[';-'] ... l[';-']
ary.last << l
else
ary << []
end
end
ary
# => [[" baseball;-1", " norm;4", " dad;3", " soda;1", " robot;-8"],
# [" fly;-1", " cat;4", " bird;4", " dragon;6", " mor;-1"]]
If you don't want the whitespace surrounding each element:
ary = [[]]
text.lines.map(&:strip).each do |l|
if l[';-'] ... l[';-']
ary.last << l
else
ary << []
end
end
ary
# => [["baseball;-1", "norm;4", "dad;3", "soda;1", "robot;-8"],
# ["fly;-1", "cat;4", "bird;4", "dragon;6", "mor;-1"]]
How does this work? The .. and ... operator changes meaning depending on whether it's used in the context of a Range, or in an if condition. .. is called a "flip-flop" operator, which changes state when the first condition is met. It will begin returning true at that point, and will continue to do so until the second condition is met, at which point it begins returning false again. That makes it easy to look for something, then begin acting on subsequent lines until the second condition occurs.
Normally we'd use different conditions, such as searching for "begin" and "end" in a block of lines in a file. In this case though, we needed it to not immediately toggle since both the start and end condition were the same, which is where ... comes it. It waits one loop before testing for the second condition, allowing this code to continue, find the next lines until the "closing" ';-'.
I have to say, this data set is one of the weirdest I've ever seen. (The weirdest was some binary data for the address book out of an old email program years ago). I'd be concerned about the process that's generating it, and if that generation was under my control I'd change it to use something more standard.
We can use Enumerable#chunk and Ruby's flip-flop operator. This does not require the use of a regular expression. str is the string given by the OP.
arr = str.lines.chunk do |line|
true if line.include?('-') ... line.include?('-')
end.select(&:first).map { |_,a| a.join }
#=> ["baseball;-1\nnorm;4\ndad;3\nsoda;1\nrobot;-8\n",
# "fly;-1\ncat;4\nbird;4\ndragon;6\nmor;-1\n"]
arr.each { |s| puts "\n"; puts s }
baseball;-1
norm;4
dad;3
soda;1
robot;-8
fly;-1
cat;4
bird;4
dragon;6
mor;-1
It is necessary to use three (not two) dots in the flip-flop expression (search for "three dot" in the reference given above).

How do you strip substrings in ruby?

I'd like to replace/duplicate a substring, between two delimeters -- e.g.,:
"This is (the string) I want to replace"
I'd like to strip out everything between the characters ( and ), and set that substr to a variable -- is there a built in function to do this?
I would just do:
my_string = "This is (the string) I want to replace"
p my_string.split(/[()]/) #=> ["This is ", "the string", " I want to replace"]
p my_string.split(/[()]/)[1] #=> "the string"
Here are two more ways to do it:
/\((?<inside_parenthesis>.*?)\)/ =~ my_string
p inside_parenthesis #=> "the string"
my_new_var = my_string[/\((.*?)\)/,1]
p my_new_var #=> "the string"
Edit - Examples to explain the last method:
my_string = 'hello there'
capture = /h(e)(ll)o/
p my_string[capture] #=> "hello"
p my_string[capture, 1] #=> "e"
p my_string[capture, 2] #=> "ll"
var = "This is (the string) I want to replace"[/(?<=\()[^)]*(?=\))/]
var # => "the string"
str = "This is (the string) I want to replace"
str.match(/\((.*)\)/)
some_var = $1 # => "the string"
As I understand, you want to remove or replace a substring as well as set a variable equal to that substring (sans the parentheses). There are many ways to do this, some of which are slight variants of the other answers. Here's another way that also allows for the possibility of multiple substrings within parentheses, picking up from #sawa's comments:
def doit(str, repl)
vars = []
str.gsub(/\(.*?\)/) {|m| vars << m[1..-2]; repl}, vars
end
new_str, vars = doit("This is (the string) I want to replace", '')
new_str # => => "This is I want to replace"
vars # => ["the string"]
new_str, vars = doit("This is (the string) I (really) want (to replace)", '')
new_str # => "This is I want"
vars # => ["the string", "really, "to replace"]
new_str, vars = doit("This (short) string is a () keeper", "hot dang")
new_str # => "This hot dang string is a hot dang keeper"
vars # => ["short", ""]
In the regex, the ? in .*? makes .* "lazy". gsub passes each match m to the block; the block strips the parens and adds it to vars, then returns the replacement string. This regex also works:
/\([^\(]*\)/

Resources