replacing lines in ruby string - ruby

i'm trying to loop through a Ruby string containing many lines using the each_line method, but I also want to change them. I'm using the following code, but it doesn't seem to work:
string.each_line{|line| line=change_line(line)}
I suppose, that Ruby is sending a copy of my line and not the line itself, but unfortunatelly there is no method each_line!. I also tried with the gsub! method, using /^.*$/ to detect each line, but it seems that it calls the change_line method only ones and replaces all lines with it. Any ideas how to do that?
Thanks in advance :)

#azlisum: You are not storing the result of your concatenation. Use:
output = string.lines.map{|line|change_line(line)}.join
Comparing four ways to process by line in a string:
# Inject method (proposed by #steenslang)
output = string.each_line.inject(""){|s, line| s << change_line(line)}
# Join method (proposed by #Lars Haugseth)
output = string.lines.map{|line|change_line(line)}.join
# REGEX method (proposed by #olistik)
output = string.gsub!(/^(.*)$/) {|line| change_line(line)}
# String concatenation += method (proposed by #Erik Hinton)
output = ""
string.each_line{|line| output += change_line(line)}
The timing with Benchmark:
user system total real
Inject Time: 7.920000 0.010000 7.930000 ( 7.920128)
Join Time: 7.150000 0.010000 7.160000 ( 7.155957)
REGEX Time: 11.660000 0.010000 11.670000 ( 11.661059)
+= Time: 7.080000 0.010000 7.090000 ( 7.076423)
As #steenslag pointed out, 's += a' will generate a new string for each concatenation and is therefor not usually the best choice.
So given that, and given the times, your best bet is:
output = string.lines.map{|line|change_line(line)}.join
Also, this is the cleaner looking choice IMHO.
Notes:
Using Benchmark
Ruby-Doc: Benchmark

You should try starting out with a blank string too, each_lining through the string and then pushing the results onto the blank string.
output = ""
string.each_line{|line| output += change_line(line)}
In your original example, you are correct. Your changes are occuring but they are not being ssved anywhere. Each in Ruby does not alter anything by default.

You could use gsub! passing a block to it:
string.gsub!(/^(.*)$/) {|line| change_line(line)}
source: String#gsub!

String#each_line is meant for reading lines in a string, not writing them. You can use this to get the result you want like so:
changed_string = ""
string.each_line{ |line| changed_string += change_line(line) }

If you don't give each_line a block, you'll get an enumerator, which has the inject method.
str = <<HERE
smestring dsfg
line 2
HERE
res = str.each_line.inject(""){|m,line|m << line.upcase}

Related

Ruby script which can replace a string in a binary file to a different, but same length string?

I would like to write a Ruby script (repl.rb) which can replace a string in a binary file (string is defined by a regex) to a different, but same length string.
It works like a filter, outputs to STDOUT, which can be redirected (ruby repl.rb data.bin > data2.bin), regex and replacement can be hardcoded. My approach is:
#!/usr/bin/ruby
fn = ARGV[0]
regex = /\-\-[0-9a-z]{32,32}\-\-/
replacement = "--0ca2765b4fd186d6fc7c0ce385f0e9d9--"
blk_size = 1024
File.open(fn, "rb") {|f|
while not f.eof?
data = f.read(blk_size)
data.gsub!(regex, str)
print data
end
}
My problem is that when string is positioned in the file that way it interferes with the block size used by reading the binary file. For example when blk_size=1024 and my 1st occurance of the string begins at byte position 1000, so I will not find it in the "data" variable. Same happens with the next read cycle. Should I process the whole file two times with different block size to ensure avoiding this worth case scenario, or is there any other approach?
I would posit that a tool like sed might be a better choice for this. That said, here's an idea: Read block 1 and block 2 and join them into a single string, then perform the replacement on the combined string. Split them apart again and print block 1. Then read block 3 and join block 2 and 3 and perform the replacement as above. Split them again and print block 2. Repeat until the end of the file. I haven't tested it, but it ought to look something like this:
File.open(fn, "rb") do |f|
last_block, this_block = nil
while not f.eof?
last_block, this_block = this_block, f.read(blk_size)
data = "#{last_block}#{this_block}".gsub(regex, str)
last_block, this_block = data.slice!(0, blk_size), data
print last_block
end
print this_block
end
There's probably a nontrivial performance penalty for doing it this way, but it could be acceptable depending on your use case.
Maybe a cheeky
f.pos = f.pos - replacement.size
at the end of the while loop, just before reading the next chunk.

Reformat string with `Regexp`, named captures, and `String#%`

Does anyone know a way to directly use a MatchData object containing named captures as the input to a String template formatting operation (%)? When I attempt to do so, I get a "positional args mixed with named args" error.
s = "One-Two-Three"
re = /(?<first>.*?)-(?<second>.*?)-(?<third>.*)/
puts "%{second}" % s.match(re)
I found other ways to achieve the functional objective (ie by creating an array of the captures in the desired order and using positional templating), but the code is comparatively klunky.
Try this:
s = "One-Two-Three"
re = /(?<first>.*?)-(?<second>.*?)-(?<third>.*)/
match = s.match(re)
[match.names.map(&:to_sym), match.captures].transpose.to_h
# => {:first=>"One", :second=>"Two", :third=>"Three"}
What about using string interpolation directly:
puts "#{s.match(re)['second']}"
For ruby < 2.0 you want to use Hash[]:
m = s.match re
Hash[m.names.map(&:to_sym).zip m.captures]
#=> {:first=>"One", :second=>"Two", :third=>"Three"}

Ruby MatchData class is repeating captures, instead of including additional captures as it "should"

Ruby 1.9.1, OSX 10.5.8
I'm trying to write a simple app that parses through of bunch of java based html template files to replace a period (.) with an underscore if it's contained within a specific tag. I use ruby all the time for these types of utility apps, and thought it would be no problem to whip up something using ruby's regex support. So, I create a Regexp.new... object, open a file, read it in line by line, then match each line against the pattern, if I get a match, I create a new string using replaceString = currentMatch.gsub(/./, '_'), then create another replacement as whole string by newReplaceRegex = Regexp.escape(currentMatch) and finally replace back into the current line with line.gsub(newReplaceRegex, replaceString) Code below, of course, but first...
The problem I'm having is that when accessing the indexes within the returned MatchData object, I'm getting the first result twice, and it's missing the second sub string it should otherwise be finding. More strange, is that when testing this same pattern and same test text using rubular.com, it works as expected. See results here
My pattern:
(<(?:WEBOBJECT|webobject) (?:NAME|name)=(?:[a-zA-Z0-9]+.)+(?:[a-zA-Z0-9]+)(?:>))
Text text:
<WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>moreNonMatchingText<WEBOBJECT NAME=admin.SecondLineMatch>AndEvenMoreNonMatchingText
Here's the relevant code:
tagRegex = Regexp.new('(<(?:WEBOBJECT|webobject) (?:NAME|name)=(?:[a-zA-Z0-9]+\.)+(?:[a-zA-Z0-9]+)(?:>))+')
testFile = File.open('RegexTestingCompFix.txt', "r+")
lineCount=0
testFile.each{|htmlLine|
lineCount += 1
puts ("Current line: #{htmlLine} at line num: #{lineCount}")
tagMatch = tagRegex.match(htmlLine)
if(tagMatch)
matchesArray = tagMatch.to_a
firstMatch = matchesArray[0]
secondMatch = matchesArray[1]
puts "First match: #{firstMatch} and second match #{secondMatch}"
tagMatch.captures.each {|lineMatchCapture|
puts "Current capture for tagMatches: #{lineMatchCapture} of total match count #{matchesArray.size}"
#create a new regex using the match results; make sure to use auto escape method
originalPatternString = Regexp.escape(lineMatchCapture)
replacementRegex = Regexp.new(originalPatternString)
#replace any periods with underscores in a copy of lineMatchCapture
periodToUnderscoreCorrection = lineMatchCapture.gsub(/\./, '_')
#replace original match with underscore replaced copy within line
htmlLine.gsub!(replacementRegex, periodToUnderscoreCorrection)
puts "The modified htmlLine is now: #{htmlLine}"
}
end
}
I would think that I should get the first tag in matchData[0] then the second tag in matchData1, or, what I'm really doing because I don't know how many matches I'll get within any given line is matchData.to_a.each. And in this case, matchData has two captures, but they're both the first tag match
which is: <WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>
So, what the heck am I doing wrong, why does rubular test give me the expected results?
You want to use the on String#scan instead of the Regexp#match:
tag_regex = /<(?:WEBOBJECT|webobject) (?:NAME|name)=(?:[a-zA-Z0-9]+\.)+(?:[a-zA-Z0-9]+)(?:>)/
lines = "<WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>moreNonMatchingText\
<WEBOBJECT NAME=admin.SecondLineMatch>AndEvenMoreNonMatchingText"
lines.scan(tag_regex)
# => ["<WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>", "<WEBOBJECT NAME=admin.SecondLineMatch>"]
A few recommendations for next ruby questions:
newlines and spaces are your friends, you don't loose points for using more lines on your code ;-)
use do-end on blocks instead of {}, improves readability a lot
declare variables in snake case (hello_world) instead of camel case (helloWorld)
Hope this helps
I ended up using the String.scan approach, the only tricky point there was figuring out that this returns an array of arrays, not a MatchData object, so there was some initial confusion on my part, mostly due to my ruby green-ness, but it's working as expected now. Also, I trimmed the regex per Trevoke's suggestion. But snake case? Never...;-) Anyway, here goes:
tagRegex = /(<(?:webobject) (?:name)=(?:\w+\.)+(?:\w+)(?:>))/i
testFile = File.open('RegexTestingCompFix.txt', "r+")
lineCount=0
testFile.each do |htmlLine|
lineCount += 1
puts ("Current line: #{htmlLine} at line num: #{lineCount}")
oldMatches = htmlLine.scan(tagRegex) #oldMatches thusly named due to not explicitly using Regexp or MatchData, as in "the old way..."
if(oldMatches.size > 0)
oldMatches.each_index do |index|
arrayMatch = oldMatches[index]
aMatch = arrayMatch[0]
#create a new regex using the match results; make sure to use auto escape method
replacementRegex = Regexp.new(Regexp.escape(aMatch))
#replace any periods with underscores in a copy of lineMatchCapture
periodToUnderscoreCorrection = aMatch.gsub(/\./, '_')
#replace original match with underscore replaced copy within line, matching against the new escaped literal regex
htmlLine.gsub!(replacementRegex, periodToUnderscoreCorrection)
puts "The modified htmlLine is now: #{htmlLine}"
end # I kind of still prefer the brackets...;-)
end
end
Now, why does MatchData work the way it does? It seems like it's behavior is a bug really, and certainly not very useful in general if you can't get it provide a simple means of accessing all the matches. Just my $.02
Small bits:
This regexp helps you get "normalMode" .. But not "secondLineMatch":
<webobject name=\w+\.((?:\w+)).+> (with option 'i', for "case insensitive")
This regexp helps you get "secondLineMatch" ... But not "normalMode":
<webobject name=\w+\.((?:\w+))> (with option 'i', for "case insensitive").
I'm not really good at regexpt but I'll keep toiling at it.. :)
And I don't know if this helps you at all, but here's a way to get both:
<webobject name=admin.(\w+) (with option 'i').

Getting some elements in a string using a regex

Context
Using Ruby I am parsing strings looking like this:
A type with an ID...
[Image=4b5da003ee133e8368000002]
[Video=679hfpam9v56dh800khfdd32]
...with between 0 and n additional options separated with #...
[Image=4b5da003ee133e8368000002#size:small]
[Image=4b5da003ee133e8368000002#size:small#media:true]
In this example:
[Image=4b5da003ee133e8368000002#size:small#media:true]
I want to retrieve:
[Image=4b5da003ee133e8368000002#size:small#media:true]
Image
4b5da003ee133e8368000002
size:small
media:true
Problem
Right now using this regex:
(\[([a-zA-Z]+)=([a-zA-Z0-9]+)(#[a-zA-Z]+:[a-zA-Z]+)*\])
I get...
[Image=4b5da003ee133e8368000002#size:small#media:true]
Image
4b5da003ee133e8368000002
#media:true
What am I doing wrong? How can I get what I want?
PS: All the results are copied from http://rubular.com/ which is nice to debug regex. Please use it if it can help you help me :)
Edit : if it's impossible to get all options separated, how could I get this:
[Image=4b5da003ee133e8368000002#size:small#media:true]
Image
4b5da003ee133e8368000002
#size:small#media:true
Edit:
Ruby's Regex implementation seems not to support multiple captures on one group, as most other regex engines do. Therefore, you'll have to do two steps; first getting all the #*:* in one string and then split those.
To get all of them, this should work:
(\[([a-zA-Z]+)=([a-zA-Z0-9]+)((?:#[a-zA-Z]+:[a-zA-Z]+)*)\])
To get the "tail" of options, you could fetch it from $4 with
/(\[([a-zA-Z]+)=([a-zA-Z0-9]+)((#[a-zA-Z]+:[a-zA-Z]+)*)\])/
and then split on at-signs.
For example:
#! /usr/bin/ruby
str = "[Image=4b5da003ee133e8368000002#size:small#media:true]"
if /(\[([a-zA-Z]+)=([a-zA-Z0-9]+)((#[a-zA-Z]+:[a-zA-Z]+)*)\])/.match(str)
print $1, "\n",
$2, "\n",
$3, "\n",
$4, "\n";
$4[1..-1].split(/#/).each do |s|
print s, "\n";
end
end
Output:
[Image=4b5da003ee133e8368000002#size:small#media:true]
Image
4b5da003ee133e8368000002
#size:small#media:true
size:small
media:true
(\[([a-zA-Z]+)=([a-zA-Z0-9]+)(?:#([a-zA-Z]+:[a-zA-Z]+))*\])
will give you media:true. Note that media:true is overwriting the previous size:small match. I don't think there's a way to get exactly what you want in a single match call.
It looks like the regex only keeps the last match. I think to get the list of matches will require a different approach.
"a=b#c:d#e:f".split(/=|#/)
which creates a list:
["a", "b", "c:d", "e:f"]
which is close to what you want...
Although it can be tricky to do it purely within a regexp, it's not too hard to split it out as a two-step operation:
while (line = DATA.gets)
line.chomp!
if (m = line.match(/\[([a-zA-Z]+)=([a-zA-Z0-9]+)((?:#[a-zA-Z]+:[a-zA-Z]+)*)\]/))
(type, hash, options) = m.to_a[1, 3]
options = options.split(/#/).reject { |s| s.empty? }
puts [ type, hash, options.join(',') ].join(' / ')
end
end
__END__
[Image=4b5da003ee133e8368000002]
[Video=679hfpam9v56dh800khfdd32]
[Image=4b5da003ee133e8368000002#size:small]
[Image=4b5da003ee133e8368000002#size:small#media:true]
[Image=4b5da003ee133e8368000002#size:small#media:true#foo:bar]
This produces the output:
Image / 4b5da003ee133e8368000002 /
Video / 679hfpam9v56dh800khfdd32 /
Image / 4b5da003ee133e8368000002 / size:small
Image / 4b5da003ee133e8368000002 / size:small,media:true
Image / 4b5da003ee133e8368000002 / size:small,media:true,foo:bar

how to remove all [d+] except the last [d+]?

i have a string like
/root/children[2]/header[1]/something/some[4]/table/tr[1]/links/a/b
and
/root/children[2]/header[1]/something/some[4]/table/tr[2]
how can i reproduce the string so that all the /\[\d+\]/ are removed except for the last /\[\d+\]/ ?
so i should end up with .
/root/children/header/something/some/table/tr[1]/links/a/b
and
/root/children/header/something/some/table/tr[2]
No loops for you. Use a lookahead assertion (?= ... ):
s.gsub(/\[\d+\](?=.*\[)/, "")
There's a reasonable explanation of the very useful lookaround operators here
We will have to use while loop, I guess. And here comes good ol' C-style-loop solution:
while s.gsub!(/(\[\d+\])(.*?)(\[\d+\])/, '\2\3'); end
It's a bit hard to read, so I'll explain. The idea is that we match the string with a pattern that requires two [\d+] blocks to persist in a string. In the replacement, we just delete the first one. We repeat it until string doesn't match (so it contains only one such block) and utilize the fact that gsub! doesn't perform substitution when string is unmatched.
I'm absolutely certain there's a more elegant solution, but this ought to get you going:
string = "/root/children[2]/header[1]/something/some[4]/table/tr[1]/links/a/b"
count = string.scan(/\[\d+\]/).size
index = 0
string.gsub(/\[\d+\]/) do |capture|
index += 1
index == count ? capture : ""
end
Try this:
str.scan(/\[\d+\]/)[0..-2].each {|match| str.sub!(match, '')}

Resources