See if the beginning of a line matches a regex character - ruby

There are lines inside a file that contain !. I need all other lines. I only want to print lines within the file that do not start with an exclamation mark.
The line of code which I have written so far is:
unless parts.each_line.split("\n" =~ /^!/)
# other bit of nested code
end
But it doesn't work. How do I do it?

As a start I'd use:
File.foreach('foo.txt') do |li|
next if li[0] == '!'
puts li
end
foreach is extremely fast and allows your code to handle any size file - "scalable" is the term. See "Why is "slurping" a file not a good practice?" for more information.
li[0] is a common idiom in Ruby to get the first character of a string. Again, it's very fast and is my favorite way to get there, however consider these tests:
require 'fruity'
STR = '!' + ('a'..'z').to_a.join # => "!abcdefghijklmnopqrstuvwxyz"
compare do
_slice { STR[0] == '!' }
_start_with { STR.start_with?('!') }
_regex { !!STR[/^!/] }
end
# >> Running each test 32768 times. Test will take about 2 seconds.
# >> _start_with is faster than _slice by 2x ± 1.0
# >> _slice is similar to _regex
Using start_with? (or its String end equivalent end_with?) is twice as fast and it looks like I'll be using start_with? and end_with? from now on.
Combine that with foreach and your code will have a decent chance of being fast and efficient.
See "What is the fastest way to compare the start or end of a String with a sub-string using Ruby?" for more information.

You can use string#start_with to find the lines that start with a particular string.
file = File.open('file.txt').read
file.each_line do |line|
unless line.start_with?('!')
print line
end
end
You can also check the index of the first character
unless line[0] === "!"
You can also do this with Regex
unless line.scan(/^!/).length

Related

Ruby: How to copy a word in a string containing a specific sequence of letters?

I am trying to read in a text file and iterate through every line. If the line contains "_u" then I want to copy that word in that line.
For example:
typedef struct {
reg 1;
reg 2;
} buffer_u;
I want to copy the word buffer_u.
This is what I have so far (everything up to how to copy the word in the string):
f_in = File.open( h_file )
test = h_file.read
text.each_line do |line|
if line.include? "_u"
# copy word
# add to output file
end
end
Thanks in advance for your help!
Don't make it harder than it has to be. If you want to scan a body of text for words that match a criteria, do just that:
text = "
word_u1
something
_u1 foo
bar _u2
another word_u2
typedef struct {
reg 1;
reg 2;
} buffer_u;
"
text.scan(/\w+/).select{ |w| w['_u'] }
# => ["word_u1", "_u1", "_u2", "word_u2", "buffer_u"]
Regex are useful but the more complex ("smarter") they are, they slower they run unless you are very careful to anchor them, as anchors give them hints on where to look. Without those, the engine tries a number of things to determine exactly what you want, and that can really bog down the processing.
I recommend instead simply grabbing the words in the text:
scan(/\w+/)
Then filtering out the ones that match:
select{ |w| w['_u'] }
Using select with a simple sub-string search w['_u'] is extremely fast.
It could probably run faster using split() instead of scan(/\w+/) but you'll have to deal with cleaning up non-word characters.
Note: \w means [a-zA-Z0-9_] so what we generally call a "word" character is actually a "variable" definition for most languages since words generally don't include digits or _.
You can probably reduce your code to:
File.read( h_file ).scan(/\w+/).select{ |w| w['_u'] }
That will return an array of matching words.
Caveat: Using read has scalability issues. If you're concerned about the size of the file being read (which you always should be) then use foreach and iterate over the file line-by-line. You will probably see no change in processing speed.
You can try something like this:
words = []
File.open( h_file ) { |file| file.each_line { |line|
words << line.split.find { |a| a =~ /_u/ }
}}
words.compact!
# => [["buffer_u"]]
puts words
# buffer_u
This regex should catch a word ending with _u
(\w*_u)(?!\w)
The matching group will match a word ending with _u not followed by letters digits or underscores.
If you want _u to appear anywhere in a word use
(\w*_u\w*)
See DEMO here.
This will return all such words in the file, even if there are two or more in a line:
r = /
\w* # match >= 0 word characters
_u # match string
\w* # match >= 0 word characters
/x # extended mode
File.read(fname).scan r
For example:
str = "Cat_u has 9 lives, \n!dog_u has none and \n pig_u_o and cow_u, 3."
fname = 'temp'
File.write(fname, str)
#=> 63
Confirm the file contents:
File.read(fname)
#=> "Cat_u has 9 lives, \n!dog_u has none and \n pig_u_o and cow_u, 3."
Extract strings:
File.read(fname).scan r
#=> ["Cat_u", "dog_u", "pig_u_o", "cow_u"]
It's not difficult to modify this code to return at most one string per line. Simply read the file into an array of lines (or read a line at a time) and execute s = line[r]; arr << s if s for each line, where r is the above regex.

Ruby: line by line match range

Is there a way to do the following Perl structure in Ruby?
while( my $line = $file->each_line() ) {
if($line =~ /first_line/ .. /end_line_I_care_about/) {
do_something;
# this will do something on a line per line basis on the range of the match
}
}
In ruby that would read something like:
file.each_line do |line|
if line.match(/first_line/) .. line.match(/end_line_I_care_about/)
do_something;
# this will only do it based on the first match not the range.
end
end
Reading the whole file into memory is not an option and I don't know how big is the chunk of the range.
EDIT:
Thanks for the answers, the answers I got where basically the same as the code I had in the first place. The problem I was having was " It can test the right operand and become false on the same evaluation it became true (as in awk), but it still returns true once."
"If you don't want it to test the right operand until the next evaluation, as in sed, just use three dots ("...") instead of two. In all other regards, "..." behaves just like ".." does."
I am marking the correct answer as the one that pointed me to see that '..' can be turn off in the same call it is made.
For reference the code I am using is:
file.each_line do |line|
if line.match(/first_line/) ... line.match(/end_line_I_care_about/)
do_something;
end
end
Yes, Ruby supports flip-flops:
str = "aaa
ON
bbb
OFF
cccc
ON
ddd
OFF
eee"
str.each_line do |line|
puts line if line =~ /ON/..line =~ /OFF/
#puts line if line.match(/ON/)..line.match(/OFF/) #works too
end
Output:
ON
bbb
OFF
ON
ddd
OFF
I'm not perfectly clear on the exact semantics of the Perl code, assuming you want exactly the same. Ruby does have something that looks and works similarly, or perhaps identically: a Range as a condition works as a toggle. The code you presented works exactly as I imagine you intend.
There are a few caveats, however:
Even after you reach the end condition, lines will keep being read until you reach the end of the file. This may be a performance consideration if you expect the end condition to be near the beginning of a large file.
The start condition can be triggered multiple times, flipping the "switch" back on, doing your do_something and testing for the end condition again. This may be fine if your condition is specific enough, or if you want that behavior, but it's something to be aware of.
The end condition can be called at the same time the start condition is called giving you true for just one line.
Here's an alternative:
started = false
file.each_line do |line|
started = true if line =~ /first_line_condition/
next unless started
do_something()
break if line =~ /last_line_condition/
end
That code reads each line of the file until the start condition is reached. Then it does whatever processing you like starting with that line until you reach a line that matches your end condition, at which point it breaks out of the loop, reading no more lines from the file.
This solution is the closest to your needs. It almost looks like Perl, but this valid Ruby (although the flip-flop operator is kind of discouraged).
The file is read line by line, it is not fully loaded in memory.
File.open("my_file.txt", "r").each_line do |line|
if (line =~ /first_line/) .. (line =~ /end_line_I_care_about/)
do_something
end
end
The parentheses are optional, but they improve readability.

Regular expression - Ruby vs Perl

I noticed some extreme delays in my Ruby (1.9) scripts and after some digging it boiled down to regular expression matching. I'm using the following test scripts in Perl and in Ruby:
Perl:
$fname = shift(#ARGV);
open(FILE, "<$fname" );
while (<FILE>) {
if ( /(.*?) \|.*?SENDING REQUEST.*?TID=(.*?),/ ) {
print "$1: $2\n";
}
}
Ruby:
f = File.open( ARGV.shift )
while ( line = f.gets )
if /(.*?) \|.*?SENDING REQUEST.*?TID=(.*?),/.match(line)
puts "#{$1}: #{$2}"
end
end
I use the same input for both scripts, a file with only 44290 lines.
The timing for each one is:
Perl:
xenofon#cpm:~/bin/local/project$ time ./try.pl input >/dev/null
real 0m0.049s
user 0m0.040s
sys 0m0.000s
Ruby:
xenofon#cpm:~/bin/local/project$ time ./try.rb input >/dev/null
real 1m5.106s
user 1m4.910s
sys 0m0.010s
I guess I'm doing something awfully stupid, any suggestions?
Thank you
regex = Regexp.new(/(.*?) \|.*?SENDING REQUEST.*?TID=(.*?),/)
f = File.open( ARGV.shift ).each do |line|
if regex .match(line)
puts "#{$1}: #{$2}"
end
end
Or
regex = Regexp.new(/(.*?) \|.*?SENDING REQUEST.*?TID=(.*?),/)
f = File.open( ARGV.shift )
f.each_line do |line|
if regex.match(line)
puts "#{$1}: #{$2}"
end
One possible difference is the amount of backtracking being performed. Perl might do a better job of pruning the search tree when backtracking (i.e. noticing when part of a pattern can't possibly match). Its regex engine is highly optimised.
First, adding a leading «^» could make a huge difference. If the pattern doesn't match starting at position 0, it's not going to match at starting position 1 either! So don't try to match at position 1.
Along the same lines, «.*?» isn't as limiting as you might think, and replacing each instance of it with a more limiting pattern could prevent a lot of backtracking.
Why don't you try:
/
^
(.*?) [ ]\|
(?:(?!SENDING[ ]REQUEST).)* SENDING[ ]REQUEST
(?:(?!TID=).)* TID=
([^,]*) ,
/x
(Not sure if it was safe to replace the first «.*?» with «[^|]», so I didn't.)
(At least for patterns that match a single string, (?:(?!PAT).) is to PAT as [^CHAR] is to CHAR.)
Using /s could possibly speed things up if «.» is allowed to match newlines, but I think it's pretty minor.
Using «\space» instead of «[space]» to match a space under /x might be slightly faster in Ruby. (They're the same in recent versions of Perl.) I used the latter because it's far more readable.
From the perlretut chapter: Using regular expressions in Perl section - "Search and replace"
(Even though the regular expression appears in a loop, Perl is smart enough to compile it only once.)
I don't know Ruby very good, but I suspect that it does compile the regex in each cycle.
(Try the code from LaGrandMere's answer to verfiy it).
Try using the (?>re) Extension. See Ruby-Documentation for Details, here a Quote:
This construct [..] inhibits backtracking, which can be a
performance enhancement. For example, the pattern /a.*b.*a/ takes
exponential time when matched against a string containing an a
followed by a number of bs, but with no trailing a. However,
this can be avoided by using a nested regular expression
/a(?>.*b).*a/.
File.open(ARGV.shift) do |f|
while line = f.gets
if /(.*?)(?> \|.*?SENDING REQUEST.*?TID=)(.*?),/.match(line)
puts "#{$1}: #{$2}"
end
end
end
Ruby:
File.open(ARGV.shift).each do |line|
if line =~ /(.*?) \|.*?SENDING REQUEST.*?TID=(.*?),/
puts "#{$1}: #{$2}"
end
end
Change match method to =~ operator. It is faster because:
(Ruby has Benchmark. I don't know your file content so I randomly typed something)
require 'benchmark'
def bm(n)
Benchmark.bm do |x|
x.report{n.times{"asdfajdfaklsdjfklajdklfj".match(/fa/)}}
x.report{n.times{"asdfajdfaklsdjfklajdklfj" =~ /fa/}}
x.report{n.times{/fa/.match("asdfajdfaklsdjfklajdklfj")}}
end
end
bm(100000)
Output report:
user system total real
0.141000 0.000000 0.141000 ( 0.140564)
0.047000 0.000000 0.047000 ( 0.046855)
0.125000 0.000000 0.125000 ( 0.124945)
The middle one is using =~. It takes less than 1/3 of others. Other two are using match method. So, use =~ in your code.
Regular expression matching is time-consuming compared to other forms of matching. Since you are expecting a long, static string in the middle of your matching lines, try filtering out lines that don't include that string by using relatively-cheap string operations. That should result in less that needs to go through regular expression parsing (depending on what your input looks like, of course).
f = File.open( ARGV.shift )
my_re = Regexp.new(/(.*?) \|.*?SENDING REQUEST.*?TID=(.*?),/)
while ( line = f.gets )
continue if line.index('SENDING REQUEST') == nil
if my_re.match(line)
puts "#{$1}: #{$2}"
end
end
f.close()
I haven't benchmarked this particular version since I don't have your input data. I have had success doing things like this in the past, though, especially with lengthy logfiles where pre-filtering can eliminate the vast majority of the input without running any regular expressions.

Search and replace multiple words in file via Ruby

Good afternoon!
I am pretty new to Ruby and want to code a basic search and replace function in Ruby.
When you call the function, you can pass parameters (search pattern, replacing word).
This works like this: multiedit(pattern1, replacement1, pattern2, replacement2, ...)
Now, I want my function to read a text file, search for pattern1 and replace it with replacement2, search for pattern2 and replace it with replacement2 and so on. Finally, the altered text should be written to another text file.
I've tried to do this with a until loop, but all I get is that only the very first pattern is replaced while all the following patterns are ignored (in this example, only apple is replaced with fruit). I think the problem is that I always reread the original unaltered text? But I can't figure out a solution. Can you help me? Calling the function the way I am doing it is important for me.
def multiedit(*_patterns)
return puts "Number of search patterns does not match number of replacement strings!" if (_patterns.length % 2 > 0)
f = File.open("1.txt", "r")
g = File.open("2.txt", "w")
i = 0
until i >= _patterns.length do
f.each_line {|line|
output = line.sub(_patterns[i], _patterns[i+1])
g.puts output
}
i+=2
end
f.close
g.close
end
multiedit("apple", "fruit", "tomato", "veggie", "steak", "meat")
Can you help me out?
Thank you very much in advance!
Regards
Your loop was kind of inside-out ... do this instead ...
f.each_line do |line|
_patterns.each_slice 2 do |a, b|
line.sub! a, b
end
g.puts line
end
Perhaps the most efficient way to evaluate all the patterns for every line is to build a single regexp from all the search patterns and use the hash replacement form of String#gsub
def multiedit *patterns
raise ArgumentError, "Number of search patterns does not match number of replacement strings!" if (_patterns.length % 2 != 0)
replacements = Hash[ *patterns ].
regexp = Regexp.new replacements.keys.map {|k| Regexp.quote(k) }.join('|')
File.open("2.txt", "w") do |out|
IO.foreach("1.txt") do |line|
out.puts line.gsub regexp, replacements
end
end
end
Easier and better method is to use erb.
http://apidock.com/ruby/ERB

Can using the ruby flip-flop as a filter be made less kludgy?

In order to get part of text, I'm using a true if kludge in front of a flip-flop:
desired_portion_lines = text.each_line.find_all do |line|
true if line =~ /start_regex/ .. line =~ /finish_regex/
end
desired_portion = desired_portion_lines.join
If I remove the true if bit, it complains
bad value for range (ArgumentError)
Is it possible to make it less kludgy, or should I merely do
desired_portion_lines = ""
text.each_line do |line|
desired_portion_lines << line if line =~ /start_regex/ .. line =~ /finish_regex/
end
Or is there a better approach that doesn't use enumeration?
if you are doing it line by line, my preference is something like this
line =~ /finish_regex/ && p=0
line =~ /start_regex/ && p=1
puts line if p
if you have all in one string. I would use split
mystring.split(/finish_regex/).each do |item|
if item[/start_regex/]
puts item.split(/start_regex/)[-1]
end
end
I think
desired_portion_lines = ""
text.each_line do |line|
desired_portion_lines << line if line =~ /start_regex/ .. line =~ /finish_regex/
end
is perfectly acceptable. The .. operator is very powerful, but not used by a lot of people, probably because they don't understand what it does. Possibly it looks weird or awkward to you because you're not used to using it, but it'll grow on you. It's very common in Perl when dealing with ranges of lines in text files, which is where I first encountered it, and eventually was using it a lot.
The only thing I'd do differently is add some parenthesis to visually separate the logical tests from each other, and from the rest of the line:
desired_portion_lines = ""
text.each_line do |line|
desired_portion_lines << line if ( (line =~ /start_regex/) .. (line =~ /finish_regex/) )
end
Ruby (and Perl) coders seem to abhor using parenthesis, but I consider them useful for visually separating the logic tests. For me it's a readability and, by extension, a maintenance thing.
The only other thing I can think of that might help, would be to change desired_portion_lines to an array, and push your selected lines onto it. Currently, using desired_portion_lines << line appends to the string, mutating it each time. It might be faster pushing on the array then joining its elements afterward to build your string.
Back to the first example. I didn't test this but I think you can simplify it to:
desired_portion = text.each_line.find_all { |line| line =~ /start_regex/ .. line =~ /finish_regex/ }.join
The only downside to iterating over all lines in a file using the flip-flop, is that if the start-pattern can occur multiple times, you'll get each found block added to desired_portion.
You can save three characters by replacing true if with !!() (with the flip flop belonging in between the parentheses).

Resources