Quick Advice: How should this be written in Ruby? - ruby

I am a Java/C++ programmer and Ruby is my first scripting language. I sometimes find that I am not using it as productively as I could in some areas, like this one for example:
Objective: to parse only certain lines from a file. The pattern I am going with is that there is one very large line with a size greater than 15, the rest are definitely smaller. I want to ignore all the lines before (and including) the large one.
def do_something(str)
puts str
end
str =
'ignore me
me too!
LARGE LINE ahahahahha its a line!
target1
target2
target3'
flag1 = nil
str.each_line do |line|
do_something(line) if flag1
flag1 = 1 if line.size > 15
end
I wrote this, but I think it could be written a lot better, ie, there must be a better way than setting a flag. Recommendations for how to write beautiful lines of Ruby also welcome.
Note/Clarification: I need to print ALL lines AFTER the first appearance of the LARGE LINE.

str.lines.drop_while {|l| l.length < 15 }.drop(1).each {|l| do_something(l) }
I like this, because if you read it from left to right, it reads almost exactly like your original description:
Split the string in lines and drop lines shorter than 15 characters. Then drop another line (i.e. the first one with more than 14 characters). Then do something with each remaining line.
You don't even need to necessarily understand Ruby, or programming at all to be able to verify whether this is correct.

require 'enumerator' # Not needed in Ruby 1.9
str.each_line.inject( false ) do |flag, line|
do_something( line ) if flag
flag || line.size > 15
end

lines = str.split($/)
start_index = 1 + lines.find_index {|l| l.size > 15 }
lines[start_index..-1].each do |l|
do_something(l)
end

Related

See if the beginning of a line matches a regex character

There are lines inside a file that contain !. I need all other lines. I only want to print lines within the file that do not start with an exclamation mark.
The line of code which I have written so far is:
unless parts.each_line.split("\n" =~ /^!/)
# other bit of nested code
end
But it doesn't work. How do I do it?
As a start I'd use:
File.foreach('foo.txt') do |li|
next if li[0] == '!'
puts li
end
foreach is extremely fast and allows your code to handle any size file - "scalable" is the term. See "Why is "slurping" a file not a good practice?" for more information.
li[0] is a common idiom in Ruby to get the first character of a string. Again, it's very fast and is my favorite way to get there, however consider these tests:
require 'fruity'
STR = '!' + ('a'..'z').to_a.join # => "!abcdefghijklmnopqrstuvwxyz"
compare do
_slice { STR[0] == '!' }
_start_with { STR.start_with?('!') }
_regex { !!STR[/^!/] }
end
# >> Running each test 32768 times. Test will take about 2 seconds.
# >> _start_with is faster than _slice by 2x ± 1.0
# >> _slice is similar to _regex
Using start_with? (or its String end equivalent end_with?) is twice as fast and it looks like I'll be using start_with? and end_with? from now on.
Combine that with foreach and your code will have a decent chance of being fast and efficient.
See "What is the fastest way to compare the start or end of a String with a sub-string using Ruby?" for more information.
You can use string#start_with to find the lines that start with a particular string.
file = File.open('file.txt').read
file.each_line do |line|
unless line.start_with?('!')
print line
end
end
You can also check the index of the first character
unless line[0] === "!"
You can also do this with Regex
unless line.scan(/^!/).length

How far does .each read? To the end of the line?

Sorry for the newbie question. Was loading a .txt file into the following code:
line_count = 0
File.open("text.txt").each {|line| line_count += 1}
puts line_count
Does .each simply read until the end of a line before passing its value to the code block? Little explanation would be great. Thanks!
You can use .each_line to be more explicit, but yes, http://www.ruby-doc.org/core-2.0.0/IO.html#method-i-each each reads a line.
f = File.new("testfile")
f.each {|line| puts "#{f.lineno}: #{line}" }
It's really important to read the documentation, because all sorts of things are explained there. For instance, the documentation for each says:
Executes the block for every line in ios, where lines are separated by sep.
sep means "\r", "\n" or "\r\n", depending on the OS the code is running on which is also the value of the special $/ global variable which contains the default line-ending character for that OS. You can tell Ruby to use a different value for the line-end/separator if you know the file uses something else.
Regarding your code:
I'd do it this way:
line_count = 0
File.foreach("text.txt") do |line|
line_count += 1
end
puts line_count
foreach is very self-explanatory, which is important when writing code. You want it to be self-documenting as much as possible. foreach iterates over "each" line in the file. It also assumes the line-ends are the same as $/, but you can force it to be something different, perhaps the letter "z" or "." or " ", depending on your whim and fancy at the moment.

Ruby: line by line match range

Is there a way to do the following Perl structure in Ruby?
while( my $line = $file->each_line() ) {
if($line =~ /first_line/ .. /end_line_I_care_about/) {
do_something;
# this will do something on a line per line basis on the range of the match
}
}
In ruby that would read something like:
file.each_line do |line|
if line.match(/first_line/) .. line.match(/end_line_I_care_about/)
do_something;
# this will only do it based on the first match not the range.
end
end
Reading the whole file into memory is not an option and I don't know how big is the chunk of the range.
EDIT:
Thanks for the answers, the answers I got where basically the same as the code I had in the first place. The problem I was having was " It can test the right operand and become false on the same evaluation it became true (as in awk), but it still returns true once."
"If you don't want it to test the right operand until the next evaluation, as in sed, just use three dots ("...") instead of two. In all other regards, "..." behaves just like ".." does."
I am marking the correct answer as the one that pointed me to see that '..' can be turn off in the same call it is made.
For reference the code I am using is:
file.each_line do |line|
if line.match(/first_line/) ... line.match(/end_line_I_care_about/)
do_something;
end
end
Yes, Ruby supports flip-flops:
str = "aaa
ON
bbb
OFF
cccc
ON
ddd
OFF
eee"
str.each_line do |line|
puts line if line =~ /ON/..line =~ /OFF/
#puts line if line.match(/ON/)..line.match(/OFF/) #works too
end
Output:
ON
bbb
OFF
ON
ddd
OFF
I'm not perfectly clear on the exact semantics of the Perl code, assuming you want exactly the same. Ruby does have something that looks and works similarly, or perhaps identically: a Range as a condition works as a toggle. The code you presented works exactly as I imagine you intend.
There are a few caveats, however:
Even after you reach the end condition, lines will keep being read until you reach the end of the file. This may be a performance consideration if you expect the end condition to be near the beginning of a large file.
The start condition can be triggered multiple times, flipping the "switch" back on, doing your do_something and testing for the end condition again. This may be fine if your condition is specific enough, or if you want that behavior, but it's something to be aware of.
The end condition can be called at the same time the start condition is called giving you true for just one line.
Here's an alternative:
started = false
file.each_line do |line|
started = true if line =~ /first_line_condition/
next unless started
do_something()
break if line =~ /last_line_condition/
end
That code reads each line of the file until the start condition is reached. Then it does whatever processing you like starting with that line until you reach a line that matches your end condition, at which point it breaks out of the loop, reading no more lines from the file.
This solution is the closest to your needs. It almost looks like Perl, but this valid Ruby (although the flip-flop operator is kind of discouraged).
The file is read line by line, it is not fully loaded in memory.
File.open("my_file.txt", "r").each_line do |line|
if (line =~ /first_line/) .. (line =~ /end_line_I_care_about/)
do_something
end
end
The parentheses are optional, but they improve readability.

Can using the ruby flip-flop as a filter be made less kludgy?

In order to get part of text, I'm using a true if kludge in front of a flip-flop:
desired_portion_lines = text.each_line.find_all do |line|
true if line =~ /start_regex/ .. line =~ /finish_regex/
end
desired_portion = desired_portion_lines.join
If I remove the true if bit, it complains
bad value for range (ArgumentError)
Is it possible to make it less kludgy, or should I merely do
desired_portion_lines = ""
text.each_line do |line|
desired_portion_lines << line if line =~ /start_regex/ .. line =~ /finish_regex/
end
Or is there a better approach that doesn't use enumeration?
if you are doing it line by line, my preference is something like this
line =~ /finish_regex/ && p=0
line =~ /start_regex/ && p=1
puts line if p
if you have all in one string. I would use split
mystring.split(/finish_regex/).each do |item|
if item[/start_regex/]
puts item.split(/start_regex/)[-1]
end
end
I think
desired_portion_lines = ""
text.each_line do |line|
desired_portion_lines << line if line =~ /start_regex/ .. line =~ /finish_regex/
end
is perfectly acceptable. The .. operator is very powerful, but not used by a lot of people, probably because they don't understand what it does. Possibly it looks weird or awkward to you because you're not used to using it, but it'll grow on you. It's very common in Perl when dealing with ranges of lines in text files, which is where I first encountered it, and eventually was using it a lot.
The only thing I'd do differently is add some parenthesis to visually separate the logical tests from each other, and from the rest of the line:
desired_portion_lines = ""
text.each_line do |line|
desired_portion_lines << line if ( (line =~ /start_regex/) .. (line =~ /finish_regex/) )
end
Ruby (and Perl) coders seem to abhor using parenthesis, but I consider them useful for visually separating the logic tests. For me it's a readability and, by extension, a maintenance thing.
The only other thing I can think of that might help, would be to change desired_portion_lines to an array, and push your selected lines onto it. Currently, using desired_portion_lines << line appends to the string, mutating it each time. It might be faster pushing on the array then joining its elements afterward to build your string.
Back to the first example. I didn't test this but I think you can simplify it to:
desired_portion = text.each_line.find_all { |line| line =~ /start_regex/ .. line =~ /finish_regex/ }.join
The only downside to iterating over all lines in a file using the flip-flop, is that if the start-pattern can occur multiple times, you'll get each found block added to desired_portion.
You can save three characters by replacing true if with !!() (with the flip flop belonging in between the parentheses).

Ruby: Length of a line of a file in bytes?

I'm writing this little HelloWorld as a followup to this and the numbers do not add up
filename = "testThis.txt"
total_bytes = 0
file = File.new(filename, "r")
file.each do |line|
total_bytes += line.unpack("U*").length
end
puts "original size #{File.size(filename)}"
puts "Total bytes #{total_bytes}"
The result is not the same as the file size. I think I just need to know what format I need to plug in... or maybe I've missed the point entirely. How can I measure the file size line by line?
Note: I'm on Windows, and the file is encoded as type ANSI.
Edit: This produces the same results!
filename = "testThis.txt"
total_bytes = 0
file = File.new(filename, "r")
file.each_byte do |whatever|
total_bytes += 1
end
puts "Original size #{File.size(filename)}"
puts "Total bytes #{total_bytes}"
so anybody who can help now...
IO#gets works the same as if you were capturing input from the command line: the "Enter" isn't sent as part of the input; neither is it passed when #gets is called on a File or other subclass of IO, so the numbers are definitely not going to match up.
See the relevant Pickaxe section
May I enquire why you're so concerned about the line lengths summing to the file size? You may be solving a harder problem than is necessary...
Aha. I think I get it now.
Lacking a handy iPod (or any other sort, for that matter), I don't know if you want exactly 4K chunks, in which case IO#read(4000) would be your friend (4000 or 4096?) or if you're happier to break by line, in which case something like this ought to work:
class Chunkifier
def Chunkifier.to_chunks(path)
chunks, current_chunk_size = [""], 0
File.readlines(path).each do |line|
line.chomp! # strips off \n, \r or \r\n depending on OS
if chunks.last.size + line.size >= 4_000 # 4096?
chunks.last.chomp! # remove last line terminator
chunks << ""
end
chunks.last << line + "\n" # or whatever terminator you need
end
chunks
end
end
if __FILE__ == $0
require 'test/unit'
class TestFile < Test::Unit::TestCase
def test_chunking
chs = Chunkifier.to_chunks(PATH)
chs.each do |chunk|
assert 4_000 >= chunk.size, "chunk is #{chunk.size} bytes long"
end
end
end
end
Note the use of IO#readlines to get all the text in one slurp: #each or #each_line would do as well. I used String#chomp! to ensure that whatever the OS is doing, the byts at the end are removed, so that \n or whatever can be forced into the output.
I would suggest using File#write, rather than #print or #puts for the output, as the latter have a tendency to deliver OS-specific newline sequences.
If you're really concerned about multi-byte characters, consider taking the each_byte or unpack(C*) options and monkey-patching String, something like this:
class String
def size_in_bytes
self.unpack("C*").size
end
end
The unpack version is about 8 times faster than the each_byte one on my machine, btw.
You might try IO#each_byte, e.g.
total_bytes = 0
file_name = "test_this.txt"
File.open(file_name, "r") do |file|
file.each_byte {|b| total_bytes += 1}
end
puts "Original size #{File.size(file_name)}"
puts "Total bytes #{total_bytes}"
That, of course, doesn't give you a line at a time. Your best option for that is probably to go through the file via each_byte until you encounter \r\n. The IO class provides a bunch of pretty low-level read methods that might be helpful.
You potentially have several overlapping issues here:
Linefeed characters \r\n vs. \n (as per your previous post). Also EOF file character (^Z)?
Definition of "size" in your problem statement: do you mean "how many characters" (taking into account multi-byte character encodings) or do you mean "how many bytes"?
Interaction of the $KCODE global variable (deprecated in ruby 1.9. See String#encoding and friends if you're running under 1.9). Are there, for example, accented characters in your file?
Your format string for #unpack. I think you want C* here if you really want to count bytes.
Note also the existence of IO#each_line (just so you can throw away the while and be a little more ruby-idiomatic ;-)).
The issue is that when you save a text file on windows, your line breaks are two characters (characters 13 and 10) and therefore 2 bytes, when you save it on linux there is only 1 (character 10). However, ruby reports both these as a single character '\n' - it says character 10. What's worse, is that if you're on linux with a windows file, ruby will give you both characters.
So, if you know that your files are always coming from windows text files and executed on windows, every time you get a newline character you can add 1 to your count. Otherwise it's a couple of conditionals and a little state machine.
BTW there's no EOF 'character'.
f = File.new("log.txt")
begin
while (line = f.readline)
line.chomp
puts line.length
end
rescue EOFError
f.close
end
Here is a simple solution, presuming that the current file pointer is set to the start of a line in the read file:
last_pos = file.pos
next_line = file.gets
current_pos = file.pos
backup_dist = last_pos - current_pos
file.seek(backup_dist, IO::SEEK_CUR)
in this example "file" is the file from which you are reading. To do this in a loop:
last_pos = file.pos
begin loop
next_line = file.gets
current_pos = file.pos
backup_dist = last_pos - current_pos
last_pos = current_pos
file.seek(backup_dist, IO::SEEK_CUR)
end loop

Resources