File comparison Ran out of memory - ruby

Hi i am using this code - but for files > 8 Million rows of lines - 2 files passed as text input, the memory runs out. How can i compare both text files which are more than 30 Million lines maybe.
fileA1 = ARGV[0]
fileA2 = ARGV[1]
if ARGV.length != 2
raise 'Send Two files pls'
end
cmd = "sort #{fileA1} > Sorted.txt"
`#{cmd}`
aFile = File.open("Sorted.txt", "r");
bFile = File.open(fileA2, "r").readlines;
fileR = File.open("result.txt", "w")
p aFile.class
p bFile.class
p bFile.length
aFile.each do |e|
if(! bFile.include?(e) )
p 'Able to get differences:' + e.to_s
fileR.write('Does not Include:' + e)
end
end
additional coding i tried without luck.
counterA = counterB = 0
aFile = File.open("Sample1 - Copy.txt", "r");
bFile = File.open("Sample2.txt", "r");
file1lines = aFile.readlines
file2lines = bFile.readlines
file1lines.each do |e|
if(!file2lines.include?(e))
puts e
else
p "Files include these lines:"
end
end
stopTime = Time.now

As a starting point, I would use the diff Unix command (available on Windows as part of Cygwin, etc), and see if that addresses your need:
#!/usr/bin/env ruby
raise "Syntax is comp_files file1 file2" unless ARGV.length == 2
file1, file2 = ARGV
`sort #{file1} > file1_sorted.txt`
`sort #{file2} > file2_sorted.txt`
`diff file1_sorted.txt file2_sorted.txt 2>&1 > diff.txt`
puts 'Created diff.txt.' # After running the script, view it w/less, etc.
Here is a similar script that uses temporary files that are automatically deleted before exiting:
#!/usr/bin/env ruby
raise "Syntax is comp_files file1 file2" unless ARGV.length == 2
require 'tempfile'
input_file1, input_file2 = ARGV
sorted_file1 = Tempfile.new('comp_files_sorted_1').path
sorted_file2 = Tempfile.new('comp_files_sorted_2').path
puts [sorted_file1, sorted_file2]
`sort #{input_file1} > #{sorted_file1}`
`sort #{input_file2} > #{sorted_file2}`
`diff #{sorted_file1} #{sorted_file2} 2>&1 > diff.txt`
puts 'Created diff.txt.' # After running the script, view it w/less, etc.
# The code below can be used to create sample input files
# File.write('input1.txt', (('a'..'j').to_a.shuffle + %w(s y)).join("\n"))
# File.write('input2.txt', (('a'..'j').to_a.shuffle + %w(s t z)).join("\n"))

I believe your problem is with readlines. This method will read the entire file and return a string. Since your file is huge, you will risk running out of memory.
To work with large files, don't read the entire contents at once, but read in pieces as needed.
Also, your algorithm has another problem since the comparison really checks whether all lines in aFile are included in bFile without actually checking for order at all. I'm not sure if that is indeed your intent.
If you really want to compare line by line and if order matters, then your comparison should be line-by-line and you don't have to read the entire file into a string. Use the gets method instead, which by default returns the next line in a file or nil at EOF.
Something like this:
aFile.each do |e|
if e != bFile.gets
p 'Able to get differences:' + e.to_s
fileR.write('Does not Include:' + e)
end
end
On the other hand, if you really want to find if all lines in a are in b, regardless of order, you can do a nested loop, where for each line in a, you iterate all lines of b. Make sure to return on first match to speedy things up since this will be a really expensive operation, but the include call is also expensive so it's probably a tie IMO with the exception of the file IO overhead.

Here is a script that will analyze 2 text files, reporting the first difference, or a difference in the number of lines, or success.
NOTE: THE CODE HERE IS TRUNCATED. PLEASE GO TO https://gist.github.com/keithrbennett/1d043fdf7b685d9692f0181ad68c6307 FOR THE COMPLETE SCRIPT!
#!/usr/bin/env ruby
raise "Syntax is first_diff file1 file2" unless ARGV.size == 2
FILE1, FILE2 = ARGV
ENUM1 = File.new(FILE1).to_enum
ENUM2 = File.new(FILE2).to_enum
def build_unequal_error_message(line_num, line1, line2)
"Difference found at line #{line_num}:
#{FILE1}: #{line1}
#{FILE2}: #{line2}"
end
def build_unequal_line_count_error_message(line_count, file_exhausted)
"All lines up to line #{line_count} were identical, " \
"but #{file_exhausted} has no more text lines."
end
def get_line(file_enumerator)
file_enumerator.next.chomp
end
def has_next(enumerator)
begin
enumerator.peek
true
rescue StopIteration
false
end
end
# Returns an analysis of the results in the form of a string
# if a compare error occurred, else returns nil.
def error_text_or_nil
line_num = 0
loop do
has1 = has_next(ENUM1)
has2 = has_next(ENUM2)
case
when has1 && has2
line1 = get_line(ENUM1)
line2 = get_line(ENUM2)
if line1 != line2
return build_unequal_error_message(line_num, line1, line2)
end
when !has1 && !has2
return nil # if both have no more values, we're done
else # only 1 enum has been exhausted
exhausted_file = has1 ? FILE2 : FILE1
not_exhausted_file = exhausted_file == FILE1 ? FILE2 : FILE1
return build_unequal_line_count_error_message(line_num, exhausted_file)
end
line_num += 1
end
puts "Lines processed successfully: #{line_num}"
end
result = error_text_or_nil
if result
puts result
exit -1
else
puts "Compare successful"
exit 0
end

Related

How to speed up Ruby script? Or shell script alternative?

I have a Ruby script that does the following to a text file:
removes non-ASCII lines
removes lines containing "::" (two colons in a row)
if there is more than one ":" present in the line (which aren't directly next to each other), it only keeps the strings on both sides of the last colon.
removes leading whitespace
removes unusual control characters
The problem is, I'm working with files that have ~20 million lines, and my script says it'll take ~45 minutes to run.
Is there a way to majorly speed this up? Or, is there a significantly quicker way to handle this in shell?
require 'ruby-progressbar'
class String
def strip_control_characters()
chars.each_with_object("") do |char, str|
str << char unless char.ascii_only? and (char.ord < 32 or char.ord == 127)
end
end
def strip_control_and_extended_characters()
chars.each_with_object("") do |char, str|
str << char if char.ascii_only? and char.ord.between?(32,126)
end
end
end
class Numeric
def percent_of(n)
self.to_f / n.to_f * 100.0
end
end
def clean(file_in,file_out)
if !File.exists?(file_in)
puts "File '#{file_in}' does not exist."
return
end
File.delete(file_out) if File.exist?(file_out)
`touch #{file_out}`
deleted = 0
count = 0
line_count = `wc -l "#{file_in}"`.strip.split(' ')[0].to_i
puts "File has #{line_count} lines. Cleaning..."
progressbar = ProgressBar.create(total: line_count, length: 100, format: 'Progress |%B| %a %e')
IO.foreach(file_in) {|x|
if x.ascii_only?
line = x.strip_control_and_extended_characters.strip
if line == ""
deleted += 1
next
end
if line.include?("::")
deleted += 1
next
end
split = line.split(":")
c = split.count
if c == 1
deleted += 1
next
end
if c > 2
line = split.last(2).join(":")
end
if line != ""
File.open(file_out, 'a') { |f| f.puts(line) }
else
deleted += 1
end
else
deleted += 1
end
progressbar.progress += 1
}
puts "Deleted #{deleted} lines."
end
Here is one of your big problems:
if line != ""
File.open(file_out, 'a') { |f| f.puts(line) }
end
So your program needs to open and close the output file millions of times because it is doing that for every single line. Each time it opens it, since it is being opened in append mode, your system might have to do a lot of work to find the end of the file.
You should really change your program to open the output file once at the beginning and only close it at the end. Also, run strace to see what your Ruby I/O operations are doing behind the scenes; it should buffer up the writes and then send them to the OS in blocks of about 4 kilobytes at a time; it shouldn't issue a write system call for every single line.
To further improve the performance, you should use a Ruby profiling tool to see which functions are taking the most time.
You can improve the speed by changing your String additions to variations on:
class String
def strip_control_characters()
gsub(/[[:cntrl:]]+/, '')
end
def strip_control_and_extended_characters()
strip_control_characters.gsub(/[^[:ascii:]]+/, '')
end
end
str = (0..255).to_a.map { |b| b.chr }.join # => "\x00\x01\x02\x03\x04\x05\x06\a\b\t\n\v\f\r\x0E\x0F\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1A\e\x1C\x1D\x1E\x1F !\"\#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7F\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F\xA0\xA1\xA2\xA3\xA4\xA5\xA6\xA7\xA8\xA9\xAA\xAB\xAC\xAD\xAE\xAF\xB0\xB1\xB2\xB3\xB4\xB5\xB6\xB7\xB8\xB9\xBA\xBB\xBC\xBD\xBE\xBF\xC0\xC1\xC2\xC3\xC4\xC5\xC6\xC7\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF\xD0\xD1\xD2\xD3\xD4\xD5\xD6\xD7\xD8\xD9\xDA\xDB\xDC\xDD\xDE\xDF\xE0\xE1\xE2\xE3\xE4\xE5\xE6\xE7\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF\xF0\xF1\xF2\xF3\xF4\xF5\xF6\xF7\xF8\xF9\xFA\xFB\xFC\xFD\xFE\xFF"
str.strip_control_characters
# => " !\"\#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F\xA0\xA1\xA2\xA3\xA4\xA5\xA6\xA7\xA8\xA9\xAA\xAB\xAC\xAD\xAE\xAF\xB0\xB1\xB2\xB3\xB4\xB5\xB6\xB7\xB8\xB9\xBA\xBB\xBC\xBD\xBE\xBF\xC0\xC1\xC2\xC3\xC4\xC5\xC6\xC7\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF\xD0\xD1\xD2\xD3\xD4\xD5\xD6\xD7\xD8\xD9\xDA\xDB\xDC\xDD\xDE\xDF\xE0\xE1\xE2\xE3\xE4\xE5\xE6\xE7\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF\xF0\xF1\xF2\xF3\xF4\xF5\xF6\xF7\xF8\xF9\xFA\xFB\xFC\xFD\xFE\xFF"
str.strip_control_and_extended_characters
# => " !\"\#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~"
Use the built-in gsub method along with the POSIX character-sets instead of iterating over the strings and testing each character.
As #Myst said though, monkey-patching is rude. Use refinements, or create some methods and pass in the string:
def strip_control_characters(str)
str.gsub(/[[:cntrl:]]+/, '')
end
def strip_control_and_extended_characters(str)
strip_control_characters(str).gsub(/[^[:ascii:]]+/, '')
end
str = (0..255).to_a.map { |b| b.chr }.join # => "\x00\x01\x02\x03\x04\x05\x06\a\b\t\n\v\f\r\x0E\x0F\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1A\e\x1C\x1D\x1E\x1F !\"\#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7F\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F\xA0\xA1\xA2\xA3\xA4\xA5\xA6\xA7\xA8\xA9\xAA\xAB\xAC\xAD\xAE\xAF\xB0\xB1\xB2\xB3\xB4\xB5\xB6\xB7\xB8\xB9\xBA\xBB\xBC\xBD\xBE\xBF\xC0\xC1\xC2\xC3\xC4\xC5\xC6\xC7\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF\xD0\xD1\xD2\xD3\xD4\xD5\xD6\xD7\xD8\xD9\xDA\xDB\xDC\xDD\xDE\xDF\xE0\xE1\xE2\xE3\xE4\xE5\xE6\xE7\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF\xF0\xF1\xF2\xF3\xF4\xF5\xF6\xF7\xF8\xF9\xFA\xFB\xFC\xFD\xFE\xFF"
strip_control_characters(str)
# => " !\"\#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F\xA0\xA1\xA2\xA3\xA4\xA5\xA6\xA7\xA8\xA9\xAA\xAB\xAC\xAD\xAE\xAF\xB0\xB1\xB2\xB3\xB4\xB5\xB6\xB7\xB8\xB9\xBA\xBB\xBC\xBD\xBE\xBF\xC0\xC1\xC2\xC3\xC4\xC5\xC6\xC7\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF\xD0\xD1\xD2\xD3\xD4\xD5\xD6\xD7\xD8\xD9\xDA\xDB\xDC\xDD\xDE\xDF\xE0\xE1\xE2\xE3\xE4\xE5\xE6\xE7\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF\xF0\xF1\xF2\xF3\xF4\xF5\xF6\xF7\xF8\xF9\xFA\xFB\xFC\xFD\xFE\xFF"
strip_control_and_extended_characters(str)
# => " !\"\#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~"
Moving on...
`touch #{file_out}`
is a problem too. You're create a sub-shell every time that runs, executing touch then tearing it down which is a slow operation. Let Ruby do it:
=== Implementation from FileUtils
------------------------------------------------------------------------------
touch(list, noop: nil, verbose: nil, mtime: nil, nocreate: nil)
------------------------------------------------------------------------------
Updates modification time (mtime) and access time (atime) of file(s) in list.
Files are created if they don't exist.
FileUtils.touch 'timestamp'
FileUtils.touch Dir.glob('*.c'); system 'make'
Finally, learn to benchmark code as you develop. Take the time to think of a couple ways to do something, then test them against each other and find out which is the fastest. I use Fruity, because it handles issues that the Benchmark class doesn't, but do one or the other. You can find a lot of tests I did here for various things by searching SO for my user and "benchmark".
require 'fruity'
class String
def strip_control_characters()
chars.each_with_object("") do |char, str|
str << char unless char.ascii_only? and (char.ord < 32 or char.ord == 127)
end
end
def strip_control_and_extended_characters()
chars.each_with_object("") do |char, str|
str << char if char.ascii_only? and char.ord.between?(32,126)
end
end
end
def strip_control_characters2(str)
str.gsub(/[[:cntrl:]]+/, '')
end
def strip_control_and_extended_characters2(str)
strip_control_characters2(str).gsub(/[^[:ascii:]]+/, '')
end
str = (0..255).to_a.map { |b| b.chr }.join
str.strip_control_characters # => " !\"\#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F\xA0\xA1\xA2\xA3\xA4\xA5\xA6\xA7\xA8\xA9\xAA\xAB\xAC\xAD\xAE\xAF\xB0\xB1\xB2\xB3\xB4\xB5\xB6\xB7\xB8\xB9\xBA\xBB\xBC\xBD\xBE\xBF\xC0\xC1\xC2\xC3\xC4\xC5\xC6\xC7\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF\xD0\xD1\xD2\xD3\xD4\xD5\xD6\xD7\xD8\xD9\xDA\xDB\xDC\xDD\xDE\xDF\xE0\xE1\xE2\xE3\xE4\xE5\xE6\xE7\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF\xF0\xF1\xF2\xF3\xF4\xF5\xF6\xF7\xF8\xF9\xFA\xFB\xFC\xFD\xFE\xFF"
strip_control_characters2(str) # => " !\"\#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F\xA0\xA1\xA2\xA3\xA4\xA5\xA6\xA7\xA8\xA9\xAA\xAB\xAC\xAD\xAE\xAF\xB0\xB1\xB2\xB3\xB4\xB5\xB6\xB7\xB8\xB9\xBA\xBB\xBC\xBD\xBE\xBF\xC0\xC1\xC2\xC3\xC4\xC5\xC6\xC7\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF\xD0\xD1\xD2\xD3\xD4\xD5\xD6\xD7\xD8\xD9\xDA\xDB\xDC\xDD\xDE\xDF\xE0\xE1\xE2\xE3\xE4\xE5\xE6\xE7\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF\xF0\xF1\xF2\xF3\xF4\xF5\xF6\xF7\xF8\xF9\xFA\xFB\xFC\xFD\xFE\xFF"
str.strip_control_and_extended_characters # => " !\"\#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~"
strip_control_and_extended_characters2(str) # => " !\"\#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~"
compare do
scc { str.strip_control_characters }
scc2 { strip_control_characters2(str) }
end
# >> Running each test 512 times. Test will take about 1 second.
# >> scc2 is faster than scc by 10x ± 1.0
and:
compare do
scec { str.strip_control_and_extended_characters }
scec2 { strip_control_and_extended_characters2(str) }
end
# >> Running each test 256 times. Test will take about 1 second.
# >> scec2 is faster than scec by 5x ± 1.0
There seem to be only to possible approaches to optimizing this:
Concurrency.
If your machine is a Unix/Linux based machine that has a multi-core CPU, you can take advantage of the multi-cores by using fork, dividing up the work between different processes.
Multi-threading might not work as well as you'd expect with Ruby, since there's a GIL (Global Instruction Lock) that prevents multiple threads from running together.
Code optimizations.
These include minimizing system calls (such as the File.open) and minimizing any temporary objects.
I would start with this approach before I moved on to fork, mainly due to the extra coding required when using fork.
The first approach requires a large rewrite of the script, while the second approach might be more easily achieved.
For example, the following approach minimizes some system calls (such as the File's open, close and write system calls):
require 'ruby-progressbar'
class String
def strip_control_characters()
chars.each_with_object("") do |char, str|
str << char unless char.ascii_only? and (char.ord < 32 or char.ord == 127)
end
end
def strip_control_and_extended_characters()
chars.each_with_object("") do |char, str|
str << char if char.ascii_only? and char.ord.between?(32,126)
end
end
end
class Numeric
def percent_of(n)
self.to_f / n.to_f * 100.0
end
end
def clean(file_in,file_out)
if !File.exists?(file_in)
puts "File '#{file_in}' does not exist."
return
end
File.delete(file_out) if File.exist?(file_out)
`touch #{file_out}`
deleted = 0
count = 0
line_count = `wc -l "#{file_in}"`.strip.split(' ')[0].to_i
puts "File has #{line_count} lines. Cleaning..."
progressbar = ProgressBar.create(total: line_count, length: 100, format: 'Progress |%B| %a %e')
file_fd = File.open(file_out, 'a')
buffer = "".dup
IO.foreach(file_in) {|x|
if x.ascii_only?
line = x.strip_control_and_extended_characters.strip
if line == ""
deleted += 1
next
end
if line.include?("::")
deleted += 1
next
end
split = line.split(":")
c = split.count
if c == 1
deleted += 1
next
end
if c > 2
line = split.last(2).join(":")
end
if line != ""
buffer += "\r\n#{line}"
else
deleted += 1
end
else
deleted += 1
end
if buffer.length >= 2048
file_fd.puts(buffer)
buffer.clear
end
progressbar.progress += 1
}
file_fd.puts(buffer)
buffer.clear
file_fd.close
puts "Deleted #{deleted} lines."
end
P.S.
I would avoid monkey patching - it's rude.
After posting this I read #DavidGrayson's answer, which pinpoints an issue with your code's performance in a much shorter and succinct answer.
I up-voted his answer, as I think you'll get a big performance gain from this simple change.

How do I detect end of file in Ruby?

I wrote the following script to read a CSV file:
f = File.open("aFile.csv")
text = f.read
text.each_line do |line|
if (f.eof?)
puts "End of file reached"
else
line_num +=1
if(line_num < 6) then
puts "____SKIPPED LINE____"
next
end
end
arr = line.split(",")
puts "line number = #{line_num}"
end
This code runs fine if I take out the line:
if (f.eof?)
puts "End of file reached"
With this line in I get an exception.
I was wondering how I can detect the end of file in the code above.
Try this short example:
f = File.open(__FILE__)
text = f.read
p f.eof? # -> true
p text.class #-> String
With f.read you read the whole file into text and reach EOF.
(Remark: __FILE__ is the script file itself. You may use you csv-file).
In your code you use text.each_line. This executes each_line for the string text. It has no effect on f.
You could use File#each_line without using a variable text. The test for EOF is not necessary. each_line loops on each line and detects EOF on its own.
f = File.open(__FILE__)
line_num = 0
f.each_line do |line|
line_num +=1
if (line_num < 6)
puts "____SKIPPED LINE____"
next
end
arr = line.split(",")
puts "line number = #{line_num}"
end
f.close
You should close the file after reading it. To use blocks for this is more Ruby-like:
line_num = 0
File.open(__FILE__) do | f|
f.each_line do |line|
line_num +=1
if (line_num < 6)
puts "____SKIPPED LINE____"
next
end
arr = line.split(",")
puts "line number = #{line_num}"
end
end
One general remark: There is a CSV library in Ruby. Normally it is better to use that.
https://www.ruby-forum.com/topic/218093#946117 talks about this.
content = File.read("file.txt")
content = File.readlines("file.txt")
The above 'slurps' the entire file into memory.
File.foreach("file.txt") {|line| content << line}
You can also use IO#each_line. These last two options do not read the entire file into memory. The use of the block makes this automatically close your IO object as well. There are other ways as well, IO and File classes are pretty feature rich!
I refer to IO objects, as File is a subclass of IO. I tend to use IO when I don't really need the added methods from File class for the object.
In this way you don't need to deal with EOF, Ruby will for you.
Sometimes the best handling is not to, when you really don't need to.
Of course, Ruby has a method for this.
Without testing this, it seems you should perform a rescue rather than checking.
http://www.ruby-doc.org/core-2.0/EOFError.html
file = File.open("aFile.csv")
begin
loop do
some_line = file.readline
# some stuff
end
rescue EOFError
# You've reached the end. Handle it.
end

read file into an array excluding the the commented out lines

I'm almost a Ruby-nOOb (have just the knowledge of Ruby to write some basic .erb template or Puppet custom-facts). Looks like my requirements fairly simple but can't get my head around it.
Trying to write a .erb template, where it reads a file (with space delimited lines) to an array and then handle each array element according to the requirements. This is what I got so far:
fname = "webURI.txt"
def myArray()
#if defined? $fname
if File.exist?($fname) and File.file?($fname)
IO.readlines($fname)
end
end
myArray.each_index do |i|
myLine = myArray[i].split(' ')
puts myLine[0] +"\t=> "+ myLine.last
end
Which works just fine, except (for obvious reason) for the line that is commented out or blank lines. I also want to make sure that when spitted (by space) up, the line shouldn't have more than two fields in it; a file like this:
# This is a COMMENT
#
# Puppet dashboard
puppet controller-all-local.example.co.uk:80
# Nagios monitoring
nagios controller-all-local.example.co.uk::80/nagios
tac talend-tac-local.example.co.uk:8080/org.talend.admin
mng console talend-mca-local.example.co.uk:8080/amc # Line with three fields
So, basically these two things I'd like to achieve:
Read the lines into array, stripping off everything after the first #
Split each element and print a message if the number id more than two
Any help would be greatly appreciated. Cheers!!
Update 25/02
Thanks guy for your help!!
The blankthing doesn't work for at all; throwing in this error; but I kinda failed to understand why:
undefined method `blank?' for "\n":String (NoMethodError)
The array: myArray, which I get is actually something like this (using p instead of puts:
["\n", "puppet controller-all-local.example.co.uk:80\n", "\n", "\n", "nagios controller-all-local.example.co.uk::80/nagios\n", ..... \n"]
Hence, I had to do this to get around this prob:
$fname = "webURI.txt"
def myArray()
if File.exist?($fname) and File.file?($fname)
IO.readlines($fname).map { |arr| arr.gsub(/#.*/,'') }
end
end
# remove blank lines
SSS = myArray.reject { |ln| ln.start_with?("\n") }
SSS.each_index do |i|
myLine = SSS[i].split(' ')
if myLine.length > 2
puts "Too many arguments!!!"
elsif myLine.length == 1
puts "page"+ i.to_s + "\t=> " + myLine[0]
else
puts myLine[0] +"\t=> "+ myLine.last
end
end
You are most welcome to improve the code. cheers!!
goodArray = myArray.reject do |line|
line.start_with?('#') || line.split(' ').length > 2
end
This would reject whatever that either starts with # or the split returns an array of more than two elements returning you an array of only good items.
Edit:
For your inline commenting you can then do
goodArray.map do |line|
line.gsub(/#.*/, '')
end

Moving to the last line of a file while reading it in a .each loop in Ruby

I'm reading in a file that can contain any number of rows.
I only need to save the first 1000 or so, passed in as a variable "recordsToParse".
If I reach my 1000 line limit, or whatever it's set to, I need to save the trailer information in the file to verify total_records, total_amount etc.
So, I need a way to move my "pointer" from where ever I am in the file to the last line and run through one more time.
file = File.open(file_name)
parsed_file_rows = Array.new
successful_records, failed_records = 0, 0
file_contract = file_contract['File_Contract']
output_file_name = file_name.gsub(/.TXT|.txt|.dat|.DAT/,'')
file.each do |line|
line.chomp!
line_contract = determine_row_type(file_contract, line)
if line_contract
parsed_row = parse_row_by_contract(line_contract, line)
parsed_file_rows << parsed_row
successful_records += 1
else
failed_records += 1
end
if (not recordsToParse.nil?)
if successful_records > recordsToParse
# move "pointer" to last line and go through loop once more
#break;
end
end
end
store_parsed_file('Parsed_File',"#{output_file_name}_parsed", parsed_file_rows)
[successful_records, failed_records]
Use IO.seek with IO::SEEK_END to move your pointer to the end of the file, then move up to the last CR, then you have your last line.
This would only be worthwhile if the file is very big, otherwise just follow the file.each do |line| to the last line or you could read the last line like this IO.readlines("file.txt")[-1].
The easiest solution is to use a gem like elif
require "elif"
lastline = Elif.open("bigfile.txt") { |f| f.gets }
It reads your lastline in a snap undoubtedly using seek.
This is one of those times I'd take advantage of the OS's head and tail commands using something like:
head = `head -#{ records_to_parse } #{ file_to_read }`.split("\n")
tail = `tail -1 #{ file_to_read }
head.pop if (head[-1] == tail.chomp)
Then write it all out using something like:
File.open(new_file_to_write, 'w') do |fo|
fo.puts head, tail
end

Increment part of a string in Ruby

I have a method in a Ruby script that is attempting to rename files before they are saved. It looks like this:
def increment (path)
if path[-3,2] == "_#"
print " Incremented file with that name already exists, renaming\n"
count = path[-1].chr.to_i + 1
return path.chop! << count.to_s
else
print " A file with that name already exists, renaming\n"
return path << "_#1"
end
end
Say you have 3 files with the same name being saved to a directory, we'll say the file is called example.mp3. The idea is that the first will be saved as example.mp3 (since it won't be caught by if File.exists?("#{file_path}.mp3") elsewhere in the script), the second will be saved as example_#1.mp3 (since it is caught by the else part of the above method) and the third as example_#2.mp3 (since it is caught by the if part of the above method).
The problem I have is twofold.
1) if path[-3,2] == "_#" won't work for files with an integer of more than one digit (example_#11.mp3 for example) since the character placement will be wrong (you'd need it to be path[-4,2] but then that doesn't cope with 3 digit numbers etc).
2) I'm never reaching problem 1) since the method doesn't reliably catch file names. At the moment it will rename the first to example_#1.mp3 but the second gets renamed to the same thing (causing it to overwrite the previously saved file).
This is possibly too vague for Stack Overflow but I can't find anything that addresses the issue of incrementing a certain part of a string.
Thanks in advance!
Edit/update:
Wayne's method below seems to work on it's own but not when included as part of the whole script - it can increment a file once (from example.mp3 to example_#1.mp3) but doesn't cope with taking example_#1.mp3 and incrementing it to example_#2.mp3. To provide a little more context - currently when the script finds a file to save it is passing the name to Wayne's method like this:
file_name = increment(image_name)
File.open("images/#{file_name}.jpeg", 'w') do |output|
open(image_url) do |input|
output << input.read
end
end
I've edited Wayne's script a little so now it looks like this:
def increment (name)
name = name.gsub(/\s{2,}|(http:\/\/)|(www.)/i, '')
if File.exists?("images/#{name}.jpeg")
_, filename, count, extension = *name.match(/(\A.*?)(?:_#(\d+))?(\.[^.]*)?\Z/)
count = (count || '0').to_i + 1
"#{name}_##{count}#{extension}"
else
return name
end
end
Where am I going wrong? Again, thanks in advance.
A regular expression will git 'er done:
#!/usr/bin/ruby1.8
def increment(path)
_, filename, count, extension = *path.match(/(\A.*?)(?:_#(\d+))?(\.[^.]*)?\Z/)
count = (count || '0').to_i + 1
"#{filename}_##{count}#{extension}"
end
p increment('example') # => "example_#1"
p increment('example.') # => "example_#1."
p increment('example.mp3') # => "example_#1.mp3"
p increment('example_#1.mp3') # => "example_#2.mp3"
p increment('example_#2.mp3') # => "example_#3.mp3"
This probably doesn't matter for the code you're writing, but if you ever may have multiple threads or processes using this algorithm on the same files, there's a race condition when checking for existence before saving: Two writers can both find the same filename unused and write to it. If that matters to you, then open the file in a mode that fails if it exists, rescuing the exception. When the exception occurs, pick a different name. Roughly:
loop do
begin
File.open(filename, File::CREAT | File::EXCL | File::WRONLY) do |file|
file.puts "Your content goes here"
end
break
rescue Errno::EEXIST
filename = increment(filename)
redo
end
end
Here's a variation that doesn't accept a file name with an existing count:
def non_colliding_filename( filename )
if File.exists?(filename)
base,ext = /\A(.+?)(\.[^.]+)?\Z/.match( filename ).to_a[1..-1]
i = 1
i += 1 while File.exists?( filename="#{base}_##{i}#{ext}" )
end
filename
end
Proof:
%w[ foo bar.mp3 jim.bob.mp3 ].each do |desired|
3.times{
file = non_colliding_filename( desired )
p file
File.open( file, 'w' ){ |f| f << "tmp" }
}
end
#=> "foo"
#=> "foo_#1"
#=> "foo_#2"
#=> "bar.mp3"
#=> "bar_#1.mp3"
#=> "bar_#2.mp3"
#=> "jim.bob.mp3"
#=> "jim.bob_#1.mp3"
#=> "jim.bob_#2.mp3"

Resources