I have a Ruby script that does the following to a text file:
removes non-ASCII lines
removes lines containing "::" (two colons in a row)
if there is more than one ":" present in the line (which aren't directly next to each other), it only keeps the strings on both sides of the last colon.
removes leading whitespace
removes unusual control characters
The problem is, I'm working with files that have ~20 million lines, and my script says it'll take ~45 minutes to run.
Is there a way to majorly speed this up? Or, is there a significantly quicker way to handle this in shell?
require 'ruby-progressbar'
class String
def strip_control_characters()
chars.each_with_object("") do |char, str|
str << char unless char.ascii_only? and (char.ord < 32 or char.ord == 127)
end
end
def strip_control_and_extended_characters()
chars.each_with_object("") do |char, str|
str << char if char.ascii_only? and char.ord.between?(32,126)
end
end
end
class Numeric
def percent_of(n)
self.to_f / n.to_f * 100.0
end
end
def clean(file_in,file_out)
if !File.exists?(file_in)
puts "File '#{file_in}' does not exist."
return
end
File.delete(file_out) if File.exist?(file_out)
`touch #{file_out}`
deleted = 0
count = 0
line_count = `wc -l "#{file_in}"`.strip.split(' ')[0].to_i
puts "File has #{line_count} lines. Cleaning..."
progressbar = ProgressBar.create(total: line_count, length: 100, format: 'Progress |%B| %a %e')
IO.foreach(file_in) {|x|
if x.ascii_only?
line = x.strip_control_and_extended_characters.strip
if line == ""
deleted += 1
next
end
if line.include?("::")
deleted += 1
next
end
split = line.split(":")
c = split.count
if c == 1
deleted += 1
next
end
if c > 2
line = split.last(2).join(":")
end
if line != ""
File.open(file_out, 'a') { |f| f.puts(line) }
else
deleted += 1
end
else
deleted += 1
end
progressbar.progress += 1
}
puts "Deleted #{deleted} lines."
end
Here is one of your big problems:
if line != ""
File.open(file_out, 'a') { |f| f.puts(line) }
end
So your program needs to open and close the output file millions of times because it is doing that for every single line. Each time it opens it, since it is being opened in append mode, your system might have to do a lot of work to find the end of the file.
You should really change your program to open the output file once at the beginning and only close it at the end. Also, run strace to see what your Ruby I/O operations are doing behind the scenes; it should buffer up the writes and then send them to the OS in blocks of about 4 kilobytes at a time; it shouldn't issue a write system call for every single line.
To further improve the performance, you should use a Ruby profiling tool to see which functions are taking the most time.
You can improve the speed by changing your String additions to variations on:
class String
def strip_control_characters()
gsub(/[[:cntrl:]]+/, '')
end
def strip_control_and_extended_characters()
strip_control_characters.gsub(/[^[:ascii:]]+/, '')
end
end
str = (0..255).to_a.map { |b| b.chr }.join # => "\x00\x01\x02\x03\x04\x05\x06\a\b\t\n\v\f\r\x0E\x0F\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1A\e\x1C\x1D\x1E\x1F !\"\#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7F\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F\xA0\xA1\xA2\xA3\xA4\xA5\xA6\xA7\xA8\xA9\xAA\xAB\xAC\xAD\xAE\xAF\xB0\xB1\xB2\xB3\xB4\xB5\xB6\xB7\xB8\xB9\xBA\xBB\xBC\xBD\xBE\xBF\xC0\xC1\xC2\xC3\xC4\xC5\xC6\xC7\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF\xD0\xD1\xD2\xD3\xD4\xD5\xD6\xD7\xD8\xD9\xDA\xDB\xDC\xDD\xDE\xDF\xE0\xE1\xE2\xE3\xE4\xE5\xE6\xE7\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF\xF0\xF1\xF2\xF3\xF4\xF5\xF6\xF7\xF8\xF9\xFA\xFB\xFC\xFD\xFE\xFF"
str.strip_control_characters
# => " !\"\#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F\xA0\xA1\xA2\xA3\xA4\xA5\xA6\xA7\xA8\xA9\xAA\xAB\xAC\xAD\xAE\xAF\xB0\xB1\xB2\xB3\xB4\xB5\xB6\xB7\xB8\xB9\xBA\xBB\xBC\xBD\xBE\xBF\xC0\xC1\xC2\xC3\xC4\xC5\xC6\xC7\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF\xD0\xD1\xD2\xD3\xD4\xD5\xD6\xD7\xD8\xD9\xDA\xDB\xDC\xDD\xDE\xDF\xE0\xE1\xE2\xE3\xE4\xE5\xE6\xE7\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF\xF0\xF1\xF2\xF3\xF4\xF5\xF6\xF7\xF8\xF9\xFA\xFB\xFC\xFD\xFE\xFF"
str.strip_control_and_extended_characters
# => " !\"\#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~"
Use the built-in gsub method along with the POSIX character-sets instead of iterating over the strings and testing each character.
As #Myst said though, monkey-patching is rude. Use refinements, or create some methods and pass in the string:
def strip_control_characters(str)
str.gsub(/[[:cntrl:]]+/, '')
end
def strip_control_and_extended_characters(str)
strip_control_characters(str).gsub(/[^[:ascii:]]+/, '')
end
str = (0..255).to_a.map { |b| b.chr }.join # => "\x00\x01\x02\x03\x04\x05\x06\a\b\t\n\v\f\r\x0E\x0F\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1A\e\x1C\x1D\x1E\x1F !\"\#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7F\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F\xA0\xA1\xA2\xA3\xA4\xA5\xA6\xA7\xA8\xA9\xAA\xAB\xAC\xAD\xAE\xAF\xB0\xB1\xB2\xB3\xB4\xB5\xB6\xB7\xB8\xB9\xBA\xBB\xBC\xBD\xBE\xBF\xC0\xC1\xC2\xC3\xC4\xC5\xC6\xC7\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF\xD0\xD1\xD2\xD3\xD4\xD5\xD6\xD7\xD8\xD9\xDA\xDB\xDC\xDD\xDE\xDF\xE0\xE1\xE2\xE3\xE4\xE5\xE6\xE7\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF\xF0\xF1\xF2\xF3\xF4\xF5\xF6\xF7\xF8\xF9\xFA\xFB\xFC\xFD\xFE\xFF"
strip_control_characters(str)
# => " !\"\#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F\xA0\xA1\xA2\xA3\xA4\xA5\xA6\xA7\xA8\xA9\xAA\xAB\xAC\xAD\xAE\xAF\xB0\xB1\xB2\xB3\xB4\xB5\xB6\xB7\xB8\xB9\xBA\xBB\xBC\xBD\xBE\xBF\xC0\xC1\xC2\xC3\xC4\xC5\xC6\xC7\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF\xD0\xD1\xD2\xD3\xD4\xD5\xD6\xD7\xD8\xD9\xDA\xDB\xDC\xDD\xDE\xDF\xE0\xE1\xE2\xE3\xE4\xE5\xE6\xE7\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF\xF0\xF1\xF2\xF3\xF4\xF5\xF6\xF7\xF8\xF9\xFA\xFB\xFC\xFD\xFE\xFF"
strip_control_and_extended_characters(str)
# => " !\"\#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~"
Moving on...
`touch #{file_out}`
is a problem too. You're create a sub-shell every time that runs, executing touch then tearing it down which is a slow operation. Let Ruby do it:
=== Implementation from FileUtils
------------------------------------------------------------------------------
touch(list, noop: nil, verbose: nil, mtime: nil, nocreate: nil)
------------------------------------------------------------------------------
Updates modification time (mtime) and access time (atime) of file(s) in list.
Files are created if they don't exist.
FileUtils.touch 'timestamp'
FileUtils.touch Dir.glob('*.c'); system 'make'
Finally, learn to benchmark code as you develop. Take the time to think of a couple ways to do something, then test them against each other and find out which is the fastest. I use Fruity, because it handles issues that the Benchmark class doesn't, but do one or the other. You can find a lot of tests I did here for various things by searching SO for my user and "benchmark".
require 'fruity'
class String
def strip_control_characters()
chars.each_with_object("") do |char, str|
str << char unless char.ascii_only? and (char.ord < 32 or char.ord == 127)
end
end
def strip_control_and_extended_characters()
chars.each_with_object("") do |char, str|
str << char if char.ascii_only? and char.ord.between?(32,126)
end
end
end
def strip_control_characters2(str)
str.gsub(/[[:cntrl:]]+/, '')
end
def strip_control_and_extended_characters2(str)
strip_control_characters2(str).gsub(/[^[:ascii:]]+/, '')
end
str = (0..255).to_a.map { |b| b.chr }.join
str.strip_control_characters # => " !\"\#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F\xA0\xA1\xA2\xA3\xA4\xA5\xA6\xA7\xA8\xA9\xAA\xAB\xAC\xAD\xAE\xAF\xB0\xB1\xB2\xB3\xB4\xB5\xB6\xB7\xB8\xB9\xBA\xBB\xBC\xBD\xBE\xBF\xC0\xC1\xC2\xC3\xC4\xC5\xC6\xC7\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF\xD0\xD1\xD2\xD3\xD4\xD5\xD6\xD7\xD8\xD9\xDA\xDB\xDC\xDD\xDE\xDF\xE0\xE1\xE2\xE3\xE4\xE5\xE6\xE7\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF\xF0\xF1\xF2\xF3\xF4\xF5\xF6\xF7\xF8\xF9\xFA\xFB\xFC\xFD\xFE\xFF"
strip_control_characters2(str) # => " !\"\#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F\xA0\xA1\xA2\xA3\xA4\xA5\xA6\xA7\xA8\xA9\xAA\xAB\xAC\xAD\xAE\xAF\xB0\xB1\xB2\xB3\xB4\xB5\xB6\xB7\xB8\xB9\xBA\xBB\xBC\xBD\xBE\xBF\xC0\xC1\xC2\xC3\xC4\xC5\xC6\xC7\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF\xD0\xD1\xD2\xD3\xD4\xD5\xD6\xD7\xD8\xD9\xDA\xDB\xDC\xDD\xDE\xDF\xE0\xE1\xE2\xE3\xE4\xE5\xE6\xE7\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF\xF0\xF1\xF2\xF3\xF4\xF5\xF6\xF7\xF8\xF9\xFA\xFB\xFC\xFD\xFE\xFF"
str.strip_control_and_extended_characters # => " !\"\#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~"
strip_control_and_extended_characters2(str) # => " !\"\#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~"
compare do
scc { str.strip_control_characters }
scc2 { strip_control_characters2(str) }
end
# >> Running each test 512 times. Test will take about 1 second.
# >> scc2 is faster than scc by 10x ± 1.0
and:
compare do
scec { str.strip_control_and_extended_characters }
scec2 { strip_control_and_extended_characters2(str) }
end
# >> Running each test 256 times. Test will take about 1 second.
# >> scec2 is faster than scec by 5x ± 1.0
There seem to be only to possible approaches to optimizing this:
Concurrency.
If your machine is a Unix/Linux based machine that has a multi-core CPU, you can take advantage of the multi-cores by using fork, dividing up the work between different processes.
Multi-threading might not work as well as you'd expect with Ruby, since there's a GIL (Global Instruction Lock) that prevents multiple threads from running together.
Code optimizations.
These include minimizing system calls (such as the File.open) and minimizing any temporary objects.
I would start with this approach before I moved on to fork, mainly due to the extra coding required when using fork.
The first approach requires a large rewrite of the script, while the second approach might be more easily achieved.
For example, the following approach minimizes some system calls (such as the File's open, close and write system calls):
require 'ruby-progressbar'
class String
def strip_control_characters()
chars.each_with_object("") do |char, str|
str << char unless char.ascii_only? and (char.ord < 32 or char.ord == 127)
end
end
def strip_control_and_extended_characters()
chars.each_with_object("") do |char, str|
str << char if char.ascii_only? and char.ord.between?(32,126)
end
end
end
class Numeric
def percent_of(n)
self.to_f / n.to_f * 100.0
end
end
def clean(file_in,file_out)
if !File.exists?(file_in)
puts "File '#{file_in}' does not exist."
return
end
File.delete(file_out) if File.exist?(file_out)
`touch #{file_out}`
deleted = 0
count = 0
line_count = `wc -l "#{file_in}"`.strip.split(' ')[0].to_i
puts "File has #{line_count} lines. Cleaning..."
progressbar = ProgressBar.create(total: line_count, length: 100, format: 'Progress |%B| %a %e')
file_fd = File.open(file_out, 'a')
buffer = "".dup
IO.foreach(file_in) {|x|
if x.ascii_only?
line = x.strip_control_and_extended_characters.strip
if line == ""
deleted += 1
next
end
if line.include?("::")
deleted += 1
next
end
split = line.split(":")
c = split.count
if c == 1
deleted += 1
next
end
if c > 2
line = split.last(2).join(":")
end
if line != ""
buffer += "\r\n#{line}"
else
deleted += 1
end
else
deleted += 1
end
if buffer.length >= 2048
file_fd.puts(buffer)
buffer.clear
end
progressbar.progress += 1
}
file_fd.puts(buffer)
buffer.clear
file_fd.close
puts "Deleted #{deleted} lines."
end
P.S.
I would avoid monkey patching - it's rude.
After posting this I read #DavidGrayson's answer, which pinpoints an issue with your code's performance in a much shorter and succinct answer.
I up-voted his answer, as I think you'll get a big performance gain from this simple change.
Related
This code is designed for a problem where the users computer has a bug where every time he/she hits the backspace button it displays a '<' symbol. The created program should fix this and output the intended string considering that '<' represents a backspace. The input string can be up to 10^6 characters long, and it only will include lowercase letters and '<'.
My code seems to be executing correctly but, when I submit it, the website says it exceeded the time limit for test 5/25. The amount of time given is 1 second. Also, if there are only '<' symbols it should produce no output.
For example,
"hellooo<< my name is matthe<<"
would output
"hello my name is matt"
and
"ssadfas<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<"
would output nothing, etc.
Here is the code:
input = gets.chomp
while input[/[[:lower:]]</]
input.gsub!(/[[:lower:]]</, "")
end
input.gsub!(/</, "")
puts"#{input}"
In the code above I stay in the while loop if there are any instances where a lowercase letter is in front of a '<'. Anywhere a lowercase letter is followed by a '<' it is replaced with nothing. Once the while loop is exited if there are any '<' symbols left, they are replaced with nothing. Then the final string is displayed.
I created a test which I think is worst case scenario for my code:
input = ("a" + "<" + "a")*10000000
#input = gets.chomp
while input[/[[:lower:]]</]
input.gsub!(/[[:lower:]]</, "")
end
input.gsub!(/</, "")
puts"#{input}"
I made the program stop between the creation of the string and the execution of the while loop and then ran it completely to be able to eyeball if it was taking longer than a second. It seemed to take much longer than 1 second.
How can it be modified to be faster or is there a much better way to do this?
Your approach is good but you get better performance if you adapt the regular expression.
Cary, I hope you don't mind I take your excellent solution also in the benchmark ?
Benchmark done on a MRI ruby 2.3.0p0 (2015-12-25 revision 53290) [x64-mingw32]
I use .dup on my sample string to make sure none of the methods changes the input sample.
require 'benchmark'
input = ""
10_000_000.times{input << ['a','<'].sample}
def original_method inp
while inp[/[[:lower:]]</]
inp.gsub!(/[[:lower:]]</, "")
end
inp.gsub(/</, "")
end
def better_method inp
tuple = /[^<]</
while inp[tuple]
inp.gsub!(inp[tuple], "")
end
inp.gsub(/</, "")
end
def backspace str
bs_count = 0
str.reverse.each_char.with_object([]) do |s, arr|
if s == '<'
bs_count += 1
else
bs_count.zero? ? arr.unshift(s) : bs_count -= 1
end
end.join
end
puts original_method(input.dup).length
puts better_method(input.dup).length
puts backspace(input.dup).length
Benchmark.bm do |x|
x.report("original_method") { original_method(input.dup) }
x.report("backspace ") { backspace(input.dup) }
x.report("better_method ") { better_method(input.dup) }
end
gives
3640
3640
3640
user system total real
original_method 3.494000 0.016000 3.510000 ( 3.510709)
backspace 1.872000 0.000000 1.872000 ( 1.862550)
better_method 1.155000 0.031000 1.186000 ( 1.187495)
def backspace(str)
bs_count = 0
str.reverse.each_char.with_object([]) do |s, arr|
if s == '<'
bs_count += 1
else
bs_count.zero? ? arr.unshift(s) : bs_count -= 1
end
end.join
end
backspace "Now is the<< tim<e fo<<<r every<<<one to chill ou<<<<t"
#=> "Now is t tier evone to chilt"
I have a 50+ GB XML file which I initially tried to (man)handle with Nokogiri :)
Got killed: 9 - obviously :)
Now I'm into muddy Ruby threaded waters with this stab (at it):
#!/usr/bin/env ruby
def add_vehicle index, str
IO.write "ess_#{index}.xml", str
#file_name = "ess_#{index}.xml"
#fd = File.new file_name, "w"
#fd.write str
#fd.close
#puts file_name
end
begin
record = []
threads = []
counter = 1
file = File.new("../ess2.xml", "r")
while (line = file.gets)
case line
when /<ns:Statistik/
record = []
record << line
when /<\/ns:Statistik/
record << line
puts "file - %s" % counter
threads << Thread.new { add_vehicle counter, record.join }
counter += 1
else
record << line
end
end
file.close
threads.each { |thr| thr.join }
rescue => err
puts "Exception: #{err}"
err
end
Somehow this code 'skips' one or two files when writing the result files - hmmm!?
Okay, you have a problem because your file is huge, and you want to use multithreading.
Now have you. problemstwo
On a more serious note, I've had very good experience with this code.
It parsed 20GB xml files with almost no memory use.
Download the mentioned code, save it as xml_parser.rb and this script should work :
require_relative 'xml_parser.rb'
file = "../ess2.xml"
def add_vehicle index, str
filename = "ess_#{index}.xml"
File.open(filename,'w+'){|out| out.puts str}
puts format("%s has been written with %d lines", filename, str.each_line.count)
end
i=0
Xml::Parser.new(Nokogiri::XML::Reader(open(file))) do
for_element 'ns:Statistik' do
i+=1
add_vehicle(i,#node.outer_xml)
end
end
#=> ess_1.xml has been written with 102 lines
#=> ess_2.xml has been written with 102 lines
#=> ...
It will take time, but it should work without error and without using much memory.
By the way, here is the reason why your code missed some files :
threads = []
counter = 1
threads << Thread.new { puts counter }
counter += 1
threads.each { |thr| thr.join }
#=> 2
threads = []
counter = 1
threads << Thread.new { puts counter }
sleep(1)
counter += 1
threads.each { |thr| thr.join }
#=> 1
counter += 1 was faster than the add_vehicle call. So your add_vehicle was often called with the wrong counter. With many millions node, some might get 0 offset, some might get 1 offset. When 2 add_vehicle are called with the same id, they overwrite each other, and a file is missing.
You have the same problem with record, with lines getting written in the wrong file.
Perhabs you should try to synchronize counter += 1 with Mutex.
For example:
#lock = Mutex.new
#counter = 0
def add_vehicle str
#lock.synchronize do
#counter += 1
IO.write "ess_#{#counter}.xml", str
end
end
Mutex implements a simple semaphore that can be used to coordinate access to shared data from multiple concurrent threads.
Or you can go another way from the start and use Ox. It is way faster than Nokogiri, take a look on a comparison. For a huge files Ox::Sax
I want to take a file, read the file into my program and split it into characters, split the resulting character array into a multidimensional array of 5,000 characters each, then write each separate array into a file found in the same location.
I have taken a file, read it, and created the multidimensional array. Now I want to write each separate single dimension array into separate files.
The file is obtained via user input. Then I created a chain helper method that stores the file to an array in the first mixin, this is then passed to another method that breaks it down into a multidimensional array, which finally hands it off to the end of the chain which currently is setup to make a new directory for which I will put these files.
require 'Benchmark/ips'
file = "C:\\test.php"
class String
def file_to_array
file = self
return_file = File.open(file) do |line|
line.each_char.to_a
end
return return_file
end
def file_write
file_to_write = self
if Dir.exist?("I:\\file_to_array")
File.open("I:/file_to_array/tmp.txt", "w") { |file| file.write(file_to_write) }
read_file = File.read("I:/file_to_array/tmp.txt")
else
Dir.mkdir("I:\\file_to_array")
end
end
end
class Array
def file_divider
file_to_divide = self
file_to_separate = []
count = 0
while count != file_to_divide.length
separator = count % 5000
if separator == 0
start = count - 5000
stop = count
file_to_separate << file_to_divide[start..stop]
end
count = count + 1
end
return file_to_separate
end
def file_write
file_to_write = self
if Dir.exist?("I:\\file_to_array")
File.open("I:/file_to_array/tmp.txt", "w") { |file| file.write(file_to_write) }
else
Dir.mkdir("I:\\file_to_array")
end
end
end
Benchmark.ips do |result|
result.report { file.file_to_array.file_divider.file_write }
end
Test.php
<?php
echo "hello world"
?>
This untested code is where I'd start to split text into chunks and save it:
str = "I want to take a file"
str_array = str.scan(/.{1,10}/) # => ["I want to ", "take a fil", "e"]
str_array.each.with_index(1) do |str_chunk, i|
File.write("output#{i}", str_chunk)
end
This doesn't honor word-boundaries.
Reading a separate input file is easy; You can use read if you KNOW the input will never exceed the available memory and you don't care about performance.
Thinking about it further, if you want to read a text file and break its contents into smaller files, then read it in chunks:
input = File.open('input.txt', 'r')
i = 1
until input.eof? do
chunk = input.read(10)
File.write("output#{i}", chunk)
i += 1
end
input.close
Or even better because it automatically closes the input:
File.open('input.txt', 'r') do |input|
i = 1
until input.eof? do
chunk = File.read(10)
File.write("output#{i}", chunk)
i += 1
end
end
Those are not tested but it look about right.
Use standard File API and Serialisation.
File.write('path/to/yourfile.txt', Marshal.dump([1, 2, 3]))
I'm looking to run a search through some files to see if they have comments on top of the file.
Here's what I'm searching for:
#++
# app_name/dir/dir/filename
# $Id$
#--
I had this as a REGEX and came up short:
:doc => { :test => '^#--\s+[filename]\s+\$Id'
if #file_text =~ Regexp.new(#rules[rule][:test])
....
Any suggestions?
Check this example:
string = <<EOF
#++
## app_name/dir/dir/filename
## $Id$
##--
foo bar
EOF
puts /#\+\+.*\n##.*\n##.*\n##--/.match(string)
The pattern matches two lines starting with ## between two lines starting with #++ and ending with #-- plus including those boundaries into the match. If I got the question right, this should be what you want.
You can generalize the pattern to match everything between the first #++ and the first #-- (including them) using the following pattern:
puts /#\+\+.*?##--/m.match(string)
Rather than try to do it all in a single pattern, which will become difficult to maintain as your file headers change/grow, instead use several small tests which give you granularity. I'd do something like:
lines = '#++
# app_name/dir/dir/filename
# $Id$
#--
'
Split the text so you can retrieve the lines you want, and normalize them:
l1, l2, l3, l4 = lines.split("\n").map{ |s| s.strip.squeeze(' ') }
This is what they contain now:
[l1, l2, l3, l4] # => ["#++", "# app_name/dir/dir/filename", "# $Id$", "#--"]
Here's a set of tests, one for each line:
!!(l1[/^#\+\+/] && l2[/^#\s[\w\/]+/] && l3[/^#\s\$Id\$/i] && l4[/^#--/]) # => true
Here's what is being tested and what each returns:
l1[/^#\+\+/] # => "#++"
l2[/^#\s[\w\/]+/] # => "# app_name/dir/dir/filename"
l3[/^#\s\$Id\$/i] # => "# $Id$"
l4[/^#--/] # => "#--"
There are many different ways to grab the first "n" rows of a file. Here's a few:
File.foreach('test.txt').to_a[0, 4] # => ["#++\n", "# app_name/dir/dir/filename\n", "# $Id$\n", "#--\n"]
File.readlines('test.txt')[0, 4] # => ["#++\n", "# app_name/dir/dir/filename\n", "# $Id$\n", "#--\n"]
File.read('test.txt').split("\n")[0, 4] # => ["#++", "# app_name/dir/dir/filename", "# $Id$", "#--"]
The downside of these is they all "slurp" the input file, which, on a huge file will cause problems. It's trivial to write a piece of code that'd open a file, read the first four lines, and return them in an array. This is untested but looks about right:
def get_four_lines(path)
ary = []
File.open(path, 'r') do |fi|
4.times do
ary << fi.readline
end
end
ary
end
Here's a quick little benchmark to show why I'd go this way:
require 'fruity'
def slurp_file(path)
File.read(path).split("\n")[0,4] rescue []
end
def read_first_four_from_file(path)
ary = []
File.open(path, 'r') do |fi|
4.times do
ary << fi.readline
end
end
ary
rescue
[]
end
PATH = '/etc/'
FILES = Dir.entries(PATH).reject{ |f| f[/^\./] || Dir.exist?(f) }.map{ |f| File.join(PATH, f) }
compare do
slurp {
FILES.each do |f|
slurp_file(f)
end
}
read_four {
FILES.each do |f|
read_first_four_from_file(f)
end
}
end
Running that as root outputs:
Running each test once. Test will take about 1 second.
read_four is faster than slurp by 2x ± 1.0
That's reading approximately 105 files in my /etc directory.
Modifying the test to actually parse the lines and test to return a true/false:
require 'fruity'
def slurp_file(path)
ary = File.read(path).split("\n")[0,4]
!!(/#\+\+\n(.|\n)*?##\-\-/.match(ary.join("\n")))
rescue
false # return a consistent value to fruity
end
def read_first_four_from_file(path)
ary = []
File.open(path, 'r') do |fi|
4.times do
ary << fi.readline
end
end
l1, l2, l3, l4 = ary
!!(l1[/^#\+\+/] && l2[/^#\s[\w\/]+/] && l3[/^#\s\$Id\$/i] && l4[/^#--/])
rescue
false # return a consistent value to fruity
end
PATH = '/etc/'
FILES = Dir.entries(PATH).reject{ |f| f[/^\./] || Dir.exist?(f) }.map{ |f| File.join(PATH, f) }
compare do
slurp {
FILES.each do |f|
slurp_file(f)
end
}
read_four {
FILES.each do |f|
read_first_four_from_file(f)
end
}
end
Running that again returns:
Running each test once. Test will take about 1 second.
read_four is faster than slurp by 2x ± 1.0
Your benchmark isn't fair.
Here's one that's "fair":
require 'fruity'
def slurp_file(path)
text = File.read(path)
!!(/#\+\+\n(.|\n)*?##\-\-/.match(text))
rescue
false # return a consistent value to fruity
end
def read_first_four_from_file(path)
ary = []
File.open(path, 'r') do |fi|
4.times do
ary << fi.readline
end
end
l1, l2, l3, l4 = ary
!!(l1[/^#\+\+/] && l2[/^#\s[\w\/]+/] && l3[/^#\s\$Id\$/i] && l4[/^#--/])
rescue
false # return a consistent value to fruity
end
PATH = '/etc/'
FILES = Dir.entries(PATH).reject{ |f| f[/^\./] || Dir.exist?(f) }.map{ |f| File.join(PATH, f) }
compare do
slurp {
FILES.each do |f|
slurp_file(f)
end
}
read_four {
FILES.each do |f|
read_first_four_from_file(f)
end
}
end
Which outputs:
Running each test once. Test will take about 1 second.
read_four is similar to slurp
joining the split strings back into a longer string prior to doing the match was the wrong path, so working from the full file's content is a more-even test.
[...] Just read the first four lines and apply the pattern, that's it
That's not just it. A multiline regex written to find information spanning multiple lines can't be passed single text lines and return accurate results, so it needs to get a long string. Determining how many characters make up four lines would only add overhead, and slow the algorithm; That's what the previous benchmark did and it wasn't "fair".
Depends on your input data. If you would run this code over a complete (bigger) source code folder, it will slow down it significantly.
There were 105+ files in the directory. That's a reasonably large number of files, but iterating over a large number of files will not show a difference as Ruby's ability to open files isn't the issue, it's the I/O speed of reading a file in one pass vs. line-by-line. And, from experience I know the line-by-line I/O is fast. Again, a benchmark says:
require 'fruity'
LITTLEFILE = 'little.txt'
MEDIUMFILE = 'medium.txt'
BIGFILE = 'big.txt'
LINES = '#++
# app_name/dir/dir/filename
# $Id$
#--
'
LITTLEFILE_MULTIPLIER = 1
MEDIUMFILE_MULTIPLIER = 1_000
BIGFILE_MULTIPLIER = 100_000
File.write(BIGFILE, LINES * BIGFILE_MULTIPLIER)
def _slurp_file(path)
File.read(path)
true # return a consistent value to fruity
end
def _read_first_four_from_file(path)
ary = []
File.open(path, 'r') do |fi|
4.times do
ary << fi.readline
end
end
l1, l2, l3, l4 = ary
true # return a consistent value to fruity
end
[
[LITTLEFILE, LITTLEFILE_MULTIPLIER],
[MEDIUMFILE, MEDIUMFILE_MULTIPLIER],
[BIGFILE, BIGFILE_MULTIPLIER]
].each do |file, mult|
File.write(file, LINES * mult)
puts "Benchmarking against #{ file }"
puts "%s is %d bytes" % [ file, File.size(file)]
compare do
slurp { _slurp_file(file) }
read_first_four_from_file { _read_first_four_from_file(file) }
end
puts
end
With the output:
Benchmarking against little.txt
little.txt is 49 bytes
Running each test 128 times. Test will take about 1 second.
slurp is similar to read_first_four_from_file
Benchmarking against medium.txt
medium.txt is 49000 bytes
Running each test 128 times. Test will take about 1 second.
read_first_four_from_file is faster than slurp by 39.99999999999999% ± 10.0%
Benchmarking against big.txt
big.txt is 4900000 bytes
Running each test 128 times. Test will take about 4 seconds.
read_first_four_from_file is faster than slurp by 100x ± 10.0
Reading a small file of four lines, read is as fast as foreach but once the file size increases the overhead of reading the entire file starts to impact the times.
Any solution relying on slurping files is known to be a bad thing; It's not scalable, and can actually cause code to halt due to memory allocation if BIG files are encountered. Reading the first four lines will always run at a consistent speed independent of the file sizes, so use that technique EVERY time there is a chance that the file sizes will vary. Or, at least, be very aware of the impact on run times and potential problems that can be caused by slurping files.
You might want to try the following parttern: \#\+{2}(?:.|[\r\n])*?\#\-{2}
Working demo # regex101
I'm using this code to let the user enter in names while the program stores them in an array until they enter an empty string (they must press enter after each name):
people = []
info = 'a' # must fill variable with something, otherwise loop won't execute
while not info.empty?
info = gets.chomp
people += [Person.new(info)] if not info.empty?
end
This code would look much nicer in a do ... while loop:
people = []
do
info = gets.chomp
people += [Person.new(info)] if not info.empty?
while not info.empty?
In this code I don't have to assign info to some random string.
Unfortunately this type of loop doesn't seem to exist in Ruby. Can anybody suggest a better way of doing this?
CAUTION:
The begin <code> end while <condition> is rejected by Ruby's author Matz. Instead he suggests using Kernel#loop, e.g.
loop do
# some code here
break if <condition>
end
Here's an email exchange in 23 Nov 2005 where Matz states:
|> Don't use it please. I'm regretting this feature, and I'd like to
|> remove it in the future if it's possible.
|
|I'm surprised. What do you regret about it?
Because it's hard for users to tell
begin <code> end while <cond>
works differently from
<code> while <cond>
RosettaCode wiki has a similar story:
During November 2005, Yukihiro Matsumoto, the creator of Ruby, regretted this loop feature and suggested using Kernel#loop.
I found the following snippet while reading the source for Tempfile#initialize in the Ruby core library:
begin
tmpname = File.join(tmpdir, make_tmpname(basename, n))
lock = tmpname + '.lock'
n += 1
end while ##cleanlist.include?(tmpname) or
File.exist?(lock) or File.exist?(tmpname)
At first glance, I assumed the while modifier would be evaluated before the contents of begin...end, but that is not the case. Observe:
>> begin
?> puts "do {} while ()"
>> end while false
do {} while ()
=> nil
As you would expect, the loop will continue to execute while the modifier is true.
>> n = 3
=> 3
>> begin
?> puts n
>> n -= 1
>> end while n > 0
3
2
1
=> nil
While I would be happy to never see this idiom again, begin...end is quite powerful. The following is a common idiom to memoize a one-liner method with no params:
def expensive
#expensive ||= 2 + 2
end
Here is an ugly, but quick way to memoize something more complex:
def expensive
#expensive ||=
begin
n = 99
buf = ""
begin
buf << "#{n} bottles of beer on the wall\n"
# ...
n -= 1
end while n > 0
buf << "no more bottles of beer"
end
end
Originally written by Jeremy Voorhis. The content has been copied here because it seems to have been taken down from the originating site. Copies can also be found in the Web Archive and at Ruby Buzz Forum. -Bill the Lizard
Like this:
people = []
begin
info = gets.chomp
people += [Person.new(info)] if not info.empty?
end while not info.empty?
Reference: Ruby's Hidden do {} while () Loop
How about this?
people = []
until (info = gets.chomp).empty?
people += [Person.new(info)]
end
Here's the full text article from hubbardr's dead link to my blog.
I found the following snippet while reading the source for Tempfile#initialize in the Ruby core library:
begin
tmpname = File.join(tmpdir, make_tmpname(basename, n))
lock = tmpname + '.lock'
n += 1
end while ##cleanlist.include?(tmpname) or
File.exist?(lock) or File.exist?(tmpname)
At first glance, I assumed the while modifier would be evaluated before the contents of begin...end, but that is not the case. Observe:
>> begin
?> puts "do {} while ()"
>> end while false
do {} while ()
=> nil
As you would expect, the loop will continue to execute while the modifier is true.
>> n = 3
=> 3
>> begin
?> puts n
>> n -= 1
>> end while n > 0
3
2
1
=> nil
While I would be happy to never see this idiom again, begin...end is quite powerful. The following is a common idiom to memoize a one-liner method with no params:
def expensive
#expensive ||= 2 + 2
end
Here is an ugly, but quick way to memoize something more complex:
def expensive
#expensive ||=
begin
n = 99
buf = ""
begin
buf << "#{n} bottles of beer on the wall\n"
# ...
n -= 1
end while n > 0
buf << "no more bottles of beer"
end
end
This works correctly now:
begin
# statment
end until <condition>
But, it may be remove in the future, because the begin statement is counterintuitive. See: http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/6745
Matz (Ruby’s Creator) recommended doing it this way:
loop do
# ...
break if <condition>
end
From what I gather, Matz does not like the construct
begin
<multiple_lines_of_code>
end while <cond>
because, it's semantics is different than
<single_line_of_code> while <cond>
in that the first construct executes the code first before checking the condition,
and the second construct tests the condition first before it executes the code (if ever). I take it Matz prefers to keep the second construct because it matches one line construct of if statements.
I never liked the second construct even for if statements. In all other cases, the computer
executes code left-to-right (eg. || and &&) top-to-bottom. Humans read code left-to-right
top-to-bottom.
I suggest the following constructs instead:
if <cond> then <one_line_code> # matches case-when-then statement
while <cond> then <one_line_code>
<one_line_code> while <cond>
begin <multiple_line_code> end while <cond> # or something similar but left-to-right
I don't know if those suggestions will parse with the rest of the language. But in any case
I prefere keeping left-to-right execution as well as language consistency.
a = 1
while true
puts a
a += 1
break if a > 10
end
Here's another one:
people = []
1.times do
info = gets.chomp
unless info.empty?
people += [Person.new(info)]
redo
end
end
ppl = []
while (input=gets.chomp)
if !input.empty?
ppl << input
else
p ppl; puts "Goodbye"; break
end
end