I'm looking to run a search through some files to see if they have comments on top of the file.
Here's what I'm searching for:
#++
# app_name/dir/dir/filename
# $Id$
#--
I had this as a REGEX and came up short:
:doc => { :test => '^#--\s+[filename]\s+\$Id'
if #file_text =~ Regexp.new(#rules[rule][:test])
....
Any suggestions?
Check this example:
string = <<EOF
#++
## app_name/dir/dir/filename
## $Id$
##--
foo bar
EOF
puts /#\+\+.*\n##.*\n##.*\n##--/.match(string)
The pattern matches two lines starting with ## between two lines starting with #++ and ending with #-- plus including those boundaries into the match. If I got the question right, this should be what you want.
You can generalize the pattern to match everything between the first #++ and the first #-- (including them) using the following pattern:
puts /#\+\+.*?##--/m.match(string)
Rather than try to do it all in a single pattern, which will become difficult to maintain as your file headers change/grow, instead use several small tests which give you granularity. I'd do something like:
lines = '#++
# app_name/dir/dir/filename
# $Id$
#--
'
Split the text so you can retrieve the lines you want, and normalize them:
l1, l2, l3, l4 = lines.split("\n").map{ |s| s.strip.squeeze(' ') }
This is what they contain now:
[l1, l2, l3, l4] # => ["#++", "# app_name/dir/dir/filename", "# $Id$", "#--"]
Here's a set of tests, one for each line:
!!(l1[/^#\+\+/] && l2[/^#\s[\w\/]+/] && l3[/^#\s\$Id\$/i] && l4[/^#--/]) # => true
Here's what is being tested and what each returns:
l1[/^#\+\+/] # => "#++"
l2[/^#\s[\w\/]+/] # => "# app_name/dir/dir/filename"
l3[/^#\s\$Id\$/i] # => "# $Id$"
l4[/^#--/] # => "#--"
There are many different ways to grab the first "n" rows of a file. Here's a few:
File.foreach('test.txt').to_a[0, 4] # => ["#++\n", "# app_name/dir/dir/filename\n", "# $Id$\n", "#--\n"]
File.readlines('test.txt')[0, 4] # => ["#++\n", "# app_name/dir/dir/filename\n", "# $Id$\n", "#--\n"]
File.read('test.txt').split("\n")[0, 4] # => ["#++", "# app_name/dir/dir/filename", "# $Id$", "#--"]
The downside of these is they all "slurp" the input file, which, on a huge file will cause problems. It's trivial to write a piece of code that'd open a file, read the first four lines, and return them in an array. This is untested but looks about right:
def get_four_lines(path)
ary = []
File.open(path, 'r') do |fi|
4.times do
ary << fi.readline
end
end
ary
end
Here's a quick little benchmark to show why I'd go this way:
require 'fruity'
def slurp_file(path)
File.read(path).split("\n")[0,4] rescue []
end
def read_first_four_from_file(path)
ary = []
File.open(path, 'r') do |fi|
4.times do
ary << fi.readline
end
end
ary
rescue
[]
end
PATH = '/etc/'
FILES = Dir.entries(PATH).reject{ |f| f[/^\./] || Dir.exist?(f) }.map{ |f| File.join(PATH, f) }
compare do
slurp {
FILES.each do |f|
slurp_file(f)
end
}
read_four {
FILES.each do |f|
read_first_four_from_file(f)
end
}
end
Running that as root outputs:
Running each test once. Test will take about 1 second.
read_four is faster than slurp by 2x ± 1.0
That's reading approximately 105 files in my /etc directory.
Modifying the test to actually parse the lines and test to return a true/false:
require 'fruity'
def slurp_file(path)
ary = File.read(path).split("\n")[0,4]
!!(/#\+\+\n(.|\n)*?##\-\-/.match(ary.join("\n")))
rescue
false # return a consistent value to fruity
end
def read_first_four_from_file(path)
ary = []
File.open(path, 'r') do |fi|
4.times do
ary << fi.readline
end
end
l1, l2, l3, l4 = ary
!!(l1[/^#\+\+/] && l2[/^#\s[\w\/]+/] && l3[/^#\s\$Id\$/i] && l4[/^#--/])
rescue
false # return a consistent value to fruity
end
PATH = '/etc/'
FILES = Dir.entries(PATH).reject{ |f| f[/^\./] || Dir.exist?(f) }.map{ |f| File.join(PATH, f) }
compare do
slurp {
FILES.each do |f|
slurp_file(f)
end
}
read_four {
FILES.each do |f|
read_first_four_from_file(f)
end
}
end
Running that again returns:
Running each test once. Test will take about 1 second.
read_four is faster than slurp by 2x ± 1.0
Your benchmark isn't fair.
Here's one that's "fair":
require 'fruity'
def slurp_file(path)
text = File.read(path)
!!(/#\+\+\n(.|\n)*?##\-\-/.match(text))
rescue
false # return a consistent value to fruity
end
def read_first_four_from_file(path)
ary = []
File.open(path, 'r') do |fi|
4.times do
ary << fi.readline
end
end
l1, l2, l3, l4 = ary
!!(l1[/^#\+\+/] && l2[/^#\s[\w\/]+/] && l3[/^#\s\$Id\$/i] && l4[/^#--/])
rescue
false # return a consistent value to fruity
end
PATH = '/etc/'
FILES = Dir.entries(PATH).reject{ |f| f[/^\./] || Dir.exist?(f) }.map{ |f| File.join(PATH, f) }
compare do
slurp {
FILES.each do |f|
slurp_file(f)
end
}
read_four {
FILES.each do |f|
read_first_four_from_file(f)
end
}
end
Which outputs:
Running each test once. Test will take about 1 second.
read_four is similar to slurp
joining the split strings back into a longer string prior to doing the match was the wrong path, so working from the full file's content is a more-even test.
[...] Just read the first four lines and apply the pattern, that's it
That's not just it. A multiline regex written to find information spanning multiple lines can't be passed single text lines and return accurate results, so it needs to get a long string. Determining how many characters make up four lines would only add overhead, and slow the algorithm; That's what the previous benchmark did and it wasn't "fair".
Depends on your input data. If you would run this code over a complete (bigger) source code folder, it will slow down it significantly.
There were 105+ files in the directory. That's a reasonably large number of files, but iterating over a large number of files will not show a difference as Ruby's ability to open files isn't the issue, it's the I/O speed of reading a file in one pass vs. line-by-line. And, from experience I know the line-by-line I/O is fast. Again, a benchmark says:
require 'fruity'
LITTLEFILE = 'little.txt'
MEDIUMFILE = 'medium.txt'
BIGFILE = 'big.txt'
LINES = '#++
# app_name/dir/dir/filename
# $Id$
#--
'
LITTLEFILE_MULTIPLIER = 1
MEDIUMFILE_MULTIPLIER = 1_000
BIGFILE_MULTIPLIER = 100_000
File.write(BIGFILE, LINES * BIGFILE_MULTIPLIER)
def _slurp_file(path)
File.read(path)
true # return a consistent value to fruity
end
def _read_first_four_from_file(path)
ary = []
File.open(path, 'r') do |fi|
4.times do
ary << fi.readline
end
end
l1, l2, l3, l4 = ary
true # return a consistent value to fruity
end
[
[LITTLEFILE, LITTLEFILE_MULTIPLIER],
[MEDIUMFILE, MEDIUMFILE_MULTIPLIER],
[BIGFILE, BIGFILE_MULTIPLIER]
].each do |file, mult|
File.write(file, LINES * mult)
puts "Benchmarking against #{ file }"
puts "%s is %d bytes" % [ file, File.size(file)]
compare do
slurp { _slurp_file(file) }
read_first_four_from_file { _read_first_four_from_file(file) }
end
puts
end
With the output:
Benchmarking against little.txt
little.txt is 49 bytes
Running each test 128 times. Test will take about 1 second.
slurp is similar to read_first_four_from_file
Benchmarking against medium.txt
medium.txt is 49000 bytes
Running each test 128 times. Test will take about 1 second.
read_first_four_from_file is faster than slurp by 39.99999999999999% ± 10.0%
Benchmarking against big.txt
big.txt is 4900000 bytes
Running each test 128 times. Test will take about 4 seconds.
read_first_four_from_file is faster than slurp by 100x ± 10.0
Reading a small file of four lines, read is as fast as foreach but once the file size increases the overhead of reading the entire file starts to impact the times.
Any solution relying on slurping files is known to be a bad thing; It's not scalable, and can actually cause code to halt due to memory allocation if BIG files are encountered. Reading the first four lines will always run at a consistent speed independent of the file sizes, so use that technique EVERY time there is a chance that the file sizes will vary. Or, at least, be very aware of the impact on run times and potential problems that can be caused by slurping files.
You might want to try the following parttern: \#\+{2}(?:.|[\r\n])*?\#\-{2}
Working demo # regex101
Related
I have a Ruby script that does the following to a text file:
removes non-ASCII lines
removes lines containing "::" (two colons in a row)
if there is more than one ":" present in the line (which aren't directly next to each other), it only keeps the strings on both sides of the last colon.
removes leading whitespace
removes unusual control characters
The problem is, I'm working with files that have ~20 million lines, and my script says it'll take ~45 minutes to run.
Is there a way to majorly speed this up? Or, is there a significantly quicker way to handle this in shell?
require 'ruby-progressbar'
class String
def strip_control_characters()
chars.each_with_object("") do |char, str|
str << char unless char.ascii_only? and (char.ord < 32 or char.ord == 127)
end
end
def strip_control_and_extended_characters()
chars.each_with_object("") do |char, str|
str << char if char.ascii_only? and char.ord.between?(32,126)
end
end
end
class Numeric
def percent_of(n)
self.to_f / n.to_f * 100.0
end
end
def clean(file_in,file_out)
if !File.exists?(file_in)
puts "File '#{file_in}' does not exist."
return
end
File.delete(file_out) if File.exist?(file_out)
`touch #{file_out}`
deleted = 0
count = 0
line_count = `wc -l "#{file_in}"`.strip.split(' ')[0].to_i
puts "File has #{line_count} lines. Cleaning..."
progressbar = ProgressBar.create(total: line_count, length: 100, format: 'Progress |%B| %a %e')
IO.foreach(file_in) {|x|
if x.ascii_only?
line = x.strip_control_and_extended_characters.strip
if line == ""
deleted += 1
next
end
if line.include?("::")
deleted += 1
next
end
split = line.split(":")
c = split.count
if c == 1
deleted += 1
next
end
if c > 2
line = split.last(2).join(":")
end
if line != ""
File.open(file_out, 'a') { |f| f.puts(line) }
else
deleted += 1
end
else
deleted += 1
end
progressbar.progress += 1
}
puts "Deleted #{deleted} lines."
end
Here is one of your big problems:
if line != ""
File.open(file_out, 'a') { |f| f.puts(line) }
end
So your program needs to open and close the output file millions of times because it is doing that for every single line. Each time it opens it, since it is being opened in append mode, your system might have to do a lot of work to find the end of the file.
You should really change your program to open the output file once at the beginning and only close it at the end. Also, run strace to see what your Ruby I/O operations are doing behind the scenes; it should buffer up the writes and then send them to the OS in blocks of about 4 kilobytes at a time; it shouldn't issue a write system call for every single line.
To further improve the performance, you should use a Ruby profiling tool to see which functions are taking the most time.
You can improve the speed by changing your String additions to variations on:
class String
def strip_control_characters()
gsub(/[[:cntrl:]]+/, '')
end
def strip_control_and_extended_characters()
strip_control_characters.gsub(/[^[:ascii:]]+/, '')
end
end
str = (0..255).to_a.map { |b| b.chr }.join # => "\x00\x01\x02\x03\x04\x05\x06\a\b\t\n\v\f\r\x0E\x0F\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1A\e\x1C\x1D\x1E\x1F !\"\#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7F\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F\xA0\xA1\xA2\xA3\xA4\xA5\xA6\xA7\xA8\xA9\xAA\xAB\xAC\xAD\xAE\xAF\xB0\xB1\xB2\xB3\xB4\xB5\xB6\xB7\xB8\xB9\xBA\xBB\xBC\xBD\xBE\xBF\xC0\xC1\xC2\xC3\xC4\xC5\xC6\xC7\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF\xD0\xD1\xD2\xD3\xD4\xD5\xD6\xD7\xD8\xD9\xDA\xDB\xDC\xDD\xDE\xDF\xE0\xE1\xE2\xE3\xE4\xE5\xE6\xE7\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF\xF0\xF1\xF2\xF3\xF4\xF5\xF6\xF7\xF8\xF9\xFA\xFB\xFC\xFD\xFE\xFF"
str.strip_control_characters
# => " !\"\#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F\xA0\xA1\xA2\xA3\xA4\xA5\xA6\xA7\xA8\xA9\xAA\xAB\xAC\xAD\xAE\xAF\xB0\xB1\xB2\xB3\xB4\xB5\xB6\xB7\xB8\xB9\xBA\xBB\xBC\xBD\xBE\xBF\xC0\xC1\xC2\xC3\xC4\xC5\xC6\xC7\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF\xD0\xD1\xD2\xD3\xD4\xD5\xD6\xD7\xD8\xD9\xDA\xDB\xDC\xDD\xDE\xDF\xE0\xE1\xE2\xE3\xE4\xE5\xE6\xE7\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF\xF0\xF1\xF2\xF3\xF4\xF5\xF6\xF7\xF8\xF9\xFA\xFB\xFC\xFD\xFE\xFF"
str.strip_control_and_extended_characters
# => " !\"\#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~"
Use the built-in gsub method along with the POSIX character-sets instead of iterating over the strings and testing each character.
As #Myst said though, monkey-patching is rude. Use refinements, or create some methods and pass in the string:
def strip_control_characters(str)
str.gsub(/[[:cntrl:]]+/, '')
end
def strip_control_and_extended_characters(str)
strip_control_characters(str).gsub(/[^[:ascii:]]+/, '')
end
str = (0..255).to_a.map { |b| b.chr }.join # => "\x00\x01\x02\x03\x04\x05\x06\a\b\t\n\v\f\r\x0E\x0F\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1A\e\x1C\x1D\x1E\x1F !\"\#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7F\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F\xA0\xA1\xA2\xA3\xA4\xA5\xA6\xA7\xA8\xA9\xAA\xAB\xAC\xAD\xAE\xAF\xB0\xB1\xB2\xB3\xB4\xB5\xB6\xB7\xB8\xB9\xBA\xBB\xBC\xBD\xBE\xBF\xC0\xC1\xC2\xC3\xC4\xC5\xC6\xC7\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF\xD0\xD1\xD2\xD3\xD4\xD5\xD6\xD7\xD8\xD9\xDA\xDB\xDC\xDD\xDE\xDF\xE0\xE1\xE2\xE3\xE4\xE5\xE6\xE7\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF\xF0\xF1\xF2\xF3\xF4\xF5\xF6\xF7\xF8\xF9\xFA\xFB\xFC\xFD\xFE\xFF"
strip_control_characters(str)
# => " !\"\#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F\xA0\xA1\xA2\xA3\xA4\xA5\xA6\xA7\xA8\xA9\xAA\xAB\xAC\xAD\xAE\xAF\xB0\xB1\xB2\xB3\xB4\xB5\xB6\xB7\xB8\xB9\xBA\xBB\xBC\xBD\xBE\xBF\xC0\xC1\xC2\xC3\xC4\xC5\xC6\xC7\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF\xD0\xD1\xD2\xD3\xD4\xD5\xD6\xD7\xD8\xD9\xDA\xDB\xDC\xDD\xDE\xDF\xE0\xE1\xE2\xE3\xE4\xE5\xE6\xE7\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF\xF0\xF1\xF2\xF3\xF4\xF5\xF6\xF7\xF8\xF9\xFA\xFB\xFC\xFD\xFE\xFF"
strip_control_and_extended_characters(str)
# => " !\"\#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~"
Moving on...
`touch #{file_out}`
is a problem too. You're create a sub-shell every time that runs, executing touch then tearing it down which is a slow operation. Let Ruby do it:
=== Implementation from FileUtils
------------------------------------------------------------------------------
touch(list, noop: nil, verbose: nil, mtime: nil, nocreate: nil)
------------------------------------------------------------------------------
Updates modification time (mtime) and access time (atime) of file(s) in list.
Files are created if they don't exist.
FileUtils.touch 'timestamp'
FileUtils.touch Dir.glob('*.c'); system 'make'
Finally, learn to benchmark code as you develop. Take the time to think of a couple ways to do something, then test them against each other and find out which is the fastest. I use Fruity, because it handles issues that the Benchmark class doesn't, but do one or the other. You can find a lot of tests I did here for various things by searching SO for my user and "benchmark".
require 'fruity'
class String
def strip_control_characters()
chars.each_with_object("") do |char, str|
str << char unless char.ascii_only? and (char.ord < 32 or char.ord == 127)
end
end
def strip_control_and_extended_characters()
chars.each_with_object("") do |char, str|
str << char if char.ascii_only? and char.ord.between?(32,126)
end
end
end
def strip_control_characters2(str)
str.gsub(/[[:cntrl:]]+/, '')
end
def strip_control_and_extended_characters2(str)
strip_control_characters2(str).gsub(/[^[:ascii:]]+/, '')
end
str = (0..255).to_a.map { |b| b.chr }.join
str.strip_control_characters # => " !\"\#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F\xA0\xA1\xA2\xA3\xA4\xA5\xA6\xA7\xA8\xA9\xAA\xAB\xAC\xAD\xAE\xAF\xB0\xB1\xB2\xB3\xB4\xB5\xB6\xB7\xB8\xB9\xBA\xBB\xBC\xBD\xBE\xBF\xC0\xC1\xC2\xC3\xC4\xC5\xC6\xC7\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF\xD0\xD1\xD2\xD3\xD4\xD5\xD6\xD7\xD8\xD9\xDA\xDB\xDC\xDD\xDE\xDF\xE0\xE1\xE2\xE3\xE4\xE5\xE6\xE7\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF\xF0\xF1\xF2\xF3\xF4\xF5\xF6\xF7\xF8\xF9\xFA\xFB\xFC\xFD\xFE\xFF"
strip_control_characters2(str) # => " !\"\#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F\xA0\xA1\xA2\xA3\xA4\xA5\xA6\xA7\xA8\xA9\xAA\xAB\xAC\xAD\xAE\xAF\xB0\xB1\xB2\xB3\xB4\xB5\xB6\xB7\xB8\xB9\xBA\xBB\xBC\xBD\xBE\xBF\xC0\xC1\xC2\xC3\xC4\xC5\xC6\xC7\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF\xD0\xD1\xD2\xD3\xD4\xD5\xD6\xD7\xD8\xD9\xDA\xDB\xDC\xDD\xDE\xDF\xE0\xE1\xE2\xE3\xE4\xE5\xE6\xE7\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF\xF0\xF1\xF2\xF3\xF4\xF5\xF6\xF7\xF8\xF9\xFA\xFB\xFC\xFD\xFE\xFF"
str.strip_control_and_extended_characters # => " !\"\#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~"
strip_control_and_extended_characters2(str) # => " !\"\#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~"
compare do
scc { str.strip_control_characters }
scc2 { strip_control_characters2(str) }
end
# >> Running each test 512 times. Test will take about 1 second.
# >> scc2 is faster than scc by 10x ± 1.0
and:
compare do
scec { str.strip_control_and_extended_characters }
scec2 { strip_control_and_extended_characters2(str) }
end
# >> Running each test 256 times. Test will take about 1 second.
# >> scec2 is faster than scec by 5x ± 1.0
There seem to be only to possible approaches to optimizing this:
Concurrency.
If your machine is a Unix/Linux based machine that has a multi-core CPU, you can take advantage of the multi-cores by using fork, dividing up the work between different processes.
Multi-threading might not work as well as you'd expect with Ruby, since there's a GIL (Global Instruction Lock) that prevents multiple threads from running together.
Code optimizations.
These include minimizing system calls (such as the File.open) and minimizing any temporary objects.
I would start with this approach before I moved on to fork, mainly due to the extra coding required when using fork.
The first approach requires a large rewrite of the script, while the second approach might be more easily achieved.
For example, the following approach minimizes some system calls (such as the File's open, close and write system calls):
require 'ruby-progressbar'
class String
def strip_control_characters()
chars.each_with_object("") do |char, str|
str << char unless char.ascii_only? and (char.ord < 32 or char.ord == 127)
end
end
def strip_control_and_extended_characters()
chars.each_with_object("") do |char, str|
str << char if char.ascii_only? and char.ord.between?(32,126)
end
end
end
class Numeric
def percent_of(n)
self.to_f / n.to_f * 100.0
end
end
def clean(file_in,file_out)
if !File.exists?(file_in)
puts "File '#{file_in}' does not exist."
return
end
File.delete(file_out) if File.exist?(file_out)
`touch #{file_out}`
deleted = 0
count = 0
line_count = `wc -l "#{file_in}"`.strip.split(' ')[0].to_i
puts "File has #{line_count} lines. Cleaning..."
progressbar = ProgressBar.create(total: line_count, length: 100, format: 'Progress |%B| %a %e')
file_fd = File.open(file_out, 'a')
buffer = "".dup
IO.foreach(file_in) {|x|
if x.ascii_only?
line = x.strip_control_and_extended_characters.strip
if line == ""
deleted += 1
next
end
if line.include?("::")
deleted += 1
next
end
split = line.split(":")
c = split.count
if c == 1
deleted += 1
next
end
if c > 2
line = split.last(2).join(":")
end
if line != ""
buffer += "\r\n#{line}"
else
deleted += 1
end
else
deleted += 1
end
if buffer.length >= 2048
file_fd.puts(buffer)
buffer.clear
end
progressbar.progress += 1
}
file_fd.puts(buffer)
buffer.clear
file_fd.close
puts "Deleted #{deleted} lines."
end
P.S.
I would avoid monkey patching - it's rude.
After posting this I read #DavidGrayson's answer, which pinpoints an issue with your code's performance in a much shorter and succinct answer.
I up-voted his answer, as I think you'll get a big performance gain from this simple change.
I want to take a file, read the file into my program and split it into characters, split the resulting character array into a multidimensional array of 5,000 characters each, then write each separate array into a file found in the same location.
I have taken a file, read it, and created the multidimensional array. Now I want to write each separate single dimension array into separate files.
The file is obtained via user input. Then I created a chain helper method that stores the file to an array in the first mixin, this is then passed to another method that breaks it down into a multidimensional array, which finally hands it off to the end of the chain which currently is setup to make a new directory for which I will put these files.
require 'Benchmark/ips'
file = "C:\\test.php"
class String
def file_to_array
file = self
return_file = File.open(file) do |line|
line.each_char.to_a
end
return return_file
end
def file_write
file_to_write = self
if Dir.exist?("I:\\file_to_array")
File.open("I:/file_to_array/tmp.txt", "w") { |file| file.write(file_to_write) }
read_file = File.read("I:/file_to_array/tmp.txt")
else
Dir.mkdir("I:\\file_to_array")
end
end
end
class Array
def file_divider
file_to_divide = self
file_to_separate = []
count = 0
while count != file_to_divide.length
separator = count % 5000
if separator == 0
start = count - 5000
stop = count
file_to_separate << file_to_divide[start..stop]
end
count = count + 1
end
return file_to_separate
end
def file_write
file_to_write = self
if Dir.exist?("I:\\file_to_array")
File.open("I:/file_to_array/tmp.txt", "w") { |file| file.write(file_to_write) }
else
Dir.mkdir("I:\\file_to_array")
end
end
end
Benchmark.ips do |result|
result.report { file.file_to_array.file_divider.file_write }
end
Test.php
<?php
echo "hello world"
?>
This untested code is where I'd start to split text into chunks and save it:
str = "I want to take a file"
str_array = str.scan(/.{1,10}/) # => ["I want to ", "take a fil", "e"]
str_array.each.with_index(1) do |str_chunk, i|
File.write("output#{i}", str_chunk)
end
This doesn't honor word-boundaries.
Reading a separate input file is easy; You can use read if you KNOW the input will never exceed the available memory and you don't care about performance.
Thinking about it further, if you want to read a text file and break its contents into smaller files, then read it in chunks:
input = File.open('input.txt', 'r')
i = 1
until input.eof? do
chunk = input.read(10)
File.write("output#{i}", chunk)
i += 1
end
input.close
Or even better because it automatically closes the input:
File.open('input.txt', 'r') do |input|
i = 1
until input.eof? do
chunk = File.read(10)
File.write("output#{i}", chunk)
i += 1
end
end
Those are not tested but it look about right.
Use standard File API and Serialisation.
File.write('path/to/yourfile.txt', Marshal.dump([1, 2, 3]))
I wrote the following script to read a CSV file:
f = File.open("aFile.csv")
text = f.read
text.each_line do |line|
if (f.eof?)
puts "End of file reached"
else
line_num +=1
if(line_num < 6) then
puts "____SKIPPED LINE____"
next
end
end
arr = line.split(",")
puts "line number = #{line_num}"
end
This code runs fine if I take out the line:
if (f.eof?)
puts "End of file reached"
With this line in I get an exception.
I was wondering how I can detect the end of file in the code above.
Try this short example:
f = File.open(__FILE__)
text = f.read
p f.eof? # -> true
p text.class #-> String
With f.read you read the whole file into text and reach EOF.
(Remark: __FILE__ is the script file itself. You may use you csv-file).
In your code you use text.each_line. This executes each_line for the string text. It has no effect on f.
You could use File#each_line without using a variable text. The test for EOF is not necessary. each_line loops on each line and detects EOF on its own.
f = File.open(__FILE__)
line_num = 0
f.each_line do |line|
line_num +=1
if (line_num < 6)
puts "____SKIPPED LINE____"
next
end
arr = line.split(",")
puts "line number = #{line_num}"
end
f.close
You should close the file after reading it. To use blocks for this is more Ruby-like:
line_num = 0
File.open(__FILE__) do | f|
f.each_line do |line|
line_num +=1
if (line_num < 6)
puts "____SKIPPED LINE____"
next
end
arr = line.split(",")
puts "line number = #{line_num}"
end
end
One general remark: There is a CSV library in Ruby. Normally it is better to use that.
https://www.ruby-forum.com/topic/218093#946117 talks about this.
content = File.read("file.txt")
content = File.readlines("file.txt")
The above 'slurps' the entire file into memory.
File.foreach("file.txt") {|line| content << line}
You can also use IO#each_line. These last two options do not read the entire file into memory. The use of the block makes this automatically close your IO object as well. There are other ways as well, IO and File classes are pretty feature rich!
I refer to IO objects, as File is a subclass of IO. I tend to use IO when I don't really need the added methods from File class for the object.
In this way you don't need to deal with EOF, Ruby will for you.
Sometimes the best handling is not to, when you really don't need to.
Of course, Ruby has a method for this.
Without testing this, it seems you should perform a rescue rather than checking.
http://www.ruby-doc.org/core-2.0/EOFError.html
file = File.open("aFile.csv")
begin
loop do
some_line = file.readline
# some stuff
end
rescue EOFError
# You've reached the end. Handle it.
end
I'm trying to use CSV to calculate the average of three numbers and output it to a separate file. Particularly, open one file, take the first value (name), and then calculate the average of the next three values. Do this multiple times for each person in the file.
Here is my Book1.csv
Tom,90,80,70
Adam,80,85,83
Mike,100,93,89
Dave,100,100,100
Rob,80,70,75
Nick,80,90,70
Justin,100,90,90
Jen,80,90,100
I'm trying to get it to output this:
Tom,80
Adam,83
Mike,94
Dave,100
Rob,75
Nick,80
Justin,93
Jen,90
I have each person in an array and I could get this to work with the basic "pseudo" code I have written, but it does not work.
Here is my code so far:
#!/usr/bin/ruby
require 'csv'
names=[]
grades1=[]
grades2=[]
grades3=[]
average=[]
i = 0
CSV.foreach('Book1.csv') do |students|
names << students.values_at(0)
grades1 << reader.values_at(1)
grades2 << reader.values_at(2)
grades3 << reader.values_at(3)
end
while i<10 do
average[i]= grades1[i] + grades2[i] + grades3[i]
i= i + 1
end
CSV.open('Book2.csv', 'w') do |writer|
rows.each { |record| writer << record }
end
The while loop part is the part that I am most concerned with. Any insight?
If you have an array of values that you want to sum, you can use:
sum = array.inject(:+)
If you change your data structure to:
grades = [ [], [], [] ]
...
grades[0] << reader.values_at(1)
Then you can do:
0.upto(9) do |i|
average[i] = (0..2).map{ |n| grades[n][i] }.inject(:+) / 3
end
There are a variety of ways to improve your data structures, the above being one of the least impactful to your code.
Any time you find yourself writing:
foo1 = ...
foo2 = ...
You should recognize it as code smell, and think of how you could organize your data in better collections.
Here's a rewrite of how I might do this. Notice that it works for any number of scores, not hardcoded to 3:
require 'csv'
averages = CSV.parse(DATA.read).map do |row|
name, *grades = *row
[ name, grades.map(&:to_i).inject(:+) / grades.length ]
end
puts averages.map(&:to_csv)
#=> Tom,80
#=> Adam,82
#=> Mike,94
#=> Dave,100
#=> Rob,75
#=> Nick,80
#=> Justin,93
#=> Jen,90
__END__
Tom,90,80,70
Adam,80,85,83
Mike,100,93,89
Dave,100,100,100
Rob,80,70,75
Nick,80,90,70
Justin,100,90,90
Jen,80,90,100
I am processing documents in ruby.
I have a document I am extracting specific strings from using regexp and then adding them to another file. When added to the destination file they must be made unique so if that string already exists in the destination file I'am adding a simple suffix e.g. <word>_1. Eventually I want to be referencing the strings by name so random number generation or string from the date is no good.
At present I am storing each word added in an array and then everytime I add a word I check the string doesn't exist in an array which is fine if there is only 1 duplicate however there might be 2 or more so I need to check for the initial string then loop incrementing the suffix until it doesn't exist, (I have simplified my code so there may be bugs)
def add_word(word)
if #added_words include? word
suffix = 1
suffixed_word = word
while added_words include? suffixed_word
suffixed_word = word + "_" + suffix.to_s
suffix += 1
end
word = suffixed_word
end
#added_words << word
end
It looks messy, is there a better algorithm or ruby way of doing this?
Make #added_words a Set (don't forget to require 'set'). This makes for faster lookup as sets are implemented with hashes, while still using include? to check for set membership. It's also easy to extract the highest used suffix:
>> s << 'foo'
#=> #<Set: {"foo"}>
>> s << 'foo_1'
#=> #<Set: {"foo", "foo_1"}>
>> word = 'foo'
#=> "foo"
>> s.max_by { |w| w =~ /#{word}_?(\d+)?/ ; $1 || '' }
#=> "foo_1"
>> s << 'foo_12' #=>
#<Set: {"foo", "foo_1", "foo_12"}>
>> s.max_by { |w| w =~ /#{word}_?(\d+)?/ ; $1 || '' }
#=> "foo_12"
Now to get the next value you can insert, you could just do the following (imagine you already had 12 foos, so the next should be a foo_13):
>> s << s.max_by { |w| w =~ /#{word}_?(\d+)?/ ; $1 || '' }.next
#=> #<Set: {"foo", "foo_1", "foo_12", "foo_13"}
Sorry if the examples are a bit confused, I had anesthesia earlier today. It should be enough to give you an idea of how sets could potentially help you though (most of it would work with array too, but sets have faster lookup).
Change #added_words to a Hash with a default of zero. Then you can do:
#added_words = Hash.new(0)
def add_word( word)
#added_words[word] += 1
end
# put it to work:
list = %w(test foo bar test bar bar)
names = list.map do |w|
"#{w}_#{add_word(w)}"
end
p #added_words
#=> {"test"=>2, "foo"=>1, "bar"=>3}
p names
#=>["test_1", "foo_1", "bar_1", "test_2", "bar_2", "bar_3"]
In that case, I'd probably use a set or hash:
#in your class:
require 'set'
require 'forwardable'
extend Forwardable #I'm just including this to keep your previous api
#elsewhere you're setting up your instance_var, it's probably [] at the moment
def initialize
#added_words = Set.new
end
#then instead of `def add_word(word); #added_words.add(word); end`:
def_delegator :added_words, :add_word, :add
#or just change whatever loop to use ##added_words.add('word') rather than self#add_word('word')
##added_words.add('word') does nothing if 'word' already exists in the set.
If you've got some attributes that you're grouping via these sections, then a hash might be better:
#elsewhere you're setting up your instance_var, it's probably [] at the moment
def initialize
#added_words = {}
end
def add_word(word, attrs={})
#added_words[word] ||= []
#added_words[word].push(attrs)
end
Doing it the "wrong way", but in slightly nicer code:
def add_word(word)
if #added_words.include? word
suffixed_word = 1.upto(1.0/0.0) do |suffix|
candidate = [word, suffix].join("_")
break candidate unless #added_words.include?(candidate)
end
word = suffixed_word
end
#added_words << word
end