How do I find the percent complete when parsing a file? - ruby

How can I print what percentage of a file I have already parsed. I am parsing a text file, so I use:
file.each_line do
Is there a method like each_with_index that is available to use with strings?
This is how I currently use each_with_index to find percentage complete:
amount = 10000000
file.each_with_index do |line, index|
if index == amount
break
end
print "%.1f%% done" % (index/(amount * 1.0) * 100)
print "\r"

To get the number of lines, you can do a couple different things.
If you are on Linux or Mac OS, take advantage of the underlying OS and ask it how many lines are in the file:
lines_in_file = `wc -l #{ path_to_file_to_read }`
wc is extremely fast, and can tell you about lines, words and characters. -l specifies lines.
If you want to do it in Ruby, you could use File.readlines('/path/to/file/to/read') or File.read('/path/to/file/to/read').lines, but be very careful. Both will read the entire file into memory, and, if that file is bigger than your available RAM you've just beaten your machine to a slow death. So, don't do that.
Instead use something like:
lines_in_file = 0
File.foreach('/path/to/file/to/read') { lines_in_file += 1 }
After running, lines_in_file will hold the number of lines in the file. File.foreach is VERY fast, pretty much equal to using File.readlines and probably faster than File.read().lines, and it only reads a line at a time so you're not filling your RAM.
If you want to know the current line number of the line you just read from a file, you can use Ruby's $..
You're concerned about "percentage of a file" though. A potential problem with this is lines are variable length. Depending on what you are doing with them, the line length could have a big effect on your progress meter. You might want to look at the actual length of the file and keep track of the number of characters consumed by reading each line, so your progress is based on percentage of characters, rather than percentage of lines.

Get all the lines upfront, then display the progress as you perform whatever operation you need on them.
lines = file.readlines
amount = lines.length
lines.each_with_index do |line, index|
if index == amount
break
end
print "%.1f%% done" % (index/(amount * 1.0) * 100)
print "\r"
end

Without having to load the file beforehand, you could employ size and pos methods:
f = open('myfile')
while (line = f.gets)
puts "#{(f.pos*100)/f.size}%\t#{line}"
end
Less lines, less logic and accurate to a byte.

Rather than reading the whole file and loading it in memory (as with read or readlines), I suggest to use File.foreach reading the file as a stream, line by line.
count = 0
File.foreach('your_file') { count += 1 }
idx = 0
File.foreach('your_file') do |line|
puts "#{(idx+1).to_f / count * 100}%"
idx += 1
end

Related

Ruby script which can replace a string in a binary file to a different, but same length string?

I would like to write a Ruby script (repl.rb) which can replace a string in a binary file (string is defined by a regex) to a different, but same length string.
It works like a filter, outputs to STDOUT, which can be redirected (ruby repl.rb data.bin > data2.bin), regex and replacement can be hardcoded. My approach is:
#!/usr/bin/ruby
fn = ARGV[0]
regex = /\-\-[0-9a-z]{32,32}\-\-/
replacement = "--0ca2765b4fd186d6fc7c0ce385f0e9d9--"
blk_size = 1024
File.open(fn, "rb") {|f|
while not f.eof?
data = f.read(blk_size)
data.gsub!(regex, str)
print data
end
}
My problem is that when string is positioned in the file that way it interferes with the block size used by reading the binary file. For example when blk_size=1024 and my 1st occurance of the string begins at byte position 1000, so I will not find it in the "data" variable. Same happens with the next read cycle. Should I process the whole file two times with different block size to ensure avoiding this worth case scenario, or is there any other approach?
I would posit that a tool like sed might be a better choice for this. That said, here's an idea: Read block 1 and block 2 and join them into a single string, then perform the replacement on the combined string. Split them apart again and print block 1. Then read block 3 and join block 2 and 3 and perform the replacement as above. Split them again and print block 2. Repeat until the end of the file. I haven't tested it, but it ought to look something like this:
File.open(fn, "rb") do |f|
last_block, this_block = nil
while not f.eof?
last_block, this_block = this_block, f.read(blk_size)
data = "#{last_block}#{this_block}".gsub(regex, str)
last_block, this_block = data.slice!(0, blk_size), data
print last_block
end
print this_block
end
There's probably a nontrivial performance penalty for doing it this way, but it could be acceptable depending on your use case.
Maybe a cheeky
f.pos = f.pos - replacement.size
at the end of the while loop, just before reading the next chunk.

how do i make my code read random lines 37 different times?

def pick_random_line
chosen_line = nil
File.foreach("id'sForCascade.txt").each_with_index do |line, id|
chosen_line = line if rand < 1.0/(id+1)
end
return chosen_line
end`enter code here
Hey, i'm trying to make that code pick 37 different lines. So how would I do that i'm stuck and confused.
Assuming you don't want the same line to repeat more than once, I would do it in one line like this:
File.read("test.txt").split("\n").shuffle.first(37)
File.read("test.txt") reads the entire file.
split("\n") splits the file to lines based on the \n delimiter (I assume your file is textual and have lines separated by new line character).
shuffle is a very convenient method of Array that shuffles the lines randomly. You can read about it here:
http://docs.ruby-lang.org/en/2.0.0/Array.html#method-i-shuffle
Finally, first(37) gives you the first 37 lines out of the shuffled array. These are guaranteed to be random from the shuffle operation.
You can do something like this:
input_lines = File.foreach("test.txt").map(&:to_s)
output_lines = []
37.times do
output_lines << input_lines.delete_at(rand(input_lines.length))
end
puts output_lines
This will ensure that you aren't grabbing duplicate lines and you don't need to do any fancy checking.
However, if your file is less than 37 lines this may cause a problem, it also assumes that your file exists.
EDIT:
What is happening is the rand call is now changing the range on which it is called based on the size of the input lines. And since you are deleting at an index when you take the line out, the length shrinks and you do not risk duplicating lines.
If you want to save relatively few lines from a large file, reading the entire file into an array (and then randomly selecting lines) could be costly. It might be better to count the number of lines in the file, randomly select line offsets and then save the lines at those offsets to an array. This approach is no more difficult to implement than the former one, but makes the method more robust, even if the files in the current application are not overly large.1
Suppose your filename were given by FName. Here are three ways to count the numbers of lines in the file:
Count lines, literally
cnt = File.foreach(FName).reduce(0) { |c,_| c+1 }
Use $.
File.foreach(FName) {}
cnt = $.
On Unix-family computers, shell-out to the operating system
cnt = %x{wc -l #{FName}}.split.first.to_ii
The third option is very fast.
Random offsets (base 1) for n lines to be saved could be computed as follows:
lines = (1..cnt).to_a.sample(n).sort
Saving the lines at those offsets to an array is straightforward; for example:
File.foreach(FName).with_object([]) do |line,a|
if lines.first == $.
a << line
lines.shift
break a if lines.empty?
end
end
Note that $. #=> 1 after the first line is first line is read, and $. is incremented by 1 after each successive line is read. (Hence base 1 for line offsets.)
1 Moreover, many programmers, not just Rubiests, are repelled by the idea of amassing large numbers of anything and then discarding all but a few.

How do I get Ruby to search for a pattern on the tail of a local file?

Say I have a file blah.rb which is constantly written to somehow and has patterns like :
bagtagrag" " hellobello " blah0 blah1 " trag kljesgjpgeagiafw blah2 " gneo" whatttjtjtbvnblah3
Basically, it's garbage. But I want to check for the blah that keeps on coming up and find the latest value i.e. number in front of the blah.
Hence, something like :
grep "blah"{$1} | tail var/test/log
My file is at location var/test/log and as you can see, I need to get the number in front of the blah.
def get_last_blah("filename")
// Code to get the number after the last blah in the less of the filename
end
def display_the_last_blah()
puts get_last_blah("var/test/log")
end
Now, I could just keep on reading the file and performing something akin to string pattern search on the entire file again and again. Obtaining the last value, I can then get the number. But what if I only want to look at the added text in the less and not the entire text.
Moreover, is there a quick one-liner or smart command to get this?
Use IO.open to read the file and Enumerable#grep to search the desired text using a regular expression like the following code does:
def get_last_blah(filename)
open(filename) { |f| f.grep(/.*blah(\d).*$/){$1}.last.to_i }
end
puts get_last_blah('var/test/log')
# => 3
The method return the number in from of the last "blah" word of the file. It is reading the entire file but the result is the same as if is done with tail.
If you want to use a proper tail, take a look at the File::Tail gem.
I presume you wish to avoid reading the entire file each time; rather, you want to start at the end and work backward until you find the last string of interest. Here's a way to do that.
Code
BLOCK_SIZE = 30
MAX_BLAH_NBR = 123
def doit(fname, blah_text)
#f = File.new(fname)
#blah_text = blah_text
#chars_to_read = BLOCK_SIZE + #blah_text.size + MAX_BLAH_NBR.to_s.size
ptr = #f.size
block_size = BLOCK_SIZE
loop do
return nil if ptr.zero?
ptr -= block_size
if ptr < 0
block_size += ptr
ptr = 0
end
blah_nbr = read_block(ptr)
(f.close; return blah_nbr.to_i) if blah_nbr
end
end
def read_block(ptr)
#f.seek(ptr)
#f.read(#chars_to_read)[/.*#{#blah_text}(\d+)/,1]
end
Demo
Let's first write something interesting to a file.
MY_FILE = 'my_file.txt'
text =<<_
Now is the time
for all blah2 to
come to the aid of
their blah3, blah4 enemy or
perhaps do blagh5 something
else like wash the dishes.
_
File.write(MY_FILE, text)
Now run the program:
p doit(MY_FILE, "blah") #=> 4
We expected it to return 4 and it did.
Explanation
doit first instructs read_block to read up to 37 characters, beginning BLOCK_SIZE (30) characters from the end of the file. That's at the beginning of the string
"ng\nelse like wash the dishes.\n"
which is 30 characters long. (I'll explain the "37" in a moment.) read_block finds no text matching the regex (like "blah3"), so returns nil.
As nil was returned, doit makes the same request of read_block, but this time starting BLOCK_SIZE characters closer to the beginning of the file. This time read_block reads the 37 character string:
"y or\nperhaps do blagh5 something\nelse"
but, again, does not match the regex, so returns nil to doit. Notice that it read the seven characters, "ng\nelse", that it read previously. This overlap is necessary in case one 30-character block ended, "...bla" and the next one began "h3...". Hence the need to read more characters (here 37) than the block size.
read_block next reads the string:
"aid of\ntheir blah3, blah4 enemy or\npe"
and finds that "blah4" matches the regex (not "blah3", because the regex is being "greedy" with .*), so it returns "4" to doit, which converts that to the number 4, which it returns.
doit would return nil if the regex did not match any text in the file.

Extract a single line string having "foo: XXXX"

I have a file with one or more key:value lines, and I want to pull a key:value out if key=foo. How can I do this?
I can get as far as this:
if File.exist?('/file_name')
content = open('/file_name').grep(/foo:??/)
I am unsure about the grep portion, and also once I get the content, how do I extract the value?
People like to slurp the files into memory, which, if the file will always be small, is a reasonable solution. However, slurping isn't scalable, and the practice can lead to excessive CPU and I/O waits as content is read.
Instead, because you could have multiple hits in a file, and you're comparing the content line-by-line, read it line-by-line. Line I/O is very fast and avoids the scalability problems. Ruby's File.foreach is the way to go:
File.foreach('path/to/file') do |li|
puts $1 if li[/foo:\s*(\w+)/]
end
Because there are no samples of actual key/value pairs, we're shooting in the dark for valid regex patterns, but this is the basis for how I'd solve the problem.
Try this:
IO.readlines('key_values.txt').find_all{|line| line.match('key1')}
i would recommend to read the file into array and select only lines you need:
regex = /\A\s?key\s?:/
results = File.readlines('file').inject([]) do |f,l|
l =~ regex ? f << "key = %s" % l.sub(regex, '') : f
end
this will detect lines starting with key: and adding them to results like key = value,
where value is the portion going after key:
so if you have a file like this:
key:1
foo
key:2
bar
key:3
you'll get results like this:
key = 1
key = 2
key = 3
makes sense?
value = File.open('/file_name').read.match("key:(.*)").captures[0] rescue nil
File.read('file_name')[/foo: (.*)/, 1]
#=> XXXX

Using Ruby to find the first previous occurrence of a string

I'm creating some basic work assistance utilities using Ruby. I've hit a problem that I don't really need to solve, but curiosity has the best of me.
What I would like to be able to do is search the contents of a file, starting from a particular line and find the first PREVIOUS occurrence of a string.
For example, if I have the following text saved in a file, I would like to be able to search for "CREATE PROCEDURE" starting at line 4 and have this return/output "CREATE PROCEDURE sp_MERGE_TABLE"
CREATE PROCEDURE sp_MERGE_TABLE
AS
SOME HORRIBLE STATEMENT
HERE
CREATE PROCEDURE sp_SOMETHING_ELSE
AS
A DIFFERENT STATEMENT
HERE
Searching for content isn't a challenge, but specifying a starting line - no idea. And then searching backwards... well...
Any help at all appreciated!
TIA!
I think you have to read file line one by line
then follwing will work
flag=true
if flag && line.include?("CREATE PROCEDURE")
puts line
flag=false
end
If performance isn't a big issue, you could just use a simple loop:
# pseudocode
line_no = 0
while line_no < start_line
read line from file
if content_found in this line
last_seen = line_no # or file offset
end
line_no += 1
end
return last_seen
I'm afraid you will have to work line by line through the file, unless you have some index over it, pointing to the beginnings of the lines. That would make the loop a little bit simpler but working through the file in backwards manner is harder (unless you keep the whole file in memory).
Edit:
I just had a much better idea, but I'm going to include the old solution anyway.
The benefit of searching backwards means you only have to read the first chunk of the file, upto the specified line number. For proximity, you get closer and closer to the start_line, and if you find a match you just forget the old one.. You still read in some redundant data at the beginning, but at least it's O(n)
path = "path/to/file"
start_line = 20
search_string = "findme!"
#assuming file is at least start_line lines long
match_index = nil
f = File.new(path)
start_line.times do |i|
line = f.readline
match_index = i if line.include? search_string
end
puts "Matched #{search_string} on line #{match_index}"
Of course, bear in mind that the size of this file plays an important role in answering your question.
If you wanted to get really serious, you could look into the IO class - it seems like this might be the ultimate solution. Untested, just a thought.
f = File.new(path)
start_line.downto(0) do |i|
f.lineno = i
break if f.gets.include?(search_string)
end
Original:
For an exhaustive solution, you could try something like the following. The downside is you'd need to read the whole file into memory, but it takes into account continuing from the bottom-up if it gets to the top without a match. Untested.
path = "path/to/file"
start_line = 20
search_string = "findme!"
#get lines of the file into an array (chomp optional)
lines = File.readlines(path).map(&:chomp)
#"cut" the deck, as with playing cards, so start_line is first in the array
lines = lines.slice!(start_line..lines.length) + lines
#searching backwards can just be searching a reversed array forwards
lines.reverse!
#search through the reversed-array, for the first occurence
reverse_occurence = nil
lines.each_with_index do |line,index|
if line.include?(search_string)
reverse_occurence = index
break
end
end
#reverse_occurence is now either "nil" for no match, or a reversed-index
#also un-cut the array when calculating the index
if reverse_occurence
occurence = lines.size - reverse_occurence - 1 + start_line
line = lines[reverse_occurence]
puts "Matched #{search_string} on line #{occurence}"
puts line
end
1) Read the entire file into a string.
2) Reverse the file-data string.
3) Reverse the search string.
4) Search forward. Remember to match end-of-line instead of beginning-of-line, and to start from position end-minus-N rather than from N.
Not very fast or efficient, but it's elegant. Or at least clever.

Resources