find the target string from a large file - ruby

I want to write a class, it can find a target string in a txt file and output the line number and the position.
class ReadFile
def find_string(filename, string)
line_num = 0
IO.readlines(filename).each do |line|
line_num += 1
if line.include?(string)
puts line_num
puts line.index(string)
end
end
end
end
a= ReadFile.new
a.find_string('test.txt', "abc")
If the txt file is very large(1 GB, 10GB ...), the performance of this method is very poor.
Is there the better solution?

Use foreach to efficiently read a single line from the file at a time and with_index to track the line number (0-based):
IO.foreach(filename).with_index do |line, index|
if found = line.index(string)
puts "#{index+1}, #{found+1}"
break # skip this if you want to find more than 1 result
end
end
See here for a good explanation of why readlines is giving you performance problems.

This is a variant of #PinnyM's answer. It uses find, which I think is more descriptive than looping and breaking, but does the same thing. This does have a small penalty of having to determine the offset into the line where the string begins after the line is found.
line, index = IO.foreach(filename).with_index.find { |line,index|
line.include?(string) }
if line
puts "'#{string}' found in line #{index}, " +
"beginning in column #{line.index(string)+1}"
else
puts "'#{string}' not found"
end

Related

No output produced

Can anyone tell me why this program is not producing an output? The output it should be producing is: Line read: 0
Line read: 1 Line read: 2 Line read: 3 and so on.
So far, I am not getting an output even though I have fixed a number of bugs. Any help or suggestions would be much appreciated.
# takes a number and writes that number to a file then on each line
# increments from zero to the number passed
def write(aFile, number)
# You might need to fix this next line:
aFile.puts(number)
index = 0
while (index < number)
aFile.puts(index.to_s)
index += 1
end
end
# Read the data from the file and print out each line
def read(aFile)
# Defensive programming:
count = aFile.gets
if (is_numeric?(count))
count = count.to_i
index = 0
while (index < count)
line = aFile.gets
puts "line read: " + line
index+=1
end
end
end
# Write data to a file then read it in and print it out
def main
aFile = File.new("mydata.txt", "w") # open for writing
write(aFile, 10)
aFile.close
aFile = File.new("mydata.txt", "r")
read(aFile)
aFile.close
end
# returns true if a string contains only digits
def is_numeric?(obj)
if /[^0-9]/.match(obj) == nil
true
end
false
end
main
Your code isn't written in the Ruby way.
This is how I'd write it if I wanted to closely mimic your code's logic:
# takes a number and writes that number to a file then on each line
# increments from zero to the number passed
def write_data(fname, counter)
File.open(fname, 'w') do |fo|
fo.puts(counter)
counter.times do |n|
fo.puts n
end
end
end
# returns true if a string contains only digits
def is_numeric?(obj)
obj[/^\d+$/]
end
# Read the data from the file and print out each line
def read_data(fname)
File.open(fname) do |fi|
counter = fi.gets.chomp
if is_numeric?(counter)
counter.to_i.times do |n|
line_in = fi.gets
puts 'Line read: %s' % line_in
end
end
end
end
# Write data to a file then read it in and print it out
DATA_FILE = 'mydata.txt'
write_data(DATA_FILE, 10)
read_data(DATA_FILE)
Which outputs:
Line read: 0
Line read: 1
Line read: 2
Line read: 3
Line read: 4
Line read: 5
Line read: 6
Line read: 7
Line read: 8
Line read: 9
Notice these things:
Method (or variable) names are not in camelCase in Ruby, they're snake_case. ItsAReadabiltyThing.
Ruby encourages us to use a block when opening files for reading or writing, to automatically close the file when we're finished with it. Leaving danging file handles opened then not closed, in a loop, in a long-running program, is a great way for your program to crash in a way that's hard to figure out. SO has many questions that resulted from doing that. This is from the IO#open documentation:
With no associated block, ::open is a synonym for ::new. If the optional code block is given, it will be passed io as an argument, and the IO object will automatically be closed when the block terminates. In this instance, ::open returns the value of the block.
Usually you'll see code use File.open instead of IO.open, mostly out of habit in Ruby coders. File inherits from IO and adds some additional file-oriented methods to the class, so it's a little more full-featured.
Ruby has many methods that help us avoid using while loops. Getting the counters wrong or missing a condition that should terminate the loop, is all too common in programming, so Ruby makes it easy to loop "n times" or to iterate over all the elements in an array. The times method accomplishes that nicely.
String's [] method is really powerful and makes it easy to look at the contents of a string and apply a pattern or a slice. Using /^\d+$/ checks the entire string to make sure all characters are digits, so some_string[/^\d+$/] is a shorter version than what you're doing and accomplishes the same thing, returns a "truthy" value.
We don't use a main method. That's old-school Pascal, C or Java and is artificially structured. Ruby's a little more friendly than that.
Instead of using
3.times do |n|
puts n
end
# >> 0
# >> 1
# >> 2
I'd probably use
puts (0..(3 - 1)).to_a * "\n"
# >> 0
# >> 1
# >> 2
just because I tend to think in Perl terms. It's another old habit.
I found 2 errors. Fixing those errors gives you desired output.
Error #1.
Your method is_numeric? always returns false. Even if your condition is true. The last line of the method is false and therefore the whole method ALWAYS returns false.
You can fix it in 2 steps.
Step #1:
if /[^0-9]/.match(obj) == nil
true
else
false
end
It's not a good practice to return booleans within conditional. You can simplify it this way:
def is_numeric?(obj)
/[^0-9]/.match(obj) == nil
end
or even better
def is_numeric?(obj)
/[^0-9]/.match(obj).nil?
end
Error #2 is inside your read method. If you try to output the value of count after you read it from the file it gives you "10\n". That \n at the end messes you up.
To get rid of \n when you read from the file you could possibly use chomp. So then your reading line would be:
count = aFile.gets.chomp
and the rest works like magic

How do I detect end of file in Ruby?

I wrote the following script to read a CSV file:
f = File.open("aFile.csv")
text = f.read
text.each_line do |line|
if (f.eof?)
puts "End of file reached"
else
line_num +=1
if(line_num < 6) then
puts "____SKIPPED LINE____"
next
end
end
arr = line.split(",")
puts "line number = #{line_num}"
end
This code runs fine if I take out the line:
if (f.eof?)
puts "End of file reached"
With this line in I get an exception.
I was wondering how I can detect the end of file in the code above.
Try this short example:
f = File.open(__FILE__)
text = f.read
p f.eof? # -> true
p text.class #-> String
With f.read you read the whole file into text and reach EOF.
(Remark: __FILE__ is the script file itself. You may use you csv-file).
In your code you use text.each_line. This executes each_line for the string text. It has no effect on f.
You could use File#each_line without using a variable text. The test for EOF is not necessary. each_line loops on each line and detects EOF on its own.
f = File.open(__FILE__)
line_num = 0
f.each_line do |line|
line_num +=1
if (line_num < 6)
puts "____SKIPPED LINE____"
next
end
arr = line.split(",")
puts "line number = #{line_num}"
end
f.close
You should close the file after reading it. To use blocks for this is more Ruby-like:
line_num = 0
File.open(__FILE__) do | f|
f.each_line do |line|
line_num +=1
if (line_num < 6)
puts "____SKIPPED LINE____"
next
end
arr = line.split(",")
puts "line number = #{line_num}"
end
end
One general remark: There is a CSV library in Ruby. Normally it is better to use that.
https://www.ruby-forum.com/topic/218093#946117 talks about this.
content = File.read("file.txt")
content = File.readlines("file.txt")
The above 'slurps' the entire file into memory.
File.foreach("file.txt") {|line| content << line}
You can also use IO#each_line. These last two options do not read the entire file into memory. The use of the block makes this automatically close your IO object as well. There are other ways as well, IO and File classes are pretty feature rich!
I refer to IO objects, as File is a subclass of IO. I tend to use IO when I don't really need the added methods from File class for the object.
In this way you don't need to deal with EOF, Ruby will for you.
Sometimes the best handling is not to, when you really don't need to.
Of course, Ruby has a method for this.
Without testing this, it seems you should perform a rescue rather than checking.
http://www.ruby-doc.org/core-2.0/EOFError.html
file = File.open("aFile.csv")
begin
loop do
some_line = file.readline
# some stuff
end
rescue EOFError
# You've reached the end. Handle it.
end

Moving to the last line of a file while reading it in a .each loop in Ruby

I'm reading in a file that can contain any number of rows.
I only need to save the first 1000 or so, passed in as a variable "recordsToParse".
If I reach my 1000 line limit, or whatever it's set to, I need to save the trailer information in the file to verify total_records, total_amount etc.
So, I need a way to move my "pointer" from where ever I am in the file to the last line and run through one more time.
file = File.open(file_name)
parsed_file_rows = Array.new
successful_records, failed_records = 0, 0
file_contract = file_contract['File_Contract']
output_file_name = file_name.gsub(/.TXT|.txt|.dat|.DAT/,'')
file.each do |line|
line.chomp!
line_contract = determine_row_type(file_contract, line)
if line_contract
parsed_row = parse_row_by_contract(line_contract, line)
parsed_file_rows << parsed_row
successful_records += 1
else
failed_records += 1
end
if (not recordsToParse.nil?)
if successful_records > recordsToParse
# move "pointer" to last line and go through loop once more
#break;
end
end
end
store_parsed_file('Parsed_File',"#{output_file_name}_parsed", parsed_file_rows)
[successful_records, failed_records]
Use IO.seek with IO::SEEK_END to move your pointer to the end of the file, then move up to the last CR, then you have your last line.
This would only be worthwhile if the file is very big, otherwise just follow the file.each do |line| to the last line or you could read the last line like this IO.readlines("file.txt")[-1].
The easiest solution is to use a gem like elif
require "elif"
lastline = Elif.open("bigfile.txt") { |f| f.gets }
It reads your lastline in a snap undoubtedly using seek.
This is one of those times I'd take advantage of the OS's head and tail commands using something like:
head = `head -#{ records_to_parse } #{ file_to_read }`.split("\n")
tail = `tail -1 #{ file_to_read }
head.pop if (head[-1] == tail.chomp)
Then write it all out using something like:
File.open(new_file_to_write, 'w') do |fo|
fo.puts head, tail
end

Ruby: What's an elegant way to pick a random line from a text file?

I've seen some really beautiful examples of Ruby and I'm trying to shift my thinking to be able to produce them instead of just admire them. Here's the best I could come up with for picking a random line out of a file:
def pick_random_line
random_line = nil
File.open("data.txt") do |file|
file_lines = file.readlines()
random_line = file_lines[Random.rand(0...file_lines.size())]
end
random_line
end
I feel like it's gotta be possible to do this in a shorter, more elegant way without storing the entire file's contents in memory. Is there?
There is already a random entry selector built into the Ruby Array class: sample().
def pick_random_line
File.readlines("data.txt").sample
end
You can do it without storing anything except the most recently-read line and the current candidate for the returned random line.
def pick_random_line
chosen_line = nil
File.foreach("data.txt").each_with_index do |line, number|
chosen_line = line if rand < 1.0/(number+1)
end
return chosen_line
end
So the first line is chosen with probability 1/1 = 1; the second line is chosen with probability 1/2, so half the time it keeps the first one and half the time it switches to the second.
Then the third line is chosen with probability 1/3 - so 1/3 of the time it picks it, and the other 2/3 of the time it keeps whichever one of the first two it picked. Since each of them had a 50% chance of being chosen as of line 2, they each wind up with a 1/3 chance of being chosen as of line 3.
And so on. At line N, every line from 1-N has an even 1/N chance of being chosen, and that holds all the way through the file (as long as the file isn't so huge that 1/(number of lines in file) is less than epsilon :)). And you only make one pass through the file and never store more than two lines at once.
EDIT You're not going to get a real concise solution with this algorithm, but you can turn it into a one-liner if you want to:
def pick_random_line
File.foreach("data.txt").each_with_index.reduce(nil) { |picked,pair|
rand < 1.0/(1+pair[1]) ? pair[0] : picked }
end
This function does exactly what you need.
It's not a one-liner. But it works with textfiles of any size (except zero size, maybe :).
def random_line(filename)
blocksize, line = 1024, ""
File.open(filename) do |file|
initial_position = rand(File.size(filename)-1)+1 # random pointer position. Not a line number!
pos = Array.new(2).fill( initial_position ) # array [prev_position, current_position]
# Find beginning of current line
begin
pos.push([pos[1]-blocksize, 0].max).shift # calc new position
file.pos = pos[1] # move pointer backward within file
offset = (n = file.read(pos[0] - pos[1]).rindex(/\n/) ) ? n+1 : nil
end until pos[1] == 0 || offset
file.pos = pos[1] + offset.to_i
# Collect line text till the end
begin
data = file.read(blocksize)
line.concat((p = data.index(/\n/)) ? data[0,p.to_i] : data)
end until file.eof? or p
end
line
end
Try it:
filename = "huge_text_file.txt"
100.times { puts random_line(filename).force_encoding("UTF-8") }
Negligible (imho) drawbacks:
the longer the line, the higher the chance it'll be picked.
doesn't take into account the "\r" line separator ( windows-specific ). Use files with Unix-style line endings!
This is not much better than what you came up with, but at least it's shorter:
def pick_random_line
lines = File.readlines("data.txt")
lines[rand(lines.length)]
end
One thing you can do to make your code more Rubyish is omitting braces. Use readlines and size instead of readlines() and size().
A one liner:
def pick_random_line(file)
`head -$((${RANDOM} % `wc -l < #{file}` + 1)) #{file} | tail -1`
end
If you protest that it's not Ruby, go find a talk in this year's Euruko titled Ruby is unlike a Banana.
PS: Ignore SO's incorrect syntax highlighting.
Here a shorter version of Mark's exellent answer, not as short as Dave's though
def pick_random_line number=1, chosen_line=""
File.foreach("data.txt") {|line| chosen_line = line if rand < 1.0/number+=1}
chosen_line
end
Stat the file, pick a random number between zero and the size of the file, seek to that byte in the file. Scan until the next newline, then read and return the next line (assuming you're not at the end of the file).

Determine last line in Ruby

I'm wondering how I can determine when I am on the last line of a file that I reading in. My code looks like
File.open(file_name).each do |line|
if(someway_to_determine_last_line)
end
I noticed that there is a file.eof? method, but how would I call the method as the file is being read? Thanks!
If you're iterating the file with each, then the last line will be passed to the block after the end-of-file is reached, because the last line is, by definition, the line ending with EOF.
So just call file.eof? in the block.
If you'd like to determine if it's the last non-empty line in the file, you'd have to implement some kind of readahead.
Depending on what you need to do with this "last non-empty line", you might be able to do something like this:
last_line = nil
File.open(file_name).each do |line|
last_line = line if(!line.chomp.empty?)
# Do all sorts of other things
end
if(last_line)
# Do things with the last non-empty line.
end
Secret sauce is .to_a
lines = File.open(filename).to_a
Get the first line:
puts lines.first
Get the last line:
puts lines.last
Get the n line of a file:
puts lines.at(5)
Get the count of lines:
puts lines.count
fd.eof? works, but just for fun, here's a generic solution that works with any kind of enumerators (Ruby 1.9):
class Enumerator
def +(other)
Enumerator.new do |yielder|
each { |e| yielder << e }
other.each { |e| yielder << e }
end
end
def with_last
Enumerator.new do |yielder|
(self + [:some_flag_here]).each_cons(2) do |a, b|
yielder << [a, b == :some_flag_here]
end
end
end
end
# a.txt is a file containing "1\n2\n3\n"
open("a.txt").lines.with_last.each do |line, is_last|
p [line, is_last]
end
Which outputs:
["1\n", false]
["2\n", false]
["3\n", true]
Open your file and use the readline method:
To simply manipulate last line of file do the following:
f = File.open('example.txt').readlines
f.each do |readline|
if readline[f.last]
puts "LAST LINE, do something to it"
else
puts "#{readline} "
end
end
Line 1 reads the file in as an array of lines
Line 2 uses that object and iterates over each of them
Line 3 tests if the current line matches the last line
Line 4 acts if it's a match
Line 5 & 6 handle behavior for non-matching circumstance

Resources